Skip to content
This repository has been archived by the owner on Nov 26, 2017. It is now read-only.

Fix bug in JString::parse_url that causes urlencoded strings to be decoded #1458

Merged
merged 1 commit into from Aug 18, 2012

Conversation

aaronschmitz
Copy link
Contributor

Switch JString::parse_url to return false on failure as specified in it's doc block.
JString::parse_url is now identical to parse_url for ascii strings.

…coded.

Switch JString::parse_url to return false on failure as specified in it's doc block.
JString::parse_url is now identical to parse_url for ascii strings.
@stefanneculai
Copy link
Contributor

Looks good.

@@ -62,7 +62,7 @@ public function test(SimpleXMLElement $element, $value, $group = null, JRegistry
* accurately without a scheme.
* @see http://php.net/manual/en/function.parse-url.php
*/
if (!array_key_exists('scheme', $urlParts))
if ($urlParts && !array_key_exists('scheme', $urlParts))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it make more sense to error out right after the call to parse_url if it returns false? Or am I missing a usecase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You would know better than I do. This simply maintains current functionality. If we want to investigate changing that functionality, perhaps that could be considered in another pull request.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically missing scheme is what causes parse_url in php in general not to work correctly ( because it will mistake the host for a string if there is no scheme) but that doesn't necessarily mean that there is nothing to parse or worth parsing.

@realityking
Copy link
Contributor

Sorry for the many questions, I'm a bit afraid we break something.

@stefanneculai
Copy link
Contributor

@realityking I have pulled this out and tried all kind of tests and looks good. I think @aaronschmitz can explain all those strange str_replace and urldecode. I have tried to tackle this issue several times but Aaron's solution doesn't fail any test - or at least I could not find one that fails.

@aaronschmitz
Copy link
Contributor Author

@realityking, the original code is incredibly confusing at first glance, and I only make it worse - I'm guessing my pseudo-code explanation will shed some light on the black magic.

  1. Urlencode the url
    • This converts unicode to escape codes so it doesn't get filtered
  2. Decode "important" characters
    • If the characters in the array on line 968 are left encoded parseurl will not be able to identify any or the parts of the url)
  3. Run the string through parseurl.
  4. Re-encode the "important" characters in each chunk of the array
    • This puts the strings back in a consistent state where all characters are urlencoded
  5. Urldecode each chunk

@dianaprajescu
Copy link
Contributor

@aaronschmitz's solution solves my problem too and it also seems to be OK with @stefanneculai. See pull requests #1285 and #1382.

@LouisLandry
Copy link
Contributor

This is great work @aaronschmitz ... really great.

LouisLandry added a commit that referenced this pull request Aug 18, 2012
Fix bug in JString::parse_url that causes urlencoded strings to be decoded
@LouisLandry LouisLandry merged commit f030fc4 into joomla:staging Aug 18, 2012
@infograf768
Copy link
Member

This has broken multilang in the CMS when using unicode aliases

@aaronschmitz
Copy link
Contributor Author

I'm happy to try to resolve the issue if you can give me the url that causes the trouble.

@mbabker
Copy link
Contributor

mbabker commented Sep 2, 2012

Using JM's multi-lingual data set, the URL http://localhost/joomla-cms/es/inicio/19-categorías-en-español.html is
supposed to load the category with ID 19 but is loading the article with ID 19. Reverting the changes to the JString method makes this load as expected.

EDIT: Likewise, http://localhost/joomla-cms/es/federico-garcía-lorca.html should go to article ID 57 but is attempting to load category ID 57.

@infograf768
Copy link
Member

@aaronschmitz
I can send you the db dump if it can help

@aaronschmitz
Copy link
Contributor Author

Bug Acknowledged. I got a test database from @mbabker.

The issue is that the browser doesn't actually send http://localhost/joomla-cms/es/inicio/19-categorías-en-español.html. Instead it sends http://localhost/joomla-cms/es/inicio/19-categor%C3%ADas-en-espa%C3%B1ol.html. Before, JString::parse_url kindly converted this back to unicode.

While this was convenient, it's not really proper behavior and it broke things because JString is also used by JHTTP where we definitely don't want URL's getting converted to unicode. Likewise, we could fix this in JUri::getInstance where the URI is retrieved from $_SERVER, but to me that also seems like an incorrect solution because that's not the actual URI (true unicode URI's are illegal - hence the browser's punycode conversion).

The problem manifests itself in the router. I'm looking at how to fix it now.

@elinw
Copy link
Contributor

elinw commented Sep 3, 2012

I'm seeing more general set of routing issues in some places in the cms trunk ... and michael what you are mentioning sounds like the id clashing we had a long time ago in which case it's a cms issue or at least a different issue.

@aaronschmitz
Copy link
Contributor Author

Ok, @infograf768. I've put together two distinct solutions for the issue.

  1. Decode URL before sending it to the router
  2. Encode the URL's within the router that comparisons are performed against

RFC 3986 - the present standard on Uniform Resource Identifiers clearly defines the allowable characters in a URI (Page 49). Clearly, non-ASCII characters are prohibited yet the Joomla currently outputs whatever is in the slug/alias.

Solution 1 does not resolve this problem. Solution 2 resolves this problem by replacing unicode characters with their percent encoded equivalents. Most browsers still display these encodings as the unicode character, but the URL's themselves are now RFC compliant.

Please look over these two solutions and let me know which is more desirable. If you think it needs more discussion, I can easily post it to the mailing list as well.

@infograf768
Copy link
Member

The simplest the best IMHO (routing_hack), even if we do not exactly follow exactly RFC 3986, as it is fully B/C.
The fact that ASCII unreserved characters should be presented as ASCII instead of percent encodings should be solved indeed globally. Another time I guess.

@realityking
Copy link
Contributor

Another vote for option No. 1.

@aaronschmitz
Copy link
Contributor Author

Here you go. As requested I moved it into the router itself.
joomla/joomla-cms#347

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants