New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipe character not supported in URL #6

Closed
EmielBruijntjes opened this Issue Jul 20, 2015 · 5 comments

Comments

Projects
None yet
2 participants
@EmielBruijntjes

EmielBruijntjes commented Jul 20, 2015

$url = new http\Url("http://www.example.com/?x=a|b");

The above code results in a fatal error:

Uncaught exception 'http\Exception\BadUrlException' with message 'http\Url::__construct(): Failed to parse query; unexpected byte 0x7c at pos 3 in 'x=a|b''

However, pipe characters are valid in URLs, and widely used (for example by Google).

In general -- I would be in favour of making http\Url::__construct() as tolerant as possible, even when you pass an invalid URL to the constructor, the object should do its best to parse as much from it as is reasonably possible.

@m6w6

This comment has been minimized.

Show comment
Hide comment
@m6w6

m6w6 Jul 20, 2015

Owner

Do you have a real-life example URL? They're probably percent-encoded.

Owner

m6w6 commented Jul 20, 2015

Do you have a real-life example URL? They're probably percent-encoded.

@EmielBruijntjes

This comment has been minimized.

Show comment
Hide comment
@EmielBruijntjes

EmielBruijntjes Jul 20, 2015

I'm not sure where the errors come from, and what real-life URLs they are from. I just see the errors pop up in the server logfiles after upgrading the pecl-http extension, which is undesirable. For example in the following code:

__utma=1152894289.1017686999.9107388726.1439222726.1494721726.1&__utmb=115739289.1.10.1437388726&__utmc=115883619&__utmx=-&__utmz=115111289.14310476.1.1.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided)&__utmv=-&__utmk=112678937

But the same problem occurs when parsing urls with other characters:

http://www.example.com/?id={$id}

It is very well possible that curly braces and pipes are not allowed in urls -- I'm not sure about that. However, even if it turns out that such chars are formally illegal in URLs, it still is considered good practice to be forgiving when parsing input.

EmielBruijntjes commented Jul 20, 2015

I'm not sure where the errors come from, and what real-life URLs they are from. I just see the errors pop up in the server logfiles after upgrading the pecl-http extension, which is undesirable. For example in the following code:

__utma=1152894289.1017686999.9107388726.1439222726.1494721726.1&__utmb=115739289.1.10.1437388726&__utmc=115883619&__utmx=-&__utmz=115111289.14310476.1.1.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided)&__utmv=-&__utmk=112678937

But the same problem occurs when parsing urls with other characters:

http://www.example.com/?id={$id}

It is very well possible that curly braces and pipes are not allowed in urls -- I'm not sure about that. However, even if it turns out that such chars are formally illegal in URLs, it still is considered good practice to be forgiving when parsing input.

@m6w6

This comment has been minimized.

Show comment
Hide comment
@m6w6

m6w6 Jul 20, 2015

Owner

Allowing illegal characters is "undesireable", too.

Owner

m6w6 commented Jul 20, 2015

Allowing illegal characters is "undesireable", too.

@m6w6 m6w6 added the suspended label Jul 20, 2015

@m6w6

This comment has been minimized.

Show comment
Hide comment
@m6w6

m6w6 Jul 20, 2015

Owner

BTW, you could use http\Url::PARSE_TOPCT to percent-encode unexpected characters:

new http\Url("?utm=foo|bar", null, http\Url::PARSE_MBUTF8|http\Url::PARSE_TOPCT);

See http://devel-m6w6.rhcloud.com/mdref/http/Url#PARSE_

Owner

m6w6 commented Jul 20, 2015

BTW, you could use http\Url::PARSE_TOPCT to percent-encode unexpected characters:

new http\Url("?utm=foo|bar", null, http\Url::PARSE_MBUTF8|http\Url::PARSE_TOPCT);

See http://devel-m6w6.rhcloud.com/mdref/http/Url#PARSE_

@EmielBruijntjes

This comment has been minimized.

Show comment
Hide comment
@EmielBruijntjes

EmielBruijntjes Jul 21, 2015

That does solve the issue indeed, so I'm going to add this flag to the code. However, I am still of the opinion that this is a bug in the pecl-http extension, even though the workaround is probably sufficient for us. Our code now looks like this:

// flags to be passed to the http\Url constructor
$flags = 0;

// newer versions of the pecl http library do their own multi-byte decoding, which results
// in exceptions being thrown by the pecl-http library when a user enters a url that
// contains pipe characters, even though that url is perfectly well recognized by the 
// browser, the Apache web server and the Zend engine. As a workaround, we pass a 
// number of flags to the http\Url constructor to rewrite the URL so that no exceptions
// are thrown
if (defined('http\\Url::PARSE_MBUTF8')) $flags |= http\Url::PARSE_MBUTF8;
if (defined('http\\Url::PARSE_TOPCT')) $flags |= http\Url::PARSE_TOPCT;

// construct current url
$url = new http\Url($_SERVER['REQUEST_URI'], null, $flags);

As you can see, we're just parsing the $_SERVER['REQUEST_URI'] variable here, which holds the content of the incoming URL. We're not in control over the contents of this variable, as it just happens to be the address that the user manually entered in his browser, or the address of a hyperlink on a remote website. We're not in a position to "allow" or "disallow" a url, we're just faced with the reality that someone who we don't know entered this address, ended up on our website, and all we want to do is split the address into a scheme, hostname, etc. This has always worked, but the upgrade to the latest version of the pecl-http library suddenly breaks this. This is a regression.

Like I mentioned before, when you deal with data it is good practice (in fact: a requirement in most RFCs!) to be tolerant and forgiving when parsing data, and strict when generating it. It would therefore be much better if the default behavior of the constructor of the http\Url class would be as tolerant as possible, and only if you explicitly pass a specific flag (for example http\Url::STRICT) to throw exceptions when it encounters (small) errors.

EmielBruijntjes commented Jul 21, 2015

That does solve the issue indeed, so I'm going to add this flag to the code. However, I am still of the opinion that this is a bug in the pecl-http extension, even though the workaround is probably sufficient for us. Our code now looks like this:

// flags to be passed to the http\Url constructor
$flags = 0;

// newer versions of the pecl http library do their own multi-byte decoding, which results
// in exceptions being thrown by the pecl-http library when a user enters a url that
// contains pipe characters, even though that url is perfectly well recognized by the 
// browser, the Apache web server and the Zend engine. As a workaround, we pass a 
// number of flags to the http\Url constructor to rewrite the URL so that no exceptions
// are thrown
if (defined('http\\Url::PARSE_MBUTF8')) $flags |= http\Url::PARSE_MBUTF8;
if (defined('http\\Url::PARSE_TOPCT')) $flags |= http\Url::PARSE_TOPCT;

// construct current url
$url = new http\Url($_SERVER['REQUEST_URI'], null, $flags);

As you can see, we're just parsing the $_SERVER['REQUEST_URI'] variable here, which holds the content of the incoming URL. We're not in control over the contents of this variable, as it just happens to be the address that the user manually entered in his browser, or the address of a hyperlink on a remote website. We're not in a position to "allow" or "disallow" a url, we're just faced with the reality that someone who we don't know entered this address, ended up on our website, and all we want to do is split the address into a scheme, hostname, etc. This has always worked, but the upgrade to the latest version of the pecl-http library suddenly breaks this. This is a regression.

Like I mentioned before, when you deal with data it is good practice (in fact: a requirement in most RFCs!) to be tolerant and forgiving when parsing data, and strict when generating it. It would therefore be much better if the default behavior of the constructor of the http\Url class would be as tolerant as possible, and only if you explicitly pass a specific flag (for example http\Url::STRICT) to throw exceptions when it encounters (small) errors.

@m6w6 m6w6 added in progress and removed suspended labels Jul 22, 2015

m6w6 added a commit that referenced this issue Jul 22, 2015

Fix gh-issue #6
Allow RFC1738 unsafe characters in URL query/fragment.
Closes issue #6.

@m6w6 m6w6 closed this Jul 22, 2015

@m6w6 m6w6 removed the in progress label Jul 22, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment