Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #72811 - parse_url fails with IPv6 only host #2079

Closed
wants to merge 7 commits into from

Conversation

6 participants
@bp1222
Copy link
Contributor

commented Aug 12, 2016

Bug report shows that when using an IPv6 formatted address as
a URL host it would incorrectly parse. parse_url detects the first ':'
as denoting a scheme, which isn't properly parsed. This relegates the
[IPv6] formatted host to ending up as the "path" portion of the URL.
This fix skips sheme checking if the first character of the URL string
is a '[', and will ensure not to fall into the catch-all else if the
URL string starts with '[' then a digit, alpha, or :.

Fix #72811 - parse_url fails with IPv6 only host
Bug report shows that when using an IPv6 formatted address as
a URL host it would incorrectly parse.  parse_url detects the first ':'
as denoting a scheme, which isn't properly parsed.  This relegates the
[IPv6] formatted host to ending up as the "path" portion of the URL.
This fix skips sheme checking if the first character of the URL string
is a '[', and will ensure not to fall into the catch-all else if the
URL string starts with '[' then a digit, alpha, or :.
@smalyshev

This comment has been minimized.

Copy link
Contributor

commented Sep 5, 2016

@yohgaki what to take a look?

@smalyshev smalyshev added the Bugfix label Sep 5, 2016

@yohgaki

This comment has been minimized.

Copy link
Contributor

commented Sep 5, 2016

@stas I suupose this patch is for
https://tools.ietf.org/html/rfc3986#section-3.2.2

3.2.2. Host

  IP-literal = "[" ( IPv6address / IPvFuture  ) "]"


  IPv6address =                            6( h16 ":" ) ls32
              /                       "::" 5( h16 ":" ) ls32
              / [               h16 ] "::" 4( h16 ":" ) ls32
              / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
              / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
              / [ *3( h16 ":" ) h16 ] "::"    h16 ":"   ls32
              / [ *4( h16 ":" ) h16 ] "::"              ls32
              / [ *5( h16 ":" ) h16 ] "::"              h16
              / [ *6( h16 ":" ) h16 ] "::"

  ls32        = ( h16 ":" h16 ) / IPv4address
              ; least-significant 32 bits of address

  h16         = 1*4HEXDIG
              ; 16 bits of address represented in hexadecima

Address format in phpt is shown as an example URI.
https://tools.ietf.org/html/rfc3986#section-1.1.2

We need to support this kind of IPv6 format, but priority is not high because browsers do not support IPv6 address format well, AFAIK.

This patch seems incomplete because parse_url should take care of port. It seems port number handling is missing at least. I think patch should look for ']'. I took a look the patch only, so I could be wrong though.

@bp1222

This comment has been minimized.

Copy link
Contributor Author

commented Sep 6, 2016

@yohgaki - Correct, that's what the bug reported. I believe you're right that I should be more verbose in checking for the finishing ] of the v6 address scheme. As for what I can tell, chrome, safari, and firefox all support http://[::1] formatted address schemes for URL access. I will round out tests, checking for port, and other options when using the bracketed v6 host scheme.

@bp1222

This comment has been minimized.

Copy link
Contributor Author

commented Sep 6, 2016

@yohgaki - Now that I have the cycle to look, parse_url already handles IPv6 like URL's, but only if the bracketed host comes after a scheme. //[::1] would have been valid and PHP_URL_HOST.

The bug is that [::1]/foo.php has a null value for PHP_URL_HOST, whereas 127.0.0.1/foo.php would have a valid PHP_URL_HOST

@cmb69

This comment has been minimized.

Copy link
Contributor

commented Sep 7, 2016

In my humble opinion, the current implementation of php_url_parse_ex() is broken beyond repair, and should better be replaced by a clean FSA implementing a concrete specification, without referring to memchr() to make some guesses, and later trying to fix up. At least, I have given up long ago trying to fix the various reported issues.

Having said this, I'm not against merging this PR.

@bp1222

This comment has been minimized.

Copy link
Contributor Author

commented Sep 7, 2016

I wouldn't be opposed to dropping this PR, and taking a stab at a better state based implementation of parsing a complex URI. I'm more surprised there isn't a library, or some PHP license compatible implementation somewhere.

@cmb69

This comment has been minimized.

Copy link
Contributor

commented Sep 8, 2016

I'm more surprised there isn't a library, or some PHP license compatible implementation somewhere.

There may be – I've never looked for one, though.

@bp1222

This comment has been minimized.

Copy link
Contributor Author

commented Sep 8, 2016

@cmb69 There apparently are no shortages of libraries for this purpose (who knew he said sarcastically). My question here, and this is more a the-proper-way-to-do-things question, would it be better to a) find a BSD licensed (or PHP-acceptable licenced) library and include it into the source, or, create a config --with-whateverlib option to use the library installed on the host machine?

I guess the former would be smarter since maintaining parse_url would fall to the library and we could remove what exists today. But I'm not sure the preferred means of incorporating a library into the core.

@cmb69

This comment has been minimized.

Copy link
Contributor

commented Sep 8, 2016

@bp1222 As parse_url() is part of ext/standard, most likely we should bundle the library (I'd suppose the library to be small, anyway), and perhaps offer an option to use the system library as alternative.

@yohgaki

This comment has been minimized.

Copy link
Contributor

commented Sep 9, 2016

A while ago, I was looking for URL parser implemented by re2c and couldn't find one. Now I found
https://github.com/staskobzar/url_parser_re2c
It's missing schemaless URL support, e.g. //example.com/, but it could be added.
The author does not write license, so I've asked as an issue.

@bp1222

This comment has been minimized.

Copy link
Contributor Author

commented Sep 9, 2016

@yohgaki - Seems your request was answered, he MIT'd it. I can take a stab at getting it to work with schemeless-urls, and the other fringe cases defined in tests.

@bp1222

This comment has been minimized.

Copy link
Contributor Author

commented Sep 13, 2016

I've been working on getting the library updated with schemeless URL and better handling of IPv6 addresses, and I've come across a couple oddities in the PHP parse_url tests. For example:

   --> 64.246.30.37/: array(1) {
     ["path"]=>
     string(13) "64.246.30.37/"
   }

   --> 64.246.30.37:80/: array(3) {
     ["host"]=>
     string(12) "64.246.30.37"
     ["port"]=>
     int(80)
     ["path"]=>
     string(1) "/"
   }

I'm wondering why adding a port number would trigger parse_url() to treat the string as having a host, and port component rather than path. This seems contradictory to the RFC. in Section 3&4 (specifically 3, 4.2, and 4.3) it lays out the URI to consist either of a hier-part or relative-part. Both of these begin by requiring "//" authority, and it's only within authority do we parse for host. So I think in the PHP unit test the first is correct, and the second incorrect (per RFC). Not sure if this is a conscious abnormality, or just due to the current parser.

Thoughts?

@nikic

This comment has been minimized.

Copy link
Member

commented Sep 13, 2016

parse_url() is supposed to be able to deal with partial URLs, as such I would consider the second output correct and the first wrong.

@cmb69

This comment has been minimized.

Copy link
Contributor

commented Sep 13, 2016

parse_url() is supposed to be able to deal with partial URLs, […]

What might be a problem in the first place. Consider the partial URL "example.com" – is this the host or the path?

@nikic

This comment has been minimized.

Copy link
Member

commented Sep 13, 2016

I would have said host (and add that nothing should be a path unless it starts with a /), but ... that does not seem to match what parse_url() does. Dammit.

@bp1222

This comment has been minimized.

Copy link
Contributor Author

commented Sep 13, 2016

example.com should be path, per RFC. I think the current implementation of parse_url() does trickery to allow people to parse things like example.com/path/to/foo.php?a=b and get the host, path, and query as we would assume. Whereas the RFC would require you to make that //example.com/path/to/foo.php?a=b

I think the trickery, and special handling in parse_url() to be incorrect behavior, and that it should more-strongly adhere to the RFC. As it defines the hier-part pretty specifically.

@bp1222

This comment has been minimized.

Copy link
Contributor Author

commented Sep 14, 2016

The other oddity I'm coming across with the RFC style of parsing a url would be something like this:

www.php.net:80

This would oddly parse into

scheme: www.php.net
path: 80

This is because the RFC defines SCHEME as

Scheme names consist of a sequence of characters beginning with a
   letter and followed by any combination of letters, digits, plus
   ("+"), period ("."), or hyphen ("-").

the 80 would then be caught by the path-rootless rule.

Although this parsing is true to the spec, this is an example that I could see us fiddling around with the spec to say that a scheme, terminated by ':' followed by a digit like

SCHEME ":" DIGIT

is NOT a scheme and should be treated as a path.

Thoughts?

@bp1222

This comment has been minimized.

Copy link
Contributor Author

commented Sep 14, 2016

Ignore my thought above. That would break parsing for tel scheme like:

tel:555-1212
@cmb69

This comment has been minimized.

Copy link
Contributor

commented Sep 15, 2016

Thanks for reworking the PR, David!

I tried to compile, but my re2c didn't like the C++ style comments in url_parser_ex.re; replacing these with classic C style comments satisfied my re2c. I found that I'm using 0.13.5, but C++ style comments are allowed only as of 0.13.6. For compatibility reasons it's probably best to stick with classic C style comments.

Anyhow, with the patch several tests are now failing, so I guess we can't simply switch the implementations. At least that would require further discussion on the internals mailing list, and probably an RFC. Another option might be to leave parse_url() as it is, and add a new function (perhaps deprecating the old parse_url()).

Thoughts?

@yohgaki

This comment has been minimized.

Copy link
Contributor

commented Sep 15, 2016

@CMB I would rather go for new re2c based parser without new function (both internal and user API).
We don't want to maintain both, we can't probably... We have time for fixing issues.

About RFC and discussion, I agree totally.

@bp1222

This comment has been minimized.

Copy link
Contributor Author

commented Sep 15, 2016

Oh yes. This commit is just work-in progress (could probably drop the pr until finished). I didn't commit all the test updates I've been plugging away at. As noted this implementation does change many of the tests. Most of the previous "failure" tests aren't fails now. They're just parsed slightly weird (though probably correct).

I concur that this should probably be a replacement rather than maintaining two parsers. Found so probably would warrant an RFC and be targeted for 7.2 or 7.3. I'm traveling on vacation the next two weeks but should be able to pick back up the.

@cmb69

This comment has been minimized.

Copy link
Contributor

commented Sep 15, 2016

I' fine with a replacement for 7.2, if there'll be an RFC. And yes, there's plenty of time until 7.2 feature freeze, so no need to hurry. :-)

@bp1222

This comment has been minimized.

Copy link
Contributor Author

commented Dec 28, 2016

Closing this PR, going to implement more of a replacement.

@bp1222 bp1222 closed this Dec 28, 2016

@bp1222 bp1222 deleted the bp1222:fix-72811 branch Dec 28, 2016

@jchook

This comment has been minimized.

Copy link

commented Apr 6, 2018

Any update on this? Seems like it didn't make the 7.2 cut?

@bp1222

This comment has been minimized.

Copy link
Contributor Author

commented Apr 6, 2018

It did not, I haven’t spent time to work on a more of a replacement over a patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.