Fix #72811 - parse_url fails with IPv6 only host #2079

bp1222 · 2016-08-12T20:25:41Z

Bug report shows that when using an IPv6 formatted address as
a URL host it would incorrectly parse. parse_url detects the first ':'
as denoting a scheme, which isn't properly parsed. This relegates the
[IPv6] formatted host to ending up as the "path" portion of the URL.
This fix skips sheme checking if the first character of the URL string
is a '[', and will ensure not to fall into the catch-all else if the
URL string starts with '[' then a digit, alpha, or :.

Bug report shows that when using an IPv6 formatted address as a URL host it would incorrectly parse. parse_url detects the first ':' as denoting a scheme, which isn't properly parsed. This relegates the [IPv6] formatted host to ending up as the "path" portion of the URL. This fix skips sheme checking if the first character of the URL string is a '[', and will ensure not to fall into the catch-all else if the URL string starts with '[' then a digit, alpha, or :.

smalyshev · 2016-09-05T06:49:18Z

@yohgaki what to take a look?

yohgaki · 2016-09-05T22:56:01Z

@stas I suupose this patch is for
https://tools.ietf.org/html/rfc3986#section-3.2.2

3.2.2. Host

  IP-literal = "[" ( IPv6address / IPvFuture  ) "]"


  IPv6address =                            6( h16 ":" ) ls32
              /                       "::" 5( h16 ":" ) ls32
              / [               h16 ] "::" 4( h16 ":" ) ls32
              / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
              / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
              / [ *3( h16 ":" ) h16 ] "::"    h16 ":"   ls32
              / [ *4( h16 ":" ) h16 ] "::"              ls32
              / [ *5( h16 ":" ) h16 ] "::"              h16
              / [ *6( h16 ":" ) h16 ] "::"

  ls32        = ( h16 ":" h16 ) / IPv4address
              ; least-significant 32 bits of address

  h16         = 1*4HEXDIG
              ; 16 bits of address represented in hexadecima

Address format in phpt is shown as an example URI.
https://tools.ietf.org/html/rfc3986#section-1.1.2

We need to support this kind of IPv6 format, but priority is not high because browsers do not support IPv6 address format well, AFAIK.

This patch seems incomplete because parse_url should take care of port. It seems port number handling is missing at least. I think patch should look for ']'. I took a look the patch only, so I could be wrong though.

bp1222 · 2016-09-06T16:22:30Z

@yohgaki - Correct, that's what the bug reported. I believe you're right that I should be more verbose in checking for the finishing ] of the v6 address scheme. As for what I can tell, chrome, safari, and firefox all support http://[::1] formatted address schemes for URL access. I will round out tests, checking for port, and other options when using the bracketed v6 host scheme.

bp1222 · 2016-09-06T17:53:03Z

@yohgaki - Now that I have the cycle to look, parse_url already handles IPv6 like URL's, but only if the bracketed host comes after a scheme. //[::1] would have been valid and PHP_URL_HOST.

The bug is that [::1]/foo.php has a null value for PHP_URL_HOST, whereas 127.0.0.1/foo.php would have a valid PHP_URL_HOST

cmb69 · 2016-09-07T21:39:01Z

In my humble opinion, the current implementation of php_url_parse_ex() is broken beyond repair, and should better be replaced by a clean FSA implementing a concrete specification, without referring to memchr() to make some guesses, and later trying to fix up. At least, I have given up long ago trying to fix the various reported issues.

Having said this, I'm not against merging this PR.

bp1222 · 2016-09-07T21:43:26Z

I wouldn't be opposed to dropping this PR, and taking a stab at a better state based implementation of parsing a complex URI. I'm more surprised there isn't a library, or some PHP license compatible implementation somewhere.

cmb69 · 2016-09-08T12:07:52Z

I'm more surprised there isn't a library, or some PHP license compatible implementation somewhere.

There may be – I've never looked for one, though.

bp1222 · 2016-09-08T14:35:40Z

@cmb69 There apparently are no shortages of libraries for this purpose (who knew he said sarcastically). My question here, and this is more a the-proper-way-to-do-things question, would it be better to a) find a BSD licensed (or PHP-acceptable licenced) library and include it into the source, or, create a config --with-whateverlib option to use the library installed on the host machine?

I guess the former would be smarter since maintaining parse_url would fall to the library and we could remove what exists today. But I'm not sure the preferred means of incorporating a library into the core.

cmb69 · 2016-09-08T14:45:00Z

@bp1222 As parse_url() is part of ext/standard, most likely we should bundle the library (I'd suppose the library to be small, anyway), and perhaps offer an option to use the system library as alternative.

yohgaki · 2016-09-09T01:05:13Z

A while ago, I was looking for URL parser implemented by re2c and couldn't find one. Now I found
https://github.com/staskobzar/url_parser_re2c
It's missing schemaless URL support, e.g. //example.com/, but it could be added.
The author does not write license, so I've asked as an issue.

bp1222 · 2016-09-09T21:58:48Z

@yohgaki - Seems your request was answered, he MIT'd it. I can take a stab at getting it to work with schemeless-urls, and the other fringe cases defined in tests.

bp1222 · 2016-09-13T19:53:27Z

I've been working on getting the library updated with schemeless URL and better handling of IPv6 addresses, and I've come across a couple oddities in the PHP parse_url tests. For example:

   --> 64.246.30.37/: array(1) {
     ["path"]=>
     string(13) "64.246.30.37/"
   }

   --> 64.246.30.37:80/: array(3) {
     ["host"]=>
     string(12) "64.246.30.37"
     ["port"]=>
     int(80)
     ["path"]=>
     string(1) "/"
   }

I'm wondering why adding a port number would trigger parse_url() to treat the string as having a host, and port component rather than path. This seems contradictory to the RFC. in Section 3&4 (specifically 3, 4.2, and 4.3) it lays out the URI to consist either of a hier-part or relative-part. Both of these begin by requiring "//" authority, and it's only within authority do we parse for host. So I think in the PHP unit test the first is correct, and the second incorrect (per RFC). Not sure if this is a conscious abnormality, or just due to the current parser.

Thoughts?

nikic · 2016-09-13T20:26:11Z

parse_url() is supposed to be able to deal with partial URLs, as such I would consider the second output correct and the first wrong.

cmb69 · 2016-09-13T23:14:01Z

parse_url() is supposed to be able to deal with partial URLs, […]

What might be a problem in the first place. Consider the partial URL "example.com" – is this the host or the path?

nikic · 2016-09-13T23:33:17Z

I would have said host (and add that nothing should be a path unless it starts with a /), but ... that does not seem to match what parse_url() does. Dammit.

bp1222 · 2016-09-13T23:38:14Z

example.com should be path, per RFC. I think the current implementation of parse_url() does trickery to allow people to parse things like example.com/path/to/foo.php?a=b and get the host, path, and query as we would assume. Whereas the RFC would require you to make that //example.com/path/to/foo.php?a=b

I think the trickery, and special handling in parse_url() to be incorrect behavior, and that it should more-strongly adhere to the RFC. As it defines the hier-part pretty specifically.

bp1222 · 2016-09-14T02:12:29Z

The other oddity I'm coming across with the RFC style of parsing a url would be something like this:

www.php.net:80

This would oddly parse into

scheme: www.php.net
path: 80

This is because the RFC defines SCHEME as

Scheme names consist of a sequence of characters beginning with a
   letter and followed by any combination of letters, digits, plus
   ("+"), period ("."), or hyphen ("-").

the 80 would then be caught by the path-rootless rule.

Although this parsing is true to the spec, this is an example that I could see us fiddling around with the spec to say that a scheme, terminated by ':' followed by a digit like

SCHEME ":" DIGIT

is NOT a scheme and should be treated as a path.

Thoughts?

bp1222 · 2016-09-14T02:14:38Z

Ignore my thought above. That would break parsing for tel scheme like:

tel:555-1212

cmb69 · 2016-09-15T09:49:57Z

Thanks for reworking the PR, David!

I tried to compile, but my re2c didn't like the C++ style comments in url_parser_ex.re; replacing these with classic C style comments satisfied my re2c. I found that I'm using 0.13.5, but C++ style comments are allowed only as of 0.13.6. For compatibility reasons it's probably best to stick with classic C style comments.

Anyhow, with the patch several tests are now failing, so I guess we can't simply switch the implementations. At least that would require further discussion on the internals mailing list, and probably an RFC. Another option might be to leave parse_url() as it is, and add a new function (perhaps deprecating the old parse_url()).

Thoughts?

yohgaki · 2016-09-15T11:58:46Z

@CMB I would rather go for new re2c based parser without new function (both internal and user API).
We don't want to maintain both, we can't probably... We have time for fixing issues.

About RFC and discussion, I agree totally.

bp1222 · 2016-09-15T12:18:44Z

Oh yes. This commit is just work-in progress (could probably drop the pr until finished). I didn't commit all the test updates I've been plugging away at. As noted this implementation does change many of the tests. Most of the previous "failure" tests aren't fails now. They're just parsed slightly weird (though probably correct).

I concur that this should probably be a replacement rather than maintaining two parsers. Found so probably would warrant an RFC and be targeted for 7.2 or 7.3. I'm traveling on vacation the next two weeks but should be able to pick back up the.

cmb69 · 2016-09-15T12:32:17Z

I' fine with a replacement for 7.2, if there'll be an RFC. And yes, there's plenty of time until 7.2 feature freeze, so no need to hurry. :-)

bp1222 · 2016-12-28T20:29:01Z

Closing this PR, going to implement more of a replacement.

jchook · 2018-04-06T01:00:42Z

Any update on this? Seems like it didn't make the 7.2 cut?

bp1222 · 2018-04-06T02:01:56Z

It did not, I haven’t spent time to work on a more of a replacement over a patch.

smalyshev added the Bug label Sep 5, 2016

move test. Add more tests to pull out more pieces

96d342d

Commit of re2c based parser for URLs adhering to RFC 3986

67b1896

bp1222 added 4 commits October 3, 2016 16:12

Update to tests for new parser

9fdc3f0

Remove RFC break, and remove old test

de19fc5

Had in wrong order

ff66377

We no longer need to worry about validity of return value

eab254d

bp1222 closed this Dec 28, 2016

bp1222 deleted the fix-72811 branch December 28, 2016 23:51

kraftbj mentioned this pull request Oct 21, 2020

Protect: Needs better IPv6 handling Automattic/jetpack#17566

Closed

Fix #72811 - parse_url fails with IPv6 only host #2079

Fix #72811 - parse_url fails with IPv6 only host #2079

Uh oh!

Conversation

bp1222 commented Aug 12, 2016

Uh oh!

smalyshev commented Sep 5, 2016

Uh oh!

yohgaki commented Sep 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bp1222 commented Sep 6, 2016

Uh oh!

bp1222 commented Sep 6, 2016

Uh oh!

cmb69 commented Sep 7, 2016

Uh oh!

bp1222 commented Sep 7, 2016

Uh oh!

cmb69 commented Sep 8, 2016

Uh oh!

bp1222 commented Sep 8, 2016

Uh oh!

cmb69 commented Sep 8, 2016

Uh oh!

yohgaki commented Sep 9, 2016

Uh oh!

bp1222 commented Sep 9, 2016

Uh oh!

bp1222 commented Sep 13, 2016

Uh oh!

nikic commented Sep 13, 2016

Uh oh!

cmb69 commented Sep 13, 2016

Uh oh!

nikic commented Sep 13, 2016

Uh oh!

bp1222 commented Sep 13, 2016

Uh oh!

bp1222 commented Sep 14, 2016

Uh oh!

bp1222 commented Sep 14, 2016

Uh oh!

cmb69 commented Sep 15, 2016

Uh oh!

yohgaki commented Sep 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bp1222 commented Sep 15, 2016

Uh oh!

cmb69 commented Sep 15, 2016

Uh oh!

bp1222 commented Dec 28, 2016

Uh oh!

jchook commented Apr 6, 2018

Uh oh!

bp1222 commented Apr 6, 2018

Uh oh!

Uh oh!

yohgaki commented Sep 5, 2016 •

edited

Loading

yohgaki commented Sep 15, 2016 •

edited

Loading