-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Fix #72811 - parse_url fails with IPv6 only host #2079
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Bug report shows that when using an IPv6 formatted address as a URL host it would incorrectly parse. parse_url detects the first ':' as denoting a scheme, which isn't properly parsed. This relegates the [IPv6] formatted host to ending up as the "path" portion of the URL. This fix skips sheme checking if the first character of the URL string is a '[', and will ensure not to fall into the catch-all else if the URL string starts with '[' then a digit, alpha, or :.
@yohgaki what to take a look? |
@stas I suupose this patch is for 3.2.2. Host
Address format in phpt is shown as an example URI. We need to support this kind of IPv6 format, but priority is not high because browsers do not support IPv6 address format well, AFAIK. This patch seems incomplete because parse_url should take care of port. It seems port number handling is missing at least. I think patch should look for ']'. I took a look the patch only, so I could be wrong though. |
@yohgaki - Correct, that's what the bug reported. I believe you're right that I should be more verbose in checking for the finishing ] of the v6 address scheme. As for what I can tell, chrome, safari, and firefox all support http://[::1] formatted address schemes for URL access. I will round out tests, checking for port, and other options when using the bracketed v6 host scheme. |
@yohgaki - Now that I have the cycle to look, parse_url already handles IPv6 like URL's, but only if the bracketed host comes after a scheme. The bug is that |
In my humble opinion, the current implementation of php_url_parse_ex() is broken beyond repair, and should better be replaced by a clean FSA implementing a concrete specification, without referring to memchr() to make some guesses, and later trying to fix up. At least, I have given up long ago trying to fix the various reported issues. Having said this, I'm not against merging this PR. |
I wouldn't be opposed to dropping this PR, and taking a stab at a better state based implementation of parsing a complex URI. I'm more surprised there isn't a library, or some PHP license compatible implementation somewhere. |
There may be – I've never looked for one, though. |
@cmb69 There apparently are no shortages of libraries for this purpose (who knew he said sarcastically). My question here, and this is more a the-proper-way-to-do-things question, would it be better to a) find a BSD licensed (or PHP-acceptable licenced) library and include it into the source, or, create a config --with-whateverlib option to use the library installed on the host machine? I guess the former would be smarter since maintaining parse_url would fall to the library and we could remove what exists today. But I'm not sure the preferred means of incorporating a library into the core. |
@bp1222 As parse_url() is part of ext/standard, most likely we should bundle the library (I'd suppose the library to be small, anyway), and perhaps offer an option to use the system library as alternative. |
A while ago, I was looking for URL parser implemented by re2c and couldn't find one. Now I found |
@yohgaki - Seems your request was answered, he MIT'd it. I can take a stab at getting it to work with schemeless-urls, and the other fringe cases defined in tests. |
I've been working on getting the library updated with schemeless URL and better handling of IPv6 addresses, and I've come across a couple oddities in the PHP parse_url tests. For example:
I'm wondering why adding a port number would trigger Thoughts? |
parse_url() is supposed to be able to deal with partial URLs, as such I would consider the second output correct and the first wrong. |
What might be a problem in the first place. Consider the partial URL "example.com" – is this the host or the path? |
I would have said host (and add that nothing should be a path unless it starts with a /), but ... that does not seem to match what parse_url() does. Dammit. |
I think the trickery, and special handling in |
The other oddity I'm coming across with the RFC style of parsing a url would be something like this:
This would oddly parse into
This is because the RFC defines SCHEME as
the Although this parsing is true to the spec, this is an example that I could see us fiddling around with the spec to say that a scheme, terminated by ':' followed by a digit like
is NOT a scheme and should be treated as a path. Thoughts? |
Ignore my thought above. That would break parsing for
|
Thanks for reworking the PR, David! I tried to compile, but my re2c didn't like the C++ style comments in url_parser_ex.re; replacing these with classic C style comments satisfied my re2c. I found that I'm using 0.13.5, but C++ style comments are allowed only as of 0.13.6. For compatibility reasons it's probably best to stick with classic C style comments. Anyhow, with the patch several tests are now failing, so I guess we can't simply switch the implementations. At least that would require further discussion on the internals mailing list, and probably an RFC. Another option might be to leave parse_url() as it is, and add a new function (perhaps deprecating the old parse_url()). Thoughts? |
@CMB I would rather go for new re2c based parser without new function (both internal and user API). About RFC and discussion, I agree totally. |
Oh yes. This commit is just work-in progress (could probably drop the pr until finished). I didn't commit all the test updates I've been plugging away at. As noted this implementation does change many of the tests. Most of the previous "failure" tests aren't fails now. They're just parsed slightly weird (though probably correct). I concur that this should probably be a replacement rather than maintaining two parsers. Found so probably would warrant an RFC and be targeted for 7.2 or 7.3. I'm traveling on vacation the next two weeks but should be able to pick back up the. |
I' fine with a replacement for 7.2, if there'll be an RFC. And yes, there's plenty of time until 7.2 feature freeze, so no need to hurry. :-) |
Closing this PR, going to implement more of a replacement. |
Any update on this? Seems like it didn't make the 7.2 cut? |
It did not, I haven’t spent time to work on a more of a replacement over a patch. |
Bug report shows that when using an IPv6 formatted address as
a URL host it would incorrectly parse. parse_url detects the first ':'
as denoting a scheme, which isn't properly parsed. This relegates the
[IPv6] formatted host to ending up as the "path" portion of the URL.
This fix skips sheme checking if the first character of the URL string
is a '[', and will ensure not to fall into the catch-all else if the
URL string starts with '[' then a digit, alpha, or :.