Extra '/' collapsing #981

Closed
deweerdt opened this Issue Jul 8, 2016 · 10 comments

Projects

None yet

3 participants

@deweerdt
Contributor
deweerdt commented Jul 8, 2016

I'm opening this as an issue instead of sending a PR because I'm not sure it's an issue at all.

The current code in rewrite_special_paths combines consecutive slashes: https://github.com/h2o/h2o/blob/d31e4729aae30dc60212e006679611ef6dbb8d01/lib/common/url.c#L109 . I haven't found a rationale for that in the parts of rfc3986 that deal with the path component or the path normalization. In particular in the context of proxying, is it valid to merge consecutive slashes?

@kazuho
Member
kazuho commented Jul 9, 2016

Thank you for opening the issue.

It seems that RFC 3986 is written to allow empty path segments (see the rules defined in section 3.3; I also believe that the path resolution rule defined in section 5 is written with the fact in mind).

So if this is an issue, I would suggest fixing rewrite_special_paths to preserve consecutive slashes. But in such case, we might also need to check how the output of the function is used; I assume the output is stored in path_normalized and is used for relative path resolution, etc.

@deweerdt
Contributor

I'm starting to think that the issue could be resolved by having an option to proxy the URL as it was received: we would still normalize it as usual, but what's forwarded would be the original URL. This would have the advantage of minimizing the differences between what clients send and what gets passed to the backend. What do you think?

@deweerdt
Contributor

Mmm, even passing the URL as-is is tricky because we at least want to normalize the part that needs to be replaced, but I think it's doable. I'll send a PR.

@kazuho
Member
kazuho commented Jul 12, 2016

@deweerdt (cc @miyagawa)

I'm starting to think that the issue could be resolved by having an option to proxy the URL as it was received: we would still normalize it as usual, but what's forwarded would be the original URL.

Sounds like an attractive way to resolve the problem.

However, if the issue is just about consecutive slashes, I would appreciate it if you could fix the issue by modifying rewrite_special_paths, due to the following reason.

path_normalized is modeled after PATH_INFO of RFC 3875, and the FastCGI handler uses the value for the purpose. Although RFC 3875 does not allow consecutive slashes within PATH_INFO, it seems that many web servers traditionally have not merged them (see https://github.com/plack/Plack/blob/4d39588569a8f45529af7d69a33308d8c99c3931/lib/Plack/Handler/FCGI.pm#L125-L128).

The current behavior of H2O merging them into is likely to cause interoperability issues for FastCGI users as well, and therefore I would appreciate it if you could fix the issue by modifying rewrite_special_paths (or the caller of the function).

@kazuho
Member
kazuho commented Jul 12, 2016

But if you need to preserve other edge cases (e.g. a URI path containing /../) then using path_normalized (modeled after PATH_INFO) cannot be an option. In the case, the only viable way to fix the issue would be to pass th URL as-is as you proposed.

@miyagawa
miyagawa commented Jul 12, 2016 edited

Normalizing the path makes sense when you're writing an app server or some sort and having h2o as a gateway to pass these environment values to the handler. When you put h2o as an H2 terminator and proxy/forward to another HTTP based server, it would be useful to have a way to access unmodified paths, since a lot of servers interpret URL encodings and paths in a sligthly different way.

Similar to what REQUEST_URI variable would do in FCGI and *SGI protocols. my 2c 😄

@deweerdt
Contributor

However, if the issue is just about consecutive slashes, I would appreciate it if you could fix the issue by modifying rewrite_special_paths, due to the following reason.

It isn't just about consecutive slashes. It was one of the changes in the URLS that I saw, but I think that in the end I'll be working on passing the URLs as-is, for the reasons that @miyagawa stated.

I'm happy to supply a fix for the consecutive slashes after i'm done with passing the url down transparently.

@kazuho
Member
kazuho commented Jul 12, 2016

@deweerdt @miyagawa Thank you for the response.

It isn't just about consecutive slashes. It was one of the changes in the URLS that I saw, but I think that in the end I'll be working on passing the URLs as-is, for the reasons that @miyagawa stated.

Understood.

I'm happy to supply a fix for the consecutive slashes after i'm done with passing the url down transparently.

Great! Thank you.

Looking forward to see the updated PR.

@deweerdt
Contributor
deweerdt commented Jul 13, 2016 edited

I've just pushed #985, which still needs some work to decide what to do about consecutive slahes.

The idea of the PR is essentially: when we normalize the url (that we'll still use to find the handler), we record which character mapped to which input character. This allows us pass the portion of the URL that we forward as we received it.

Please let me know what you think about the approach.

@deweerdt
Contributor

I've just added tests to #985 that verify that slashes are now preserved. I think we can close this issue.

@deweerdt deweerdt closed this Jul 14, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment