Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of double slashes in path info for relative URLs #491

Closed
arnaud-fontaine opened this issue Feb 4, 2014 · 13 comments · Fixed by #1696
Closed

Handling of double slashes in path info for relative URLs #491

arnaud-fontaine opened this issue Feb 4, 2014 · 13 comments · Fixed by #1696
Labels
Milestone

Comments

@arnaud-fontaine
Copy link

Since commit 7486573 and the introduction of absolute URLs support in HTTP requests, it is not possible anymore to have a double-slash at the beginning of PATH_INFO. Actually, the environ variables from serving.py:WSGIRequestHandler.make_environ() is completely wrong when requesting 'http://localhost:5001//supplySupply' (same issue with current werkzeug Git HEAD) and raise a 404 Not Found error:

{'CONTENT_LENGTH': '',
 'CONTENT_TYPE': '',
 'HTTP_ACCEPT': '*/*',
 'HTTP_CONNECTION': 'Keep-Alive',
 'HTTP_HOST': 'supplySupply',
 'HTTP_USER_AGENT': 'Wget/1.15',
 'PATH_INFO': '',
 'QUERY_STRING': '',
 'REMOTE_ADDR': '127.0.0.1',
 'REMOTE_PORT': 45501,
 'REQUEST_METHOD': 'GET',
 'SCRIPT_NAME': '',
 'SERVER_NAME': '127.0.0.1',
 'SERVER_PORT': '5001',
 'SERVER_PROTOCOL': 'HTTP/1.1',
 'SERVER_SOFTWARE': 'Werkzeug/0.9.4',
 'werkzeug.server.shutdown': <function shutdown_server at 0x24356e0>,
 'wsgi.errors': <open file '<stderr>', mode 'w' at 0x7f5a67e261e0>,
 'wsgi.input': <socket._fileobject object at 0x2432ed0>,
 'wsgi.multiprocess': False,
 'wsgi.multithread': False,
 'wsgi.run_once': False,
 'wsgi.url_scheme': 'http',
 'wsgi.version': (1, 0)}
127.0.0.1 - - [04/Feb/2014 12:25:47] "GET //supplySupply HTTP/1.1" 404 -

As you can see, HTTP_HOST has been wrongly set to 'supplySupply' and PATH_INFO to '' because of urls.py:url_parse() which considers '//supplySupply' to be a netloc and not the path itself.

It seems that Python urlparse() behaves the same even though I'm not sure why because if scheme is not given, AFAIU from RFC 3986, scheme should always be given... Anyhow, this issue is similar to this one reported on urllib2.urlopen() some years ago:
http://bugs.python.org/issue2776

As a side note, Most HTTP server such as Apache and Zope just ignore double-slashes, so URLs such as 'http://localhost:80//path/info' and 'http://localhost:80/path/info' are the same.

FTR, here is the environ dict set in previous version of werkzeug (0.8.3, before using urlparse in WSGIRequestHandler.make_environ()):

{'CONTENT_LENGTH': '',
 'CONTENT_TYPE': '',
 'HTTP_ACCEPT': '*/*',
 'HTTP_CONNECTION': 'Keep-Alive',
 'HTTP_HOST': 'localhost:5001',
 'HTTP_USER_AGENT': 'Wget/1.15 (linux-gnu)',
 'PATH_INFO': '//supplySupply',
 'QUERY_STRING': '',
 'REMOTE_ADDR': '127.0.0.1',
 'REMOTE_PORT': 47148,
 'REQUEST_METHOD': 'GET',
 'SCRIPT_NAME': '',
 'SERVER_NAME': '127.0.0.1',
 'SERVER_PORT': '5001',
 'SERVER_PROTOCOL': 'HTTP/1.1',
 'SERVER_SOFTWARE': 'Werkzeug/0.8.3',
 'werkzeug.server.shutdown': <function shutdown_server at 0x24dd578>,
 'wsgi.errors': <open file '<stderr>', mode 'w' at 0x7fb58fb4b1e0>,
 'wsgi.input': <socket._fileobject object at 0x24dc2d0>,
 'wsgi.multiprocess': False,
 'wsgi.multithread': False,
 'wsgi.run_once': False,
 'wsgi.url_scheme': 'http',
 'wsgi.version': (1, 0)}
127.0.0.1 - - [04/Feb/2014 13:37:45] "GET //supplySupply HTTP/1.1" 200 -

Here is the sample application I used to reproduce this bug easily:

from werkzeug.exceptions import HTTPException
from werkzeug.routing import Map, Rule, NotFound, RequestRedirect

url_map = Map([Rule('/supplySupply')])

def application(environ, start_response):
    urls = url_map.bind_to_environ(environ)
    try:
        endpoint, args = urls.match()
    except HTTPException, e:
        return e(environ, start_response)
    start_response('200 OK', [('Content-Type', 'text/plain')])
    return ['Rule points to %r with arguments %r' % (endpoint, args)]

from werkzeug.serving import run_simple
run_simple('127.0.0.1', 5001, application, use_debugger=True, use_reloader=True)
arnaud-fontaine added a commit to arnaud-fontaine/werkzeug that referenced this issue Feb 4, 2014
@arnaud-fontaine
Copy link
Author

Here is a patch for url_parse(). I'm not sure which one is the best, either patching urlparse() like I did or modifying serving.py:WSGIRequestHandler.make_environ() similarly to what it used to be before commit 7486573... If you think the second way is better, then I can submit another patch.

@untitaker
Copy link
Contributor

Patching url_parse seems more sensible to me, not only because i authored the commit you're referring to, but also because i think url_parse could get more URLs like that. I'm not sure if those are even valid though.

@untitaker
Copy link
Contributor

But actually i can't reproduce this issue with either gunicorn + Apache or the builtin server.

@arnaud-fontaine
Copy link
Author

Well, with the sample code I provided above, I can reproduce the problem straightaway (404 Not Found), so I'm not sure what you mean then... Initially, I stumbled upon this issue with some code using Flask relying in turn on Werkzeug.

I'm not sure either this kind of URL are valid as I may have misinterpreted RFC 3986 and/or I don't understand the rationale behind Python Standard Library url_parse() behavior... Anyone?

@untitaker
Copy link
Contributor

Ok i admit i only tried out one of my Flask apps if they return a 404 if i add some double slashes into the path, but they handled it very well. I will try your code later.

@untitaker
Copy link
Contributor

I can reproduce this bug only with http://127.0.0.1:5001//supplySupply, but not with http://127.0.0.1:5001///supplySupply or http://127.0.0.1:5001////supplySupply.

@untitaker
Copy link
Contributor

Your patch also seems to break for schemeless URLs:

rv = urls.url_parse('//foobar.example.com/test')
self.assert_strict_equal(rv.scheme, '')
self.assert_strict_equal(rv.host, 'foobar.example.com')
self.assert_strict_equal(rv.path, '/test')

Actually i now have to revert my opinion about url_parse having to be patched. It's make_environ which provides a syntactically valid but semantically incorrect URL.

EDIT: http://stackoverflow.com/a/20524044 and this from the RFC:

 If a URI does not contain an authority component, then the path cannot begin
 with two slash characters ("//").

I still wonder how one would find out if //foobar.com/test contains an
authority component (schemeless URL), or if this is just a path (scheme- and
hostless). I am not sure if the HTTP
standard
allows schemeless URLs:

URIs in HTTP can be represented in absolute form or relative to some known base
URI [11], depending upon the context of their use.

[11] only refers to the relevant RFC for relative URLs, and there's no
mentioning of the keyword scheme in the whole HTTP RFC. I think we should
start to disallow either schemeless or absolute URLs in HTTP requests.

@arnaud-fontaine
Copy link
Author

Yes, the patch break schemeless URLs, but I'm not sure how this is supposed to work if no scheme has been provided or at least it does not make sense to me in Werkzeug context. As far as I understand RFC 3986, scheme should always be given... My only concern with my patch on url_parse() is that the behavior would be different from Python urlparse() and I'm not sure that's the way to go...

By the way, why there would be absolute URLs in HTTP requests to start with? Would you like me to submit another patch for make_environ() instead? (similar to what have been done there: http://bugs.python.org/issue2776).

Thanks for your review and quick responses!

@untitaker
Copy link
Contributor

I don't see a reason why absolute URLs should be supported, but by the HTTP RFC it is valid and if we don't implement it Werkzeug would choke on some clients.

@untitaker
Copy link
Contributor

There hasn't been a clear outcome on this issue. Is it still one?

@kparal
Copy link

kparal commented May 13, 2015

Sorry, is this supposed to have been fixed? I assume it wasn't.

There are some reports around the internet mentioning this issue, e.g.:
semanticize/semanticizer#21

And with our own project, we've hit this as well. We have a Flask application deployed with werkzeug and access it from a different tool which supports defining a URL of the Flask app in a config file. If the user sets http://0.0.0.0:5000/api/v1.0, we then append something like /jobs and everything works fine. But if the user sets http://0.0.0.0:5000/api/v1.0/ (a trailing slash), the result is http://0.0.0.0:5000/api/v1.0//jobs and that doesn't work, per this bug, yielding 404 Not Found.

It's not that difficult to work around, but it would be nice if it worked out of the box. Thanks.

@untitaker
Copy link
Contributor

It wasn't fixed, but just stale.

@untitaker untitaker reopened this May 14, 2015
@lzecca78

This comment has been minimized.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants