Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question mark eaten from request URL if no params specified #2912

Closed
mloskot opened this issue Dec 2, 2015 · 22 comments
Closed

Question mark eaten from request URL if no params specified #2912

mloskot opened this issue Dec 2, 2015 · 22 comments

Comments

@mloskot
Copy link

mloskot commented Dec 2, 2015

Here is an URL with question mark included, but without any parameters:

url='http://server.com/api/item.json?'

That URL (w/ question mark) controls server response in such way that it allows to request certain metadata about an item, instead of representation of an item iteself:

import urllib3
http = urllib3.PoolManager()
r = http.request('GET', url)
print (r.data)
{ "ItemMetadata" : { ... }}

The above works well.

Requests, however, eat-or-skip the question mark if the params argument is missing/None/{}, so my request is responded with the resource representation, instead of the metadata:

from requests import get
r = get(url, stream=True)
print(r.text)
{ "Item" : { ... }}

That seem to indicate a bug in URL parsing.

@Lukasa
Copy link
Member

Lukasa commented Dec 2, 2015

This is not a bug in URL parsing: quite the opposite.

Requests does its best to normalise URLs where it is safe to do so. This is necessary to ensure that various stages of URL building work appropriately, and that we don't accidentally end up with invalid headers (e.g. multiple question marks or none at all when parameters are present). A trailing question mark with no parameters is superfluous (and arguably somewhat invalid), so requests kindly strips it off for you. This is entirely intentional.

For what it's worth, my opinion is that this API is poorly designed. It relies on the idea that no middle-box will rewrite that URL to remove the empty query part, which is a risk. Generally speaking, the better API would be to have a sub-resource or a proper query string (e.g. /api/item.json?metadata=true, or /api/item/metadata.json). This API design is super fragile, and I guarantee that it'll be prone to breaking in mysterious ways. I doubt we're the only framework that strips it. In fact, anything that relies on Python's standard urlsplit function will probably do exactly what we do and ignore the query part.

However, both browsers and curl will, if requested, send the empty query string. Having chatted to @bagder I don't think this is a deliberate decision on the part of those entities, so much as it just happens to fall out of the way they handle query strings.

Given that we have a duty to be better than the Python standard library, that means I think we should come up with a way to make it possible, at least when using the PreparedRequest API. Sadly, urlsplit makes this really hard for us, because it doesn't have a difference in the query portion of the URL for http://http2bin.org/get' andhttp://http2bin.org/get?`. That makes our lives really tricky because I'd rather not hand roll URL parsing if I can possibly avoid it.

I think the only way we can fix this is to change the library we use to parse URLs to one that can safely inform us of the difference between these two cases. @sigmavirus24, do you think your URL parsing library would do better here?

I think this is a reasonable request, but we can't get to it before 3.0.0 because moving to a new URL parsing library will probably break a whole lot of stuff.

@mloskot
Copy link
Author

mloskot commented Dec 2, 2015

This is not a bug in URL parsing: quite the opposite.

I would appreciate clarification that would help me to understand the rules behind the URL parsers behaviour at large.

Is that URL parsers interpret the following from RFC 1738, section 3.3, that the "?" (question mark) is not part of the query string?

http://<host>:<port>/<path>?<searchpart>

Does also the grammar in RFC 3986, Appendix A., indicate the "?" is not part of the actual query string?

URI           = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

If so, for URL like http://server.com/item.json?, while applying the suggested (de|re)composition algorithm, i.e.:

      if defined(query) then
         append "?" to result;
         append query to result;
      endif;

parsers rightly assume the query component as undefined, hence superfluous.

Is my understanding correct?

@Lukasa
Copy link
Member

Lukasa commented Dec 2, 2015

Your understanding is correct, at least as far as the standard library URL parser goes. That is not necessarily generally true of URL parsers, which is why I asked @sigmavirus24 if his URL parser has the same behaviour or not. In particular, it's not clear to me that an absent query part is semantically identical to an empty query part: Appendix A of RFC 3986 does appear to allow zero-length query strings.

Again, I think this is strictly acceptable, but it's likely to be extremely fragile.

@mloskot
Copy link
Author

mloskot commented Dec 2, 2015

@Lukasa I see. Yes, I'm concerned about the standard library URL parser, but also curious what is a common practice in other parsers. So, thanks for the clarifications so far.

From your earlier response:

This API design is super fragile, and I guarantee that it'll be prone to breaking in mysterious ways.

That is exactly what worries me, so I'm trying to understand what are the super fragile traits exactly and what are the ways the API may fails.

@Lukasa
Copy link
Member

Lukasa commented Dec 2, 2015

Sure. So, my reason for thinking it's fragile is mostly because this is clearly a grey area of the URL specification. URLs and URL normalisation are fraught topics, and as a result it's unwise to build an API that assumes that two semantically equivalent URLs with different representations will be transmitted without potentially being translated between those representations.

What's not clear to me is whether the zero-length query string is one of them. I think this is because it's simply in a grey area: some entities treat it as different to no query string at all, others do not. For that reason it's fragile, not wrong: while it may work in many cases, I think it's likely that it'd be a source of subtle bugs.

@sigmavirus24
Copy link
Contributor

@Lukasa I think we strip it but I can imagine a way to avoid that.

@mloskot
Copy link
Author

mloskot commented Dec 2, 2015

@Lukasa Thanks very much. I sympathise with your critique.

@mloskot
Copy link
Author

mloskot commented Dec 3, 2015

FYI,

Out of curiosity, not to prove my point, I've probed a bit the landscape of HTTP client libraries.

I've tried my target question-mark-powered API with popular clients for

  • Python (http.client, requests, urllib, urllib3)
  • Node (request, shred, unirest, urllib)
  • C# (WebRequest, WebClient, HttpClient, EasyHttp, RestSharp)
  • Go (http/net)
  • D (std.net.curl - powered by libcurl)

Here is complete project https://github.com/mloskot/http-url-test

The result is: 4 of those libraries behave like the Python Requests:

It makes a wild statistics of ~30% clients affected by a fragile question-mark-powered resource API.

@mloskot
Copy link
Author

mloskot commented Dec 6, 2015

I received some pointers to further details in RFC 3986 (see Is question mark in URL part of query string?). In particular, 6.2.3. Scheme-Based Normalization which somewhat applies to this issue and the "question mark" handling. Quoting:

Normalization should not remove delimiters when their associated component is empty unless licensed to do so by the scheme specification. For example, the URI http://example.com/? cannot be assumed to be equivalent to any of the examples above.

Where "examples above" refers to:

http://example.com
http://example.com/
http://example.com:/
http://example.com:80/

@Lukasa
Copy link
Member

Lukasa commented Dec 6, 2015

So I agree that we should be persisting the query part. The problem is that our supporting libraries don't make it clear whether there was no query part or just an empty one.

@sigmavirus24
Copy link
Contributor

So rfc3986 now supports this use case as of 0.3.1.

@mloskot
Copy link
Author

mloskot commented Dec 10, 2015

FYI, here is an example of API similar to the one in my issue above:

https://confluence.ucop.edu/display/Curation/ARK

a brief metadata record if you append a single question mark to the ARK
(...)
Adding '??' to the end should return a policy statement.

@Lukasa
Copy link
Member

Lukasa commented Dec 10, 2015

@mloskot Sadly, lots of people make bad decisions. =(

@sigmavirus24
Copy link
Contributor

@mloskot constantly posting information here will not get this fixed any faster. You're merely spamming 781 people (- the people who have already unsubscribed).

@mloskot
Copy link
Author

mloskot commented Dec 10, 2015

@Lukasa Yes. That example surprised me and I thought I will share it as an example of bad design..

@sigmavirus24

constantly posting information here will not get this fixed any faster.

You are misreading my intention - I'm not trying to press anything at all.
I'm simply sharing some further details which I believe are relevant to this issue.
I'm sorry if I caused any annoyance.

@kennethreitz
Copy link
Contributor

Workaround: you could probably percent-encode the question mark.

@cabrinha
Copy link

cabrinha commented Apr 8, 2016

I'd just like some way to remove the url that requests is inserting into my url. I never asked requests to insert a '?' into my url...

@Lukasa
Copy link
Member

Lukasa commented Apr 8, 2016

@internaught What code do you have that causes requests to insert a question-mark?

@cabrinha
Copy link

cabrinha commented Apr 8, 2016

@Lukasa I'm simply doing this:

return requests.get(
            'https://my-{}-api-url.com/api/0.1'.format(
                random.randrange(1, 9)),
            '/thing/my_thing/{}'.format(other_thing), headers=self.headers
        )

Thanks for the quick reply!

@Lukasa
Copy link
Member

Lukasa commented Apr 8, 2016

@internaught You have a comma between the two parts of your URL. That means your second string is being passed to the params parameter of the requests.get call, which means requests assumes it's a query string. Replace the comma with a + and you'll be fine.

@cabrinha
Copy link

cabrinha commented Apr 8, 2016

@Lukasa thanks for the help, working great now. Simple mistake, thanks!

@poleguy
Copy link

poleguy commented Oct 20, 2017

I'll leave this here for the next guy bitten by this...
The mini-circuits ZTM boxes use a bare ? to query the state. https://www.minicircuits.com/softwaredownload/ztm_rcm.html

I was able to work around the behavior of requests by putting a space after the question mark.

@psf psf locked and limited conversation to collaborators Oct 21, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants