Is it safe to rely on presence of content-encoding response header? #20

Closed
inactivist opened this Issue Mar 30, 2013 · 3 comments

Projects

None yet

2 participants

@inactivist

After doing some code diving, it appears Py-StackExchange assumes uncompressed responses in the absence of a received Content-Encoding header. If I'm reading the WebRequestManager.request() code correctly, that is. (It's late, and I may be loopy.)

Per the StackExchange API documentation the only times a response will not be GZIP-encoded are:

  1. If you requested other (DEFLATE) encoding, or
  2. if an error occurs very early in the API request processing.

How to properly consume API responses

If Content-Encoding is set on the response, use the specified algorithm. If it is missing, assume GZIP.

If response is not compressed this suggests a proxy between the user and us is intentionally decompressing content, or errors are occuring very early in processing requests. You can detect uncompressed content by checking for the appropriate magic numbers, assuming your library cannot detect this error for you.

The API will never return an uncompressed response during normal operation.

So: If the Content-Encoding header is missing and you specified GZIP encoding in the original request, it's GZIP. If you didn't specify Content-Encoding, it's GZIP. If it can't be decoded as GZIP, it's because a processing error occurred.

I'm not sure how to detect compressed vs. uncompressed responses but it appears that peeking into the response data stream to determine actual data format may be necessary.

Or, is it safe to assume that, in the absence of the Content-Encoding response header, that GZIP content was also decompressed by an upstream agent (proxy?) that removed the header? This might be true, but I don't know.

(This is one of those edge case issues that may only arise in rare circumstances, just wanted to raise the issue and get some feedback.)

Owner
lucjon commented Mar 30, 2013

Hello,

It's been a while, but I'm sure I've had problems with the gzip compression/Content-Encoding header before. Looking over the code and the docs you've pointed out (thanks!), it looks like checking for a magic number is really the correct solution here. The gzip format magic number, which I am presuming is also used in web responses, is 0x1f 0x8b. This should be relatively foolproof, especially as raw DEFLATE decompression, etc. don't need to be supported.

Presuming this is the case, I'll rustle up a patch, hopefully tonight.

Thanks for getting in touch.

@lucjon lucjon referenced this issue Mar 30, 2013
Closed

2.0 API status #21

@lucjon Thanks for the prompt response. This may be one of those "shouldn't happen" problems, but I thought it might be worth investigating -- especially since I saw comments on your stackapps.com page, and read the API docs about compression.

@lucjon lucjon added a commit that referenced this issue Mar 31, 2013
@lucjon Fix issue #20 (gzip encoding)
Now detect when gzip is actually being used via magic number instead of
taking the server's word for it.
65112fe
Owner
lucjon commented Mar 31, 2013

I've switched it to use the magic number method, and, at least in my "perfect" environment (no proxies, nefarious routers, etc.) the tests still pass. The only issue I can think of is perhaps an endianness one as far as the magic value itself is concerned, but I'm fairly sure Python's networking library deals with that somewhere along the line; regardless, I only have an x86 machine to test it on.

Thanks for the report; the downside of network programming is definitely the frequency with which "impossible" situations tend to occur.

@lucjon lucjon closed this Sep 2, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment