Permalink
4 comments
on commit
sign in to comment.
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Browse files
updated `get_encoding_from_headers` to return utf-8 if the content ty…
…pe is set to application/json, following RFC 4627. fixes #5667
- Loading branch information
This comment has been minimized.
The
Response.jsonmethod already includes proper guessing of the JSON encoding based on the first four octets as explained in RFC 4627 section 3. This newly introduced simplification fully disables this more advanced detection, as it sets self.encoding, and break usage of exotic JSON APIs serving JSON encoded in UTF variations with Bit Order Mark. I would recommend to revert this change. Especially for a patch release (2.5.1) this change is an unexpected behaviour change.This comment has been minimized.
As the contributor I don't know enough about the library to weigh in, but the reason for this PR was based on #5667. Can you take a look and see if there is another solution that will solve that issue?
In short, the issue is that the
Response.textmethod does not default to utf-8 encoding whenapplication/jsonis set as the content type. RFC 4627 says that utf-8 should be the default encoding, and like you said it's possible to detect if the encoding is instead utf-16 or utf-32 based on the first two octets. That said, the functionality that I encountered and want to contribute to fixing is thatResponse.textuseschardetto identify the encoding for the content even ifapplication/jsonis set as the content type, rather than defaulting to utf-* based on RFC 4627.I'm more than happy to create a different PR for a quick review based on what you would like.
To quickly reproduce the issue in #5667, you can run the following code:
I would expect the 2nd print statement to be identical to the first -- i.e.
r.json()andjson.loads(r.text)to be identical -- provided thatapplication/jsonis set in the response header.This comment has been minimized.
Thanks for bringing this to our attention, @k0ssi! It sounds like this needs some more work. We may need to move the
json()content-type logic out into another function we can reuse. Would one of you mind opening a new issue to track the discussion? These comments don't have great discoverability here.This comment has been minimized.
Unfortunately I dont have much time to take a closer look at the this at the moment. I'll have a look at it on the weekend and then get back to you next week. At the moment we first have to figure out what causes the issue @jjmaldonis tells. At first glance this looks to me as a encoding error within text() not json()
@jjmaldonis by setting an charset within get_encoding_from_headers function which is used to build the response there is no guessing while using .json() since charset is already set. Since there is no guarantee everyone is using 'utf-8' we cannot simply assume it is.
In my case this causes an error since api im working with is using 'utf-8 bom'