Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
updated
get_encoding_from_headers
to return utf-8 if the content ty…
…pe is set to application/json, following RFC 4627. fixes #5667
- Loading branch information
5855dd7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
Response.json
method already includes proper guessing of the JSON encoding based on the first four octets as explained in RFC 4627 section 3. This newly introduced simplification fully disables this more advanced detection, as it sets self.encoding, and break usage of exotic JSON APIs serving JSON encoded in UTF variations with Bit Order Mark. I would recommend to revert this change. Especially for a patch release (2.5.1) this change is an unexpected behaviour change.5855dd7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the contributor I don't know enough about the library to weigh in, but the reason for this PR was based on #5667. Can you take a look and see if there is another solution that will solve that issue?
In short, the issue is that the
Response.text
method does not default to utf-8 encoding whenapplication/json
is set as the content type. RFC 4627 says that utf-8 should be the default encoding, and like you said it's possible to detect if the encoding is instead utf-16 or utf-32 based on the first two octets. That said, the functionality that I encountered and want to contribute to fixing is thatResponse.text
useschardet
to identify the encoding for the content even ifapplication/json
is set as the content type, rather than defaulting to utf-* based on RFC 4627.I'm more than happy to create a different PR for a quick review based on what you would like.
To quickly reproduce the issue in #5667, you can run the following code:
I would expect the 2nd print statement to be identical to the first -- i.e.
r.json()
andjson.loads(r.text)
to be identical -- provided thatapplication/json
is set in the response header.5855dd7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bringing this to our attention, @k0ssi! It sounds like this needs some more work. We may need to move the
json()
content-type logic out into another function we can reuse. Would one of you mind opening a new issue to track the discussion? These comments don't have great discoverability here.5855dd7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately I dont have much time to take a closer look at the this at the moment. I'll have a look at it on the weekend and then get back to you next week. At the moment we first have to figure out what causes the issue @jjmaldonis tells. At first glance this looks to me as a encoding error within text() not json()
@jjmaldonis by setting an charset within get_encoding_from_headers function which is used to build the response there is no guessing while using .json() since charset is already set. Since there is no guarantee everyone is using 'utf-8' we cannot simply assume it is.
In my case this causes an error since api im working with is using 'utf-8 bom'