Skip to content

Commit

Permalink
updated get_encoding_from_headers to return utf-8 if the content ty…
Browse files Browse the repository at this point in the history
…pe is set to application/json, following RFC 4627.

fixes #5667
  • Loading branch information
jjmaldonis committed Nov 29, 2020
1 parent d0359c9 commit 5855dd7
Showing 1 changed file with 4 additions and 0 deletions.
4 changes: 4 additions & 0 deletions requests/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -503,6 +503,10 @@ def get_encoding_from_headers(headers):
if 'text' in content_type:
return 'ISO-8859-1'

if 'application/json' in content_type:
# Assume UTF-8 based on RFC 4627: https://www.ietf.org/rfc/rfc4627.txt since the charset was unset
return 'utf-8'


def stream_decode_response_unicode(iterator, r):
"""Stream decodes a iterator."""
Expand Down

4 comments on commit 5855dd7

@k0ssi
Copy link

@k0ssi k0ssi commented on 5855dd7 Dec 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Response.json method already includes proper guessing of the JSON encoding based on the first four octets as explained in RFC 4627 section 3. This newly introduced simplification fully disables this more advanced detection, as it sets self.encoding, and break usage of exotic JSON APIs serving JSON encoded in UTF variations with Bit Order Mark. I would recommend to revert this change. Especially for a patch release (2.5.1) this change is an unexpected behaviour change.

@jjmaldonis
Copy link
Contributor Author

@jjmaldonis jjmaldonis commented on 5855dd7 Dec 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the contributor I don't know enough about the library to weigh in, but the reason for this PR was based on #5667. Can you take a look and see if there is another solution that will solve that issue?

In short, the issue is that the Response.text method does not default to utf-8 encoding when application/json is set as the content type. RFC 4627 says that utf-8 should be the default encoding, and like you said it's possible to detect if the encoding is instead utf-16 or utf-32 based on the first two octets. That said, the functionality that I encountered and want to contribute to fixing is that Response.text uses chardet to identify the encoding for the content even if application/json is set as the content type, rather than defaulting to utf-* based on RFC 4627.

I'm more than happy to create a different PR for a quick review based on what you would like.

To quickly reproduce the issue in #5667, you can run the following code:

import requests
import json

r = requests.get("https://api.covidtracking.com/v1/us/current.json")
r._content = b'{"name":"rd\xce\xba","uuid":"1234"}'
r.headers = {
    "Content-Type": "application/json",
}
print(r.json())            # Prints:  {'name': 'rdκ', 'uuid': '1234'}  which is perfect
print(json.loads(r.text))  # Prints:  {'name': 'rdκ', 'uuid': '1234'}

I would expect the 2nd print statement to be identical to the first -- i.e. r.json() and json.loads(r.text) to be identical -- provided that application/json is set in the response header.

@nateprewitt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bringing this to our attention, @k0ssi! It sounds like this needs some more work. We may need to move the json() content-type logic out into another function we can reuse. Would one of you mind opening a new issue to track the discussion? These comments don't have great discoverability here.

@k0ssi
Copy link

@k0ssi k0ssi commented on 5855dd7 Dec 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I dont have much time to take a closer look at the this at the moment. I'll have a look at it on the weekend and then get back to you next week. At the moment we first have to figure out what causes the issue @jjmaldonis tells. At first glance this looks to me as a encoding error within text() not json()

@jjmaldonis by setting an charset within get_encoding_from_headers function which is used to build the response there is no guessing while using .json() since charset is already set. Since there is no guarantee everyone is using 'utf-8' we cannot simply assume it is.
In my case this causes an error since api im working with is using 'utf-8 bom'

Please sign in to comment.