-
-
Notifications
You must be signed in to change notification settings - Fork 9.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
requests.get() ignores charset=UTF-8 and BOM #654
Comments
If the server doesn't specify an encoding, one will automatically be detected via chardet |
If you're saying that requests uses chardet, then this isn't working properly for me. Perhaps I need to dive into the code to see what's happening? |
Requests uses chardet if it is available, yes. Can you share the URL you're having trouble with? |
Ah, this is because the server responds with This is standards compliance behavior: http://en.wikipedia.org/wiki/ISO/IEC_8859-1#History You can do the following to ignore server-set encodings: r = requests.get('http://www.stanford.edu/dept/visitorinfo/activities/dining.html')
r.encoding = None Then, the contents of |
If there is a BOM, wouldn't it be reasonable to override |
I've seen some discussion that charset=UTF-8 isn't sufficient because requests looks at the http headers rather than the body of the request (and I can understand the desire to avoid parsing the html looking for the META tag). However, if a page has a UTF-8 Byte Order Mark, I'd think that would be sufficient indication to treat the body as UTF-8.
The text was updated successfully, but these errors were encountered: