Remove ISO-8859-1 charset fallback #2086

Open
Lukasa opened this Issue Jun 7, 2014 · 33 comments

Projects

None yet

8 participants

@Lukasa
Collaborator
Lukasa commented Jun 7, 2014

For a long time we've had a fallback value in response.encoding of ISO-8859-1, because RFC 2616 told us to. RFC 2616 is now obsolete, replaced by RFCs 7230, 7231, 7232, 7233, 7234, and 7235. The authoritative RFC on this issue is RFC 7231, which has this to say:

The default charset of ISO-8859-1 for text media types has been removed; the default is now whatever the media type definition says.

The media type definitions for text/* are most recently affected by RFC 6657, which has this to say:

In accordance with option (a) above, registrations for "text/*" media types that can transport charset information inside the corresponding payloads (such as "text/html" and "text/xml") SHOULD NOT specify the use of a "charset" parameter, nor any default value, in order to avoid conflicting interpretations should the "charset" parameter value and the value specified in the payload disagree.

I checked the registration for text/html here. Unsurprisingly, it provides no default values. It does allow a charset parameter which overrides anything in the content itself.

I propose the following changes:

  1. Remove the ISO-8859-1 fallback, as it's no longer valid (being only enforced by RFC 2616). We should definitely do this.
  2. Consider writing a module that has the appropriate fallback encodings for other text/* content and use them where appropriate. This isn't vital, just is a "might be nice".
  3. Begin checking HTML content for meta tags again, in order to appropriately fall back. This is controversial, and we'll want @kennethreitz to consider it carefully.
@sigmavirus24
Collaborator

Remove the ISO-8859-1 fallback, as it's no longer valid (being only enforced by RFC 2616). We should definitely do this.

I agree that we should remove the fallback. I do wonder how we should handle Response#text in the event that the server does not specify a charset (in anyway, including the meta tags of the body). Should we disable Response#text conditionally either through an exception or something else? Not doing so will rely more heavily on chardet, which I have decreasing confidence in given the number of new encodings it does not detect.

Consider writing a module that has the appropriate fallback encodings for other text/* content and use them where appropriate. This isn't vital, just is a "might be nice".

Given that this is not guaranteed to be included in requests, I'm fine with adding it to the toolbelt, that said. I'm also okay with making this a separate package so users can just use that with out having to install the rest of the toolbelt. That, however, is a separate discussion.

Begin checking HTML content for meta tags again, in order to appropriately fall back. This is controversial, and we'll want @kennethreitz to consider it carefully.

We still have a method to do this in utils, right? I don't like the idea in the slightest, but it won't cost extra effort. That said, we have to make sure any charset provided in the media type takes precedence.

@sigmavirus24
Collaborator

Upon reading more int RFC 7231, specifically Section 3.1.1.5 I think the third option should ideally be opt-in, not opt-out. My specific reasoning for this is:

Clients that do so [examine a payload's content] risk drawing incorrect conclusions, which might expose additional security risks (e.g., "privilege escalation").

Taken from the same section I linked above.

@Lukasa
Collaborator
Lukasa commented Jun 8, 2014

Agreed from a correctness perspective, but I wonder if @kennethreitz is going to want it from a usability perspective.

@sigmavirus24
Collaborator

I wonder how easy it would be to prop up a simple app to demonstrate the security risks involved to give a concrete example why not to do it.

@untitaker
Contributor

If only 1.) is going to be implemented, i guess r.encoding = None and requests will use chardet?

@Lukasa
Collaborator
Lukasa commented Jun 12, 2014

Correct. =)

@sigmavirus24
Collaborator

That's how it works now, so I don't think we'd change that.

@gsnedders

I wonder how easy it would be to prop up a simple app to demonstrate the security risks involved to give a concrete example why not to do it.

Look up all the UTF-7 XSS attacks. (None work in any current browser, as everyone simply dropped UTF-7 sniffing — and most UTF-7 support entirely — to avoid making sites thus vulnerable.)

In a very real sense, using chardet is worse than option three above — it will make different conclusions to what any implementation following a specification defining how to sniff the content would (and both HTML and XML provide such a specification). The only safe thing to do is if you don't know how to determine the charset is to not try. You can probably support the vast majority of users by implementing the (standardised) HTML and XML character encoding detection algorithms.

I checked the registration for text/html here. Unsurprisingly, it provides no default values. It does allow a charset parameter which overrides anything in the content itself.

Hmm, the registration (which is included inline in the HTML spec) contradicts the HTML spec itself — per the HTML spec, UTF-8/UTF-16 BOMs are given precedence over the MIME type. I've filed bug 26100 for that.

@kennethreitz
Owner

Hmmm....

@gsnedders

http://html5.org/tools/web-apps-tracker?from=8723&to=8724 fixes the IANA registration in the HTML spec to match the body of it. It now reads:

The charset parameter may be provided to specify the document's character encoding, overriding any character encoding declarations in the document other than a Byte Order Mark (BOM). The parameter's value must be one of the labels of the character encoding used to serialise the file. [ENCODING]

@sigmavirus24 sigmavirus24 added this to the 3.0.0 milestone Oct 5, 2014
@erowan
erowan commented Feb 25, 2015

Hello,

I just got hit by this reading XML files which were encoded as UTF8. On OSX the content type was being returned as 'application/xml' but on linux it was set to 'text/xml' therefore the requests lib assumed its default encoding of 'ISO-8859-1' as 'text' was in the content. Most XML files will be encoded in UTF8 so setting the encoding as 'ISO-8859-1' for 'text/xml' content is surely wrong as discussed.

@kennethreitz
Owner

Just because an RFC specifies something, doesn't mean we should do it. Especially if it makes the code crazy.

I believe that our current behavior is elegant and actually works quite effectively. Is this not the case?

As always, what does Chrome do?

@Lukasa
Collaborator
Lukasa commented Mar 5, 2015

Chrome introspects the HTML, a position we've always decided we don't want to do. We could optionally add support for hooks to do content-type specific encoding heuristics if we wanted. We already kinda do that for JSON, it might not hurt to do it more generally for other content types.

@kennethreitz
Owner

Grumble.

@gsnedders

Note the HTML case is even worse than that, really. Because the pre-scan in browsers just looks at the first 1024 bytes or there abouts, and the parser itself can then change it.

@kennethreitz
Owner

I still feel like our current behavior is a great implementation.

@Lukasa
Collaborator
Lukasa commented Mar 9, 2015

Ok. =)

How about I knock up a proposal for smarter encoding heuristics, that takes certain known content-types and attempts to gently introspect them for their encodings. At least then we'll have something concrete to discuss.

@est
est commented Nov 24, 2015

@Lukasa @kennethreitz
Hey guys, for the time being, I don't think we have a obvious solution yet, but can we at least make this ISO-8859-1 optional?

    if 'text' in content_type:
        return 'ISO-8859-1'

This looks way too brutal. Some parameters like Session(auto_encoding=False) would be nice. What do you guys think?

@Lukasa
Collaborator
Lukasa commented Nov 24, 2015

@est ISO-8859-1 is optional, you can simply set response.encoding yourself. It's a one-line change. =)

@gsnedders

@Lukasa But you can't determine whether the initial response.encoding came from the Content-Type header or whether it's the ISO-8859-1 fallback, which means if you want to avoid the fallback you have to start parsing the Content-Type header yourself, and that's quite a lot of extra complexity all of a sudden.

@Lukasa
Collaborator
Lukasa commented Dec 15, 2015

@gsnedders Sure you can. if 'charset' in response.headers['Content-Type'].

@gsnedders

While yes, that works under the assumption the other party is doing something sane, it doesn't work in the face of madness and something like text/html; foo=charset.

@Lukasa
Collaborator
Lukasa commented Dec 15, 2015

@gsnedders Try if 'charset=' in response.headers['Content-Type']. At this point we're out of 'crazy' and into 'invalid formatting'.

@gsnedders

@Lukasa uh, I thought there was whitespace (or rather what 2616 had as implicit *LWS) allowed around the equals, apparently not.

The grammar appears to be:

     media-type = type "/" subtype *( OWS ";" OWS parameter )
     type       = token
     subtype    = token

   The type/subtype MAY be followed by parameters in the form of
   name=value pairs.

     parameter      = token "=" ( token / quoted-string )

So I guess the only issue here is something like text/html; foo="charset=bar".

FWIW, html5lib's API changes are going to make the correct behaviour with requests require something like:

r = requests.get('https://api.github.com/events')
e = response.encoding if 'charset=' in response.headers['Content-Type'] else None
tree = html5lib.parse(r.content, transport_encoding=e)
@Lukasa
Collaborator
Lukasa commented Dec 15, 2015

That seems reasonable. =)

FWIW, in requests 3.0.0 we'll reconsider this, or at least consider adding some way of working out what decision we made.

@dentarg
dentarg commented Jan 9, 2016

Interesting discussion.

I still feel like our current behavior is a great implementation.

If I may, a counterexample. I use requests to extract the <title> from a requested document, and here is facebook.com.

>>> import requests
>>> r = requests.get("https://www.facebook.com/mathiassundin/posts/10153748227675479")
>>> r.encoding
'ISO-8859-1'
>>> import re
>>> m = re.search('<title[^>]*>\s*(.+?)\s*<\/title>', r.text, re.IGNORECASE|re.MULTILINE)
>>> m.group(1)
u'Mathias Sundin - M\xc3\xa5nga roliga samtal mellan Tony Blair och... | Facebook'
>>> r.headers['Content-Type']
'text/html'

Apparently they don't specify the encoding in their headers. But they do in the HTML (<meta charset="utf-8" />, full example at https://gist.github.com/dentarg/f13ef7cc0ce55753faf6).

As mentioned in #2161, "requests aren't a HTML library", and I can understand that point of view. Perhaps I should be looking into parsing the HTML with some library that also can take the specified encoding into consideration.

Thank you for your work on requests. I'm looking forward to 3.0.0.

@Lukasa
Collaborator
Lukasa commented Jan 9, 2016

@dentarg That's not really a counter example: it's an example of why this system works.

The guidance from the IETF is that for all text/* encodings, one of the following things MUST be true:

  • the peer MUST send a charset in the content-type
  • the content MUST carry an encoding specifier in it (HTML, XML)

Requests' default behaviour here works well: in the case of XML and HTML, ISO-8859-1 is guaranteed to safely decode the text well enough to let you see the <meta> tag. Anyone working with HTML really should go looking for that tag, because servers almost never correctly advertise the content type of the HTML they deliver, but the HTML itself is usually (though sadly not always) right.

The behaviour requests has right now is probably less prone to failure with HTML than the one proposed for 3.0.0, and part of me does wonder if we should try to solve this more by documentation than by code change.

@dentarg dentarg added a commit to dentarg/pynik that referenced this issue Jan 9, 2016
@dentarg dentarg Make title reader more robust
facebook.com doesn't specify the charset in HTTP headers, and then
requests fallback to ISO-8859-1 (the fallback might be removed in
3.0.0, see kennethreitz/requests#2086).

Use Beautiful Soup which can be intelligent about the encoding[1].

[1]: http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Beautiful%20Soup%20Gives%20You%20Unicode,%20Dammit

Close #17.
c18d6b8
@dentarg
dentarg commented Jan 9, 2016

Yes, perhaps documentation is the way forward.

I can share my experience. I'm a very new user of requests, and I haven't looked at all the documentation for requests, but I have looked some. The requests website have this in the code snippet on the front page:

>>> r.encoding
'utf-8'

That information, combined with me somehow (maybe from browsing issues here on GitHub) finding out that requests uses chardet, gave me the impression that requests would solve the charset/encoding problem for me "all the way".

Once I found the right documentation, it was straightforward to workaround the limitations with another library. I can fully understand that requests just want to be the HTTP parser, not the HTML parser.

All that said, let me share some short examples where I think the ISO-8859-1 fallback might cause some unexpected behavior.

>>> import requests
>>> r = requests.get("https://www.facebook.com/mathiassundin/posts/10153748227675479")
>>> from bs4 import BeautifulSoup

You can't use r.text:

>>> print BeautifulSoup(r.text, "html5lib", from_encoding="").title.text
Mathias Sundin - MÃ¥nga roliga samtal mellan Tony Blair och... | Facebook
>>> print BeautifulSoup(r.text, "html5lib", from_encoding=r.encoding).title.text
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/bs4/__init__.py", line 215, in __init__
    self._feed()
  File "/usr/local/lib/python2.7/site-packages/bs4/__init__.py", line 239, in _feed
    self.builder.feed(self.markup)
  File "/usr/local/lib/python2.7/site-packages/bs4/builder/_html5lib.py", line 50, in feed
    doc = parser.parse(markup, encoding=self.user_specified_encoding)
  File "/usr/local/lib/python2.7/site-packages/html5lib/html5parser.py", line 236, in parse
    parseMeta=parseMeta, useChardet=useChardet)
  File "/usr/local/lib/python2.7/site-packages/html5lib/html5parser.py", line 89, in _parse
    parser=self, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/html5lib/tokenizer.py", line 40, in __init__
    self.stream = HTMLInputStream(stream, encoding, parseMeta, useChardet)
  File "/usr/local/lib/python2.7/site-packages/html5lib/inputstream.py", line 144, in HTMLInputStream
    raise TypeError("Cannot explicitly set an encoding with a unicode string")
TypeError: Cannot explicitly set an encoding with a unicode string

You need to use r.content, but if requests have fallen back to ISO-8859-1, r.encoding will cause trouble:

>>> print BeautifulSoup(r.content, "html5lib", from_encoding=r.encoding).title.text
Mathias Sundin - MÃ¥nga roliga samtal mellan Tony Blair och... | Facebook

Working as intended:

>>> print BeautifulSoup(r.content, "html5lib", from_encoding="").title.text
Mathias Sundin - Många roliga samtal mellan Tony Blair och... | Facebook
>>> print BeautifulSoup(r.content, "html5lib", from_encoding="utf-8").title.text
Mathias Sundin - Många roliga samtal mellan Tony Blair och... | Facebook
@Lukasa
Collaborator
Lukasa commented Jan 9, 2016

So here we hit a problem, which is that we can't really document the way Requests interacts with every possible tool you may want to use it with: there are just too many possibilities! So instead we try to provide general documentation.

The best advice I can give is that the more specific the tool, the more likely it can handle bytes as an input and DTRT with them. BeautifulSoup is a HTML tool and so can almost certainly take a stream of arbitrary bytes and find the relevant meta tag (and indeed, it can), so you can just pass it r.content. The same would be true if you were passing it to an XML library, or to a JSON library, or to anything else like that.

@Lukasa Lukasa referenced this issue Jan 14, 2016
@jgorset jgorset Default the encoding of "text" media subtypes to "ISO-8859-1"
Ref. RFC2616 (HyperText Transfer Protocol), section 3.7.1 (Canonicalization and Text Defaults).
a0ae2e6
@thp thp added a commit to thp/urlwatch that referenced this issue Jan 27, 2016
@thp thp Workaround a requests shortcoming related to encoding
We want to try UTF-8 before falling back to ISO-8859-1.

See also: kennethreitz/requests#2086
d1e38d7
@gsnedders

Requests' default behaviour here works well: in the case of XML and HTML, ISO-8859-1 is guaranteed to safely decode the text well enough to let you see the <meta> tag.

Sadly, that isn't true, because of UTF-16. In both cases, they should be able to handle BOMs. Furthermore, a conforming XML parser should be able to handle a UTF-16 encoded <?xml version="1.0" encoding="UTF-16"> with no preceding BOM.

@Lukasa
Collaborator
Lukasa commented Mar 29, 2016

@gsnedders I'm not sure what your point is. ISO-8859-1 will decode incorrectly, but you can still search for the initial <meta> tag. It's not trivial but it is not enormously difficult either.

@gsnedders

@Lukasa My point is that you're not necessarily just looking for a <meta> tag, you could also be wanting to see if the opening bytes are 00 3C 00 3F (which obviously you can do as they'll just be U+0000 U+003C U+0000 U+003F), and then start decoding as UTF-16 until you find the inline encoding declaration. There's a lot more complexity (esp. in the XML case) than just looking for one ASCII string. ISO-8859-1 isn't that bad insofar as it causes no data loss, but it doesn't really make searching any easier.

@Lukasa
Collaborator
Lukasa commented Mar 29, 2016

So agreed that there is more complexity here, but I don't believe that requests can meaningfully take that complexity on. The spectrum of options in this situation is:

  1. Requests chooses ISO-8859-1 for .text, ensuring that accessing .text doesn't explode. This may well be wrong, but it's fast and ensures that the user can progress.
  2. Requests makes no decision, falling back on chardet. Chardet can be quite slow, which slows down the initial load of .text. Additionally, it can get it wrong, meaning that sometimes .text will randomly throw UnicodeDecodeError when accessed incautiously.
  3. Requests explicitly chooses no encoding and throws its own special exception when accessing .text.

I'd argue that of these three options, option 1 is the least user hostile. Essentially, we promise that accessing .text will never cause random exceptions for HTML or XML. We can argue over whether .text should exist at all, but it does, and in a world where it does I'd rather that everything doesn't randomly explode on users simply because they were incautious.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment