Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

utils.get_encoding_from_headers returns ISO-8859-1 incorrectly #480

Closed
umbrae opened this Issue · 16 comments

4 participants

@umbrae

When I call get_encoding_from_headers on this url:

http://thelastpsychiatrist.com/2012/02/my_fiancee_is_pushing_me_away.html

The response is ISO-8859-1:

(Pdb) get_encoding_from_headers(self.response.headers)
'ISO-8859-1'

Even though the headers don't contain that characterset:

(Pdb) self.response.headers
{'date': 'Sun, 11 Mar 2012 21:10:40 GMT', 'transfer-encoding': 'chunked', 'content-type': 'text/html', 'server': 'Apache/2.2.22'}

It looks like this was an intentional choice in the source, but this is problematic for me because, if I knew that the encoding was guessed, I'd want to check the HTML meta tag myself - which would then properly parse as UTF-8.

I think the better solution for is to either return None explicitly, or provide a default kwarg param that people could set to an encoding manually if they wanted to.

I can patch this if it sounds like a good solution.

@umbrae

(Also, for the record, this is causing encoding errors for words like fiancée, which is how I noticed the discrepancy.)

@kennethreitz

You actually can set an encoding manually today — it's just not very obvious/documented.

Here's the workflow:

r = requests.get(...)
r.encoding = 'myencoding'

Then, when you call r.text, that encoding will be used.


The initial implementation of response encoding did take meta tags into consideration, but I decided to remove it to keep things within a small scope. In short, meta tags are HTML, not HTTP :)

@umbrae

@kennethreitz If I'm understanding you, I don't think this issue is solved from my perspective. Let me give you my use case.

When I fetch an HTML resource with requests, I want to parse the HTML with the encoding as it was intended. That means:

  1. Looking in the HTTP header.
  2. Looking in the meta tag.
  3. Making a dumb guess using something like chardet.

Setting the encoding on a request doesn't help me there, because I want to get the natural encoding response. And if it failed to get it from the headers, I can continue down the line and look at the meta tag.

Right now in my code I'm working around this by just checking to see if requests responded with ISO-8859-1 and ignoring it because I don't have any guarantees it was actually there. The better solution for me would be for this method to return None.

@kennethreitz

Ah, well Requests already makes a dumb guess using chardet if no header is specified. Hopefully that makes the meta tag parsing unnecessary.

r.encoding is set automatically if the headers specify an encoding.

r = requests.get(...)

# Check meta tags if no encoding was in headers
f not r.encoding or r.encoding == 'ISO-8859-1':
    r.encoding=my_meta_parser(r)

# If no encoding was returned, chardet will automatically be used.

Also, ISO-8859-1 is only used as a default if "text" is present in the content type. There's some weird RFC for that or something.

@umbrae

Oh, I didn't realize that was to spec—Bizarre! In that case then I'll just keep my current workflow, which is essentially the same as the code chunk you've got up there. Thanks.

@umbrae

For future reference to anyone who stumbles upon this, the spec is:

http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.7.1

The "charset" parameter is used with some media types to define the character set (section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value. See section 3.4.1 for compatibility problems.

@kennethreitz

Excellent :)

I really should document this stuff...

@jhalcrow

I'm running into this issue as well. RFC2616 does that subtypes of text are defined to have a default charset value of ISO-8859-1. However, W3 also says

The HTTP protocol ([RFC2616], section 3.7.1) mentions ISO-8859-1 as a default character encoding when the
"charset" parameter is absent from the "Content-Type" header field. In practice, this recommendation has proved
useless because some servers don't allow a "charset" parameter to be sent, and others may not be configured to
send the parameter. Therefore, user agents must not assume any default value for the "charset" parameter.

http://www.w3.org/TR/html4/charset.html#h-5.2.2

In practice, site admins seem to be pretty lazy about properly specifying an encoding in headers. The upshot is that for non-English language sites defaulting ISO-8859-1 will almost certainly be wrong. They go on to say

To sum up, conforming user agents must observe the following priorities when determining a document's character
encoding (from highest priority to lowest):

  1. An HTTP "charset" parameter in a "Content-Type" field.
  2. A META declaration with "http-equiv" set to "Content-Type" and a value set for "charset".
  3. The charset attribute set on an element that designates an external resource.

So then I think the more compliant thing to do would be to just drop the check for 'text' in Content-Type and fall back to the meta.

@Lukasa
Collaborator

While it's possible that guessing is wrong, we can't fall back to the META declaration because Requests doesn't parse HTML. That's really the job of the user, not Requests.

@kennethreitz

I'm confident that my current character workflow is correct.

However, I'm not against removing the 'text' default.

@kennethreitz

That is, if I can be convinced :)

@jhalcrow

Fair enough. :)

For HTML the value of encoding is going to be misleading so there should at least be
a big red sticker somewhere. The ISO-8859-1 default is a pretty obscure point that is going
to be missed by most users of the library. In fact, ISO-8859-1 is probably not what was meant by the
server either. This seems out of keeping of the spirit of the rest of this package.

This also means that Request.encoding isn't even a good starting point if you are trying to get HTML
because of the default behavior. (which is what .text does right now)

For .text the doc string basically says it will make its best effort to give you unicode. It even goes as
far as using chardet to make a guess. To do that but not run a regex against the code to check for
content-type seems silly.

With the current behavior if someone requests a utf-8 encoded document from a server that neglects
to put a content-type in the headers (which people may not even have control over) Request.text will mangle
the document. It will use the assumed ISO-8859-1 encoding, skipping chardet entirely, and just do
str(content, 'ISO-8859-1', errors='replace'). This basically makes Request.text useless.

For Request.text, something like below seems like a more appropriate way to guess. It would also be
nice to have the guess_encoding function available in .utils

def is_valid(charset, bytes):
    try:
        _ = bytes.decode(charset)
    except:
        return False
    return True

def guess_encoding(response):
    '''
    Guesses the encoding for a document in the following order:
    1. content-type header
    2. content type from <meta.. tags
    3. content-type from chardet
    '''
    if 'content-type' in response.headers:
        content_type, params = cgi.parse_header(response.headers['content-type'])

        if 'charset' in params:
            charset = params['charset'].strip("'\"")
            if is_valid(charset, response.content):
                return charset

    meta_encodings = requests.utils.get_encodings_from_content(response.content)
    for charset in meta_encodings:
        if is_valid(charset, response.content):
            return charset

    detected = chardet.detect(response.content)
    if detected:
        return detected['encoding']
@kennethreitz

No, requests is not an HTML library. We used to do meta tags and it was a mistake.

@jhalcrow

That's totally reasonable, and I respect the decision to limit the scope of the library. But, please at least consider returning None when the character encoding is unspecified by the content-type. In practice, an unspecified content-type rarely actually implies ISO-8859-1.

Doing that would at least allow someone processing HTML to determine that they need to do more work, and would lead to Response.text making use of chardet which is far more likely to produce the correct encoding. As it stands now Response.text is completely unreliable.

Edit: Just to add, really returning unicode is basically impossible without deeper processing of the content. I'm not sure what the use case is for attempting to get unicode without checking this. The .text function should probably get the axe if it can't do that. Otherwise it is basically just offering people rides in a dynamite truck - it will probably get you there ok, unless you hit a pothole that is.

@jhalcrow

One more thing to bug you further about this :)

What if I added code making it so that if you set the value of r.encoding to a callable it would use that to resolve the encoding? Or maybe some sort of a hook? That would be pretty in-line with the behavior of encoding now without introducing the html specific code into the library, and would also be usable for other things like xml.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.