Skip to content

Fix of UnicodeDecodeError on unicode header name that can be converted to ascii. #1181

Merged
merged 2 commits into from Feb 12, 2013

3 participants

@denis-ryzhkov

Problem

Please see issue #1082.

Solution

Let's fix at least Python-2 unicode issues with header name,
because its representation as bytes is well defined in HTTP/1.1 spec:
http://tools.ietf.org/html/rfc2616#section-4.2

message-header = field-name ":" [ field-value ]
field-name     = token
token          = 1*<any CHAR except CTLs or separators>
CHAR           = <any US-ASCII character (octets 0 - 127)>

So header name should always be sent as ascii.

If user provides header name that can be converted to ascii without errors,
then it should be converted, e.g. u'Content-Type' --> 'Content-Type'.

Otherwise a helpful error should be raised, indicating that the header name is the reason.

  • Created: test_unicode_header_name.
  • Updated: prepare_headers.
@Lukasa
Collaborator
Lukasa commented Feb 11, 2013

Why use a side-effect of the str function to do this, when you can easily do: name.encode('ascii')? Additionally, if you do that you can remove the 'if py2' tests.

@denis-ryzhkov

@Lukasa
Thanks! Please check the fix.

@Lukasa
Collaborator
Lukasa commented Feb 12, 2013

I'm going to close and reopen to get Travis to re-run the tests.

@Lukasa Lukasa closed this Feb 12, 2013
@Lukasa Lukasa reopened this Feb 12, 2013
@Lukasa Lukasa closed this Feb 12, 2013
@Lukasa Lukasa reopened this Feb 12, 2013
@Lukasa
Collaborator
Lukasa commented Feb 12, 2013

This fix looks broadly right. My only concern is that this is a subset of the problem, but this is a perfectly good start to the problem. =)

@denis-ryzhkov

One step at a time )

@kennethreitz
Owner

I don't recall off the top of my head where I've seen this, but I believe latin1 (or something similar) is recommended for header encodings. @mitsuhiko do you know?

@denis-ryzhkov

@kennethreitz

latin1 (aka ISO-8859-1) is recommended for header values by the very HTTP/1.1 spec
http://tools.ietf.org/html/rfc2616#section-4.2

  • Header name is always ascii:
    message-header = field-name ":" [ field-value ]
    field-name     = token
    token          = 1*<any CHAR except CTLs or separators>
    CHAR           = <any US-ASCII character (octets 0 - 127)>
  • Header value is latin1 by default, but may be any other charset too:
    message-header = field-name ":" [ field-value ]
    field-value    = *( field-content | LWS )
    field-content  = <the OCTETs making up the field-value
                     and consisting of either *TEXT or combinations
                     of token, separators, and quoted-string>
    TEXT           = <any OCTET except CTLs, but including LWS>
    OCTET          = <any 8-bit sequence of data>

    Words of *TEXT MAY contain characters from character sets
    other than ISO-8859-1 (aka latin1)
    only when encoded according to the rules of RFC 2047.

This RFC is MIME for headers: http://tools.ietf.org/html/rfc2047

Example of encoded value:

    =?UTF-8?Q?=E2=98=91?=

It can be decoded with:
http://docs.python.org/2/library/email.header.html#email.header.decode_header

So it's a bit more complex than just latin1.

I guess it is a good idea to merge the fix for header name first,
and then return to header value issues.

@kennethreitz
Owner

beautiful, thanks :)

@kennethreitz kennethreitz merged commit cdec20a into kennethreitz:master Feb 12, 2013

1 check passed

Details default The Travis build passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.