UnicodeDecodeError raised on some cache max-age headers #84

hakanw · 2015-06-29T14:32:18Z

I'm encountering this exception when fetching some URLs:

UnicodeDecodeError
 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

The function that is raising is this:

def _b64_encode_str(s):
    return _b64_encode_bytes(s.encode("utf8"))

And some example data that some HTTP servers seem to be sending is for example: '\u201cmax-age=31536000\u2033'

It would be great if cache-control could handle these edge cases without dying.

Thanks for a great package!

The text was updated successfully, but these errors were encountered:

sigmavirus24 · 2015-06-29T15:30:17Z

The code you provided does not exist in cachecontrol.

Could you provide the full header?

sigmavirus24 · 2015-06-29T15:30:37Z

Also, what version of Python are you using?

ionrock · 2015-06-29T15:37:40Z

@sigmavirus24 I think @hakanw is looking at https://github.com/ionrock/cachecontrol/blob/master/cachecontrol/serialize.py#L15

@hakanw Is that a public URL you are accessing? It is odd to me that a server is returning headers with opening and closing quotes rather than double quotes.

ionrock · 2015-06-29T15:38:09Z

@hakanw Thanks for creating the issue btw!

hakanw · 2015-06-29T15:46:39Z

I'm running python 2.7.6 and cachecontrol 0.11.5.

Yes, exactly it's that code that is failing!

Yeah I know it's odd but you know, it's the web :) Lots of strange HTTP servers and clients are out there... I'm writing a spider so I'm running into a lot of these issues.

sigmavirus24 · 2015-06-29T18:46:02Z

@sigmavirus24 I think @hakanw is looking at https://github.com/ionrock/cachecontrol/blob/master/cachecontrol/serialize.py#L15

Ah, GitHub's search failed at finding that.

hakanw · 2015-07-29T13:30:15Z

This is still happening a bunch. Maybe there's some specific server that emits this particular kind of cache header. I just wished cache control didn't totally fail with an exception.

Screenshot from the stacktrace and my logging software:

sigmavirus24 · 2015-07-29T13:48:32Z

So on Python 2.7.9, if I do

>>> uniheader = u'\u201cmax-age=31536000\u2033'
>>> strheader =  '\u201cmax-age=31536000\u2033'
>>> base64.b64encode(uniheader.encode('utf-8'))
'4oCcbWF4LWFnZT0zMTUzNjAwMOKAsw=='
>>> base64.b64encode(strheader.encode('utf-8'))
'XHUyMDFjbWF4LWFnZT0zMTUzNjAwMFx1MjAzMw=='

This particular header value doesn't cause any problems, that said, this does produce two different values because:

>>> uniheader.encode('utf-8')
'\xe2\x80\x9cmax-age=31536000\xe2\x80\xb3'
>>> strheader.encode('utf-8')
'\\u201cmax-age=31536000\\u2033'

This almost seems like something that absolutely should work, but something else is causing problems. Perhaps system settings or something else?

hakanw · 2015-07-31T12:27:24Z

Hmm, good question. Here's my locale settings if that is relevant:

LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

sigmavirus24 · 2015-07-31T13:39:06Z

The locale settings I have are almost identical (although shorter):

LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=

The difference is that LC_ALL on my side isn't set.

I looked at the original report again and this stood out:

UnicodeDecodeError
 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

Specifically byte 0xe2 which is the first byte when you encode the unicode string.

# From my last comment
>>> uniheader.encode('utf-8')
'\xe2\x80\x9cmax-age=31536000\xe2\x80\xb3'

Which gave me an idea:

>>> uniheader.encode('utf-8').encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

So the header is already encoded properly when _b64_encode_str receives it. This tells me we probably need one more layer of indirection that determines whether we use _b64_encode_str or _b64_encode_bytes for the values we're not 100% certain of (e.g., headers) which should protect against this.

I'll work on a PR for this right now.

hakanw · 2015-07-31T14:09:27Z

Ah, good catch. Sounds like a great plan!

2015-07-31 15:39 GMT+02:00 Ian Cordasco notifications@github.com:

The locale settings I have are almost identical (although shorter):

LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=

The difference is that LC_ALL on my side isn't set.

I looked at the original report again and this stood out:

UnicodeDecodeError
'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

Specifically byte 0xe2 which is the first byte when you encode the unicode
string.

From my last comment>>> uniheader.encode('utf-8')'\xe2\x80\x9cmax-age=31536000\xe2\x80\xb3'

Which gave me an idea:

uniheader.encode('utf-8').encode('utf-8')UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

So the header is already encoded properly when _b64_encode_str receives
it. This tells me we probably need one more layer of indirection that
determines whether we use _b64_encode_str or _b64_encode_bytes for the
values we're not 100% certain of (e.g., headers) which should protect
against this.

I'll work on a PR for this right now.

—
Reply to this email directly or view it on GitHub
#84 (comment)
.

ionrock · 2015-07-31T17:11:54Z

@hakanw I just pushed @sigmavirus24's changes. Mind giving it a try?

hakanw · 2015-08-01T00:17:40Z

Of course, I'll give it a try!

hakanw · 2015-08-01T00:37:52Z

Works fine on master now. Thanks a bunch guys! 🌟

sigmavirus24 · 2015-08-03T15:00:32Z

@hakanw @ionrock I guess one of you can close this then

hakanw · 2015-08-04T10:04:07Z

Thanks guys!

hakanw changed the title ~~UnicodeDecodeError on some cache max-age headers~~ UnicodeDecodeError raised on some cache max-age headers Jun 29, 2015

hakanw closed this as completed Aug 4, 2015

hakanw mentioned this issue Aug 5, 2015

UnicodeDecodeError on some utf8 content in headers in cachecontrol #91

Closed

dstufft mentioned this issue Mar 7, 2016

Use msgpack for cache serialization #115

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError raised on some cache max-age headers #84

UnicodeDecodeError raised on some cache max-age headers #84

hakanw commented Jun 29, 2015

sigmavirus24 commented Jun 29, 2015

sigmavirus24 commented Jun 29, 2015

ionrock commented Jun 29, 2015

ionrock commented Jun 29, 2015

hakanw commented Jun 29, 2015

sigmavirus24 commented Jun 29, 2015

hakanw commented Jul 29, 2015

sigmavirus24 commented Jul 29, 2015

hakanw commented Jul 31, 2015

sigmavirus24 commented Jul 31, 2015

hakanw commented Jul 31, 2015

From my last comment>>> uniheader.encode('utf-8')'\xe2\x80\x9cmax-age=31536000\xe2\x80\xb3'

ionrock commented Jul 31, 2015

hakanw commented Aug 1, 2015

hakanw commented Aug 1, 2015

sigmavirus24 commented Aug 3, 2015

hakanw commented Aug 4, 2015

UnicodeDecodeError raised on some cache max-age headers #84

UnicodeDecodeError raised on some cache max-age headers #84

Comments

hakanw commented Jun 29, 2015

sigmavirus24 commented Jun 29, 2015

sigmavirus24 commented Jun 29, 2015

ionrock commented Jun 29, 2015

ionrock commented Jun 29, 2015

hakanw commented Jun 29, 2015

sigmavirus24 commented Jun 29, 2015

hakanw commented Jul 29, 2015

sigmavirus24 commented Jul 29, 2015

hakanw commented Jul 31, 2015

sigmavirus24 commented Jul 31, 2015

hakanw commented Jul 31, 2015

From my last comment>>> uniheader.encode('utf-8')'\xe2\x80\x9cmax-age=31536000\xe2\x80\xb3'

ionrock commented Jul 31, 2015

hakanw commented Aug 1, 2015

hakanw commented Aug 1, 2015

sigmavirus24 commented Aug 3, 2015

hakanw commented Aug 4, 2015