Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError raised on some cache max-age headers #84

Closed
hakanw opened this issue Jun 29, 2015 · 16 comments
Closed

UnicodeDecodeError raised on some cache max-age headers #84

hakanw opened this issue Jun 29, 2015 · 16 comments

Comments

@hakanw
Copy link

hakanw commented Jun 29, 2015

I'm encountering this exception when fetching some URLs:

UnicodeDecodeError
 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

The function that is raising is this:

def _b64_encode_str(s):
    return _b64_encode_bytes(s.encode("utf8"))

And some example data that some HTTP servers seem to be sending is for example: '\u201cmax-age=31536000\u2033'

It would be great if cache-control could handle these edge cases without dying.

Thanks for a great package!

@hakanw hakanw changed the title UnicodeDecodeError on some cache max-age headers UnicodeDecodeError raised on some cache max-age headers Jun 29, 2015
@sigmavirus24
Copy link
Contributor

The code you provided does not exist in cachecontrol.

Could you provide the full header?

@sigmavirus24
Copy link
Contributor

Also, what version of Python are you using?

@ionrock
Copy link
Contributor

ionrock commented Jun 29, 2015

@sigmavirus24 I think @hakanw is looking at https://github.com/ionrock/cachecontrol/blob/master/cachecontrol/serialize.py#L15

@hakanw Is that a public URL you are accessing? It is odd to me that a server is returning headers with opening and closing quotes rather than double quotes.

@ionrock
Copy link
Contributor

ionrock commented Jun 29, 2015

@hakanw Thanks for creating the issue btw!

@hakanw
Copy link
Author

hakanw commented Jun 29, 2015

I'm running python 2.7.6 and cachecontrol 0.11.5.

Yes, exactly it's that code that is failing!

Yeah I know it's odd but you know, it's the web :) Lots of strange HTTP servers and clients are out there... I'm writing a spider so I'm running into a lot of these issues.

@sigmavirus24
Copy link
Contributor

@sigmavirus24 I think @hakanw is looking at https://github.com/ionrock/cachecontrol/blob/master/cachecontrol/serialize.py#L15

Ah, GitHub's search failed at finding that.

@hakanw
Copy link
Author

hakanw commented Jul 29, 2015

This is still happening a bunch. Maybe there's some specific server that emits this particular kind of cache header. I just wished cache control didn't totally fail with an exception.

Screenshot from the stacktrace and my logging software:

skarmavbild 2015-07-29 kl 15 29 36

@sigmavirus24
Copy link
Contributor

So on Python 2.7.9, if I do

>>> uniheader = u'\u201cmax-age=31536000\u2033'
>>> strheader =  '\u201cmax-age=31536000\u2033'
>>> base64.b64encode(uniheader.encode('utf-8'))
'4oCcbWF4LWFnZT0zMTUzNjAwMOKAsw=='
>>> base64.b64encode(strheader.encode('utf-8'))
'XHUyMDFjbWF4LWFnZT0zMTUzNjAwMFx1MjAzMw=='

This particular header value doesn't cause any problems, that said, this does produce two different values because:

>>> uniheader.encode('utf-8')
'\xe2\x80\x9cmax-age=31536000\xe2\x80\xb3'
>>> strheader.encode('utf-8')
'\\u201cmax-age=31536000\\u2033'

This almost seems like something that absolutely should work, but something else is causing problems. Perhaps system settings or something else?

@hakanw
Copy link
Author

hakanw commented Jul 31, 2015

Hmm, good question. Here's my locale settings if that is relevant:

LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

@sigmavirus24
Copy link
Contributor

The locale settings I have are almost identical (although shorter):

LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=

The difference is that LC_ALL on my side isn't set.

I looked at the original report again and this stood out:

UnicodeDecodeError
 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

Specifically byte 0xe2 which is the first byte when you encode the unicode string.

# From my last comment
>>> uniheader.encode('utf-8')
'\xe2\x80\x9cmax-age=31536000\xe2\x80\xb3'

Which gave me an idea:

>>> uniheader.encode('utf-8').encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

So the header is already encoded properly when _b64_encode_str receives it. This tells me we probably need one more layer of indirection that determines whether we use _b64_encode_str or _b64_encode_bytes for the values we're not 100% certain of (e.g., headers) which should protect against this.

I'll work on a PR for this right now.

@hakanw
Copy link
Author

hakanw commented Jul 31, 2015

Ah, good catch. Sounds like a great plan!

2015-07-31 15:39 GMT+02:00 Ian Cordasco notifications@github.com:

The locale settings I have are almost identical (although shorter):

LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=

The difference is that LC_ALL on my side isn't set.

I looked at the original report again and this stood out:

UnicodeDecodeError
'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

Specifically byte 0xe2 which is the first byte when you encode the unicode
string.

From my last comment>>> uniheader.encode('utf-8')'\xe2\x80\x9cmax-age=31536000\xe2\x80\xb3'

Which gave me an idea:

uniheader.encode('utf-8').encode('utf-8')UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

So the header is already encoded properly when _b64_encode_str receives
it. This tells me we probably need one more layer of indirection that
determines whether we use _b64_encode_str or _b64_encode_bytes for the
values we're not 100% certain of (e.g., headers) which should protect
against this.

I'll work on a PR for this right now.


Reply to this email directly or view it on GitHub
#84 (comment)
.

@ionrock
Copy link
Contributor

ionrock commented Jul 31, 2015

@hakanw I just pushed @sigmavirus24's changes. Mind giving it a try?

@hakanw
Copy link
Author

hakanw commented Aug 1, 2015

Of course, I'll give it a try!

@hakanw
Copy link
Author

hakanw commented Aug 1, 2015

Works fine on master now. Thanks a bunch guys! 🌟

@sigmavirus24
Copy link
Contributor

@hakanw @ionrock I guess one of you can close this then

@hakanw
Copy link
Author

hakanw commented Aug 4, 2015

Thanks guys!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants