New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UnicodeDecodeError raised on some cache max-age headers #84
Comments
The code you provided does not exist in cachecontrol. Could you provide the full header? |
Also, what version of Python are you using? |
@sigmavirus24 I think @hakanw is looking at https://github.com/ionrock/cachecontrol/blob/master/cachecontrol/serialize.py#L15 @hakanw Is that a public URL you are accessing? It is odd to me that a server is returning headers with opening and closing quotes rather than double quotes. |
@hakanw Thanks for creating the issue btw! |
I'm running python 2.7.6 and cachecontrol 0.11.5. Yes, exactly it's that code that is failing! Yeah I know it's odd but you know, it's the web :) Lots of strange HTTP servers and clients are out there... I'm writing a spider so I'm running into a lot of these issues. |
Ah, GitHub's search failed at finding that. |
So on Python 2.7.9, if I do >>> uniheader = u'\u201cmax-age=31536000\u2033'
>>> strheader = '\u201cmax-age=31536000\u2033'
>>> base64.b64encode(uniheader.encode('utf-8'))
'4oCcbWF4LWFnZT0zMTUzNjAwMOKAsw=='
>>> base64.b64encode(strheader.encode('utf-8'))
'XHUyMDFjbWF4LWFnZT0zMTUzNjAwMFx1MjAzMw==' This particular header value doesn't cause any problems, that said, this does produce two different values because: >>> uniheader.encode('utf-8')
'\xe2\x80\x9cmax-age=31536000\xe2\x80\xb3'
>>> strheader.encode('utf-8')
'\\u201cmax-age=31536000\\u2033' This almost seems like something that absolutely should work, but something else is causing problems. Perhaps system settings or something else? |
Hmm, good question. Here's my locale settings if that is relevant: LANG=en_US.UTF-8 |
The locale settings I have are almost identical (although shorter):
The difference is that I looked at the original report again and this stood out:
Specifically # From my last comment
>>> uniheader.encode('utf-8')
'\xe2\x80\x9cmax-age=31536000\xe2\x80\xb3' Which gave me an idea: >>> uniheader.encode('utf-8').encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128) So the header is already encoded properly when I'll work on a PR for this right now. |
Ah, good catch. Sounds like a great plan! 2015-07-31 15:39 GMT+02:00 Ian Cordasco notifications@github.com:
|
@hakanw I just pushed @sigmavirus24's changes. Mind giving it a try? |
Of course, I'll give it a try! |
Works fine on master now. Thanks a bunch guys! 🌟 |
Thanks guys! |
I'm encountering this exception when fetching some URLs:
The function that is raising is this:
And some example data that some HTTP servers seem to be sending is for example:
'\u201cmax-age=31536000\u2033'
It would be great if cache-control could handle these edge cases without dying.
Thanks for a great package!
The text was updated successfully, but these errors were encountered: