Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 169: ordinal not in range(128) #1252

Closed
michaelhelmick opened this Issue · 48 comments

8 participants

@michaelhelmick

Something similar has been posted before: #403

This is using requests 1.1.0
But this problem is still popping up while trying to post just a file and a file with data.

On top of the similar issue, I've posted about this before and in requests_oauthlib it has said to been fixed; If you wish, I'll try and find the issue in that lib, just too lazy to open a new tab now ;P

Error:

Traceback (most recent call last):
  File "/Users/mikehelmick/.virtualenv/twython/lib/python2.7/site-packages/requests/sessions.py", line 340, in post
    return self.request('POST', url, data=data, **kwargs)
  File "/Users/mikehelmick/.virtualenv/twython/lib/python2.7/site-packages/requests/sessions.py", line 279, in request
    resp = self.send(prep, stream=stream, timeout=timeout, verify=verify, cert=cert, proxies=proxies)
  File "/Users/mikehelmick/.virtualenv/twython/lib/python2.7/site-packages/requests/sessions.py", line 374, in send
    r = adapter.send(request, **kwargs)
  File "/Users/mikehelmick/.virtualenv/twython/lib/python2.7/site-packages/requests/adapters.py", line 174, in send
    timeout=timeout
  File "/Users/mikehelmick/.virtualenv/twython/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py", line 422, in urlopen
    body=body, headers=headers)
  File "/Users/mikehelmick/.virtualenv/twython/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py", line 274, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 955, in request
    self._send_request(method, url, body, headers)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 989, in _send_request
    self.endheaders(body)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 951, in endheaders
    self._send_output(message_body)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 809, in _send_output
    msg += message_body
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 169: ordinal not in range(128)

I posted a gist with sample code if any of you needed to test it. If you guys are really being lazy, I can post some app/user tokens for you to use (let me know).

Gist: https://gist.github.com/michaelhelmick/5199754

@sigmavirus24
Collaborator

So here's my first guess (but I'll obviously look into this more): I think that OAuth1 might be generating unicode and opening the file in binary form will cause this issue due to how httplib sends the message body. Also this may be related to #1250.

@Lukasa, opinions?

@Lukasa
Collaborator

That seems like a reasonable diagnosis. We get lots of obscure bugs in Requests from problems with encoding, I'm adding it to my list of things I need to do. =P

@husainihisan

still no fix? :(

@grillermo

Virtually all maintained twitter python libraries migrated to use requests > 1.0 and posting images is broken, if you want a very specific example try twython with its twythonAPI.updateProfileImage(open('file','r')) method where this bug is causing pain.

@Lukasa
Collaborator

Ok, so I've tracked this down. Gimme a second to write a fix.

@sigmavirus24
Collaborator

@grillermo it isn't a matter of reproducing the bug or not believing you. The stack trace is fairly clear on the matter. The problem is that we're all quite busy.

@Lukasa
Collaborator

So, really, this is because oauthlib is converting all the headers to unicode objects. We're then concatenating these unicode objects with the bytes of the file. Python tries to implicitly decode the bytes into unicode using the locale default codec, and obviously fails.

We can do the 'easy' fix and have requests-oauthlib just encode the headers using Latin-1, but that defers the problem. Alternatively, we can do the 'right' thing and take control of header encoding ourselves. I don't know if Kenneth is up for that, though. @kennethreitz, thoughts?

@Lukasa
Collaborator

Just spoke with Kenneth, and he and I agree that we should fix requests-oauthlib. This is no longer blocking v1.2.

@t-8ch

@Lukasa We recently had a discussion about header encoding in shazow/urllib3#164 at which you might want to look.

@sigmavirus24
Collaborator
@Lukasa
Collaborator

So wait, urllib3 is expecting us to provide unicode objects, not bytes?

@shazow
Collaborator

I think the strategy, as with the rest of Python, is to use the appropriate type where appropriate. I would argue that it makes more sense for header keys to be strings rather than bytes. Is there a counter-argument? (Things like request body should definitely be bytes.)

@Lukasa
Collaborator

My position on the matter would be that headers have a defined encoding on the wire (ISO-8859-1), which means that the only valid headers are ones that can be encoded in that encoding. You can't send strings on the wire, you can only send bytes, and the user shouldn't have to know what bytes those are. I'm happy to leave that encoding to urllib3 though. =)

Incidentally, if urllib3 is passing unicode headers through to httplib without encoding them it might be the cause of our issue.

@Lukasa
Collaborator

Further debugging suggests the problem is in the interface between urllib3 and httplib. This exact problem can be reproduced using the following short program:

import httplib
conn = httplib.HTTPConnection('httpbin.org', 80)

conn.request('POST', u'/post', '\xff', {'test': 'value'}) # Exception here.

Any unicode value, whether in the method, the url, or the keys/values of the headers dict, will cause the entire body of the message to be 'promoted' to a unicode string. This is fine unless you are uploading a file that isn't ascii text, which might contain out-of-range bytes. Exceptions will then be dramatically thrown.

I think either urllib3 or requests needs to ensure that by this stage, everything is bytes.

@sigmavirus24
Collaborator

If we ensure everything is bytes, this will work well for Python 2 because strs are bytes objects. In python 3 this seems to produce an issue like @t-8ch mentioned. Naturally it's perfectly fine for there to be multiple header values and there are no bizarre characters in headers (and cannot be if I remember the spec properly) so the coercion to whatever will be fine. You might think this falls on our shoulders because it doesn't seem that too many urllib3 users have reported this issue, but you're wrong.

The problem with doing this is exactly the case where we're reading binary data which is a very common use case. If we're provided a file (or file-like object). We have no way of knowing if it's binary data or not and images and the like can't be coerced to text. This makes me think that the burden lies on urllib3 to coerce everything together.

Either way, I feel obligated to leave this behind.

@t-8ch

Wouldn't this mean urllib3 has to mess with Content-Length? (and maybe other
things I am not aware of)

(Python 3):

>>> requests.get('http://httpbin.org/post', data='u').json()
{
 'data': 'u', # data looks correct
 'headers': {
  'Content-Length': '1',
  # [..]
  },
}
# note the "ü"                                    v
>>> requests.get('http://httpbin.org/post', data='ü').json()
{
 'data': 'data:application/octet-stream;base64,/A==', # ??
 'headers': {
  'Content-Length': '1',
  # [..]
 }
 # [..]
}
>>> requests.get('http://httpbin.org/post', data='ü'.encode()).json()
{
 'data': 'ü', # works
 'headers': {
  'Content-Length': '2',
  # [..]
 }
 # [..]
}
$ curl --data-binary ü http://httpbin.org/post
{
  "data": "\u00fc" # == 'ü'
  "headers": {
    "Content-Length": "2",
    # [..]
  },
  # [..]
}

RFC2616:

OCTET = <any 8-bit sequence of data>

# [..]

The Content-Length entity-header field indicates the size of the
entity-body, in decimal number of OCTETs, sent to the recipient or,
in the case of the HEAD method, the size of the entity-body that
would have been sent had the request been a GET.
@Lukasa
Collaborator

I disagree with your assessment of correct. =)

Content-Length, as you rightly pointed out, asks for the length of the data in octets. The unicode string u'a' (using Python 2.7 notation to avoid ambiguity) does not have a length in octets, because it's unicode. Only encoded text has any octet-based length. For example:

>>> len(u'a'.encode('utf-8'))
1
>>> len(u'a'.encode('utf-16')) # Don't forget the BOM will be here too!
4
>>> len(u'a'.encode('utf-32'))
8

This means that if urllib3 gets unicode data, but no explicit Content-Length header, urllib3 should encode that data and then set the content-length based on that encoding. However, if urllib3 gets an explicit Content-Length header, I'd argue that it should just assume the user knows what they're doing and let it go.

From where I'm sitting, the problem here is that urllib3 needs to assume that it might get unicode values for any of these strings, but the wire needs bytes. httplib isn't doing the right thing here, so to avoid the Python interpreter doing its totally bogus implicit encoding/decoding, urllib3 needs to take it into its own hands. This means encoding the unicode.

It is totally legitimate to ask users of urllib3 to do their own encoding, and if you conclude that that is what you want to do then we can make the fix in requests-oauthlib. However, I think that someone in the stack, either requests or urllib3, needs to take responsibility for this encoding stuff, because Python 2 just does it all wrong.

@t-8ch

(Python 2 works, following is Python 3)

This means that if urllib3 gets unicode data, but no explicit Content-Length header,
urllib3 should encode that data and then set the content-length based on that encoding. 

Urllib3 does get a explicit Content-Length.

>>> r = requests.Request('POST', 'http://httpbin.org', data=u'ü').prepare()
>>> r.headers
{'Content-Length': '1'}

My 2 cents:
Urllib3 should assume native strings for headers and bytes for the body.

@Lukasa
Collaborator

Yeah, I was excluding Requests' behaviour for a moment, and just trying to nail down what urllib3 should be doing. Then we could change Requests to program to that interface. =)

Thomas, I'm also quite happy with your proposal there. If @shazow thinks that's the way it should go, the fix belongs outside urllib3. :fireworks:

@shazow
Collaborator

I would prefer to avoid doing aggressive type coercion for every input on the urllib3 side.

@sigmavirus24
Collaborator

So this can not block 1.2.0 unless @kennethreitz really wants it to.

Perhaps to satisfy @michaelhelmick and company we should add a notice to the release that we realize that this is broken and a fix is being worked on in shazow/urllib3

@michaelhelmick

I was sleeping while all this convo was going on :blush: haha

But, first and foremost I want to thank all of you for the participation in the issue!

Although I feel that it would be weird if 1.2.0 was released, this issue was still valid and then all of the sudden 2 weeks later without any version bump to requests, this issue was just solved and file uploading worked, etc worked

@michaelhelmick

Although, urllib3 is contained within requests so I guess the Kenneth would have to update the internal package anyways; therefore forcing some sort of version bump? So I guess this technically isn't a block for 1.2.0; my apologies.

@sigmavirus24
Collaborator

@michaelhelmick I hope you got better sleep than I did. :) And yes, as soon as this gets fixed, I would be certain to bug @kennethreitz about a bump to 1.2.1

And there's no need to apologize.

@michaelhelmick

@sigmavirus24 I got about 9 hours, haha. And alright, and if he doesn't bump on your first request; we'll start a trending topic on Twitter ;D

@sigmavirus24
Collaborator
@michaelhelmick

xD hahah, I just lol'd haha

@Lukasa Lukasa referenced this issue from a commit in requests/requests-oauthlib
@Lukasa Lukasa Only pass bytes to urllib3.
This should resolve requests-oauthlib's problems with uploading binary
data, as demonstrated in kennethreitz/requests#1252.
c1a0f56
@Lukasa Lukasa referenced this issue in requests/requests-oauthlib
Merged

Only pass bytes to urllib3. #26

@Lukasa
Collaborator

So, everyone in this thread who cares about the requests end of this, I've pushed a fix up to requests-oauthlib. Anyone who cares to test it should download from the unicodedecodeerror branch (yes, you will mis-type that at least once), and I welcome code review on the PR at requests/requests-oauthlib#26.

@husainihisan

i would like to try, but i dont know, any guide? im using debian on raspberry pi mainly use for twython to upload picture

@sigmavirus24
Collaborator

So since this seems to be fixed in requests/requests-oauthlib and since it seems like we all agree this should be done in urllib3, can we close this?

@sigmavirus24
Collaborator

Actually I misread shazow's comment. I thought he said he'd prefer to do the coercion in urllib3. It seemed bizarre but at least I got it right the second time around, right?

@shazow
Collaborator

@sigmavirus24 I'd like to treat urllib3 as more of an expected-input-expected-output library, and Requests to do the "do silly thing to input to make behaviour more user-friendly" stuff. Does that make sense?

@sigmavirus24
Collaborator
@Lukasa
Collaborator

Seems fair to me. I'll try to take a look into it at some point over the long weekend. No guarantees though!

@Lukasa
Collaborator

Possibly in order to punish us (:wink:), requests-oauthlib does not work on Python3 if you upload files. That's because Requests uses encode_multipart_formdata from urllib3, which returns the content-type as bytes. @shazow: is that intentional? If so, I can work around it here. If not, I can offer you a PR to fix it.

@sigmavirus24
Collaborator
@Lukasa
Collaborator

Sure, but the content-type is a header value. =)

@sigmavirus24
Collaborator
@shazow
Collaborator

Hmm yes I think that's a mistake. If everyone agrees, a PR sounds good. :)

@marselester

I think my problem seems to be the same. UnicodeDecodeError appears when method's param is unicode and requests.request() got files argument, e.g.:

>>> requests.request(u'post', u'http://httpbin.org/post',
...                  files={u'file': open('README.rst', 'rb')})
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 1759: ordinal not in range(128)

But requests.request(u'post', u'http://httpbin.org/post') is ok.

@Lukasa
Collaborator

@marselester: What version of Requests are you using? I can't reproduce this in Requests v1.2.0, using either Python 2.7 or Python 3.3.

@marselester

I use Python 2.7.3, Requests 1.2.0:

Python 2.7.3 (default, Mar  9 2013, 17:38:02) 
[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.24)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> requests.__version__
'1.2.0'
>>> requests.request(u'post', u'http://httpbin.org/post',
...                  files={u'file': open('README.rst', 'rb')})
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/local/lib/python2.7/site-packages/requests/api.py", line 44, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 354, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 460, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/adapters.py", line 211, in send
    timeout=timeout
  File "/usr/local/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py", line 421, in urlopen
    body=body, headers=headers)
  File "/usr/local/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py", line 273, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 958, in request
    self._send_request(method, url, body, headers)
  File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 992, in _send_request
    self.endheaders(body)
  File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 954, in endheaders
    self._send_output(message_body)
  File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 812, in _send_output
    msg += message_body
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 1759: ordinal not in range(128)
@Lukasa
Collaborator

Oh, hang on, this just occurred to me: are you uploading Requests' README.rst file?

@marselester

I have tried to upload image file:

>>> requests.request(u'post', u'http://httpbin.org/post', files={u'file': open('IMG_1365.JPG', 'rb')})
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 134: ordinal not in range(128)
@marselester

But if convert method's param to str then it is fine:

>>> requests.request(str(u'post'), u'http://httpbin.org/post', files={u'file': open('README.rst', 'rb')})
<Response [200]>
@Lukasa
Collaborator

Oh, that makes perfect sense. I suggest you just use a normal, Python 2.7 native string, e.g. 'POST'. Or, even better, use requests.post() and save yourself this trouble entirely. =)

The issue is that Python 2.7 thinks it can convert between unicode and byte strings without your input, which it can't. When you concatenate two strings, if one of them is unicode, the other is decoded using the default encoding (almost always ASCII). HTTP is a text-based format, so building an HTTP message involves a lot of string concatenation. When you upload anything with non-ascii bytes in it, and you've used unicode in a place we don't change it, Bad Stuff Happens(tm).

Requests is aiming to improve sanitising of this stuff (see #1338). However, there are no plans to sanitise the verb string. You must always provide that verb string as a native string (On Python 2.X, bytes, on Python 3.X, unicode).

@marselester

@Lukasa, thank you. When I can use requests.post() I use it :)

@Lukasa Lukasa closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.