Permalink
Browse files

revising parse_utf8_qsl and updating tests

  • Loading branch information...
maxcountryman committed Jun 9, 2012
1 parent d179ea4 commit 2e18639fb48cce9462a3b0680f92b7df4b16a79a
Showing with 33 additions and 16 deletions.
  1. +15 −9 rauth/service.py
  2. +18 −7 tests/test_service.py
View
@@ -17,11 +17,19 @@
def parse_utf8_qsl(s):
- if isinstance(s, unicode):
- s = s.encode('utf-8')
- else:
- s = unicode(s, 'utf-8').encode('utf-8')
- return parse_qsl(s)
+ d = dict(parse_qsl(s))
+
+ for k, v in d.items():
+ if isinstance(k, unicode):
+ k = k.encode('utf-8')
+ else:
+ k = unicode(k, 'utf-8').encode('utf-8')

This comment has been minimized.

Show comment Hide comment
@joeshaw

joeshaw Jun 9, 2012

Contributor

Not sure I understand this block. You're taking a utf-8 byte string, decoding it and putting it into a unicode object, and then reencoding it back to a utf-8 byte string. Later it's decoded to be put into the returned dict.

I think you could do:

if not isinstance(k, unicode):
    s = unicode(k, 'utf-8')

and omit the second decoding (the dict key/value assignment) altogether.

@joeshaw

joeshaw Jun 9, 2012

Contributor

Not sure I understand this block. You're taking a utf-8 byte string, decoding it and putting it into a unicode object, and then reencoding it back to a utf-8 byte string. Later it's decoded to be put into the returned dict.

I think you could do:

if not isinstance(k, unicode):
    s = unicode(k, 'utf-8')

and omit the second decoding (the dict key/value assignment) altogether.

This comment has been minimized.

Show comment Hide comment
@maxcountryman

maxcountryman Jun 9, 2012

Contributor

Not sure I understand this block. You're taking a utf-8 byte string, decoding it and putting it into a unicode object, and then reencoding it back to a utf-8 byte string. Later it's decoded to be put into the returned dict.

I don't think that's quite right. We're encoding a byte string (note: we have no guarantee about the encoding here) as UTF-8 if it's unicode, otherwise we're casting the string as unicode and then encoding the byte string as UTF-8. Thus we have some semblance of normalization. This is necessary because Python doesn't appreciate decoding/encoding strings blindly. In particular if parse_qsl is called over a string containing UTF-8 encoded chars. Check out the unit tests and feel free to experiment with removing this block. I think you'll find that Python blows up without these guards. Please correct me if I'm wrong however. Maybe I've overlooked something in the unit tests?

@maxcountryman

maxcountryman Jun 9, 2012

Contributor

Not sure I understand this block. You're taking a utf-8 byte string, decoding it and putting it into a unicode object, and then reencoding it back to a utf-8 byte string. Later it's decoded to be put into the returned dict.

I don't think that's quite right. We're encoding a byte string (note: we have no guarantee about the encoding here) as UTF-8 if it's unicode, otherwise we're casting the string as unicode and then encoding the byte string as UTF-8. Thus we have some semblance of normalization. This is necessary because Python doesn't appreciate decoding/encoding strings blindly. In particular if parse_qsl is called over a string containing UTF-8 encoded chars. Check out the unit tests and feel free to experiment with removing this block. I think you'll find that Python blows up without these guards. Please correct me if I'm wrong however. Maybe I've overlooked something in the unit tests?

This comment has been minimized.

Show comment Hide comment
@joeshaw

joeshaw Jun 11, 2012

Contributor

note: we have no guarantee about the encoding here

That may be true, but the line k = unicode(k, 'utf-8').encode('utf-8') will blow up if the encoding is not UTF-8.

Consider the Euro symbol. It has unicode code point U+20AC, is encoded in UTF-8 as \xe2\x82\xac but is encoded in ISO-8851-15 as \xa4. If the data is encoded as ISO-8851-15:

>>> unicode("\xa4", "utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa4 in position 0: invalid start byte

So we either need to grab the encoding from the Content-Type, or assume that str objects coming in are UTF-8 encoded. I think that's a reasonable assumption -- isn't UTF-8 mandated by the OAuth spec?

@joeshaw

joeshaw Jun 11, 2012

Contributor

note: we have no guarantee about the encoding here

That may be true, but the line k = unicode(k, 'utf-8').encode('utf-8') will blow up if the encoding is not UTF-8.

Consider the Euro symbol. It has unicode code point U+20AC, is encoded in UTF-8 as \xe2\x82\xac but is encoded in ISO-8851-15 as \xa4. If the data is encoded as ISO-8851-15:

>>> unicode("\xa4", "utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa4 in position 0: invalid start byte

So we either need to grab the encoding from the Content-Type, or assume that str objects coming in are UTF-8 encoded. I think that's a reasonable assumption -- isn't UTF-8 mandated by the OAuth spec?

This comment has been minimized.

Show comment Hide comment
@maxcountryman

maxcountryman Jun 11, 2012

Contributor

isn't UTF-8 mandated by the OAuth spec?

Great question. I think you're probably right. But maybe a corollary to that is: can we trust all providers to adhere to the spec strictly in this regard?

That may be true, but the line k = unicode(k, 'utf-8').encode('utf-8') will blow up if the encoding is not UTF-8.

Correct me if I'm wrong (I'm not in a good position to write up a test at the moment) but does conditionally encoding a string based on type, e.g. in our if/else branching, protect against this?

Thanks for your feedback on this, Joe!

@maxcountryman

maxcountryman Jun 11, 2012

Contributor

isn't UTF-8 mandated by the OAuth spec?

Great question. I think you're probably right. But maybe a corollary to that is: can we trust all providers to adhere to the spec strictly in this regard?

That may be true, but the line k = unicode(k, 'utf-8').encode('utf-8') will blow up if the encoding is not UTF-8.

Correct me if I'm wrong (I'm not in a good position to write up a test at the moment) but does conditionally encoding a string based on type, e.g. in our if/else branching, protect against this?

Thanks for your feedback on this, Joe!

This comment has been minimized.

Show comment Hide comment
@joeshaw

joeshaw Jun 11, 2012

Contributor

I did find the reference in the spec to this: http://oauth.net/core/1.0/#encoding_parameters Lots of sections refer to "parameter encoding".

Correct me if I'm wrong (I'm not in a good position to write up a test at the moment) but does conditionally encoding a string based on type, e.g. in our if/else branching, protect against this?

I don't think so. In the isinstance(k, unicode) true case, that means that parse_qsl() was passed unicode objects, meaning that the data was already decoded from whatever its native encoding was. Ultimately what we want out of this function are unicode objects, not encoded strings, so re-encoding and decoding again them isn't necessary. Unfortunately I don't think this function ever gets called with unicode objects? At least, it doesn't in the case where we were breaking previously (in Response.content, I think?)

In the false case, we are decoding the string objects but are assuming UTF-8. That's fine according to the spec, but if we want to do something more we'll have to pull it from the Content-Type header... I don't think it's worth it, personally. In that case, we should simply get a unicode object by calling unicode(k, 'utf-8') and using that.

@joeshaw

joeshaw Jun 11, 2012

Contributor

I did find the reference in the spec to this: http://oauth.net/core/1.0/#encoding_parameters Lots of sections refer to "parameter encoding".

Correct me if I'm wrong (I'm not in a good position to write up a test at the moment) but does conditionally encoding a string based on type, e.g. in our if/else branching, protect against this?

I don't think so. In the isinstance(k, unicode) true case, that means that parse_qsl() was passed unicode objects, meaning that the data was already decoded from whatever its native encoding was. Ultimately what we want out of this function are unicode objects, not encoded strings, so re-encoding and decoding again them isn't necessary. Unfortunately I don't think this function ever gets called with unicode objects? At least, it doesn't in the case where we were breaking previously (in Response.content, I think?)

In the false case, we are decoding the string objects but are assuming UTF-8. That's fine according to the spec, but if we want to do something more we'll have to pull it from the Content-Type header... I don't think it's worth it, personally. In that case, we should simply get a unicode object by calling unicode(k, 'utf-8') and using that.

This comment has been minimized.

Show comment Hide comment
@maxcountryman

maxcountryman Jun 11, 2012

Contributor

Unfortunately I don't think this function ever gets called with unicode objects? At least, it doesn't in the case where we were breaking previously (in Response.content, I think?)

I think this is right, to the best of my understanding. :)

In the false case, we are decoding the string objects but are assuming UTF-8. That's fine according to the spec, but if we want to do something more we'll have to pull it from the Content-Type header... I don't think it's worth it, personally.

Yeah, I guess for the moment we can fallback to calling unicode over the string, just in case. If we find this is an issue later we can always revisit it.

Phew, Python 2.x byte encoding is hard. ;) But hey, I kind of like Github for code review! 🍰 This is nice.

@maxcountryman

maxcountryman Jun 11, 2012

Contributor

Unfortunately I don't think this function ever gets called with unicode objects? At least, it doesn't in the case where we were breaking previously (in Response.content, I think?)

I think this is right, to the best of my understanding. :)

In the false case, we are decoding the string objects but are assuming UTF-8. That's fine according to the spec, but if we want to do something more we'll have to pull it from the Content-Type header... I don't think it's worth it, personally.

Yeah, I guess for the moment we can fallback to calling unicode over the string, just in case. If we find this is an issue later we can always revisit it.

Phew, Python 2.x byte encoding is hard. ;) But hey, I kind of like Github for code review! 🍰 This is nice.

+ if isinstance(v, unicode):
+ v = v.encode('utf-8')
+ else:
+ v = unicode(v, 'utf-8').encode('utf-8')
+ d[k.decode('utf-8')] = v.decode('utf-8')

This comment has been minimized.

Show comment Hide comment
@joeshaw

joeshaw Jun 9, 2012

Contributor

if the key was a utf-8 encoded byte string before and it's been decoded into a unicode object, you're likely to have (kinda) duplicate keys here. I would probably wait to convert to a dict until after the strings have been converted. ie:

d = {}
for k, v in parse_qsl(s):
    if not isinstance(k, unicode):
        k = unicode(k, 'utf-8')
    if not isinstance(v, unicode):
        v = unicode(v, 'utf-8')
    d[k] = v

(Given that parse_qsl() never gives you unicode objects back, it's probably safe to do:

d = dict((unicode(k, 'utf-8'), unicode(v, 'utf-8')) for k,v in parse_qsl(s))

It is annoying that unicode(x) is not idempotent.)

@joeshaw

joeshaw Jun 9, 2012

Contributor

if the key was a utf-8 encoded byte string before and it's been decoded into a unicode object, you're likely to have (kinda) duplicate keys here. I would probably wait to convert to a dict until after the strings have been converted. ie:

d = {}
for k, v in parse_qsl(s):
    if not isinstance(k, unicode):
        k = unicode(k, 'utf-8')
    if not isinstance(v, unicode):
        v = unicode(v, 'utf-8')
    d[k] = v

(Given that parse_qsl() never gives you unicode objects back, it's probably safe to do:

d = dict((unicode(k, 'utf-8'), unicode(v, 'utf-8')) for k,v in parse_qsl(s))

It is annoying that unicode(x) is not idempotent.)

This comment has been minimized.

Show comment Hide comment
@maxcountryman

maxcountryman Jun 9, 2012

Contributor

Unfortunately this line

d = dict((unicode(k, 'utf-8'), unicode(v, 'utf-8')) for k,v in parse_qsl(s))

does not jive with Python's byte encoding assuming certain kinds of strings and does blow up upon unit testing.

fwiw parse_qsl() does seem to respect the encoding of whatever string it's passed:

>>> isinstance(parse_qsl(u'foo=bar')[0][0], unicode)
True

>>> parse_qsl(u'ü=bar')
[(u'\xfc', u'bar')]

I augmented the unit tests to make sure the code is aware of and respects this.

I think you're right about the duplicate keys issue, good catch! I'll update that now.

@maxcountryman

maxcountryman Jun 9, 2012

Contributor

Unfortunately this line

d = dict((unicode(k, 'utf-8'), unicode(v, 'utf-8')) for k,v in parse_qsl(s))

does not jive with Python's byte encoding assuming certain kinds of strings and does blow up upon unit testing.

fwiw parse_qsl() does seem to respect the encoding of whatever string it's passed:

>>> isinstance(parse_qsl(u'foo=bar')[0][0], unicode)
True

>>> parse_qsl(u'ü=bar')
[(u'\xfc', u'bar')]

I augmented the unit tests to make sure the code is aware of and respects this.

I think you're right about the duplicate keys issue, good catch! I'll update that now.

This comment has been minimized.

Show comment Hide comment
@maxcountryman

maxcountryman Jun 9, 2012

Contributor

Actually hmm, would you mind whipping up a simple unit test that demonstrates the key dup issue so that I have a base to run this against? :)

@maxcountryman

maxcountryman Jun 9, 2012

Contributor

Actually hmm, would you mind whipping up a simple unit test that demonstrates the key dup issue so that I have a base to run this against? :)

This comment has been minimized.

Show comment Hide comment
@joeshaw

joeshaw Jun 11, 2012

Contributor
d = {}
euro_utf8 = '\xe2\x82\xac'
euro_unicode = unicode(euro_utf8, 'utf-8')
d[euro_utf8] = 'euro'
d[euro_unicode] = 'euro'
print d
{'\xe2\x82\xac': 'euro', u'\u20ac': 'euro'}

So if you passed in this to parse_utf8_qsl():

parse_qsl('\xe2\x82\xac=euro')

you'd end up with a dictionary similar to the above.

@joeshaw

joeshaw Jun 11, 2012

Contributor
d = {}
euro_utf8 = '\xe2\x82\xac'
euro_unicode = unicode(euro_utf8, 'utf-8')
d[euro_utf8] = 'euro'
d[euro_unicode] = 'euro'
print d
{'\xe2\x82\xac': 'euro', u'\u20ac': 'euro'}

So if you passed in this to parse_utf8_qsl():

parse_qsl('\xe2\x82\xac=euro')

you'd end up with a dictionary similar to the above.

This comment has been minimized.

Show comment Hide comment
@maxcountryman

maxcountryman Jun 11, 2012

Contributor

Awesome, thank you. I'll use this as the basis of another unit test.

@maxcountryman

maxcountryman Jun 11, 2012

Contributor

Awesome, thank you. I'll use this as the basis of another unit test.

+ return d
class Request(object):
@@ -78,7 +86,7 @@ def content(self):
try:
content = json.loads(self.response.content)
except ValueError:
- content = dict(parse_utf8_qsl(self.response.content))
+ content = parse_utf8_qsl(self.response.content)
else:
content = self.response.content
return content
@@ -414,9 +422,7 @@ def get_request_token(self, method='GET', **kwargs):
response.raise_for_status()
- encoded_content = response.content
-
- data = dict(parse_utf8_qsl(encoded_content))
+ data = parse_utf8_qsl(response.content)
return data['oauth_token'], data['oauth_token_secret']
def get_authorize_url(self, request_token, **params):
View
@@ -438,23 +438,33 @@ def test_other_response(self, mock_request):
def test_parse_utf8_qsl_non_unicode(self, mock_request):
mock_request.return_value = self.response
- self.response.content = 'oauth_token=ö&oauth_token_secret=b'
+ self.response.content = 'oauth_token=\xc3\xbc&oauth_token_secret=b'
request_token, request_token_secret = \
self.service.get_request_token('GET')
- self.assertEqual(request_token, '\xc3\xb6')
+ self.assertEqual(request_token, u'\xfc')
+ self.assertEqual(request_token_secret, 'b')
+
+ @patch.object(requests.Session, 'request')
+ def test_parse_utf8_qsl_unicode_encoded(self, mock_request):
+ mock_request.return_value = self.response
+
+ self.response.content = u'oauth_token=\xfc&oauth_token_secret=b'
+
+ request_token, request_token_secret = \
+ self.service.get_request_token('GET')
+ self.assertEqual(request_token, u'\xfc')
self.assertEqual(request_token_secret, 'b')
@patch.object(requests.Session, 'request')
def test_parse_utf8_qsl_unicode(self, mock_request):
mock_request.return_value = self.response
- self.response.content = unicode('oauth_token=a&oauth_token_secret=b',
- 'utf-8')
+ self.response.content = u'oauth_token=ü&oauth_token_secret=b'
request_token, request_token_secret = \
self.service.get_request_token('GET')
- self.assertEqual(request_token, 'a')
+ self.assertEqual(request_token, u'\xfc')
self.assertEqual(request_token_secret, 'b')
@patch.object(requests.Session, 'request')
@@ -468,6 +478,7 @@ def test_parse_utf8_qsl_joe(self, mock_request):
'/',
access_token='a',
access_token_secret='b')
- expected = {'username': 'joeshaw \xc3\xa9\xc3\xa9\xc3\xa9',
- 'fullname': 'Joe Shaw'}
+
+ expected = {u'username': u'joeshaw \xe9\xe9\xe9',
+ u'fullname': u'Joe Shaw'}
self.assertEqual(response.content, expected)

0 comments on commit 2e18639

Please sign in to comment.