Resolve unicode problems #15

merged 5 commits into from Jul 4, 2012


None yet
5 participants

ziima commented Jun 6, 2011

Commits which solve problems caused by improper handling of unicode strings in library
* incorrect encoding of data in responses with AX/Sreg extensions
* error raised by discovering page with unicode and html entities in html attribute values
* testes added for these problems

dokai and others added some commits Oct 14, 2010

@dokai dokai Fixed a bug in Message.toFormMarkup() related to encoding UTF-8 encod…
…ed form values.

The .toFormMarkup() method that generates a <form> HTML structure had a bug
when the form field values contained UTF-8 encoded strings with characters
outside the 7-bit ASCII space.

If the lxml implementation of the ElementTree API was in use these values
would result in a ValueError being raised (ValueError: All strings must be XML
compatible: Unicode or ASCII, no NULL bytes or control characters). If the
stdlib implementation of ElementTree was used these characters were silently
replaced by their XML character reference equivalents (&#XXX;).

This patch generates the form using Unicode values for everything and then
serializes the form to a UTF-8 encoded string ensuring that the final form is
what is expected and constant regardless of the ElementTree API
@dokai dokai Fixed a bug in Message.toPostArgs() related to UTF-8 encoded values.
In generating the argument dictionary the .toPostArgs() method (apparently)
assumed that values were all Unicode objects and called
``value.encode('utf-8')`` on them unconditionally. However, the values appear
to be a mixed set of Unicode objects and UTF-8 encoded strings (most being of
the latter group).

Calling .encode('utf-8') on a string will implicitly decode the string into a
Unicode object before encoding it to the selected encoding. This automatic
decoding happens using the ``sys.getdefaultencoding()`` encoding which is by
default 'ascii'. The original call therefore works only as long as the values
are 7-bit ASCII and breaks when they contain higher bit characters.

The patch ensures that the resulting values in the returned dictionary are
UTF-8 encoded strings regardless if the input values were Unicode objects or
UTF-8 strings.
@ziima ziima Fix problem with decoding of HTML that has mixture of unicode and ent…

Original commits:
 - 08382e5
 - 8ced77c

Thanks, you rock :)

@ziima ziima Yet another problem with unicode - some HTLM pages can not be decoded…
… because they contain undecodable characters.

It causes raise of UnicodeDecodeError deep inside python. This only happens if xrds location is not found before
some unicode character.

 - Catch UnicodeDecodeError when searching for yadis
 - Update check of whether yadis was used - if xrds location is none it was not
 - Added tests, update previous unicode test with comment

ziima commented Jul 18, 2011

I discovered yet another problem. Commit was included into this pull request automatically

@ziima ziima Fix encoding of namespace uris.
Namespace for openid is usually unicode type, but it is not encoded to string in response.

hades commented Apr 9, 2012

Hi! I've recently stumbled upon a manifestation of the problem you're fixing here and traced it back right here.

Thanks for your work! Any ideas, why is this still not merged?


ziima commented Apr 10, 2012

I tried to push it to upstream but it seems nobody cares :-(

@willnorris willnorris added a commit that referenced this pull request Jul 4, 2012

@willnorris willnorris Merge pull request #15 from vzima/unicode
Resolve unicode problems

@willnorris willnorris merged commit b19e822 into openid:master Jul 4, 2012

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment