New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode & string processing #159

Merged
merged 4 commits into from Feb 21, 2015

Conversation

Projects
None yet
2 participants
@opottone
Contributor

opottone commented Feb 21, 2015

Commits related to strings, encodings, and allowed characters.

  • Check that all strings consists of allowed XML characters (no\ufffe, \uffff, or surrogates).
  • Check that names are valid as specified in current XML standard, which is more tolerant than earlier, outdated editions (http://www.w3.org/TR/REC-xml/#NT-Name vs. http://www.w3.org/TR/2006/REC-xml-20060816/#NT-Name)
  • Python 2 expects __repr__ to return str, not unicode, and wrong type may lead to UnicodeEncodeError. Fix this.
  • Some test contained _str('ha\u1234\x07ho') or similar things. The \u escape does not work on python 2 '' string, unless we do some magic.

opottone added some commits Feb 21, 2015

Update name validation.
Validate names accordingy to XML 1.0 standard fifth edition, instead of
the outdated fourth edition. (See also Namespaces in XML 1.0 third edition.)
@scoder

This comment has been minimized.

Show comment
Hide comment
@scoder

scoder Feb 21, 2015

what does _str('\\u1234') give with this change?

what does _str('\\u1234') give with this change?

This comment has been minimized.

Show comment
Hide comment
@opottone

opottone Feb 21, 2015

Owner

Since '\u1234' == '\\u1234', _str('\u1234') == _str('\\u1234') == u'\u1234' (with python 2).

Owner

opottone replied Feb 21, 2015

Since '\u1234' == '\\u1234', _str('\u1234') == _str('\\u1234') == u'\u1234' (with python 2).

This comment has been minimized.

Show comment
Hide comment
@scoder

scoder Feb 21, 2015

Right - I guess it's acceptable to allow that behaviour in the test suite (assuming it's not currently used anywhere).

Right - I guess it's acceptable to allow that behaviour in the test suite (assuming it's not currently used anywhere).

This comment has been minimized.

Show comment
Hide comment
@opottone

opottone Feb 21, 2015

Owner

Yes _str('\\u1234) == u'\u1234' (after the change) is a nasty pitfall, but then again, so is _str('\u1234) == u'u1234' (before the change), and it just is not possible to get both cases right.

Grepping through the code, I did not find a single _str('\\uxxxx'), but there were couple of lines (not many) with _str('\uxxxx').

Owner

opottone replied Feb 21, 2015

Yes _str('\\u1234) == u'\u1234' (after the change) is a nasty pitfall, but then again, so is _str('\u1234) == u'u1234' (before the change), and it just is not possible to get both cases right.

Grepping through the code, I did not find a single _str('\\uxxxx'), but there were couple of lines (not many) with _str('\uxxxx').

This comment has been minimized.

Show comment
Hide comment
@scoder

scoder Feb 21, 2015

After looking through these cases, I think it doesn't hurt to make this change.

After looking through these cases, I think it doesn't hurt to make this change.

@scoder

This comment has been minimized.

Show comment
Hide comment
@scoder

scoder Feb 21, 2015

cdef bint is_valid_xml_utf8()

cdef bint is_valid_xml_utf8()

@scoder

This comment has been minimized.

Show comment
Hide comment
@scoder

scoder Feb 21, 2015

0x00eda080 <= next3 <= 0x00edbfbf

0x00eda080 <= next3 <= 0x00edbfbf

This comment has been minimized.

Show comment
Hide comment
@scoder

scoder Feb 21, 2015

I tried the following, just to be safe:
assert all(0x00eda080 <= sum((ord(s) << (i*8)) for i, s in enumerate(unichr(u).encode('utf8')[::-1])) <= 0x00edbfbf for u in range(ord(u'\ud800'), ord(u'\udfff')+1))
works for me (but would make a good additional test).

I tried the following, just to be safe:
assert all(0x00eda080 <= sum((ord(s) << (i*8)) for i, s in enumerate(unichr(u).encode('utf8')[::-1])) <= 0x00edbfbf for u in range(ord(u'\ud800'), ord(u'\udfff')+1))
works for me (but would make a good additional test).

@scoder

This comment has been minimized.

Show comment
Hide comment
@scoder

scoder Feb 21, 2015

strrepr()

@scoder

This comment has been minimized.

Show comment
Hide comment
@scoder

scoder Feb 21, 2015

unicode-escape tends to look funny when displayed in bytes strings, but I don't see a better option.

unicode-escape tends to look funny when displayed in bytes strings, but I don't see a better option.

This comment has been minimized.

Show comment
Hide comment
@opottone

opottone Feb 21, 2015

Owner

Yep, funny looking results are unavoidable, and this closely mimics what repr(s) gives for non-ascii string s.

Owner

opottone replied Feb 21, 2015

Yep, funny looking results are unavoidable, and this closely mimics what repr(s) gives for non-ascii string s.

@scoder

This comment has been minimized.

Show comment
Hide comment
@scoder

scoder Feb 21, 2015

the tests look like they should go into test_unicode().

scoder commented on f53dc9c Feb 21, 2015

the tests look like they should go into test_unicode().

@scoder

This comment has been minimized.

Show comment
Hide comment
@scoder

scoder Feb 21, 2015

Member

Looks good to me, thanks!

Member

scoder commented Feb 21, 2015

Looks good to me, thanks!

scoder added a commit that referenced this pull request Feb 21, 2015

@scoder scoder merged commit 1c04899 into lxml:master Feb 21, 2015

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment