HTML Unicode parsing #51

mgiuca · 2012-05-22T00:18:05Z

Attached is a fix for LP #1002581:
https://bugs.launchpad.net/lxml/+bug/1002581

Note: This patch is funded by my employer, Google. If I am credited, please use my work email address mgiuca@google.com.

Can't use \u escape because it doesn't work on Python 2 without a u prefix. Instead use literal UTF-8 bytes.

These fail on a system where libxml doesn't support iconv (try commenting out _UNICODE_ENCODING = enc in parser.pxi). The second test always fails because it is converted to UTF-8 and back as Latin-1.

…s UTF-8, it overrides the encoding in the parser to interpret the string as UTF-8.

…ings. If the encoding is unspecified, explicitly sets it to UTF-8 to match the way it will actually be encoded.

jonashaag · 2014-01-02T00:16:50Z

I have debugged this issue for the last few hours and came up with a patch that is essentially the same. Mine isn't as complete as this one is, so I'm not going to post it here.

Long story short, I second this patch. It even looks like it's pretty easy to merge.

This issue came up for me with Python 3.

>>> import lxml.html
>>> lxml.html.fromstring("ä").text
'Ã¤'

scoder · 2014-01-02T12:59:20Z

I've implemented a fix here: 3169b0c

Also, I've implemented PEP393 support for the Unicode string parser: 293302c

Unicode file parsing hasn't been changed yet, so that will still fail in some cases.

Closing this pull request as it no longer applies to the current master branch.

jonashaag · 2014-01-02T13:13:17Z

Thanks, 293302c fixed this issue.

mgiuca-google added 6 commits May 21, 2012 15:59

test_htmlparser: Fix Unicode test input.

3169b0c

Can't use \u escape because it doesn't work on Python 2 without a u prefix. Instead use literal UTF-8 bytes.

test_htmlparser: Added new Unicode tests.

a7a574b

These fail on a system where libxml doesn't support iconv (try commenting out _UNICODE_ENCODING = enc in parser.pxi). The second test always fails because it is converted to UTF-8 and back as Latin-1.

parser: Fixed _parseMemoryDocument so that if it encodes the string a…

3ea7ba6

…s UTF-8, it overrides the encoding in the parser to interpret the string as UTF-8.

parser: Fixed parsing from a StringIO object that returns unicode str…

8cea80f

…ings. If the encoding is unspecified, explicitly sets it to UTF-8 to match the way it will actually be encoded.

Added link to Launchpad bug.

bfb57dd

Fix test case name for consistency.

8b4edd1

scoder closed this Jan 2, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

HTML Unicode parsing #51

HTML Unicode parsing #51

Uh oh!

mgiuca commented May 22, 2012

Uh oh!

jonashaag commented Jan 2, 2014

Uh oh!

scoder commented Jan 2, 2014

Uh oh!

jonashaag commented Jan 2, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

HTML Unicode parsing #51

HTML Unicode parsing #51

Uh oh!

Conversation

mgiuca commented May 22, 2012

Uh oh!

jonashaag commented Jan 2, 2014

Uh oh!

scoder commented Jan 2, 2014

Uh oh!

jonashaag commented Jan 2, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants