HTML Unicode parsing #51

wants to merge 6 commits into


None yet
4 participants

mgiuca commented May 22, 2012

Attached is a fix for LP #1002581:

Note: This patch is funded by my employer, Google. If I am credited, please use my work email address

mgiuca-google added some commits May 21, 2012

@mgiuca-google mgiuca-google test_htmlparser: Fix Unicode test input.
Can't use \u escape because it doesn't work on Python 2 without a u prefix. Instead use literal UTF-8 bytes.
@mgiuca-google mgiuca-google test_htmlparser: Added new Unicode tests.
These fail on a system where libxml doesn't support iconv (try commenting out _UNICODE_ENCODING = enc in parser.pxi).
The second test always fails because it is converted to UTF-8 and back as Latin-1.
@mgiuca-google mgiuca-google parser: Fixed _parseMemoryDocument so that if it encodes the string a…
…s UTF-8, it overrides the encoding in the parser to interpret the string as UTF-8.
@mgiuca-google mgiuca-google parser: Fixed parsing from a StringIO object that returns unicode str…

If the encoding is unspecified, explicitly sets it to UTF-8 to match the way it will actually be encoded.
@mgiuca-google mgiuca-google Added link to Launchpad bug. bfb57dd
@mgiuca-google mgiuca-google Fix test case name for consistency. 8b4edd1

scoder commented on a7a574b Jun 26, 2012

lxml.html has its own set of tests. However, this is not lxml.html specific, so it should use the normal HTML parser in lxml.etree instead.

scoder commented on 8cea80f Jun 26, 2012

I don't like the fact that this special cases StringIO. What if a file object was opened in Unicode text mode?


mgiuca replied Jul 4, 2012

I don't like it either. But as far as I can tell, there is no general way to know whether a file-like object will return a byte or Unicode string without calling read(), and that permanently consumes input. I think the only general solution will be your suggestion (on the bug tracker) to delay choosing the encoding until after the first string is read.

scoder commented on 3ea7ba6 Jun 26, 2012

This looks ok.

I have debugged this issue for the last few hours and came up with a patch that is essentially the same. Mine isn't as complete as this one is, so I'm not going to post it here.

Long story short, I second this patch. It even looks like it's pretty easy to merge.

This issue came up for me with Python 3.

>>> import lxml.html
>>> lxml.html.fromstring("ä").text

scoder commented Jan 2, 2014

I've implemented a fix here: 3169b0c

Also, I've implemented PEP393 support for the Unicode string parser: 293302c

Unicode file parsing hasn't been changed yet, so that will still fail in some cases.

Closing this pull request as it no longer applies to the current master branch.

scoder closed this Jan 2, 2014

Thanks, 293302c fixed this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment