Attached is a fix for LP #1002581:
Note: This patch is funded by my employer, Google. If I am credited, please use my work email address firstname.lastname@example.org.
test_htmlparser: Fix Unicode test input.
Can't use \u escape because it doesn't work on Python 2 without a u prefix. Instead use literal UTF-8 bytes.
test_htmlparser: Added new Unicode tests.
These fail on a system where libxml doesn't support iconv (try commenting out _UNICODE_ENCODING = enc in parser.pxi).
The second test always fails because it is converted to UTF-8 and back as Latin-1.
parser: Fixed _parseMemoryDocument so that if it encodes the string a…
…s UTF-8, it overrides the encoding in the parser to interpret the string as UTF-8.
parser: Fixed parsing from a StringIO object that returns unicode str…
If the encoding is unspecified, explicitly sets it to UTF-8 to match the way it will actually be encoded.
Added link to Launchpad bug.
Fix test case name for consistency.
lxml.html has its own set of tests. However, this is not lxml.html specific, so it should use the normal HTML parser in lxml.etree instead.
I don't like the fact that this special cases StringIO. What if a file object was opened in Unicode text mode?
I don't like it either. But as far as I can tell, there is no general way to know whether a file-like object will return a byte or Unicode string without calling read(), and that permanently consumes input. I think the only general solution will be your suggestion (on the bug tracker) to delay choosing the encoding until after the first string is read.
This looks ok.
I have debugged this issue for the last few hours and came up with a patch that is essentially the same. Mine isn't as complete as this one is, so I'm not going to post it here.
Long story short, I second this patch. It even looks like it's pretty easy to merge.
This issue came up for me with Python 3.
>>> import lxml.html
I've implemented a fix here: 3169b0c
Also, I've implemented PEP393 support for the Unicode string parser: 293302c
Unicode file parsing hasn't been changed yet, so that will still fail in some cases.
Closing this pull request as it no longer applies to the current master branch.
Thanks, 293302c fixed this issue.