Skip to content

Conversation

mgiuca
Copy link

@mgiuca mgiuca commented May 22, 2012

Attached is a fix for LP #1002581:
https://bugs.launchpad.net/lxml/+bug/1002581

Note: This patch is funded by my employer, Google. If I am credited, please use my work email address mgiuca@google.com.

Can't use \u escape because it doesn't work on Python 2 without a u prefix. Instead use literal UTF-8 bytes.
These fail on a system where libxml doesn't support iconv (try commenting out _UNICODE_ENCODING = enc in parser.pxi).
The second test always fails because it is converted to UTF-8 and back as Latin-1.
…s UTF-8, it overrides the encoding in the parser to interpret the string as UTF-8.
…ings.

If the encoding is unspecified, explicitly sets it to UTF-8 to match the way it will actually be encoded.
@jonashaag
Copy link

I have debugged this issue for the last few hours and came up with a patch that is essentially the same. Mine isn't as complete as this one is, so I'm not going to post it here.

Long story short, I second this patch. It even looks like it's pretty easy to merge.

This issue came up for me with Python 3.

>>> import lxml.html
>>> lxml.html.fromstring("ä").text
'ä'

@scoder
Copy link
Member

scoder commented Jan 2, 2014

I've implemented a fix here: 3169b0c

Also, I've implemented PEP393 support for the Unicode string parser: 293302c

Unicode file parsing hasn't been changed yet, so that will still fail in some cases.

Closing this pull request as it no longer applies to the current master branch.

@scoder scoder closed this Jan 2, 2014
@jonashaag
Copy link

Thanks, 293302c fixed this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants