Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Use libxml2 properly, that is, give xmlCtxtResetPush() at least 4 bytes of xml data. Then it properly processes byte order mark, and it need not be processed in lxml.
- Loading branch information
Showing
1 changed file
with
13 additions
and
25 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
0df8e59
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tested this with libxml2 2.7?
0df8e59
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the general approach is better than what's there, but I'm not sure we should drop the BOM handling code.
0df8e59
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could not directly test this with libxml2 2.7, because the lxml tests won't run:
However test in C indicates that libxml2 2.7.2 works like the latest version. Git blame tells that
xmlCtxtResetPush()
has had encoding detection since 2003-10-28.The BOM handling here is somewhat wrong, though compatible with libxml2 also being wrong, so the end result is correct. Let us say the encoding is little-endian UTF-16 (with BOM). Then
xmlDetectCharEncoding()
will say it is UTF-16LE, which it is not. UTF-16LE and UTF-16 are two different encoding schemes; UTF-16LE does not have a BOM. Assume libxml2 actually gave the correct answer, UTF-16. When lxml strips the BOM, the mechanism for distinguishing between little endian and big endian is broken.