Skip to content

Commit

Permalink
Simplify encoding detection.
Browse files Browse the repository at this point in the history
Use libxml2 properly, that is, give xmlCtxtResetPush() at least 4 bytes
of xml data. Then it properly processes byte order mark, and it need not
be processed in lxml.
  • Loading branch information
opottone committed Jun 21, 2015
1 parent 3b77ce2 commit 0df8e59
Showing 1 changed file with 13 additions and 25 deletions.
38 changes: 13 additions & 25 deletions src/lxml/parser.pxi
Expand Up @@ -1243,34 +1243,24 @@ cdef class _FeedParser(_BaseParser):
c_filename = (_cstr(self._filename)
if self._filename is not None else NULL)

if c_encoding is NULL and py_buffer_len >= 2:
# libxml2 can't handle BOMs here, so let's try ourselves
if c_data[0] in b'\xfe\xef\xff':
# likely a BOM, let's take a closer look
c_encoding = _findEncodingName(
<const_xmlChar*>c_data,
4 if py_buffer_len > 4 else <int>py_buffer_len)
if c_encoding is not NULL:
# found it => skip over BOM (if there is one)
if (c_data[0] == b'\xef' and
c_data[1] == b'\xbb' and
c_data[2] == b'\xbf'):
c_data += 3 # UTF-8 BOM
py_buffer_len -= 3
elif (c_data[0] == b'\xfe' and c_data[1] == b'\xff' or
c_data[0] == b'\xff' and c_data[1] == b'\xfe'):
# UTF-16 BE/LE
c_data += 2
py_buffer_len -= 2

# We have to give *mlCtxtResetPush() enough input to figure
# out the character encoding (at least four bytes),
# however if we give it all we got, we'll have nothing for
# *mlParseChunk() and things go wrong.
if py_buffer_len > 4:
buffer_len = 4
else:
buffer_len = <int>py_buffer_len
if self._for_html:
error = _htmlCtxtResetPush(
pctxt, NULL, 0, c_filename, c_encoding,
pctxt, c_data, buffer_len, c_filename, c_encoding,
self._parse_options)
else:
xmlparser.xmlCtxtUseOptions(pctxt, self._parse_options)
error = xmlparser.xmlCtxtResetPush(
pctxt, NULL, 0, c_filename, c_encoding)
pctxt, c_data, buffer_len, c_filename, c_encoding)
py_buffer_len -= buffer_len
c_data += buffer_len
if error:
raise MemoryError()
__GLOBAL_PARSER_CONTEXT.initParserDict(pctxt)
Expand Down Expand Up @@ -1365,7 +1355,7 @@ cdef int _htmlCtxtResetPush(xmlparser.xmlParserCtxt* c_ctxt,
cdef xmlparser.xmlParserInput* c_input_stream
# libxml2 lacks an HTML push parser setup function
error = xmlparser.xmlCtxtResetPush(
c_ctxt, NULL, 0, c_filename, c_encoding)
c_ctxt, c_data, buffer_len, c_filename, c_encoding)
if error:
return error

Expand All @@ -1374,8 +1364,6 @@ cdef int _htmlCtxtResetPush(xmlparser.xmlParserCtxt* c_ctxt,
c_ctxt.html = 1
htmlparser.htmlCtxtUseOptions(c_ctxt, parse_options)

if c_data is not NULL and buffer_len > 0:
return htmlparser.htmlParseChunk(c_ctxt, c_data, buffer_len, 0)
return 0

############################################################
Expand Down

3 comments on commit 0df8e59

@scoder
Copy link

@scoder scoder commented on 0df8e59 Jun 21, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested this with libxml2 2.7?

@scoder
Copy link

@scoder scoder commented on 0df8e59 Jun 21, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the general approach is better than what's there, but I'm not sure we should drop the BOM handling code.

@opottone
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not directly test this with libxml2 2.7, because the lxml tests won't run:

ImportError: dlopen(/Users/opottone/omat/code/lxml/src/lxml/etree.so, 2): Library not loaded: /opt/local/lib/libxml2.2.dylib
Referenced from: /Users/opottone/omat/code/lxml/src/lxml/etree.so
Reason: Incompatible library version: etree.so requires version 12.0.0 or later, but libxml2.2.dylib provides version 11.0.0

However test in C indicates that libxml2 2.7.2 works like the latest version. Git blame tells that xmlCtxtResetPush() has had encoding detection since 2003-10-28.

The BOM handling here is somewhat wrong, though compatible with libxml2 also being wrong, so the end result is correct. Let us say the encoding is little-endian UTF-16 (with BOM). Then xmlDetectCharEncoding() will say it is UTF-16LE, which it is not. UTF-16LE and UTF-16 are two different encoding schemes; UTF-16LE does not have a BOM. Assume libxml2 actually gave the correct answer, UTF-16. When lxml strips the BOM, the mechanism for distinguishing between little endian and big endian is broken.

Please sign in to comment.