Unicode literal in BufferedStream._readFromBuffer causes failure in HTMLBinaryInputStream #67

niklasl · 2013-06-17T20:40:31Z

In html5lib/inputstream.py, unicode_literals is imported from __future__. This causes html5lib.inputstream.BufferedStream to misbehave, specifically the _readFromBuffer method, which ends with return "".join(rv). Due to this being a unicode literal, any read from after the first becomes a chunk of unicode instead of a chunk of bytes.

An example of the problem caused:

from urllib2 import Request, urlopen
from html5lib.inputstream import HTMLBinaryInputStream

req = Request(url='http://example.org/')
source = urlopen(req)
HTMLBinaryInputStream(source)

Causing:

Traceback (most recent call last):
  File "<stdin>", line 6, in <module>
  File ".../html5lib/inputstream.py", line 411, in __init__
    self.charEncoding = self.detectEncoding(parseMeta, chardet)
  File ".../html5lib/inputstream.py", line 448, in detectEncoding
    encoding = self.detectEncodingMeta()
  File ".../html5lib/inputstream.py", line 535, in detectEncodingMeta
    assert isinstance(buffer, bytes)
AssertionError

(That is, when HTMLBinaryInputStream is used with a file-like object (such as the result of urllib2.urlopen), it wraps it in a BufferedStream, which then fails (at line 535) with the assert isinstance(buffer, bytes).)

This can be fixed by using a byte literal in _readFromBuffer, instead, i.e. return b"".join(rv). (There are at least three places in inputstream.py where string literals are used like this: at lines 117, 318 and 348.)

The text was updated successfully, but these errors were encountered:

This is a simple case of using a unicode string to join where we should be using a bytes string.

Due to an html5lib regression (described in the thread at https://groups.google.com/d/msg/rdflib-dev/ZcAgKzhS3vI/3mxIJz4rwWUJ) , we opened html5lib/html5lib-python#67 on June 17th, 2013 and pinned html5lib to 0.95 in setup.py. The bug was fixed and a new release of html5lib (1.0b3) was cut on July 24, 2013. The current version of html5lib is 0.999; let's unpin that html5lib requirement. Signed-off-by: Dan Scott <dan@coffeecode.net>

…ib to 0.95 until html5lib/html5lib-python#67 is resolved. fixes #54

gsnedders added a commit to gsnedders/html5lib-python that referenced this issue Jun 26, 2013

Fix html5lib#67: BufferedStream returns unicode string.

b995fe5

This is a simple case of using a unicode string to join where we should be using a bytes string.

gsnedders closed this as completed in 80c9044 Jul 9, 2013

dbs mentioned this issue Feb 26, 2014

Unpin html5lib 0.95 requirement RDFLib/rdflib#360

Merged

jechols pushed a commit to uoregon-libraries/oregonnews that referenced this issue Sep 29, 2014

api for parsing rdfa changed a bit in rdflib 4.x ; need to pin html5l…

ae87113

…ib to 0.95 until html5lib/html5lib-python#67 is resolved. fixes #54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode literal in BufferedStream._readFromBuffer causes failure in HTMLBinaryInputStream #67

Unicode literal in BufferedStream._readFromBuffer causes failure in HTMLBinaryInputStream #67

niklasl commented Jun 17, 2013

Unicode literal in BufferedStream._readFromBuffer causes failure in HTMLBinaryInputStream #67

Unicode literal in BufferedStream._readFromBuffer causes failure in HTMLBinaryInputStream #67

Comments

niklasl commented Jun 17, 2013