Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode literal in BufferedStream._readFromBuffer causes failure in HTMLBinaryInputStream #67

Closed
niklasl opened this issue Jun 17, 2013 · 0 comments
Milestone

Comments

@niklasl
Copy link

niklasl commented Jun 17, 2013

In html5lib/inputstream.py, unicode_literals is imported from __future__. This causes html5lib.inputstream.BufferedStream to misbehave, specifically the _readFromBuffer method, which ends with return "".join(rv). Due to this being a unicode literal, any read from after the first becomes a chunk of unicode instead of a chunk of bytes.

An example of the problem caused:

from urllib2 import Request, urlopen
from html5lib.inputstream import HTMLBinaryInputStream

req = Request(url='http://example.org/')
source = urlopen(req)
HTMLBinaryInputStream(source)

Causing:

Traceback (most recent call last):
  File "<stdin>", line 6, in <module>
  File ".../html5lib/inputstream.py", line 411, in __init__
    self.charEncoding = self.detectEncoding(parseMeta, chardet)
  File ".../html5lib/inputstream.py", line 448, in detectEncoding
    encoding = self.detectEncodingMeta()
  File ".../html5lib/inputstream.py", line 535, in detectEncodingMeta
    assert isinstance(buffer, bytes)
AssertionError

(That is, when HTMLBinaryInputStream is used with a file-like object (such as the result of urllib2.urlopen), it wraps it in a BufferedStream, which then fails (at line 535) with the assert isinstance(buffer, bytes).)

This can be fixed by using a byte literal in _readFromBuffer, instead, i.e. return b"".join(rv). (There are at least three places in inputstream.py where string literals are used like this: at lines 117, 318 and 348.)

gsnedders added a commit to gsnedders/html5lib-python that referenced this issue Jun 26, 2013
This is a simple case of using a unicode string to join where we
should be using a bytes string.
dbs added a commit to dbs/rdflib that referenced this issue Feb 26, 2014
Due to an html5lib regression (described in the thread at
https://groups.google.com/d/msg/rdflib-dev/ZcAgKzhS3vI/3mxIJz4rwWUJ) , we
opened html5lib/html5lib-python#67 on June 17th, 2013
and pinned html5lib to 0.95 in setup.py. The bug was fixed and a new release of
html5lib (1.0b3) was cut on July 24, 2013. The current version of html5lib is
0.999; let's unpin that html5lib requirement.

Signed-off-by: Dan Scott <dan@coffeecode.net>
jechols pushed a commit to uoregon-libraries/oregonnews that referenced this issue Sep 29, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant