You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In html5lib/inputstream.py, unicode_literals is imported from __future__. This causes html5lib.inputstream.BufferedStream to misbehave, specifically the _readFromBuffer method, which ends with return "".join(rv). Due to this being a unicode literal, any read from after the first becomes a chunk of unicode instead of a chunk of bytes.
Traceback (most recent call last):
File "<stdin>", line 6, in <module>
File ".../html5lib/inputstream.py", line 411, in __init__self.charEncoding =self.detectEncoding(parseMeta, chardet)
File ".../html5lib/inputstream.py", line 448, in detectEncoding
encoding =self.detectEncodingMeta()
File ".../html5lib/inputstream.py", line 535, in detectEncodingMetaassertisinstance(buffer, bytes)
AssertionError
(That is, when HTMLBinaryInputStream is used with a file-like object (such as the result of urllib2.urlopen), it wraps it in a BufferedStream, which then fails (at line 535) with the assert isinstance(buffer, bytes).)
This can be fixed by using a byte literal in _readFromBuffer, instead, i.e. return b"".join(rv). (There are at least three places in inputstream.py where string literals are used like this: at lines 117, 318 and 348.)
The text was updated successfully, but these errors were encountered:
gsnedders
added a commit
to gsnedders/html5lib-python
that referenced
this issue
Jun 26, 2013
Due to an html5lib regression (described in the thread at
https://groups.google.com/d/msg/rdflib-dev/ZcAgKzhS3vI/3mxIJz4rwWUJ) , we
opened html5lib/html5lib-python#67 on June 17th, 2013
and pinned html5lib to 0.95 in setup.py. The bug was fixed and a new release of
html5lib (1.0b3) was cut on July 24, 2013. The current version of html5lib is
0.999; let's unpin that html5lib requirement.
Signed-off-by: Dan Scott <dan@coffeecode.net>
In html5lib/inputstream.py,
unicode_literals
is imported from__future__
. This causeshtml5lib.inputstream.BufferedStream
to misbehave, specifically the_readFromBuffer
method, which ends withreturn "".join(rv)
. Due to this being a unicode literal, any read from after the first becomes a chunk of unicode instead of a chunk of bytes.An example of the problem caused:
Causing:
(That is, when
HTMLBinaryInputStream
is used with a file-like object (such as the result ofurllib2.urlopen
), it wraps it in aBufferedStream
, which then fails (at line 535) with theassert isinstance(buffer, bytes)
.)This can be fixed by using a byte literal in
_readFromBuffer
, instead, i.e.return b"".join(rv)
. (There are at least three places in inputstream.py where string literals are used like this: at lines 117, 318 and 348.)The text was updated successfully, but these errors were encountered: