Skip to content


Subversion checkout URL

You can clone with
Download ZIP


Allow bytes as input to lxml.html.fromstring, thereby fixing issue #33. #72

wants to merge 8 commits into from

2 participants


The new code in fromstring now uses appropriate arguments to startswith depending on whether a bytes object was given as input or not. I also added a test case, that gives utf-8 encoded data and provides the encoding via the parser argument.

I only tried out the test case with python 2.7 and 3.2. Hopefully earlier versions do, like I read, simply ignore the b in front of literal strings. The output at the end will also differ from the actual output in python3 by the missing b in front. But as far as I have seen, this also happens in


The b'' string prefix is not available in Py2.4.

@canaaerus canaaerus closed this

Sorry, I can't accept code that doesn't work in Py2.4/5 and that contains a test that (IIUC) fails in Py3.

But why did you close the pull request?


Because I just read your comment on my commit. I wished I would have been informed about it somehow...
Please see my comment on the issue. To be honest, this feels like a terrible way to do a discussion.

@canaaerus canaaerus reopened this

Ok, I hope it works now in all necessary versions of python.


Thanks. However, there's way too much code churn in your changes now. It's even hard to see if you really managed to undo all the accidental changes. They look like a broken merge or something.

Could you try to remove those changes that introduced and reverted all the whitespace changes etc.?


In the “Files Changed”-view you can see that the reverted changes are ok. But if these things would mess up the commit history, I’ll try to remove all the commits and only push a single clean one, although I don’t know yet how to do this with git…
The cause of the mess up was that I first made the changes to my local (outdated) lxml version and then copied it into the git tree, which of course was a mistake. When trying to restore the current version, I first still messed up the white spaces.

@canaaerus canaaerus closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
This page is out of date. Refresh to see the latest.
Showing with 14 additions and 1 deletion.
  1. +2 −1  src/lxml/html/
  2. +12 −0 src/lxml/html/tests/test_basic.txt
3  src/lxml/html/
@@ -658,7 +658,8 @@ def fromstring(html, base_url=None, parser=None, **kw):
if parser is None:
parser = html_parser
start = html[:10].lstrip().lower()
- if start.startswith('<html') or start.startswith('<!doctype'):
+ if (isinstance(start, bytes) and start.startswith( ('<html'.encode(), '<!doctype'.encode()) ) or
+ not isinstance(start, bytes) and start.startswith( ('<html', '<!doctype') )):
# Looks like a full HTML document
return document_fromstring(html, parser=parser, base_url=base_url, **kw)
# otherwise, lets parse it out...
12 src/lxml/html/tests/test_basic.txt
@@ -99,3 +99,15 @@ drop the comment. Here, ``drop_tag()`` behaves exactly like ``drop_tree()``:
+In Python3 it should be possible to parse strings given as bytes objects, at
+least if an encoding is given.
+ >>> from lxml.html import HTMLParser
+ >>> enc = 'utf-8'
+ >>> html_parser = HTMLParser(encoding=enc)
+ >>> src = '<html><body>Test</body></html>'.encode(enc)
+ >>> doc = fromstring(src, parser=html_parser)
+ >>> print(tostring(doc, encoding=unicode))
+ '<html><body>Test</body></html>'
Something went wrong with that request. Please try again.