Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Allow bytes as input to lxml.html.fromstring, thereby fixing issue #33. #72

Closed
wants to merge 8 commits into from

2 participants

@canaaerus

The new code in fromstring now uses appropriate arguments to startswith depending on whether a bytes object was given as input or not. I also added a test case, that gives utf-8 encoded data and provides the encoding via the parser argument.

I only tried out the test case with python 2.7 and 3.2. Hopefully earlier versions do, like I read, simply ignore the b in front of literal strings. The output at the end will also differ from the actual output in python3 by the missing b in front. But as far as I have seen, this also happens in selftest.py.

@scoder

The b'' string prefix is not available in Py2.4.

@canaaerus canaaerus closed this
@scoder
Owner

Sorry, I can't accept code that doesn't work in Py2.4/5 and that contains a test that (IIUC) fails in Py3.

But why did you close the pull request?

@canaaerus

Because I just read your comment on my commit. I wished I would have been informed about it somehow...
Please see my comment on the issue. To be honest, this feels like a terrible way to do a discussion.

@canaaerus canaaerus reopened this
@canaaerus

Ok, I hope it works now in all necessary versions of python.

@scoder
Owner

Thanks. However, there's way too much code churn in your changes now. It's even hard to see if you really managed to undo all the accidental changes. They look like a broken merge or something.

Could you try to remove those changes that introduced and reverted all the whitespace changes etc.?

@canaaerus

In the “Files Changed”-view you can see that the reverted changes are ok. But if these things would mess up the commit history, I’ll try to remove all the commits and only push a single clean one, although I don’t know yet how to do this with git…
The cause of the mess up was that I first made the changes to my local (outdated) lxml version and then copied it into the git tree, which of course was a mistake. When trying to restore the current version, I first still messed up the white spaces.

@canaaerus canaaerus closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
This page is out of date. Refresh to see the latest.
Showing with 14 additions and 1 deletion.
  1. +2 −1  src/lxml/html/__init__.py
  2. +12 −0 src/lxml/html/tests/test_basic.txt
View
3  src/lxml/html/__init__.py
@@ -658,7 +658,8 @@ def fromstring(html, base_url=None, parser=None, **kw):
if parser is None:
parser = html_parser
start = html[:10].lstrip().lower()
- if start.startswith('<html') or start.startswith('<!doctype'):
+ if (isinstance(start, bytes) and start.startswith( ('<html'.encode(), '<!doctype'.encode()) ) or
+ not isinstance(start, bytes) and start.startswith( ('<html', '<!doctype') )):
# Looks like a full HTML document
return document_fromstring(html, parser=parser, base_url=base_url, **kw)
# otherwise, lets parse it out...
View
12 src/lxml/html/tests/test_basic.txt
@@ -99,3 +99,15 @@ drop the comment. Here, ``drop_tag()`` behaves exactly like ``drop_tree()``:
<div>footer</div>
</body>
</html>
+
+In Python3 it should be possible to parse strings given as bytes objects, at
+least if an encoding is given.
+
+ >>> from lxml.html import HTMLParser
+ >>> enc = 'utf-8'
+ >>> html_parser = HTMLParser(encoding=enc)
+ >>> src = '<html><body>Test</body></html>'.encode(enc)
+ >>> doc = fromstring(src, parser=html_parser)
+ >>> print(tostring(doc, encoding=unicode))
+ '<html><body>Test</body></html>'
+
Something went wrong with that request. Please try again.