New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Soup hyphen in comment #251

Merged
merged 4 commits into from Jul 29, 2017

Conversation

Projects
None yet
2 participants
@mozbugbox
Copy link
Contributor

mozbugbox commented Jul 28, 2017

Fix a couple of issues with soupparser

mozbugbox added some commits Jul 28, 2017

Soupparser: Fix "No parser was explicitly specified" warning
Seems that beautifulsoup won't handle using list as `features`
properly.
@scoder

This comment has been minimized.

Copy link
Member

scoder commented Jul 28, 2017

Thanks. Could you come up with a test that shows why the "--" special handling is necessary? At first glance, it seems wrong to me to add a character escape in text content (which is not serialised XML).

@mozbugbox

This comment has been minimized.

Copy link
Contributor

mozbugbox commented Jul 28, 2017

Something like

<html>
<body>
<div> Test </div>
<!-- This is -- a comment with double hyphen -->
</body>
</html>

When parsing with soupparser, exception raised

 raise ValueError("Comment may not contain '--' or end with '-'")

The lxml code that raise the exception is in lxml.etree.pyx, Comment().

@mozbugbox

This comment has been minimized.

Copy link
Contributor

mozbugbox commented Jul 28, 2017

hmm... It seems lxml.html can parse comment with 2 hyphens with no problem.

from lxml.html import soupparser
from lxml import html

content = "<html><!-- something with -- 2 hyphen --></html>"

tree = html.fromstring(content)
print(html.tostring(tree))

tree = soupparser.fromstring(content)
print(html.tostring(tree))
@scoder

This comment has been minimized.

Copy link
Member

scoder commented Jul 28, 2017

I think the soup parser is simply using the wrong way to instantiate an HTML comment:

>>> lxml.html.HtmlComment("something with -- 2 hyphen")
<!--something with -- 2 hyphen-->
@mozbugbox

This comment has been minimized.

Copy link
Contributor

mozbugbox commented Jul 29, 2017

aha, thanks for the advice. Pull updated

@scoder

This comment has been minimized.

Copy link
Member

scoder commented Jul 29, 2017

Could you please also add tests to test_elementsoup.py ?

@scoder scoder merged commit e70dad6 into lxml:master Jul 29, 2017

1 check was pending

continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment