Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input validation & encoding detection #172

Merged
merged 2 commits into from
Sep 4, 2015
Merged

Input validation & encoding detection #172

merged 2 commits into from
Sep 4, 2015

Conversation

opottone
Copy link
Contributor

Here are two unrelated commits.

The first one is straightforward input validation (this time Python 3 compatible).

The second one is about encoding detection. In progressive parsing libxml2 does not properly detect the encoding and lxml has a workaround for that. However the problem with libxml2 is that it is not used properly (can't blame you, it's not properly documented.) It's simpler if we use libxml2 as intended and drop the workaround.

Use libxml2 properly, that is, give xmlCtxtResetPush() at least 4 bytes
of xml data. Then it properly processes byte order mark, and it need not
be processed in lxml.
@scoder
Copy link
Member

scoder commented Jun 21, 2015

It's always better to keep unrelated commits in separate pull requests. I would have merged the first change immediately (and fixed it up a tiny bit), but I'm not sure I'll merge the second one.

@scoder
Copy link
Member

scoder commented Sep 4, 2015

Thanks!

scoder added a commit that referenced this pull request Sep 4, 2015
Input validation & encoding detection
@scoder scoder merged commit 9553f4d into lxml:master Sep 4, 2015
@adamailru
Copy link

Hi
please look at http://stackoverflow.com/questions/34595275/disable-comments-check-for-in-lxml
Your commit crashed my parser, please suggest how to upgrade parser code, in order to be compatible with lxml after this commit.

reproduce able code:

import requests
from lxml.html import html5parser
data = requests.get("https://www.banca-romaneasca.ro/en/tools-and-resources/exchange-rates/").content
context = html5parser.document_fromstring(data)

...
      File "src/lxml/lxml.etree.pyx", line 3017, in lxml.etree.Comment (src/lxml/lxml.etree.c:80806)
ValueError: Comment may not contain '--' or end with '-'

looking forward

@opottone
Copy link
Contributor Author

opottone commented Jan 5, 2016

@adamailru, it seems to me that that is a bug in html5parser, not in lxml. If it even is a bug; html5 standard does not allow double hyphen in a comment, so it is reasonable to throw an exception. Although arguably it would be better to log a warning instead of throwing an exception.

@adamailru
Copy link

@opottone it is lxml for sure

  1. cloned lxml

changed src/lxml/lxml.etree.pyx
from:

if text is None:
    text = b''
else:
    text = _utf8(text)
    if b'--' in text or text.endswith(b'-'):
        raise ValueError("Comment may not contain '--' or end with '-'")

to:

if text is None:
    text = b''
else:
    text = _utf8(text)
    if b'--' in text or text.endswith(b'-'):
        # raise ValueError("Comment may not contain '--' or end with '-'")
        import warnings
        warnings.warn(
            u"Comment may not contain '--' or end with '-'",
            FutureWarning
            )
  1. sudo /opt/python-env/lxml/bin/pip install cython
  2. sudo /opt/python-env/lxml/bin/pip install "requests[security]"
  3. sudo /opt/python-env/lxml/bin/pip install html5lib
  4. sudo /opt/python-env/lxml/bin/pip install /_github.com/lxml

run the testa.py code

import requests
from lxml.html import html5parser
data = requests.get("https://www.banca-romaneasca.ro/en/tools-and-resources/exchange-rates/").content
context = html5parser.document_fromstring(data)
print context

and got:

/opt/python-env/lxml/bin/python2.7 /_github.com/lxml/testa.py
/opt/python-env/lxml/local/lib/python2.7/site-packages/html5lib/ihatexml.py:262: DataLossWarning: Coercing non-XML name
  warnings.warn("Coercing non-XML name", DataLossWarning)
/opt/python-env/lxml/local/lib/python2.7/site-packages/html5lib/treebuilders/etree.py:148: FutureWarning: Comment may not contain '--' or end with '-'
  self._element = ElementTree.Comment(data)
<Element {http://www.w3.org/1999/xhtml}html at 0x7fb6760c4a70>

SO

  1. How and can I propagate my change into https://github.com/lxml/lxml ?
  2. Will it be accepted at all ? If not then which workaround there is to handle my issue ?
  3. In case if it accepted, how long time will it be until next release ? How can I apply patch until next release is available, is it enough to change only /opt/python-env/lxml/local/lib/python2.7/site-packages/lxml/{etree.so,lxml.etree.h} ?

@opottone
Copy link
Contributor Author

opottone commented Jan 5, 2016

It seems to me that html5parser misuses lxml by calling lxml.etree.Comment. It is meant for xml, but html5parser uses it for html. With xml, throwing an error is the correct thing to do when there is a double hyphen.

@opottone
Copy link
Contributor Author

opottone commented Jan 6, 2016

I tried installing latest html5parser from https://github.com/html5lib/html5lib-python. Now I only get a warning, not an error.

@adamailru
Copy link

@opottone, yes !!!
you are right, its really working

/opt/python-env/ciur_2/bin/python /_bitbucket.com/ciur.example.exchange/country/ro/banks/banca_romaneasca_ro/test_parse.py
/opt/python-env/ciur_2/local/lib/python2.7/site-packages/html5lib/ihatexml.py:265: DataLossWarning: Coercing non-XML name
  warnings.warn("Coercing non-XML name", DataLossWarning)
/opt/python-env/ciur_2/local/lib/python2.7/site-packages/html5lib/ihatexml.py:226: DataLossWarning: Comments cannot contain adjacent dashes
  warnings.warn("Comments cannot contain adjacent dashes", DataLossWarning)
/opt/python-env/ciur_2/local/lib/python2.7/site-packages/html5lib/ihatexml.py:229: DataLossWarning: Comments cannot end in a dash
  warnings.warn("Comments cannot end in a dash", DataLossWarning)
{
    "root": {
        "date": "datetime.datetime(2016, 1, 6, 0, 0)", 
        "item_list": [
            {
                "currency": "EUR", 
                "rate": 1.0, 
                "buy": 4.4876, 
                "sell": 4.5546
            }, 

So I can say now:

ISSUE is CLOSED

All credits and thanks goes to @opottone

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants