New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input validation & encoding detection #172

Merged
merged 2 commits into from Sep 4, 2015

Conversation

Projects
None yet
4 participants
@opottone
Contributor

opottone commented Jun 21, 2015

Here are two unrelated commits.

The first one is straightforward input validation (this time Python 3 compatible).

The second one is about encoding detection. In progressive parsing libxml2 does not properly detect the encoding and lxml has a workaround for that. However the problem with libxml2 is that it is not used properly (can't blame you, it's not properly documented.) It's simpler if we use libxml2 as intended and drop the workaround.

opottone added some commits Jun 14, 2015

Simplify encoding detection.
Use libxml2 properly, that is, give xmlCtxtResetPush() at least 4 bytes
of xml data. Then it properly processes byte order mark, and it need not
be processed in lxml.
@scoder

This comment has been minimized.

Show comment
Hide comment
@scoder

scoder Jun 21, 2015

Just indent this by one level and drop the None check.

scoder commented on src/lxml/lxml.etree.pyx in 3b77ce2 Jun 21, 2015

Just indent this by one level and drop the None check.

@scoder

This comment has been minimized.

Show comment
Hide comment
@scoder

scoder Jun 21, 2015

No need to check the empty string here (see code right above).

scoder commented on src/lxml/lxml.etree.pyx in 3b77ce2 Jun 21, 2015

No need to check the empty string here (see code right above).

@scoder

This comment has been minimized.

Show comment
Hide comment
@scoder

scoder Jun 21, 2015

That's not an "end tag", I'd rather call it "end marker" or actually rather nothing at all.

scoder commented on src/lxml/lxml.etree.pyx in 3b77ce2 Jun 21, 2015

That's not an "end tag", I'd rather call it "end marker" or actually rather nothing at all.

@scoder

This comment has been minimized.

Show comment
Hide comment
@scoder

scoder Jun 21, 2015

Have you tested this with libxml2 2.7?

scoder commented on 0df8e59 Jun 21, 2015

Have you tested this with libxml2 2.7?

This comment has been minimized.

Show comment
Hide comment
@scoder

scoder Jun 21, 2015

I agree that the general approach is better than what's there, but I'm not sure we should drop the BOM handling code.

scoder replied Jun 21, 2015

I agree that the general approach is better than what's there, but I'm not sure we should drop the BOM handling code.

This comment has been minimized.

Show comment
Hide comment
@opottone

opottone Jul 8, 2015

Owner

I could not directly test this with libxml2 2.7, because the lxml tests won't run:

ImportError: dlopen(/Users/opottone/omat/code/lxml/src/lxml/etree.so, 2): Library not loaded: /opt/local/lib/libxml2.2.dylib
Referenced from: /Users/opottone/omat/code/lxml/src/lxml/etree.so
Reason: Incompatible library version: etree.so requires version 12.0.0 or later, but libxml2.2.dylib provides version 11.0.0

However test in C indicates that libxml2 2.7.2 works like the latest version. Git blame tells that xmlCtxtResetPush() has had encoding detection since 2003-10-28.

The BOM handling here is somewhat wrong, though compatible with libxml2 also being wrong, so the end result is correct. Let us say the encoding is little-endian UTF-16 (with BOM). Then xmlDetectCharEncoding() will say it is UTF-16LE, which it is not. UTF-16LE and UTF-16 are two different encoding schemes; UTF-16LE does not have a BOM. Assume libxml2 actually gave the correct answer, UTF-16. When lxml strips the BOM, the mechanism for distinguishing between little endian and big endian is broken.

Owner

opottone replied Jul 8, 2015

I could not directly test this with libxml2 2.7, because the lxml tests won't run:

ImportError: dlopen(/Users/opottone/omat/code/lxml/src/lxml/etree.so, 2): Library not loaded: /opt/local/lib/libxml2.2.dylib
Referenced from: /Users/opottone/omat/code/lxml/src/lxml/etree.so
Reason: Incompatible library version: etree.so requires version 12.0.0 or later, but libxml2.2.dylib provides version 11.0.0

However test in C indicates that libxml2 2.7.2 works like the latest version. Git blame tells that xmlCtxtResetPush() has had encoding detection since 2003-10-28.

The BOM handling here is somewhat wrong, though compatible with libxml2 also being wrong, so the end result is correct. Let us say the encoding is little-endian UTF-16 (with BOM). Then xmlDetectCharEncoding() will say it is UTF-16LE, which it is not. UTF-16LE and UTF-16 are two different encoding schemes; UTF-16LE does not have a BOM. Assume libxml2 actually gave the correct answer, UTF-16. When lxml strips the BOM, the mechanism for distinguishing between little endian and big endian is broken.

@scoder

This comment has been minimized.

Show comment
Hide comment
@scoder

scoder Jun 21, 2015

Member

It's always better to keep unrelated commits in separate pull requests. I would have merged the first change immediately (and fixed it up a tiny bit), but I'm not sure I'll merge the second one.

Member

scoder commented Jun 21, 2015

It's always better to keep unrelated commits in separate pull requests. I would have merged the first change immediately (and fixed it up a tiny bit), but I'm not sure I'll merge the second one.

@scoder

This comment has been minimized.

Show comment
Hide comment
@scoder

scoder Sep 4, 2015

Member

Thanks!

Member

scoder commented Sep 4, 2015

Thanks!

scoder added a commit that referenced this pull request Sep 4, 2015

Merge pull request #172 from opottone/misc
Input validation & encoding detection

@scoder scoder merged commit 9553f4d into lxml:master Sep 4, 2015

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@adamailru

This comment has been minimized.

Show comment
Hide comment
@adamailru

adamailru Jan 4, 2016

Hi
please look at http://stackoverflow.com/questions/34595275/disable-comments-check-for-in-lxml
Your commit crashed my parser, please suggest how to upgrade parser code, in order to be compatible with lxml after this commit.

reproduce able code:

import requests
from lxml.html import html5parser
data = requests.get("https://www.banca-romaneasca.ro/en/tools-and-resources/exchange-rates/").content
context = html5parser.document_fromstring(data)

...
      File "src/lxml/lxml.etree.pyx", line 3017, in lxml.etree.Comment (src/lxml/lxml.etree.c:80806)
ValueError: Comment may not contain '--' or end with '-'

looking forward

adamailru commented Jan 4, 2016

Hi
please look at http://stackoverflow.com/questions/34595275/disable-comments-check-for-in-lxml
Your commit crashed my parser, please suggest how to upgrade parser code, in order to be compatible with lxml after this commit.

reproduce able code:

import requests
from lxml.html import html5parser
data = requests.get("https://www.banca-romaneasca.ro/en/tools-and-resources/exchange-rates/").content
context = html5parser.document_fromstring(data)

...
      File "src/lxml/lxml.etree.pyx", line 3017, in lxml.etree.Comment (src/lxml/lxml.etree.c:80806)
ValueError: Comment may not contain '--' or end with '-'

looking forward

@opottone

This comment has been minimized.

Show comment
Hide comment
@opottone

opottone Jan 5, 2016

Contributor

@adamailru, it seems to me that that is a bug in html5parser, not in lxml. If it even is a bug; html5 standard does not allow double hyphen in a comment, so it is reasonable to throw an exception. Although arguably it would be better to log a warning instead of throwing an exception.

Contributor

opottone commented Jan 5, 2016

@adamailru, it seems to me that that is a bug in html5parser, not in lxml. If it even is a bug; html5 standard does not allow double hyphen in a comment, so it is reasonable to throw an exception. Although arguably it would be better to log a warning instead of throwing an exception.

@adamailru

This comment has been minimized.

Show comment
Hide comment
@adamailru

adamailru Jan 5, 2016

@opottone it is lxml for sure

  1. cloned lxml

changed src/lxml/lxml.etree.pyx
from:

if text is None:
    text = b''
else:
    text = _utf8(text)
    if b'--' in text or text.endswith(b'-'):
        raise ValueError("Comment may not contain '--' or end with '-'")

to:

if text is None:
    text = b''
else:
    text = _utf8(text)
    if b'--' in text or text.endswith(b'-'):
        # raise ValueError("Comment may not contain '--' or end with '-'")
        import warnings
        warnings.warn(
            u"Comment may not contain '--' or end with '-'",
            FutureWarning
            )
  1. sudo /opt/python-env/lxml/bin/pip install cython
  2. sudo /opt/python-env/lxml/bin/pip install "requests[security]"
  3. sudo /opt/python-env/lxml/bin/pip install html5lib
  4. sudo /opt/python-env/lxml/bin/pip install /_github.com/lxml

run the testa.py code

import requests
from lxml.html import html5parser
data = requests.get("https://www.banca-romaneasca.ro/en/tools-and-resources/exchange-rates/").content
context = html5parser.document_fromstring(data)
print context

and got:

/opt/python-env/lxml/bin/python2.7 /_github.com/lxml/testa.py
/opt/python-env/lxml/local/lib/python2.7/site-packages/html5lib/ihatexml.py:262: DataLossWarning: Coercing non-XML name
  warnings.warn("Coercing non-XML name", DataLossWarning)
/opt/python-env/lxml/local/lib/python2.7/site-packages/html5lib/treebuilders/etree.py:148: FutureWarning: Comment may not contain '--' or end with '-'
  self._element = ElementTree.Comment(data)
<Element {http://www.w3.org/1999/xhtml}html at 0x7fb6760c4a70>

SO

  1. How and can I propagate my change into https://github.com/lxml/lxml ?
  2. Will it be accepted at all ? If not then which workaround there is to handle my issue ?
  3. In case if it accepted, how long time will it be until next release ? How can I apply patch until next release is available, is it enough to change only /opt/python-env/lxml/local/lib/python2.7/site-packages/lxml/{etree.so,lxml.etree.h} ?

adamailru commented Jan 5, 2016

@opottone it is lxml for sure

  1. cloned lxml

changed src/lxml/lxml.etree.pyx
from:

if text is None:
    text = b''
else:
    text = _utf8(text)
    if b'--' in text or text.endswith(b'-'):
        raise ValueError("Comment may not contain '--' or end with '-'")

to:

if text is None:
    text = b''
else:
    text = _utf8(text)
    if b'--' in text or text.endswith(b'-'):
        # raise ValueError("Comment may not contain '--' or end with '-'")
        import warnings
        warnings.warn(
            u"Comment may not contain '--' or end with '-'",
            FutureWarning
            )
  1. sudo /opt/python-env/lxml/bin/pip install cython
  2. sudo /opt/python-env/lxml/bin/pip install "requests[security]"
  3. sudo /opt/python-env/lxml/bin/pip install html5lib
  4. sudo /opt/python-env/lxml/bin/pip install /_github.com/lxml

run the testa.py code

import requests
from lxml.html import html5parser
data = requests.get("https://www.banca-romaneasca.ro/en/tools-and-resources/exchange-rates/").content
context = html5parser.document_fromstring(data)
print context

and got:

/opt/python-env/lxml/bin/python2.7 /_github.com/lxml/testa.py
/opt/python-env/lxml/local/lib/python2.7/site-packages/html5lib/ihatexml.py:262: DataLossWarning: Coercing non-XML name
  warnings.warn("Coercing non-XML name", DataLossWarning)
/opt/python-env/lxml/local/lib/python2.7/site-packages/html5lib/treebuilders/etree.py:148: FutureWarning: Comment may not contain '--' or end with '-'
  self._element = ElementTree.Comment(data)
<Element {http://www.w3.org/1999/xhtml}html at 0x7fb6760c4a70>

SO

  1. How and can I propagate my change into https://github.com/lxml/lxml ?
  2. Will it be accepted at all ? If not then which workaround there is to handle my issue ?
  3. In case if it accepted, how long time will it be until next release ? How can I apply patch until next release is available, is it enough to change only /opt/python-env/lxml/local/lib/python2.7/site-packages/lxml/{etree.so,lxml.etree.h} ?
@opottone

This comment has been minimized.

Show comment
Hide comment
@opottone

opottone Jan 5, 2016

Contributor

It seems to me that html5parser misuses lxml by calling lxml.etree.Comment. It is meant for xml, but html5parser uses it for html. With xml, throwing an error is the correct thing to do when there is a double hyphen.

Contributor

opottone commented Jan 5, 2016

It seems to me that html5parser misuses lxml by calling lxml.etree.Comment. It is meant for xml, but html5parser uses it for html. With xml, throwing an error is the correct thing to do when there is a double hyphen.

@opottone

This comment has been minimized.

Show comment
Hide comment
@opottone

opottone Jan 6, 2016

Contributor

I tried installing latest html5parser from https://github.com/html5lib/html5lib-python. Now I only get a warning, not an error.

Contributor

opottone commented Jan 6, 2016

I tried installing latest html5parser from https://github.com/html5lib/html5lib-python. Now I only get a warning, not an error.

@adamailru

This comment has been minimized.

Show comment
Hide comment
@adamailru

adamailru Jan 6, 2016

@opottone, yes !!!
you are right, its really working

/opt/python-env/ciur_2/bin/python /_bitbucket.com/ciur.example.exchange/country/ro/banks/banca_romaneasca_ro/test_parse.py
/opt/python-env/ciur_2/local/lib/python2.7/site-packages/html5lib/ihatexml.py:265: DataLossWarning: Coercing non-XML name
  warnings.warn("Coercing non-XML name", DataLossWarning)
/opt/python-env/ciur_2/local/lib/python2.7/site-packages/html5lib/ihatexml.py:226: DataLossWarning: Comments cannot contain adjacent dashes
  warnings.warn("Comments cannot contain adjacent dashes", DataLossWarning)
/opt/python-env/ciur_2/local/lib/python2.7/site-packages/html5lib/ihatexml.py:229: DataLossWarning: Comments cannot end in a dash
  warnings.warn("Comments cannot end in a dash", DataLossWarning)
{
    "root": {
        "date": "datetime.datetime(2016, 1, 6, 0, 0)", 
        "item_list": [
            {
                "currency": "EUR", 
                "rate": 1.0, 
                "buy": 4.4876, 
                "sell": 4.5546
            }, 

So I can say now:

ISSUE is CLOSED

All credits and thanks goes to @opottone

adamailru commented Jan 6, 2016

@opottone, yes !!!
you are right, its really working

/opt/python-env/ciur_2/bin/python /_bitbucket.com/ciur.example.exchange/country/ro/banks/banca_romaneasca_ro/test_parse.py
/opt/python-env/ciur_2/local/lib/python2.7/site-packages/html5lib/ihatexml.py:265: DataLossWarning: Coercing non-XML name
  warnings.warn("Coercing non-XML name", DataLossWarning)
/opt/python-env/ciur_2/local/lib/python2.7/site-packages/html5lib/ihatexml.py:226: DataLossWarning: Comments cannot contain adjacent dashes
  warnings.warn("Comments cannot contain adjacent dashes", DataLossWarning)
/opt/python-env/ciur_2/local/lib/python2.7/site-packages/html5lib/ihatexml.py:229: DataLossWarning: Comments cannot end in a dash
  warnings.warn("Comments cannot end in a dash", DataLossWarning)
{
    "root": {
        "date": "datetime.datetime(2016, 1, 6, 0, 0)", 
        "item_list": [
            {
                "currency": "EUR", 
                "rate": 1.0, 
                "buy": 4.4876, 
                "sell": 4.5546
            }, 

So I can say now:

ISSUE is CLOSED

All credits and thanks goes to @opottone

@ActionsPerMinute

This comment has been minimized.

Show comment
Hide comment
@ActionsPerMinute

ActionsPerMinute Apr 22, 2016

Should anyone get this far and wonder why 'xml' is excluded from the Processing instruction, I believe that the <?xml header is not a Processing Instruction it's an xml declaration, although they look the same. Therefore, the tree should be printed with that option i.e.:

etree.tostring(root, pretty_print=True, encoding='UTF-8', xml_declaration=True)

ActionsPerMinute commented on 3b77ce2 Apr 22, 2016

Should anyone get this far and wonder why 'xml' is excluded from the Processing instruction, I believe that the <?xml header is not a Processing Instruction it's an xml declaration, although they look the same. Therefore, the tree should be printed with that option i.e.:

etree.tostring(root, pretty_print=True, encoding='UTF-8', xml_declaration=True)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment