-
-
Notifications
You must be signed in to change notification settings - Fork 562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Input validation & encoding detection #172
Conversation
Use libxml2 properly, that is, give xmlCtxtResetPush() at least 4 bytes of xml data. Then it properly processes byte order mark, and it need not be processed in lxml.
It's always better to keep unrelated commits in separate pull requests. I would have merged the first change immediately (and fixed it up a tiny bit), but I'm not sure I'll merge the second one. |
Thanks! |
Input validation & encoding detection
Hi reproduce able code:
looking forward |
@adamailru, it seems to me that that is a bug in html5parser, not in lxml. If it even is a bug; html5 standard does not allow double hyphen in a comment, so it is reasonable to throw an exception. Although arguably it would be better to log a warning instead of throwing an exception. |
@opottone it is lxml for sure
changed
to:
run the testa.py code
and got:
SO
|
It seems to me that html5parser misuses lxml by calling lxml.etree.Comment. It is meant for xml, but html5parser uses it for html. With xml, throwing an error is the correct thing to do when there is a double hyphen. |
I tried installing latest html5parser from https://github.com/html5lib/html5lib-python. Now I only get a warning, not an error. |
@opottone, yes !!!
So I can say now: ISSUE is CLOSEDAll credits and thanks goes to @opottone |
Here are two unrelated commits.
The first one is straightforward input validation (this time Python 3 compatible).
The second one is about encoding detection. In progressive parsing libxml2 does not properly detect the encoding and lxml has a workaround for that. However the problem with libxml2 is that it is not used properly (can't blame you, it's not properly documented.) It's simpler if we use libxml2 as intended and drop the workaround.