-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change HtmlParser to use Beautiful Soup #337
Conversation
UnicodeDammit input has to be non-unicode to trigger character set detection.
Read-only property with new Beautiful Soup parser.
Needs miniboa >= 1.0.8 for telnet test on Python 3.7. Test with older Beautiful Soup without line number support on Python 3.5. Resolve tox deprecation warning: Matching undeclared envs is deprecated. Be sure all the envs that Tox should run are declared in the tox config.
Shown every time linkchecker is run: /usr/lib/python3.7/site-packages/bs4/element.py:16: UserWarning: The soupsieve package is not installed. CSS selectors cannot be used. 'The soupsieve package is not installed. CSS selectors cannot be used.'
Else non-UTF-8 codes are misinterpreted: >>> from urllib import parse >>> parse.unquote("%FF") '�' >>> parse.unquote("%FF", "latin1") 'ÿ'
@@ -84,8 +84,6 @@ def content_allows_robots (self): | |||
handler = linkparse.MetaRobotsFinder() | |||
parser = htmlsax.parser(handler) | |||
handler.parser = parser | |||
if self.charset: | |||
parser.encoding = self.charset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have doubts about this entire commit.
If we have charset information from the Content-Type header, there should be a way to force the HTML parser to use it. (The spec requires that if the document itself specifies charset X in a <meta>
tag, but the server specifies charset Y in the Content-Type header, then the parser must use charset Y.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's straightforward to implement (add a get_content() method to HttpUrl that takes the encoding from requests and tells Beautiful Soup to use it). Harder to test all the permutations.
linkcheck/httputil.py
Outdated
@@ -44,6 +44,7 @@ def x509_to_dict(x509): | |||
} | |||
notAfter = x509.get_notAfter() | |||
if notAfter is not None: | |||
notAfter = notAfter.decode() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no regression test for this fix. :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed the limited https testing is completely disabled. There's a lot of code around certificates using sockets which probably needs ripping out and replacing with getting the certificate from requests. Best left until after the bs4 changes though.
This change is just a result of the pyOpenSSL API - which itself recommends using cryptography, and that would give us a nice datetime!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there is no simple way to get the certificate from requests. Session.cert is the default client certificate (must update my local copy of the requests docs...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've included this commit with a test in #338.
linkcheck/parser/__init__.py
Outdated
@@ -128,9 +128,9 @@ def find_links (url_data, callback, tags): | |||
handler.parser = parser | |||
# parse | |||
try: | |||
content = url_data.get_raw_content() | |||
soup = url_data.get_soup() | |||
with parse_mutex: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, we can drop the parse_mutex now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a commit-by-commit review this time, except I skipped commits
- Pass encoding when unquoting
- Always use utf-8 encoding when quoting
- Add new tests for URL quoting
because it's late and they're complicated.
I'll note that playing with this branch is complicated by the need to manually run git clean -dfx
(probably to remove htmlsax.so so it doesn't shadow htmlsax.py in the same directory), otherwise all the parser tests fail on Python 3.
I'll also note that we have one test marked xfail that now passes, so maybe we can drop the @pytest.mark.xfail
decorator from it!
I'm not approving the PR at this time because
- WIP in title
- drops Python 2.7 support before January 1 2020 ;)
So, remember the browser tests I did in some other GH issue/PR and gleefully reported that modern browsers decode everything and treat all links as UTF-8? Well, about that.
The tests added in this branch make linkchecker treat that link as http://localhost:8001/?quoted=%C3%BF! This makes me nervous. Linkchecker should behave the same way browsers do. And it's not a difference between file:// vs http://, I can reproduce this even if I do
I should dig up my earlier experiment with a custom wsgi script and figure out what makes Chromium behave differently. |
I think this is the key reason for the WIP on this PR. (#311 demonstrates and is included here). |
How do we want to manage changes to this? PRs against the branch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to blanket approve this because it's been too long, and the tests are green (due to me cheating with the "allowed_failures" config).
We can decide what to do about py35-oldbs4 later.
i guess we should remove the WIP here then? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
god i love this: +677 −11,375
removing 11,000 lines of code! and tests are green! let's just ship this damn thing :)
For me personally, and Gentoo Linux, no problem requiring beautifulsoup >= 4.8.2. If we did go that way I think we should remove the allow_failures from Travis. (and remove the alternative test from tests/checker/test_all_parts.py) But seems extreme just to get the tests to pass when it shouldn't impact the user - the beautifulsoup parsing has been done by this point. Skipping the tests for beautifulsoup < 4.8.2 seems the best idea. Maybe in future all the doctype code can just be ripped out but that would take time to check which probably isn't a priority now. |
Not just allow_failures, the entire oldbs4 job from the travis matrix (and from tox.ini).
+1 Does BeautifulSoup define its version in the bs4 package somewhere? Or do we use pkg_resources to get its version? |
Or maybe we can do a feature-based test somehow? bs_can_handle_doctypes_correctly = BeautifulSoup("<!doctype html>", "html.parser").something == something? |
Haven't checked the docs but a quick grep got me: >>> bs4.__version__
'4.8.2' Thinking about this a bit more - if it turns out to be too fiddly to isolate the DOCTYPE test strings, making the code compatible with all versions would only take something like: --- a/linkcheck/HtmlParser/htmllib.py
+++ b/linkcheck/HtmlParser/htmllib.py
@@ -151,7 +151,7 @@ class HtmlPrettyPrinter (object):
@type data: string
@return: None
"""
- self.fd.write("<!DOCTYPE%s>" % data)
+ self.fd.write("<!DOCTYPE %s>" % data)
def pi (self, data):
"""
--- a/linkcheck/HtmlParser/htmlsax.py
+++ b/linkcheck/HtmlParser/htmlsax.py
@@ -93,7 +93,8 @@ class Parser(object):
self.handler.comment(comment)
elif isinstance(content, Doctype):
if hasattr(self.handler, 'doctype'):
- self.handler.doctype(content[7:])
+ self.handler.doctype(content[8:] if
+ content.upper().startswith('DOCTYPE ') else content)
elif isinstance(content, Comment):
if hasattr(self.handler, 'comment'):
self.handler.comment(content.strip()) TBC, but I think it may never actually be called by running the |
I missed the thumbs up - they don't generate an email notification. @mgedmin do you want to do something with the three "Split the oldbs4 into a separate Travis job" commits? By all means squash them or otherwise; the "allow_failures" needs to go? I've done some testing of moving and removing more parsing related code (including the doctype handling), nearly ready, but I think that would better wait for another PR. |
Sorry about that! I meant to apply your suggestion and push it, but ran out of motivation.
I didn't want to force-push to a shared branch. We could squash them if you want to, but having them as separate commits doesn't bother me.
Yes: allow_failures needs to go once the code is fixed. I would commit the removal as a separate commit, but I won't mind if you want to play with some history rewriting instead ;)
I would prefer to have the oldbs4 job run as a separate Travis job, for two reasons:
tox-travis has been disappointing to me: I hoped to make the Python 3.5 job run just the py35 tox job without an explicit TOXENV override in the environment, but couldn't make it happen by configuring tox-travis in tox.ini. https://tox-travis.readthedocs.io/en/stable/envlist.html#advanced-configuration did nothing. Now I have a new idea: if the tox environment were called I'll experiment, stay tuned! |
I want the Python 3.5 travis job to run just tox -e py35, without the oldbs4 job, and without an explicit TOXENV setting that is awkward to insert in the .travis.yml (also, it reorders the jobs putting 3.5 below 3.8 which annoys me). I think I found a way of doing that by renaming py35-oldbs4 to oldbs4.
Changes pushed. I have no motivation to play with history rewriting (and I think it's interesting to see all the missteps made along the way). Maybe let's merge this? |
The builds are green on Travis CI; I don't know why GitHub doesn't see that yet: |
humm... mayeb we can rerun travis to fix that? (just close and reopen the PR) my approval still stands, in any case, but i will let you folks do the honors. :) |
I reran a single job in one of the builds, and I don't see any changes here. We can temporarily disable the branch protection, if there's consensus that this can be merged. @cjmayo, what's your opinion? |
i tried to kick CI, let's see if GH picks it up now. should we consider switching to GH Actions? :) |
Maybe? I don't like them. |
One more idea for Travis. I had a simple commit available, to remove doc/python3.txt, I've pushed that (hopefully it doesn't make things worse...). Anyway, all the tests I can run locally pass, I'm happy for it to be merged - removed the WIP! |
There are some problems reported on previous days at https://www.traviscistatus.com/
|
Uhh, why is the Travis CI service inactive? I've been postponing service → app migration in my personal projects in hopes GitHub or Travis will automate the process. Travis is now telling me
so I don't know if that'll ever happen. The other link says
and yeah, we have a working webhook: This explains why the service was disabled. |
maybe we can move this in a different issue, but could you expand? :) that said, travis still seems to not be running:
i can't help but wonder if travis is suffering because of GH actions.. they're bound to have more integration troubles than GH itself... :/ unfair, really... |
No objective reason, just the UX feels bad. (Like the checks tab in pull requests suddenly expands to take 100% of the screen width, while all the other tabs are narrow columns with wide margins on both sides. Also the YAML I've seen feels redundant, repetitive, and the opposite of beautiful.)
It is running, just not reporting back. Like here's the PR build for cjmayo's latest commit: https://travis-ci.org/github/linkchecker/linkchecker/builds/668847616 |
I've disabled the branch protection in Travis temporarily to merge this, then reenabled it for future PRs. |
This PR replaces the HTML parsing code included in LinkChecker with Beautiful Soup 4 which enables us to support Python 3. It includes the outstanding commits from Petr Dlouhy in #40 and #119 and #125.
It is a continuation of #249, created because the branch has moved to this repository.
Currently it does not support Python 2 because urllib.unquote() does not accept an encoding argument. #249 (comment)