Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change HtmlParser to use Beautiful Soup #337

Merged
merged 30 commits into from
Mar 30, 2020
Merged

Conversation

cjmayo
Copy link
Contributor

@cjmayo cjmayo commented Oct 30, 2019

This PR replaces the HTML parsing code included in LinkChecker with Beautiful Soup 4 which enables us to support Python 3. It includes the outstanding commits from Petr Dlouhy in #40 and #119 and #125.

It is a continuation of #249, created because the branch has moved to this repository.

Currently it does not support Python 2 because urllib.unquote() does not accept an encoding argument. #249 (comment)

PetrDlouhy and others added 19 commits January 5, 2018 22:48
UnicodeDammit input has to be non-unicode to trigger character set
detection.
Read-only property with new Beautiful Soup parser.
Needs miniboa >= 1.0.8 for telnet test on Python 3.7.
Test with older Beautiful Soup without line number support on
Python 3.5.

Resolve tox deprecation warning:
  Matching undeclared envs is deprecated. Be sure all the envs that Tox
  should run are declared in the tox config.
Shown every time linkchecker is run:

/usr/lib/python3.7/site-packages/bs4/element.py:16: UserWarning: The
soupsieve package is not installed. CSS selectors cannot be used.
  'The soupsieve package is not installed. CSS selectors cannot be used.'
Else non-UTF-8 codes are misinterpreted:

>>> from urllib import parse
>>> parse.unquote("%FF")
'�'
>>> parse.unquote("%FF", "latin1")
'ÿ'
@@ -84,8 +84,6 @@ def content_allows_robots (self):
handler = linkparse.MetaRobotsFinder()
parser = htmlsax.parser(handler)
handler.parser = parser
if self.charset:
parser.encoding = self.charset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have doubts about this entire commit.

If we have charset information from the Content-Type header, there should be a way to force the HTML parser to use it. (The spec requires that if the document itself specifies charset X in a <meta> tag, but the server specifies charset Y in the Content-Type header, then the parser must use charset Y.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's straightforward to implement (add a get_content() method to HttpUrl that takes the encoding from requests and tells Beautiful Soup to use it). Harder to test all the permutations.

@@ -44,6 +44,7 @@ def x509_to_dict(x509):
}
notAfter = x509.get_notAfter()
if notAfter is not None:
notAfter = notAfter.decode()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no regression test for this fix. :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed the limited https testing is completely disabled. There's a lot of code around certificates using sockets which probably needs ripping out and replacing with getting the certificate from requests. Best left until after the bs4 changes though.
This change is just a result of the pyOpenSSL API - which itself recommends using cryptography, and that would give us a nice datetime!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there is no simple way to get the certificate from requests. Session.cert is the default client certificate (must update my local copy of the requests docs...).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've included this commit with a test in #338.

@@ -128,9 +128,9 @@ def find_links (url_data, callback, tags):
handler.parser = parser
# parse
try:
content = url_data.get_raw_content()
soup = url_data.get_soup()
with parse_mutex:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, we can drop the parse_mutex now!

Copy link
Contributor

@mgedmin mgedmin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a commit-by-commit review this time, except I skipped commits

  • Pass encoding when unquoting
  • Always use utf-8 encoding when quoting
  • Add new tests for URL quoting

because it's late and they're complicated.

I'll note that playing with this branch is complicated by the need to manually run git clean -dfx (probably to remove htmlsax.so so it doesn't shadow htmlsax.py in the same directory), otherwise all the parser tests fail on Python 3.

I'll also note that we have one test marked xfail that now passes, so maybe we can drop the @pytest.mark.xfail decorator from it!

I'm not approving the PR at this time because

  • WIP in title
  • drops Python 2.7 support before January 1 2020 ;)

@mgedmin
Copy link
Contributor

mgedmin commented Oct 30, 2019

So, remember the browser tests I did in some other GH issue/PR and gleefully reported that modern browsers decode everything and treat all links as UTF-8? Well, about that.

The tests added in this branch make linkchecker treat that link as http://localhost:8001/?quoted=%C3%BF!

This makes me nervous. Linkchecker should behave the same way browsers do.

And it's not a difference between file:// vs http://, I can reproduce this even if I do

I should dig up my earlier experiment with a custom wsgi script and figure out what makes Chromium behave differently.

@cjmayo
Copy link
Contributor Author

cjmayo commented Nov 2, 2019

So, remember the browser tests I did in some other GH issue/PR and gleefully reported that modern browsers decode everything and treat all links as UTF-8? Well, about that.

* xdg-open tests/checker/data/file_url_quote.html

* click the l3 link

* observe http://localhost:8001/?quoted=%FF in the address bar!

The tests added in this branch make linkchecker treat that link as http://localhost:8001/?quoted=%C3%BF!

I think this is the key reason for the WIP on this PR. (#311 demonstrates and is included here).

@cjmayo
Copy link
Contributor Author

cjmayo commented Nov 2, 2019

How do we want to manage changes to this? PRs against the branch?
I guess avoiding reviewing the same thing twice is a key objective, but also having an understandable set of commits for the future. Have PRs with squash commits, and then rebase before merging into master?

Copy link
Contributor

@mgedmin mgedmin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to blanket approve this because it's been too long, and the tests are green (due to me cheating with the "allowed_failures" config).

We can decide what to do about py35-oldbs4 later.

@anarcat
Copy link
Contributor

anarcat commented Mar 23, 2020

i guess we should remove the WIP here then?

Copy link
Contributor

@anarcat anarcat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

god i love this: +677 −11,375

removing 11,000 lines of code! and tests are green! let's just ship this damn thing :)

@cjmayo
Copy link
Contributor Author

cjmayo commented Mar 23, 2020

For me personally, and Gentoo Linux, no problem requiring beautifulsoup >= 4.8.2. If we did go that way I think we should remove the allow_failures from Travis. (and remove the alternative test from tests/checker/test_all_parts.py)

But seems extreme just to get the tests to pass when it shouldn't impact the user - the beautifulsoup parsing has been done by this point. Skipping the tests for beautifulsoup < 4.8.2 seems the best idea.

Maybe in future all the doctype code can just be ripped out but that would take time to check which probably isn't a priority now.

@mgedmin
Copy link
Contributor

mgedmin commented Mar 23, 2020

For me personally, and Gentoo Linux, no problem requiring beautifulsoup >= 4.8.2. If we did go that way I think we should remove the allow_failures from Travis. (and remove the alternative test from tests/checker/test_all_parts.py)

Not just allow_failures, the entire oldbs4 job from the travis matrix (and from tox.ini).

But seems extreme just to get the tests to pass when it shouldn't impact the user - the beautifulsoup parsing has been done by this point. Skipping the tests for beautifulsoup < 4.8.2 seems the best idea.

+1

Does BeautifulSoup define its version in the bs4 package somewhere? Or do we use pkg_resources to get its version?

@mgedmin
Copy link
Contributor

mgedmin commented Mar 23, 2020

Does BeautifulSoup define its version in the bs4 package somewhere? Or do we use pkg_resources to get its version?

Or maybe we can do a feature-based test somehow?

bs_can_handle_doctypes_correctly = BeautifulSoup("<!doctype html>", "html.parser").something == something?

@cjmayo
Copy link
Contributor Author

cjmayo commented Mar 24, 2020

Does BeautifulSoup define its version in the bs4 package somewhere? Or do we use pkg_resources to get its version?

Haven't checked the docs but a quick grep got me:

>>> bs4.__version__
'4.8.2'

Thinking about this a bit more - if it turns out to be too fiddly to isolate the DOCTYPE test strings, making the code compatible with all versions would only take something like:

--- a/linkcheck/HtmlParser/htmllib.py
+++ b/linkcheck/HtmlParser/htmllib.py
@@ -151,7 +151,7 @@ class HtmlPrettyPrinter (object):
         @type data: string
         @return: None
         """
-        self.fd.write("<!DOCTYPE%s>" % data)
+        self.fd.write("<!DOCTYPE %s>" % data)
 
     def pi (self, data):
         """
--- a/linkcheck/HtmlParser/htmlsax.py
+++ b/linkcheck/HtmlParser/htmlsax.py
@@ -93,7 +93,8 @@ class Parser(object):
                             self.handler.comment(comment)
             elif isinstance(content, Doctype):
                 if hasattr(self.handler, 'doctype'):
-                    self.handler.doctype(content[7:])
+                    self.handler.doctype(content[8:] if
+                        content.upper().startswith('DOCTYPE ') else content)
             elif isinstance(content, Comment):
                 if hasattr(self.handler, 'comment'):
                     self.handler.comment(content.strip())

TBC, but I think it may never actually be called by running the linkchecker command. Part of the future work might even involve not just dropping the doctype handler but moving the HtmlPrettyPrinter class into the test suite.

@cjmayo
Copy link
Contributor Author

cjmayo commented Mar 29, 2020

I missed the thumbs up - they don't generate an email notification.

@mgedmin do you want to do something with the three "Split the oldbs4 into a separate Travis job" commits? By all means squash them or otherwise; the "allow_failures" needs to go?
And the "Fix doctype tests" commit, feel free to replace with the change in the comment above (or just remove it, I have the coment as a commit - I'm just not sure what to do with the Travis changes),

I've done some testing of moving and removing more parsing related code (including the doctype handling), nearly ready, but I think that would better wait for another PR.

@mgedmin
Copy link
Contributor

mgedmin commented Mar 30, 2020

I missed the thumbs up - they don't generate an email notification.

Sorry about that! I meant to apply your suggestion and push it, but ran out of motivation.

@mgedmin do you want to do something with the three "Split the oldbs4 into a separate Travis job" commits?

I didn't want to force-push to a shared branch. We could squash them if you want to, but having them as separate commits doesn't bother me.

By all means squash them or otherwise; the "allow_failures" needs to go?

Yes: allow_failures needs to go once the code is fixed. I would commit the removal as a separate commit, but I won't mind if you want to play with some history rewriting instead ;)

And the "Fix doctype tests" commit, feel free to replace with the change in the comment above (or just remove it, I have the coment as a commit - I'm just not sure what to do with the Travis changes),

I would prefer to have the oldbs4 job run as a separate Travis job, for two reasons:

  • so its failures don't get mistaken with Python 3.5 compatibility problems
  • so we don't accidentally drop it when we drop Python 3.5 support

tox-travis has been disappointing to me: I hoped to make the Python 3.5 job run just the py35 tox job without an explicit TOXENV override in the environment, but couldn't make it happen by configuring tox-travis in tox.ini. https://tox-travis.readthedocs.io/en/stable/envlist.html#advanced-configuration did nothing.

Now I have a new idea: if the tox environment were called oldbs4 without the py35- prefix, maybe tox-travis wouldn't run it by default on Python 3.5? And then we can have a cleaner .travis.yml.

I'll experiment, stay tuned!

I want the Python 3.5 travis job to run just tox -e py35, without the
oldbs4 job, and without an explicit TOXENV setting that is awkward to
insert in the .travis.yml (also, it reorders the jobs putting 3.5 below
3.8 which annoys me).

I think I found a way of doing that by renaming py35-oldbs4 to oldbs4.
@mgedmin
Copy link
Contributor

mgedmin commented Mar 30, 2020

Changes pushed.

I have no motivation to play with history rewriting (and I think it's interesting to see all the missteps made along the way).

Maybe let's merge this?

@mgedmin
Copy link
Contributor

mgedmin commented Mar 30, 2020

The builds are green on Travis CI; I don't know why GitHub doesn't see that yet:

@anarcat
Copy link
Contributor

anarcat commented Mar 30, 2020

humm... mayeb we can rerun travis to fix that? (just close and reopen the PR)

my approval still stands, in any case, but i will let you folks do the honors. :)

@mgedmin
Copy link
Contributor

mgedmin commented Mar 30, 2020

I reran a single job in one of the builds, and I don't see any changes here.

We can temporarily disable the branch protection, if there's consensus that this can be merged.

@cjmayo, what's your opinion?

@anarcat anarcat closed this Mar 30, 2020
@anarcat anarcat reopened this Mar 30, 2020
@anarcat
Copy link
Contributor

anarcat commented Mar 30, 2020

i tried to kick CI, let's see if GH picks it up now.

should we consider switching to GH Actions? :)

@mgedmin
Copy link
Contributor

mgedmin commented Mar 30, 2020

should we consider switching to GH Actions? :)

Maybe? I don't like them.

@cjmayo cjmayo changed the title [WIP] Change HtmlParser to use Beautiful Soup Change HtmlParser to use Beautiful Soup Mar 30, 2020
@cjmayo
Copy link
Contributor Author

cjmayo commented Mar 30, 2020

One more idea for Travis. I had a simple commit available, to remove doc/python3.txt, I've pushed that (hopefully it doesn't make things worse...).

Anyway, all the tests I can run locally pass, I'm happy for it to be merged - removed the WIP!

@cjmayo
Copy link
Contributor Author

cjmayo commented Mar 30, 2020

There are some problems reported on previous days at https://www.traviscistatus.com/
related to "repositories using the legacy Services integration"

You can migrate your repositories to GitHub Apps on https://travis-ci.com/account/repositories.

@mgedmin
Copy link
Contributor

mgedmin commented Mar 30, 2020

Uhh, why is the Travis CI service inactive?
Ekrano nuotrauka iš 2020-03-30 20-01-06

I've been postponing service → app migration in my personal projects in hopes GitHub or Travis will automate the process. Travis is now telling me

We are only able to migrate accounts that have 50 or fewer repositories using the Legacy Services Integration. Please refer to our documentation on how to migrate your account.

so I don't know if that'll ever happen.

The other link says

Note: While this process is ongoing, we have been migrating from GitHub services to webhooks on the https://travis-ci.org enabled repositories. The closure of GitHub services will not affect your repositories even if you are currently on .org.

and yeah, we have a working webhook:

Ekrano nuotrauka iš 2020-03-30 20-08-02

This explains why the service was disabled.

@anarcat
Copy link
Contributor

anarcat commented Mar 30, 2020

Maybe? I don't like them.

maybe we can move this in a different issue, but could you expand? :)

that said, travis still seems to not be running:

Expected — Waiting for status to be reported

i can't help but wonder if travis is suffering because of GH actions.. they're bound to have more integration troubles than GH itself... :/ unfair, really...

@mgedmin
Copy link
Contributor

mgedmin commented Mar 30, 2020

maybe we can move this in a different issue, but could you expand? :)

No objective reason, just the UX feels bad. (Like the checks tab in pull requests suddenly expands to take 100% of the screen width, while all the other tabs are narrow columns with wide margins on both sides. Also the YAML I've seen feels redundant, repetitive, and the opposite of beautiful.)

that said, travis still seems to not be running:

It is running, just not reporting back. Like here's the PR build for cjmayo's latest commit: https://travis-ci.org/github/linkchecker/linkchecker/builds/668847616

@mgedmin mgedmin merged commit 7853095 into master Mar 30, 2020
@mgedmin mgedmin deleted the htmlparser-beautifulsoup branch March 30, 2020 17:45
@mgedmin
Copy link
Contributor

mgedmin commented Mar 30, 2020

I've disabled the branch protection in Travis temporarily to merge this, then reenabled it for future PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants