Change HtmlParser to use Beautiful Soup #337

cjmayo · 2019-10-30T19:55:19Z

This PR replaces the HTML parsing code included in LinkChecker with Beautiful Soup 4 which enables us to support Python 3. It includes the outstanding commits from Petr Dlouhy in #40 and #119 and #125.

It is a continuation of #249, created because the branch has moved to this repository.

Currently it does not support Python 2 because urllib.unquote() does not accept an encoding argument. #249 (comment)

UnicodeDammit input has to be non-unicode to trigger character set detection.

Read-only property with new Beautiful Soup parser.

Needs miniboa >= 1.0.8 for telnet test on Python 3.7. Test with older Beautiful Soup without line number support on Python 3.5. Resolve tox deprecation warning: Matching undeclared envs is deprecated. Be sure all the envs that Tox should run are declared in the tox config.

Shown every time linkchecker is run: /usr/lib/python3.7/site-packages/bs4/element.py:16: UserWarning: The soupsieve package is not installed. CSS selectors cannot be used. 'The soupsieve package is not installed. CSS selectors cannot be used.'

Else non-UTF-8 codes are misinterpreted: >>> from urllib import parse >>> parse.unquote("%FF") '�' >>> parse.unquote("%FF", "latin1") 'ÿ'

mgedmin · 2019-10-30T21:54:22Z

linkcheck/checker/httpurl.py

@@ -84,8 +84,6 @@ def content_allows_robots (self):
        handler = linkparse.MetaRobotsFinder()
        parser = htmlsax.parser(handler)
        handler.parser = parser
-        if self.charset:
-            parser.encoding = self.charset


I have doubts about this entire commit.

If we have charset information from the Content-Type header, there should be a way to force the HTML parser to use it. (The spec requires that if the document itself specifies charset X in a <meta> tag, but the server specifies charset Y in the Content-Type header, then the parser must use charset Y.)

It's straightforward to implement (add a get_content() method to HttpUrl that takes the encoding from requests and tells Beautiful Soup to use it). Harder to test all the permutations.

mgedmin · 2019-10-30T22:03:39Z

linkcheck/httputil.py

@@ -44,6 +44,7 @@ def x509_to_dict(x509):
    }
    notAfter = x509.get_notAfter()
    if notAfter is not None:
+        notAfter = notAfter.decode()


There's no regression test for this fix. :(

Indeed the limited https testing is completely disabled. There's a lot of code around certificates using sockets which probably needs ripping out and replacing with getting the certificate from requests. Best left until after the bs4 changes though.
This change is just a result of the pyOpenSSL API - which itself recommends using cryptography, and that would give us a nice datetime!

Looks like there is no simple way to get the certificate from requests. Session.cert is the default client certificate (must update my local copy of the requests docs...).

I've included this commit with a test in #338.

mgedmin · 2019-10-30T22:05:17Z

linkcheck/parser/__init__.py

@@ -128,9 +128,9 @@ def find_links (url_data, callback, tags):
    handler.parser = parser
    # parse
    try:
-        content = url_data.get_raw_content()
+        soup = url_data.get_soup()
        with parse_mutex:


Hey, we can drop the parse_mutex now!

mgedmin

I did a commit-by-commit review this time, except I skipped commits

Pass encoding when unquoting
Always use utf-8 encoding when quoting
Add new tests for URL quoting

because it's late and they're complicated.

I'll note that playing with this branch is complicated by the need to manually run git clean -dfx (probably to remove htmlsax.so so it doesn't shadow htmlsax.py in the same directory), otherwise all the parser tests fail on Python 3.

I'll also note that we have one test marked xfail that now passes, so maybe we can drop the @pytest.mark.xfail decorator from it!

I'm not approving the PR at this time because

WIP in title
drops Python 2.7 support before January 1 2020 ;)

mgedmin · 2019-10-30T22:19:51Z

So, remember the browser tests I did in some other GH issue/PR and gleefully reported that modern browsers decode everything and treat all links as UTF-8? Well, about that.

xdg-open tests/checker/data/file_url_quote.html
click the l3 link
observe http://localhost:8001/?quoted=%FF in the address bar!

The tests added in this branch make linkchecker treat that link as http://localhost:8001/?quoted=%C3%BF!

This makes me nervous. Linkchecker should behave the same way browsers do.

And it's not a difference between file:// vs http://, I can reproduce this even if I do

python3 -m http.server -d tests/checker/data
click file_url_quote.html
click the l3 link
observe http://localhost:8001/?quoted=%FF in the address bar!

I should dig up my earlier experiment with a custom wsgi script and figure out what makes Chromium behave differently.

cjmayo · 2019-11-02T17:59:45Z

So, remember the browser tests I did in some other GH issue/PR and gleefully reported that modern browsers decode everything and treat all links as UTF-8? Well, about that.
* xdg-open tests/checker/data/file_url_quote.html

* click the l3 link

* observe http://localhost:8001/?quoted=%FF in the address bar!
The tests added in this branch make linkchecker treat that link as http://localhost:8001/?quoted=%C3%BF!

I think this is the key reason for the WIP on this PR. (#311 demonstrates and is included here).

cjmayo · 2019-11-02T18:04:11Z

How do we want to manage changes to this? PRs against the branch?
I guess avoiding reviewing the same thing twice is a key objective, but also having an understandable set of commits for the future. Have PRs with squash commits, and then rebase before merging into master?

mgedmin

I'm going to blanket approve this because it's been too long, and the tests are green (due to me cheating with the "allowed_failures" config).

We can decide what to do about py35-oldbs4 later.

anarcat · 2020-03-23T15:09:55Z

i guess we should remove the WIP here then?

anarcat

god i love this: +677 −11,375

removing 11,000 lines of code! and tests are green! let's just ship this damn thing :)

cjmayo · 2020-03-23T19:46:37Z

For me personally, and Gentoo Linux, no problem requiring beautifulsoup >= 4.8.2. If we did go that way I think we should remove the allow_failures from Travis. (and remove the alternative test from tests/checker/test_all_parts.py)

But seems extreme just to get the tests to pass when it shouldn't impact the user - the beautifulsoup parsing has been done by this point. Skipping the tests for beautifulsoup < 4.8.2 seems the best idea.

Maybe in future all the doctype code can just be ripped out but that would take time to check which probably isn't a priority now.

mgedmin · 2020-03-23T21:04:05Z

For me personally, and Gentoo Linux, no problem requiring beautifulsoup >= 4.8.2. If we did go that way I think we should remove the allow_failures from Travis. (and remove the alternative test from tests/checker/test_all_parts.py)

Not just allow_failures, the entire oldbs4 job from the travis matrix (and from tox.ini).

But seems extreme just to get the tests to pass when it shouldn't impact the user - the beautifulsoup parsing has been done by this point. Skipping the tests for beautifulsoup < 4.8.2 seems the best idea.

+1

Does BeautifulSoup define its version in the bs4 package somewhere? Or do we use pkg_resources to get its version?

mgedmin · 2020-03-23T21:05:20Z

Does BeautifulSoup define its version in the bs4 package somewhere? Or do we use pkg_resources to get its version?

Or maybe we can do a feature-based test somehow?

bs_can_handle_doctypes_correctly = BeautifulSoup("<!doctype html>", "html.parser").something == something?

cjmayo · 2020-03-24T08:34:48Z

Does BeautifulSoup define its version in the bs4 package somewhere? Or do we use pkg_resources to get its version?

Haven't checked the docs but a quick grep got me:

>>> bs4.__version__
'4.8.2'

Thinking about this a bit more - if it turns out to be too fiddly to isolate the DOCTYPE test strings, making the code compatible with all versions would only take something like:

--- a/linkcheck/HtmlParser/htmllib.py
+++ b/linkcheck/HtmlParser/htmllib.py
@@ -151,7 +151,7 @@ class HtmlPrettyPrinter (object):
         @type data: string
         @return: None
         """
-        self.fd.write("<!DOCTYPE%s>" % data)
+        self.fd.write("<!DOCTYPE %s>" % data)
 
     def pi (self, data):
         """
--- a/linkcheck/HtmlParser/htmlsax.py
+++ b/linkcheck/HtmlParser/htmlsax.py
@@ -93,7 +93,8 @@ class Parser(object):
                             self.handler.comment(comment)
             elif isinstance(content, Doctype):
                 if hasattr(self.handler, 'doctype'):
-                    self.handler.doctype(content[7:])
+                    self.handler.doctype(content[8:] if
+                        content.upper().startswith('DOCTYPE ') else content)
             elif isinstance(content, Comment):
                 if hasattr(self.handler, 'comment'):
                     self.handler.comment(content.strip())

TBC, but I think it may never actually be called by running the linkchecker command. Part of the future work might even involve not just dropping the doctype handler but moving the HtmlPrettyPrinter class into the test suite.

cjmayo · 2020-03-29T18:58:48Z

I missed the thumbs up - they don't generate an email notification.

@mgedmin do you want to do something with the three "Split the oldbs4 into a separate Travis job" commits? By all means squash them or otherwise; the "allow_failures" needs to go?
And the "Fix doctype tests" commit, feel free to replace with the change in the comment above (or just remove it, I have the coment as a commit - I'm just not sure what to do with the Travis changes),

I've done some testing of moving and removing more parsing related code (including the doctype handling), nearly ready, but I think that would better wait for another PR.

mgedmin · 2020-03-30T11:45:17Z

I missed the thumbs up - they don't generate an email notification.

Sorry about that! I meant to apply your suggestion and push it, but ran out of motivation.

@mgedmin do you want to do something with the three "Split the oldbs4 into a separate Travis job" commits?

I didn't want to force-push to a shared branch. We could squash them if you want to, but having them as separate commits doesn't bother me.

By all means squash them or otherwise; the "allow_failures" needs to go?

Yes: allow_failures needs to go once the code is fixed. I would commit the removal as a separate commit, but I won't mind if you want to play with some history rewriting instead ;)

And the "Fix doctype tests" commit, feel free to replace with the change in the comment above (or just remove it, I have the coment as a commit - I'm just not sure what to do with the Travis changes),

I would prefer to have the oldbs4 job run as a separate Travis job, for two reasons:

so its failures don't get mistaken with Python 3.5 compatibility problems
so we don't accidentally drop it when we drop Python 3.5 support

tox-travis has been disappointing to me: I hoped to make the Python 3.5 job run just the py35 tox job without an explicit TOXENV override in the environment, but couldn't make it happen by configuring tox-travis in tox.ini. https://tox-travis.readthedocs.io/en/stable/envlist.html#advanced-configuration did nothing.

Now I have a new idea: if the tox environment were called oldbs4 without the py35- prefix, maybe tox-travis wouldn't run it by default on Python 3.5? And then we can have a cleaner .travis.yml.

I'll experiment, stay tuned!

I want the Python 3.5 travis job to run just tox -e py35, without the oldbs4 job, and without an explicit TOXENV setting that is awkward to insert in the .travis.yml (also, it reorders the jobs putting 3.5 below 3.8 which annoys me). I think I found a way of doing that by renaming py35-oldbs4 to oldbs4.

mgedmin · 2020-03-30T11:55:44Z

Changes pushed.

I have no motivation to play with history rewriting (and I think it's interesting to see all the missteps made along the way).

Maybe let's merge this?

mgedmin · 2020-03-30T11:57:13Z

The builds are green on Travis CI; I don't know why GitHub doesn't see that yet:

anarcat · 2020-03-30T13:46:21Z

humm... mayeb we can rerun travis to fix that? (just close and reopen the PR)

my approval still stands, in any case, but i will let you folks do the honors. :)

mgedmin · 2020-03-30T16:00:04Z

I reran a single job in one of the builds, and I don't see any changes here.

We can temporarily disable the branch protection, if there's consensus that this can be merged.

@cjmayo, what's your opinion?

anarcat · 2020-03-30T16:05:56Z

i tried to kick CI, let's see if GH picks it up now.

should we consider switching to GH Actions? :)

mgedmin · 2020-03-30T16:22:23Z

should we consider switching to GH Actions? :)

Maybe? I don't like them.

cjmayo · 2020-03-30T16:43:49Z

One more idea for Travis. I had a simple commit available, to remove doc/python3.txt, I've pushed that (hopefully it doesn't make things worse...).

Anyway, all the tests I can run locally pass, I'm happy for it to be merged - removed the WIP!

cjmayo · 2020-03-30T16:52:05Z

There are some problems reported on previous days at https://www.traviscistatus.com/
related to "repositories using the legacy Services integration"

You can migrate your repositories to GitHub Apps on https://travis-ci.com/account/repositories.

mgedmin · 2020-03-30T17:11:40Z

Uhh, why is the Travis CI service inactive?

I've been postponing service → app migration in my personal projects in hopes GitHub or Travis will automate the process. Travis is now telling me

We are only able to migrate accounts that have 50 or fewer repositories using the Legacy Services Integration. Please refer to our documentation on how to migrate your account.

so I don't know if that'll ever happen.

The other link says

Note: While this process is ongoing, we have been migrating from GitHub services to webhooks on the https://travis-ci.org enabled repositories. The closure of GitHub services will not affect your repositories even if you are currently on .org.

and yeah, we have a working webhook:

This explains why the service was disabled.

anarcat · 2020-03-30T17:34:16Z

Maybe? I don't like them.

maybe we can move this in a different issue, but could you expand? :)

that said, travis still seems to not be running:

Expected — Waiting for status to be reported

i can't help but wonder if travis is suffering because of GH actions.. they're bound to have more integration troubles than GH itself... :/ unfair, really...

mgedmin · 2020-03-30T17:44:07Z

maybe we can move this in a different issue, but could you expand? :)

No objective reason, just the UX feels bad. (Like the checks tab in pull requests suddenly expands to take 100% of the screen width, while all the other tabs are narrow columns with wide margins on both sides. Also the YAML I've seen feels redundant, repetitive, and the opposite of beautiful.)

that said, travis still seems to not be running:

It is running, just not reporting back. Like here's the PR build for cjmayo's latest commit: https://travis-ci.org/github/linkchecker/linkchecker/builds/668847616

mgedmin · 2020-03-30T17:45:52Z

I've disabled the branch protection in Travis temporarily to merge this, then reenabled it for future PRs.

PetrDlouhy and others added 19 commits January 5, 2018 22:48

Python3: fix few htmllib problems

2daf685

add charset tests

d1844a5

Remove home-cooked htmlparser and use BeautifulSoup

51a06d8

html parser: use name instead of peeking

d6d48b4

fix parser tests after parser change

0089349

fix http test after parser change

2c3c794

change test_parse encoding to UTF-8

b511145

fix parser encoding tests after change of parser

69d426b

UnicodeDammit input has to be non-unicode to trigger character set detection.

Don't set parser.encoding

4f8c295

Read-only property with new Beautiful Soup parser.

Support Beautiful Soup line numbers

607328d

Improve pretty printed comments

30df69c

Hide Beautiful Soup soupsieve warning

978042a

Shown every time linkchecker is run: /usr/lib/python3.7/site-packages/bs4/element.py:16: UserWarning: The soupsieve package is not installed. CSS selectors cannot be used. 'The soupsieve package is not installed. CSS selectors cannot be used.'

Reuse soup object used for detecting encoding in the HTML parser

153e53b

Pass encoding when unquoting

646e138

Else non-UTF-8 codes are misinterpreted: >>> from urllib import parse >>> parse.unquote("%FF") '�' >>> parse.unquote("%FF", "latin1") 'ÿ'

Update strformat.ascii_safe() because paths are now strings

5bb4524

Update fileutil.pathencode() because paths are now strings

a9f147c

Always use utf-8 encoding when quoting

b7ec71d

Add new tests for URL quoting

74d5c68

This was referenced Oct 30, 2019

[WIP] Convert HtmlParser to use Beautiful Soup #249

Closed

Python 3 support #205

Closed

Htmlparser beatifulsoup #119

Closed

(Not to be pulled) Charset tests #125

Closed

mgedmin reviewed Oct 30, 2019

View reviewed changes

mgedmin approved these changes Mar 23, 2020

View reviewed changes

anarcat approved these changes Mar 23, 2020

View reviewed changes

mgedmin added 2 commits March 30, 2020 14:46

Restore support for older BeautifulSoup4 versions

af0f50e

anarcat closed this Mar 30, 2020

anarcat reopened this Mar 30, 2020

Remove Python 3 status document

9030050

cjmayo changed the title ~~[WIP] Change HtmlParser to use Beautiful Soup~~ Change HtmlParser to use Beautiful Soup Mar 30, 2020

mgedmin merged commit 7853095 into master Mar 30, 2020

mgedmin deleted the htmlparser-beautifulsoup branch March 30, 2020 17:45

cjmayo mentioned this pull request Mar 30, 2020

Add new tests for URL quoting #311

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change HtmlParser to use Beautiful Soup #337

Change HtmlParser to use Beautiful Soup #337

cjmayo commented Oct 30, 2019

mgedmin Oct 30, 2019

cjmayo Nov 2, 2019

mgedmin Oct 30, 2019

cjmayo Nov 2, 2019

cjmayo Nov 4, 2019

cjmayo Nov 5, 2019

mgedmin Oct 30, 2019

mgedmin left a comment •

edited

Loading

mgedmin commented Oct 30, 2019

cjmayo commented Nov 2, 2019

cjmayo commented Nov 2, 2019

mgedmin left a comment •

edited

Loading

anarcat commented Mar 23, 2020

anarcat left a comment

cjmayo commented Mar 23, 2020

mgedmin commented Mar 23, 2020

mgedmin commented Mar 23, 2020

cjmayo commented Mar 24, 2020

cjmayo commented Mar 29, 2020

mgedmin commented Mar 30, 2020

mgedmin commented Mar 30, 2020

mgedmin commented Mar 30, 2020

anarcat commented Mar 30, 2020

mgedmin commented Mar 30, 2020

anarcat commented Mar 30, 2020

mgedmin commented Mar 30, 2020

cjmayo commented Mar 30, 2020

cjmayo commented Mar 30, 2020

mgedmin commented Mar 30, 2020

anarcat commented Mar 30, 2020

mgedmin commented Mar 30, 2020

mgedmin commented Mar 30, 2020

Change HtmlParser to use Beautiful Soup #337

Change HtmlParser to use Beautiful Soup #337

Conversation

cjmayo commented Oct 30, 2019

mgedmin Oct 30, 2019

Choose a reason for hiding this comment

cjmayo Nov 2, 2019

Choose a reason for hiding this comment

mgedmin Oct 30, 2019

Choose a reason for hiding this comment

cjmayo Nov 2, 2019

Choose a reason for hiding this comment

cjmayo Nov 4, 2019

Choose a reason for hiding this comment

cjmayo Nov 5, 2019

Choose a reason for hiding this comment

mgedmin Oct 30, 2019

Choose a reason for hiding this comment

mgedmin left a comment • edited Loading

Choose a reason for hiding this comment

mgedmin commented Oct 30, 2019

cjmayo commented Nov 2, 2019

cjmayo commented Nov 2, 2019

mgedmin left a comment • edited Loading

Choose a reason for hiding this comment

anarcat commented Mar 23, 2020

anarcat left a comment

Choose a reason for hiding this comment

cjmayo commented Mar 23, 2020

mgedmin commented Mar 23, 2020

mgedmin commented Mar 23, 2020

cjmayo commented Mar 24, 2020

cjmayo commented Mar 29, 2020

mgedmin commented Mar 30, 2020

mgedmin commented Mar 30, 2020

mgedmin commented Mar 30, 2020

anarcat commented Mar 30, 2020

mgedmin commented Mar 30, 2020

anarcat commented Mar 30, 2020

mgedmin commented Mar 30, 2020

cjmayo commented Mar 30, 2020

cjmayo commented Mar 30, 2020

mgedmin commented Mar 30, 2020

anarcat commented Mar 30, 2020

mgedmin commented Mar 30, 2020

mgedmin commented Mar 30, 2020

mgedmin left a comment •

edited

Loading

mgedmin left a comment •

edited

Loading