Consider using BeautifulSoup instead of built-in HTML parser? #26

astrofrog · 2017-02-05T10:34:37Z

I wonder whether it would be worth considering using an existing HTML parser such as BeautifulSoup to avoid having to include C code in the linkchecker package? This might lower the maintenance burden in the long term (since keeping C extensions working across platforms is not trivial).

anarcat · 2017-02-06T15:06:44Z

i absolutely agree. we have tons of duct tape as well to parse that HTML that gives us weird results, #23 for example.

ghost · 2017-02-07T10:16:47Z

Definitely. I need an immediate fix for #23 just to be able to use linkchecker, but what we've added now is an ugly patch, and I can see us needing to add more of these with the current homegrown parser.

PetrDlouhy · 2018-01-07T02:30:58Z

Much of my work in #40 is related to the HTML parser and there are still two remaining problems with that which cause failed tests on Python 3 and I am unable to solve them right now.

There would be ton of special cases, which are properly solved in more widely used parser, that might not be solved in the build-in parser.

So I think, that it would be huge benefit, if it gets implemented.

PetrDlouhy · 2018-01-09T20:52:56Z

I have worked on this. See #119. It would require extensive testing an possibly some improvements, though.

cjmayo · 2020-08-06T18:55:50Z

Good idea! Thanks Petr for showing it was possible.
Done.

anarcat added the enhancement label Feb 6, 2017

PetrDlouhy mentioned this issue Jan 7, 2018

Htmlparser beatifulsoup #119

Closed

cjmayo closed this as completed Aug 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider using BeautifulSoup instead of built-in HTML parser? #26

Consider using BeautifulSoup instead of built-in HTML parser? #26

astrofrog commented Feb 5, 2017

anarcat commented Feb 6, 2017

ghost commented Feb 7, 2017

PetrDlouhy commented Jan 7, 2018

PetrDlouhy commented Jan 9, 2018

cjmayo commented Aug 6, 2020

Consider using BeautifulSoup instead of built-in HTML parser? #26

Consider using BeautifulSoup instead of built-in HTML parser? #26

Comments

astrofrog commented Feb 5, 2017

anarcat commented Feb 6, 2017

ghost commented Feb 7, 2017

PetrDlouhy commented Jan 7, 2018

PetrDlouhy commented Jan 9, 2018

cjmayo commented Aug 6, 2020