-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider using BeautifulSoup instead of built-in HTML parser? #26
Comments
i absolutely agree. we have tons of duct tape as well to parse that HTML that gives us weird results, #23 for example. |
Definitely. I need an immediate fix for #23 just to be able to use linkchecker, but what we've added now is an ugly patch, and I can see us needing to add more of these with the current homegrown parser. |
Much of my work in #40 is related to the HTML parser and there are still two remaining problems with that which cause failed tests on Python 3 and I am unable to solve them right now. There would be ton of special cases, which are properly solved in more widely used parser, that might not be solved in the build-in parser. So I think, that it would be huge benefit, if it gets implemented. |
I have worked on this. See #119. It would require extensive testing an possibly some improvements, though. |
Good idea! Thanks Petr for showing it was possible. |
I wonder whether it would be worth considering using an existing HTML parser such as BeautifulSoup to avoid having to include C code in the linkchecker package? This might lower the maintenance burden in the long term (since keeping C extensions working across platforms is not trivial).
The text was updated successfully, but these errors were encountered: