Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use bleach for HTML sanitizing #257

Open
kurtmckee opened this issue Jan 4, 2021 · 4 comments
Open

Use bleach for HTML sanitizing #257

kurtmckee opened this issue Jan 4, 2021 · 4 comments

Comments

@kurtmckee
Copy link
Owner

This will likely be a breaking change.

feedparser's HTML sanitizing should not rely on custom internal code anymore. Using an external package like bleach will allow feedparser to focus more closely on feed parsing, and allow developers to consolidate HTML sanitizing efforts so that everyone benefits.

Interestingly, browsing the source code for Mozilla's bleach module links to the WHAT-WG documentation which states that its early work was based on feedparser's HTML sanitizing, so it appears that things have come full circle in the last ~17 years!

Early testing suggests that this will affect feedparser's output so that it is HTML5 but perhaps not XHTML or HTML4. For example, quotes may or may not always be used with element attributes.

@thigg
Copy link

thigg commented Aug 12, 2021

Is there already work on this feature that could be tested?

@kurtmckee
Copy link
Owner Author

kurtmckee commented Aug 12, 2021 via email

@masklinn
Copy link
Contributor

@kurtmckee in looking at the code of BaseHTMLProcessor it looks to have quite a bit of a mixup between the sanitisation and the loose parsing support, would better splitting those up (and possibly removing some of the dead code which looks to be in it) be a good first step?

Playing around it looks like all the tests pass if all the handle_, unknown_, and convert_ methods are removed from the LooseFeedParser as XMLParserMixin tends to implement those and not delegate (call super()).

And then HTMLSanitizer overrides a few methods which BaseHTMLProcessor defines/overrides, either with or without delegation.

Those without delegation mean the one in BaseHTMLProcessor is dead code, and the rest might be opportunities for simplification (or dead code as well).

@twm
Copy link
Contributor

twm commented Feb 8, 2023

A heads up — you probably don't to go down this road, as Bleach is deprecated because html5lib is not actively maintained (it's a bit circular 🙃).

CC @lemon24 as this was also discussed in #296.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants