Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Future of the sanitizer #443

Open
gsnedders opened this issue Feb 26, 2020 · 9 comments
Open

Future of the sanitizer #443

gsnedders opened this issue Feb 26, 2020 · 9 comments

Comments

@gsnedders
Copy link
Member

As it is, many of the open issues relate to the sanitizer and the default set of elements/attributes allowed in it.

I've had occasional discussions with @willkg about whether or not it makes sense to keep on maintaining the sanitizer as part of html5lib, and on the whole my opinion has for a long time been that it doesn't. IMO, it should either become a project in its own right or incorporated into Bleach (cc/ @g-k who's probably the other person there who needs to be involved in this discussion).

Why do I think it makes sense to split out? At this point it's relatively tangential to the rest of the project (it's not tightly coupled to any part of html5lib and it operates purely with the public API), but it's arguably the most in need of maintenance part of the project (as it is in many ways more security sensitive than the majority of the rest).

One relatively simple option is to split it out into a project of its own right (potentially initially as a cyclic dependency, whereby the existing API continues to function), and see if anyone wants to maintain it as a separate project.

@g-k
Copy link

g-k commented May 14, 2020

I'm +1 on removing it or splitting it out.

I don't know if anyone is using it directly, but we could host it in bleach to reduce surface area and prevent things like mozilla/bleach#534

@gsnedders
Copy link
Member Author

I don't know if anyone is using it directly, but we could host it in bleach to reduce surface area and prevent things like mozilla/bleach#534

Some people are, but really not many. I don't think any are really using it in any way that it doesn't make sense to push people towards Bleach. And that html5lib is, uh, "lightly maintained" becomes doubly-problematic if we have security bugs in the sanitizer like CVE-2020-6817.

@twm
Copy link
Contributor

twm commented Jun 27, 2020

To provide the feedback requested in the release notes—

I'm currently using html5lib's sanitizer directly. I originally moved to it from Bleach because I needed to be able to chain filters in front of the sanitizer. It looks like Bleach now exposes its Filter directly so I may be able to adopt it. However I also use html5lib directly, so I don't love taking a second copy (Bleach vendors it).

If that doesn't work I'll probably fork and vendor html5lib's sanitizer. From my perspective that is roughly the same maintenance burden as now, but fewer monkeypatches.

@pllim
Copy link

pllim commented Sep 17, 2020

Hello. This deprecation has come up in our project's CI as such (full log):

.../astropy/utils/xml/writer.py:195: in <lambda>
    self.xml_escape_cdata = lambda x: bleach.clean(x, **clean_kwargs)
.../bleach/__init__.py:84: in clean
    return cleaner.clean(text)
.../bleach/sanitizer.py:175: in clean
    filtered = BleachSanitizerFilter(
.../bleach/sanitizer.py:273: in __init__
    return super(BleachSanitizerFilter, self).__init__(source, **kwargs)
...
E       DeprecationWarning: html5lib's sanitizer is deprecated;
see https://github.com/html5lib/html5lib-python/issues/443 and
please let us know if Bleach is unsuitable for your needs

As the traceback says, we are already using Bleach (3.2.0), so how do we get rid of the warning? Thank you.

@gsnedders
Copy link
Member Author

Ah, Bleach 3.2.0 came out yesterday/today with the vendored html5lib updated to 1.1. They probably should've avoided the warning showing up through their vendored copy! I suggest filing a bug on Bleach for this?

@sferencik
Copy link

This comment should maybe be a new issue in Bleach, but before that (in the spirit of "please let us know if Bleach is unsuitable for your needs"):

In my project, I've updated to html5lib 1.1 and switched to using Bleach (v4.1.0) for sanitizing. I've replaced this

html5lib.serialize(html5parser.HTMLParser().parseFragment("<u>hi</u>"), sanitize=True)
# -> <u>hi</u>

by this

bleach.clean("<u>hi</u>")
# -> &lt;u&gt;hi&lt;/u&gt;

Notice how Bleach escapes my <u>hi</u> because the u-tag isn't on its very conservative allow-list. I could override the list like so:

bleach.clean("<u>hi</u>", tags=MY_LIST_THAT_INCLUDES_U)
# -> <u>hi</u>

but I don't see how I could construct a practical MY_LIST_THAT_INCLUDES_U. The deprecated html5lib.sanitizer has a long allow-list and, ironically, Bleach respects/uses that list when sanitizing. However, it uses the shorter list for tokenizing, and so "<u>hi</u>" comes out as a text fragment, rather than a u-node (which could then be sanitized - or not - by the sanitizer).

Has someone run into this? Am I missing something?

@facundochaud-eb
Copy link

Hi @sferencik ! I'm facing something similar when replacing html5lib.filters.sanitizer.Filter with bleach.sanitizer.BleachSanitizerFilter. The BleachSanitizerFilter has some default ALLOWED_ATTRIBUTES which is a very reduced list compared with the html5lib's one. Where you able to solve your issue?

@sferencik
Copy link

sferencik commented Mar 9, 2022

Hi, yes. I raised this with Bleach, mozilla/bleach#624, and ended up writing up the migration notes in that repo: https://github.com/mozilla/bleach/blob/main/docs/migrating.rst. Feel free to elaborate on it, adding your experience!

@twm
Copy link
Contributor

twm commented Jan 25, 2023

Bleach has been deprecated because html5lib is unmaintained.

@ambv ambv changed the title Future of the saniziter Future of the sanitizer Mar 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants