-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unclosed script tag is escaped incorrectly #271
Comments
In both cases, the output HTML is safe--so we haven't lost anything there. With Bleach 2.0, you get some additional tags that get escaped. I think what's going on is that the open With Bleach 1.5, it worked as part of the tokenizing step of html5lib parsing, so it would see the Two related issues are these:
This is the failing test that you're reporting the issue from: That seems like junk text to me. Does that sort of thing show up in the AMO corpus? Is this regression in cleaning meaningful or is it just a thing? If it's meaningful, maybe "pre-treat" your text and escape the script tags, then run it through |
Its junk text for the test, but users have a tendency to write junk in descriptions on occasion so it's testing for it. The other failing test is similar but it's an unknown tag rather than Ideally we'd turn off whatever is deciding to add end tags when we wish it wouldn't ( |
It's not turn-offable as far as I could tell. You're welcome to look into it and see if there's a way to deal with it I didn't think of. That html5lib bug I wrote up is probably a good start for both issues. |
With Standup, I went through and clean/linkified the corpus with Bleach 1.5, then did it again with Bleach 2.0 and compared the two to see what users actually did and we didn't have any changes. So while users do crazy things in general, you might find that in your specific corpus, it's not a big deal. |
Since bleach is used to sanitize user input, perhaps often for the purposes of displaying it in web interfaces, I would think the use case of not closing disallowed tags would be valid since the disallowed tags would get rendered in web app UIs (and users would be surprised to see closing tags they didn't type). To give specific examples, I think this behavior is expected:
But this surprises me:
Since no tags are allowed in my second example, we probably want to print the text as typed by the user into a web interface. Since the user didn't type the closing "" it would be surpring for it to be rendered. |
@bowlofeggs I agree! This is just waiting on someone to work on it. I wrote out some thoughts in #271 (comment) including the upstream bug if you want to see if you can get further. Issue #255 covers looking for alternatives and/or vendoring/forking html5lib. All those solutions stink, but I mention it in case that gives you more flexibility for options for fixing this issue. |
This is especially unfortunate, because it also affects the RFC 5322 address specification:
To cover this particular case, I wrote a simple regex (as a preprocessor for python markdown):
|
@ValentinVoigt I agree! This is just waiting on someone like you to work on it. I wrote out some thoughts in #271 (comment) including the upstream bug if you want to see if you can get further. |
@eviljeff @ValentinVoigt @bowlofeggs: I just landed changes that fix this issue. Given that this issue was so important to you all, can you take some time and test those out with the data you were having problems with in the next few days? You can get a zipfile of master tip that has the changes here: https://github.com/mozilla/bleach/archive/master.zip |
I just imported the current master into my development system: It works as expected on our data. Thank you very much for your work! |
@willkg I think this seems reasonable - thanks! |
bleach.clean('<ul><li><script></li></ul>')
used to (correctly, I would assume,) just escape the<script>
and output<ul><li><script></li></ul>
now the output is
<ul><li><script></li></ul></script></li></ul>
Previously we were using Bleach 1.5.0 with html5lib 0.9999999; now we're using Bleach 2.0.0 with html5lib 0.999999999
The text was updated successfully, but these errors were encountered: