Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unclosed script tag is escaped incorrectly #271

Closed
eviljeff opened this issue Mar 17, 2017 · 11 comments · Fixed by #391
Closed

unclosed script tag is escaped incorrectly #271

eviljeff opened this issue Mar 17, 2017 · 11 comments · Fixed by #391

Comments

@eviljeff
Copy link
Member

eviljeff commented Mar 17, 2017

bleach.clean('<ul><li><script></li></ul>') used to (correctly, I would assume,) just escape the <script> and output <ul><li>&lt;script&gt;</li></ul>

now the output is <ul><li>&lt;script&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/script&gt;</li></ul>

Previously we were using Bleach 1.5.0 with html5lib 0.9999999; now we're using Bleach 2.0.0 with html5lib 0.999999999

@willkg
Copy link
Member

willkg commented Mar 17, 2017

In both cases, the output HTML is safe--so we haven't lost anything there.

With Bleach 2.0, you get some additional tags that get escaped. I think what's going on is that the open <script> slurps the text until a </script> or the end of the string and so all of that becomes part of the <script> and then html5lib closes off the <ul> and <li> tags.

With Bleach 1.5, it worked as part of the tokenizing step of html5lib parsing, so it would see the <script> token, escape it and move on before html5lib tries to fix the HTML so it parses correctly.

Two related issues are these:

This is the failing test that you're reporting the issue from:

https://github.com/mozilla/addons-server/blob/master/src/olympia/translations/tests/test_helpers.py#L83

That seems like junk text to me. Does that sort of thing show up in the AMO corpus? Is this regression in cleaning meaningful or is it just a thing?

If it's meaningful, maybe "pre-treat" your text and escape the script tags, then run it through bleach.clean?

@eviljeff
Copy link
Member Author

Its junk text for the test, but users have a tendency to write junk in descriptions on occasion so it's testing for it.

The other failing test is similar but it's an unknown tag rather than <script> (so it's only hitting html5lib/html5lib-python#312). It oddly even still adds an ending tag when the tag is written not to have one (e.g. bleach.clean('<ul><li><madeuptag/></li></ul>'))

Ideally we'd turn off whatever is deciding to add end tags when we wish it wouldn't (<script>) and arguably shouldn't (<madeuptag>) but if that's not possible I suppose we'll just workaround/ignore it.

@willkg
Copy link
Member

willkg commented Mar 17, 2017

It's not turn-offable as far as I could tell.

You're welcome to look into it and see if there's a way to deal with it I didn't think of. That html5lib bug I wrote up is probably a good start for both issues.

@willkg
Copy link
Member

willkg commented Mar 17, 2017

With Standup, I went through and clean/linkified the corpus with Bleach 1.5, then did it again with Bleach 2.0 and compared the two to see what users actually did and we didn't have any changes.

So while users do crazy things in general, you might find that in your specific corpus, it's not a big deal.

@bowlofeggs
Copy link

Since bleach is used to sanitize user input, perhaps often for the purposes of displaying it in web interfaces, I would think the use case of not closing disallowed tags would be valid since the disallowed tags would get rendered in web app UIs (and users would be surprised to see closing tags they didn't type).

To give specific examples, I think this behavior is expected:

>>> bleach.clean('<b>This is fine')
u'<b>This is fine</b>'

But this surprises me:

>>> bleach.clean('<b>This is surprising', tags=[])
u'&lt;b&gt;This is surprising&lt;/b&gt;'

Since no tags are allowed in my second example, we probably want to print the text as typed by the user into a web interface. Since the user didn't type the closing "" it would be surpring for it to be rendered.

@willkg
Copy link
Member

willkg commented Aug 16, 2017

@bowlofeggs I agree! This is just waiting on someone to work on it. I wrote out some thoughts in #271 (comment) including the upstream bug if you want to see if you can get further.

Issue #255 covers looking for alternatives and/or vendoring/forking html5lib. All those solutions stink, but I mention it in case that gives you more flexibility for options for fixing this issue.

@ValentinVoigt
Copy link

ValentinVoigt commented Dec 16, 2017

This is especially unfortunate, because it also affects the RFC 5322 address specification:

>>> clean("Foo <bar@example.com>")
'Foo &lt;bar@example.com&gt;&lt;/bar@example.com&gt;'

To cover this particular case, I wrote a simple regex (as a preprocessor for python markdown):

re.sub(
    r'<([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)>',
    r'&lt;\1&gt;',
    line
)

@willkg
Copy link
Member

willkg commented Dec 16, 2017

@ValentinVoigt I agree! This is just waiting on someone like you to work on it. I wrote out some thoughts in #271 (comment) including the upstream bug if you want to see if you can get further.

@willkg willkg added this to the v3.0.0 milestone Sep 4, 2018
@willkg willkg added clean and removed clean labels Sep 4, 2018
@willkg
Copy link
Member

willkg commented Sep 19, 2018

@eviljeff @ValentinVoigt @bowlofeggs: I just landed changes that fix this issue.

Given that this issue was so important to you all, can you take some time and test those out with the data you were having problems with in the next few days?

You can get a zipfile of master tip that has the changes here: https://github.com/mozilla/bleach/archive/master.zip

@ValentinVoigt
Copy link

I just imported the current master into my development system: It works as expected on our data.

Thank you very much for your work!

@bowlofeggs
Copy link

@willkg I think this seems reasonable - thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants