Skip to content

Fix regressions from Bleach 2.0 rewrite#391

Merged
willkg merged 5 commits into
mozilla:masterfrom
willkg:280-regression-fix
Sep 19, 2018
Merged

Fix regressions from Bleach 2.0 rewrite#391
willkg merged 5 commits into
mozilla:masterfrom
willkg:280-regression-fix

Conversation

@willkg
Copy link
Copy Markdown
Member

@willkg willkg commented Sep 13, 2018

The Bleach 2.0 rewrite involved changing Bleach clean() from making
changes during the tokenizing step to operating as an html5lib filter
that runs between the parsing and serializing steps. There were some
consequences of that rewrite namely that the html5lib parser got to
"fix" the HTML before Bleach could strip/escape disallowed tags. Because
of that, there were a variety of regressions from Bleach 1.x behavior
and Bleach 2.x behavior. For example:

  1. Added end tags:
    bleach.clean('<sarcasm>') -> '&lt;sarcasm&gt;&lt;/sarcasm&gt;'
    
  2. Parsed malformed tags as comments:
    bleach.clean('</sarcasm>') -> ''
    

Et cetera.

This change fixes that in a few ways. First, it overrides emitCurrentToken()
in the tokenizer to know about the allowed tags and whether they're going
to get stripped or escaped. It does the stripping at this point and changes
tags to be escaped into character data. That way the parser doesn't do any
"fixing".

Second, it adds some additional scrutiny for the
"expected-closing-tag-but-got-char" parser error case. That gets kicked
up when the tag is malformed. The parser will treat this as a comment
and then Bleach clean() will drop it. Instead, this fixes that to
turn that into character data to be escaped later.

Fixes #271
Fixes #279
Fixes #280

The Bleach 2.0 rewrite involved changing Bleach clean() from making
changes during the tokenizing step to operating as an html5lib filter
that runs between the parsing and serializing steps. There were some
consequences of that rewrite namely that the html5lib parser got to
"fix" the HTML before Bleach could strip/escape disallowed tags. Because
of that, there were a variety of regressions from Bleach 1.x behavior
and Bleach 2.x behavior. For example:

1. Added end tags:

   bleach.clean('<sarcasm>') -> '&lt;sarcasm&gt;&lt;/sarcasm&gt;'

2. Parsed malformed tags as comments:

   bleach.clean('</sarcasm>') -> ''

Et cetera.

This change fixes that in a few ways. First, it overrides emitCurrentToken()
in the tokenizer to know about the allowed tags and whether they're going
to get stripped or escaped. It does the stripping at this point and changes
tags to be escaped into character data. That way the parser doesn't do any
"fixing".

Second, it adds some additional scrutiny for the
"expected-closing-tag-but-got-char" parser error case. That gets kicked
up when the tag is malformed. The parser will treat this as a comment
and then Bleach .clean() will drop it. Instead, this fixes that to
turn that into character data to be escaped later.

Fixes #271
Fixes #279
Fixes #280
@willkg willkg requested a review from g-k September 13, 2018 21:12
Comment thread bleach/html5lib_shim.py Outdated
Comment thread bleach/html5lib_shim.py Outdated
Comment thread bleach/html5lib_shim.py
last_error_token = token
continue

yield token
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has the same behavior as the original tokenizing, except it looks at the next token in the case of ParseError tokens and if it's an expected-closing-tag-but-got-char error, it does this funky thing--the comments above should be clear and if not, we should fix those.

Otherwise this does the same thing as what the tokenizer used to do.

Comment thread bleach/html5lib_shim.py
self.state = self.dataState
return

super(BleachHTMLTokenizer, self).emitCurrentToken()
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the crux of the regression fix. Many things in the tokenizer call emitCurrentToken. This overrides that, looks at the token that's going to be emitted, and if it looks like a tag but isn't an allowed tag, converts it to a Characters token.

This prevents the parser from "fixing" the HTML tags for tags that are going to be stripped/escaped. Now they'll remain exactly as they are in the stream.

I claim this is the right thing to do because Characters tokens get escaped and handled appropriately in the sanitizer filter later on.

The one thing I'm not 100% positive about is whether it's possible to create a sequence of characters such that converting a StartTag/EndTag to a Characters tag in the middle yields a bad thing. For example, something like this:

    [Characters] [StartTag] [Characters]

becomes:

    [Characters] [Characters] [Characters]

And because the middle one is a separate token from the outer ones and thus sanitized in isolation, combining the three tokens in serializing somehow creates a bad thing.

If it creates a bad thing, then we're going to need to further fix this to merge consecutive Characters tokens somehow. I think we'd need to wrap the tokenizer in a Characters-tokens merging thing.

That's the thing that has me slightly uneasy. I don't know what else is possible here that I'm not thinking of.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

I'd worry about escaping somehow leading to: 1) dangling tags (an opening tag without a closing one and potentially leaking data), 2) extra tags, or 3) escaped characters that get ignored but result in an HTML tag.

However, I don't think that should be possible if if [StartTag] to [Characters]` is escaping things properly for an HTML document/body context. I think the caution is justified though. I'll have to think about this some more.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, this covers StartTag and EndTag, so we should never end up with dangling tags. In both cases, the tokenizer will check to see if they're allowed and if not, convert them to Characters.

Comment thread tests/data/10.test
<IMG SRC="javascript:alert('XSS');">
--
&lt;img src="javascript:alert('XSS');"&gt;
&lt;IMG SRC="javascript:alert('XSS');"&gt;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay! We now maintain case for escaped things. That's a regression that didn't even have an issue written up for it.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, nice

Comment thread tests/test_clean.py
'Yeah right &lt;sarcasm/&gt;'
)


Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merged a bunch of tests into a big stripping test with many test cases and a big escaping test with many test cases. It cleaned up this file a bunch plus added some additional coverage.

@willkg
Copy link
Copy Markdown
Member Author

willkg commented Sep 13, 2018

I ran this through the Standups status message corpus (58,000+ messages some of which are from people working on stylo and other HTML tag things) and it looks much better now. I didn't hit anything new that should get fixed.

Copy link
Copy Markdown
Collaborator

@g-k g-k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r+ w/ a couple random nits/questions but lgtm

Huge improvement, great job, it's really good to fix those regressions, and the unreported case-sensitivity issue too.

Comment thread bleach/html5lib_shim.py Outdated
Comment thread bleach/html5lib_shim.py Outdated
Comment thread bleach/html5lib_shim.py Outdated
name_reversed.append('/')
name_reversed_len = len(name_reversed)

quotey_things = '"\''
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this var name 😄

Comment thread bleach/html5lib_shim.py Outdated

quotey_things = '"\''
pile = []
in_quotes = []
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think if we made this and quotey_things params we could support escaping for other contexts now, e.g. if someone wants to inject bleach output into tags.

Not something I want to think about right now, but could be interesting down the line.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is interesting. The quotey things here are characters that delineate HTML attributes, so I was thinking we only need to be concerned with those characters.

Because this code is balancing quote characters, it assumes that the tag is well-formed HTML. I'm not sure what happens if it's not well-formed and I don't think I have any tests for that. I'll look at that.

Comment thread bleach/html5lib_shim.py Outdated
Comment thread bleach/html5lib_shim.py
:arg tags: list of allowed tages--everything else is either stripped or
escaped; if None, then this doesn't look at tags at all
:arg strip: whether to strip disallowed tags (True) or escape them (False);
if tags=None, then this doesn't have any effect
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to have separate parsers for strip and escape modes?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's really interesting. It would make sense if it reduced the likelihood of mistakes and/or was more performant. I'll write up an issue to look into that after I land this.

Comment thread tests/data/10.test
<IMG SRC="javascript:alert('XSS');">
--
&lt;img src="javascript:alert('XSS');"&gt;
&lt;IMG SRC="javascript:alert('XSS');"&gt;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, nice

Comment thread tests/data/3.test
>"'><img%20src%3D%26%23x6a;%26%23x61;%26%23x76;%26%23x61;%26%23x73;%26%23x63;%26%23x72;%26%23x69;%26%23x70;%26%23x74;%26%23x3a;alert(%26quot;%26%23x20;XSS%26%23x20;Test%26%23x20;Successful%26quot;)>
--
&gt;"'&gt;&lt;img%20src%3d%26%23x6a;%26%23x61;%26%23x76;%26%23x61;%26%23x73;%26%23x63;%26%23x72;%26%23x69;%26%23x70;%26%23x74;%26%23x3a;alert(%26quot;%26%23x20;xss%26%23x20;test%26%23x20;successful%26quot;)&gt;&lt;/img%20src%3d%26%23x6a;%26%23x61;%26%23x76;%26%23x61;%26%23x73;%26%23x63;%26%23x72;%26%23x69;%26%23x70;%26%23x74;%26%23x3a;alert(%26quot;%26%23x20;xss%26%23x20;test%26%23x20;successful%26quot;)&gt;
&gt;"'&gt;&lt;img%20src%3D%26%23x6a;%26%23x61;%26%23x76;%26%23x61;%26%23x73;%26%23x63;%26%23x72;%26%23x69;%26%23x70;%26%23x74;%26%23x3a;alert(%26quot;%26%23x20;XSS%26%23x20;Test%26%23x20;Successful%26quot;)&gt;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😄

Comment thread tests/test_clean.py


def test_invalid_char_in_tag():
# NOTE(willkg): Two possible outcomes because attrs aren't ordered
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yesss

Comment thread tests/test_clean.py
# namespace didn't exist. After the fixes for Bleach 3.0, this no longer
# goes through the HTML parser as a tag, so it doesn't tickle the bad
# namespace code.
assert clean('<d {c}>') == '&lt;d {c}&gt;'
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 💯

@willkg
Copy link
Copy Markdown
Member Author

willkg commented Sep 17, 2018

@g-k These comments are awesome! I'll work through them over the next couple of days.

@willkg
Copy link
Copy Markdown
Member Author

willkg commented Sep 18, 2018

^^^ That fixes/addresses many of the comments.

Work that still needs to get done:

@willkg
Copy link
Copy Markdown
Member Author

willkg commented Sep 18, 2018

vvv That nixes get_recent_tag_string() altogether by abusing tagOpenState(). This is a muuuuuch better idea.

(Edit: It was above this comment, but now it's below because I fixed something else and then amended and force-pushed. I don't know why I did that. But now it's done and while I could fiddle with git to undo it, it's 7:30pm and I'm tired.)

tagOpenState() is always called for every < that denotes a StartTag,
EndTag, or ParseError and never any other time. We can (ab)use that
fact to reset the input stream history and then we don't have
to search backwards in the history for the < to find the original
tag string which is super because that was fragile and not very
performant.
@willkg
Copy link
Copy Markdown
Member Author

willkg commented Sep 19, 2018

I think this is as good as we're going to get it at this stage. I'm going to land it and solicit help testing it out.

Thank you, @g-k!

@willkg willkg merged commit a6b14bd into mozilla:master Sep 19, 2018
@willkg willkg deleted the 280-regression-fix branch October 1, 2018 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tags get added when tokenizing with bleach 2.0 <isindex> tag being replaced with a string unclosed script tag is escaped incorrectly

2 participants