Fix regressions from Bleach 2.0 rewrite#391
Conversation
The Bleach 2.0 rewrite involved changing Bleach clean() from making
changes during the tokenizing step to operating as an html5lib filter
that runs between the parsing and serializing steps. There were some
consequences of that rewrite namely that the html5lib parser got to
"fix" the HTML before Bleach could strip/escape disallowed tags. Because
of that, there were a variety of regressions from Bleach 1.x behavior
and Bleach 2.x behavior. For example:
1. Added end tags:
bleach.clean('<sarcasm>') -> '<sarcasm></sarcasm>'
2. Parsed malformed tags as comments:
bleach.clean('</sarcasm>') -> ''
Et cetera.
This change fixes that in a few ways. First, it overrides emitCurrentToken()
in the tokenizer to know about the allowed tags and whether they're going
to get stripped or escaped. It does the stripping at this point and changes
tags to be escaped into character data. That way the parser doesn't do any
"fixing".
Second, it adds some additional scrutiny for the
"expected-closing-tag-but-got-char" parser error case. That gets kicked
up when the tag is malformed. The parser will treat this as a comment
and then Bleach .clean() will drop it. Instead, this fixes that to
turn that into character data to be escaped later.
Fixes #271
Fixes #279
Fixes #280
| last_error_token = token | ||
| continue | ||
|
|
||
| yield token |
There was a problem hiding this comment.
This has the same behavior as the original tokenizing, except it looks at the next token in the case of ParseError tokens and if it's an expected-closing-tag-but-got-char error, it does this funky thing--the comments above should be clear and if not, we should fix those.
Otherwise this does the same thing as what the tokenizer used to do.
| self.state = self.dataState | ||
| return | ||
|
|
||
| super(BleachHTMLTokenizer, self).emitCurrentToken() |
There was a problem hiding this comment.
This is the crux of the regression fix. Many things in the tokenizer call emitCurrentToken. This overrides that, looks at the token that's going to be emitted, and if it looks like a tag but isn't an allowed tag, converts it to a Characters token.
This prevents the parser from "fixing" the HTML tags for tags that are going to be stripped/escaped. Now they'll remain exactly as they are in the stream.
I claim this is the right thing to do because Characters tokens get escaped and handled appropriately in the sanitizer filter later on.
The one thing I'm not 100% positive about is whether it's possible to create a sequence of characters such that converting a StartTag/EndTag to a Characters tag in the middle yields a bad thing. For example, something like this:
[Characters] [StartTag] [Characters]
becomes:
[Characters] [Characters] [Characters]
And because the middle one is a separate token from the outer ones and thus sanitized in isolation, combining the three tokens in serializing somehow creates a bad thing.
If it creates a bad thing, then we're going to need to further fix this to merge consecutive Characters tokens somehow. I think we'd need to wrap the tokenizer in a Characters-tokens merging thing.
That's the thing that has me slightly uneasy. I don't know what else is possible here that I'm not thinking of.
There was a problem hiding this comment.
Nice!
I'd worry about escaping somehow leading to: 1) dangling tags (an opening tag without a closing one and potentially leaking data), 2) extra tags, or 3) escaped characters that get ignored but result in an HTML tag.
However, I don't think that should be possible if if [StartTag] to [Characters]` is escaping things properly for an HTML document/body context. I think the caution is justified though. I'll have to think about this some more.
There was a problem hiding this comment.
Just to clarify, this covers StartTag and EndTag, so we should never end up with dangling tags. In both cases, the tokenizer will check to see if they're allowed and if not, convert them to Characters.
| <IMG SRC="javascript:alert('XSS');"> | ||
| -- | ||
| <img src="javascript:alert('XSS');"> | ||
| <IMG SRC="javascript:alert('XSS');"> |
There was a problem hiding this comment.
Yay! We now maintain case for escaped things. That's a regression that didn't even have an issue written up for it.
| 'Yeah right <sarcasm/>' | ||
| ) | ||
|
|
||
|
|
There was a problem hiding this comment.
I merged a bunch of tests into a big stripping test with many test cases and a big escaping test with many test cases. It cleaned up this file a bunch plus added some additional coverage.
|
I ran this through the Standups status message corpus (58,000+ messages some of which are from people working on stylo and other HTML tag things) and it looks much better now. I didn't hit anything new that should get fixed. |
g-k
left a comment
There was a problem hiding this comment.
r+ w/ a couple random nits/questions but lgtm
Huge improvement, great job, it's really good to fix those regressions, and the unreported case-sensitivity issue too.
| name_reversed.append('/') | ||
| name_reversed_len = len(name_reversed) | ||
|
|
||
| quotey_things = '"\'' |
|
|
||
| quotey_things = '"\'' | ||
| pile = [] | ||
| in_quotes = [] |
There was a problem hiding this comment.
So I think if we made this and quotey_things params we could support escaping for other contexts now, e.g. if someone wants to inject bleach output into tags.
Not something I want to think about right now, but could be interesting down the line.
There was a problem hiding this comment.
That is interesting. The quotey things here are characters that delineate HTML attributes, so I was thinking we only need to be concerned with those characters.
Because this code is balancing quote characters, it assumes that the tag is well-formed HTML. I'm not sure what happens if it's not well-formed and I don't think I have any tests for that. I'll look at that.
| :arg tags: list of allowed tages--everything else is either stripped or | ||
| escaped; if None, then this doesn't look at tags at all | ||
| :arg strip: whether to strip disallowed tags (True) or escape them (False); | ||
| if tags=None, then this doesn't have any effect |
There was a problem hiding this comment.
Would it make sense to have separate parsers for strip and escape modes?
There was a problem hiding this comment.
That's really interesting. It would make sense if it reduced the likelihood of mistakes and/or was more performant. I'll write up an issue to look into that after I land this.
| <IMG SRC="javascript:alert('XSS');"> | ||
| -- | ||
| <img src="javascript:alert('XSS');"> | ||
| <IMG SRC="javascript:alert('XSS');"> |
| >"'><img%20src%3D%26%23x6a;%26%23x61;%26%23x76;%26%23x61;%26%23x73;%26%23x63;%26%23x72;%26%23x69;%26%23x70;%26%23x74;%26%23x3a;alert(%26quot;%26%23x20;XSS%26%23x20;Test%26%23x20;Successful%26quot;)> | ||
| -- | ||
| >"'><img%20src%3d%26%23x6a;%26%23x61;%26%23x76;%26%23x61;%26%23x73;%26%23x63;%26%23x72;%26%23x69;%26%23x70;%26%23x74;%26%23x3a;alert(%26quot;%26%23x20;xss%26%23x20;test%26%23x20;successful%26quot;)></img%20src%3d%26%23x6a;%26%23x61;%26%23x76;%26%23x61;%26%23x73;%26%23x63;%26%23x72;%26%23x69;%26%23x70;%26%23x74;%26%23x3a;alert(%26quot;%26%23x20;xss%26%23x20;test%26%23x20;successful%26quot;)> | ||
| >"'><img%20src%3D%26%23x6a;%26%23x61;%26%23x76;%26%23x61;%26%23x73;%26%23x63;%26%23x72;%26%23x69;%26%23x70;%26%23x74;%26%23x3a;alert(%26quot;%26%23x20;XSS%26%23x20;Test%26%23x20;Successful%26quot;)> |
|
|
||
|
|
||
| def test_invalid_char_in_tag(): | ||
| # NOTE(willkg): Two possible outcomes because attrs aren't ordered |
| # namespace didn't exist. After the fixes for Bleach 3.0, this no longer | ||
| # goes through the HTML parser as a tag, so it doesn't tickle the bad | ||
| # namespace code. | ||
| assert clean('<d {c}>') == '<d {c}>' |
|
@g-k These comments are awesome! I'll work through them over the next couple of days. |
|
^^^ That fixes/addresses many of the comments. Work that still needs to get done:
|
|
vvv That nixes (Edit: It was above this comment, but now it's below because I fixed something else and then amended and force-pushed. I don't know why I did that. But now it's done and while I could fiddle with git to undo it, it's 7:30pm and I'm tired.) |
tagOpenState() is always called for every < that denotes a StartTag, EndTag, or ParseError and never any other time. We can (ab)use that fact to reset the input stream history and then we don't have to search backwards in the history for the < to find the original tag string which is super because that was fragile and not very performant.
|
I think this is as good as we're going to get it at this stage. I'm going to land it and solicit help testing it out. Thank you, @g-k! |
The Bleach 2.0 rewrite involved changing Bleach
clean()from makingchanges during the tokenizing step to operating as an html5lib filter
that runs between the parsing and serializing steps. There were some
consequences of that rewrite namely that the html5lib parser got to
"fix" the HTML before Bleach could strip/escape disallowed tags. Because
of that, there were a variety of regressions from Bleach 1.x behavior
and Bleach 2.x behavior. For example:
Et cetera.
This change fixes that in a few ways. First, it overrides
emitCurrentToken()in the tokenizer to know about the allowed tags and whether they're going
to get stripped or escaped. It does the stripping at this point and changes
tags to be escaped into character data. That way the parser doesn't do any
"fixing".
Second, it adds some additional scrutiny for the
"expected-closing-tag-but-got-char" parser error case. That gets kicked
up when the tag is malformed. The parser will treat this as a comment
and then Bleach
clean()will drop it. Instead, this fixes that toturn that into character data to be escaped later.
Fixes #271
Fixes #279
Fixes #280