Fix regressions from Bleach 2.0 rewrite by willkg · Pull Request #391 · mozilla/bleach

willkg · 2018-09-13T21:12:08Z

The Bleach 2.0 rewrite involved changing Bleach clean() from making
changes during the tokenizing step to operating as an html5lib filter
that runs between the parsing and serializing steps. There were some
consequences of that rewrite namely that the html5lib parser got to
"fix" the HTML before Bleach could strip/escape disallowed tags. Because
of that, there were a variety of regressions from Bleach 1.x behavior
and Bleach 2.x behavior. For example:

Added end tags:

bleach.clean('<sarcasm>') -> '&lt;sarcasm&gt;&lt;/sarcasm&gt;'

Parsed malformed tags as comments:
```
bleach.clean('</sarcasm>') -> ''
```

Et cetera.

This change fixes that in a few ways. First, it overrides emitCurrentToken()
in the tokenizer to know about the allowed tags and whether they're going
to get stripped or escaped. It does the stripping at this point and changes
tags to be escaped into character data. That way the parser doesn't do any
"fixing".

Second, it adds some additional scrutiny for the
"expected-closing-tag-but-got-char" parser error case. That gets kicked
up when the tag is malformed. The parser will treat this as a comment
and then Bleach clean() will drop it. Instead, this fixes that to
turn that into character data to be escaped later.

Fixes #271
Fixes #279
Fixes #280

The Bleach 2.0 rewrite involved changing Bleach clean() from making changes during the tokenizing step to operating as an html5lib filter that runs between the parsing and serializing steps. There were some consequences of that rewrite namely that the html5lib parser got to "fix" the HTML before Bleach could strip/escape disallowed tags. Because of that, there were a variety of regressions from Bleach 1.x behavior and Bleach 2.x behavior. For example: 1. Added end tags: bleach.clean('<sarcasm>') -> '<sarcasm></sarcasm>' 2. Parsed malformed tags as comments: bleach.clean('</sarcasm>') -> '' Et cetera. This change fixes that in a few ways. First, it overrides emitCurrentToken() in the tokenizer to know about the allowed tags and whether they're going to get stripped or escaped. It does the stripping at this point and changes tags to be escaped into character data. That way the parser doesn't do any "fixing". Second, it adds some additional scrutiny for the "expected-closing-tag-but-got-char" parser error case. That gets kicked up when the tag is malformed. The parser will treat this as a comment and then Bleach .clean() will drop it. Instead, this fixes that to turn that into character data to be escaped later. Fixes #271 Fixes #279 Fixes #280

willkg · 2018-09-13T21:17:00Z

+                last_error_token = token
+                continue
+
+            yield token


This has the same behavior as the original tokenizing, except it looks at the next token in the case of ParseError tokens and if it's an expected-closing-tag-but-got-char error, it does this funky thing--the comments above should be clear and if not, we should fix those.

Otherwise this does the same thing as what the tokenizer used to do.

willkg · 2018-09-13T21:24:18Z

+            self.state = self.dataState
+            return
+
+        super(BleachHTMLTokenizer, self).emitCurrentToken()


This is the crux of the regression fix. Many things in the tokenizer call emitCurrentToken. This overrides that, looks at the token that's going to be emitted, and if it looks like a tag but isn't an allowed tag, converts it to a Characters token.

This prevents the parser from "fixing" the HTML tags for tags that are going to be stripped/escaped. Now they'll remain exactly as they are in the stream.

I claim this is the right thing to do because Characters tokens get escaped and handled appropriately in the sanitizer filter later on.

The one thing I'm not 100% positive about is whether it's possible to create a sequence of characters such that converting a StartTag/EndTag to a Characters tag in the middle yields a bad thing. For example, something like this:

[Characters] [StartTag] [Characters]

becomes:

[Characters] [Characters] [Characters]

And because the middle one is a separate token from the outer ones and thus sanitized in isolation, combining the three tokens in serializing somehow creates a bad thing.

If it creates a bad thing, then we're going to need to further fix this to merge consecutive Characters tokens somehow. I think we'd need to wrap the tokenizer in a Characters-tokens merging thing.

That's the thing that has me slightly uneasy. I don't know what else is possible here that I'm not thinking of.

Nice!

I'd worry about escaping somehow leading to: 1) dangling tags (an opening tag without a closing one and potentially leaking data), 2) extra tags, or 3) escaped characters that get ignored but result in an HTML tag.

However, I don't think that should be possible if if [StartTag] to [Characters]` is escaping things properly for an HTML document/body context. I think the caution is justified though. I'll have to think about this some more.

Just to clarify, this covers StartTag and EndTag, so we should never end up with dangling tags. In both cases, the tokenizer will check to see if they're allowed and if not, convert them to Characters.

willkg · 2018-09-13T21:25:43Z

 <IMG SRC="javascript:alert('XSS');">
 --
-&lt;img src="javascript:alert('XSS');"&gt;
+&lt;IMG SRC="javascript:alert('XSS');"&gt;


Yay! We now maintain case for escaped things. That's a regression that didn't even have an issue written up for it.

willkg · 2018-09-13T21:27:01Z

-        'Yeah right &lt;sarcasm/&gt;'
-    )
-
-


I merged a bunch of tests into a big stripping test with many test cases and a big escaping test with many test cases. It cleaned up this file a bunch plus added some additional coverage.

willkg · 2018-09-13T21:28:48Z

I ran this through the Standups status message corpus (58,000+ messages some of which are from people working on stylo and other HTML tag things) and it looks much better now. I didn't hit anything new that should get fixed.

g-k

r+ w/ a couple random nits/questions but lgtm

Huge improvement, great job, it's really good to fix those regressions, and the unreported case-sensitivity issue too.

g-k · 2018-09-14T19:24:03Z

+        name_reversed.append('/')
+    name_reversed_len = len(name_reversed)
+
+    quotey_things = '"\''


this var name 😄

g-k · 2018-09-14T19:30:39Z

+
+    quotey_things = '"\''
+    pile = []
+    in_quotes = []


So I think if we made this and quotey_things params we could support escaping for other contexts now, e.g. if someone wants to inject bleach output into tags.

Not something I want to think about right now, but could be interesting down the line.

That is interesting. The quotey things here are characters that delineate HTML attributes, so I was thinking we only need to be concerned with those characters.

Because this code is balancing quote characters, it assumes that the tag is well-formed HTML. I'm not sure what happens if it's not well-formed and I don't think I have any tests for that. I'll look at that.

g-k · 2018-09-17T15:31:01Z

+        :arg tags: list of allowed tages--everything else is either stripped or
+            escaped; if None, then this doesn't look at tags at all
+        :arg strip: whether to strip disallowed tags (True) or escape them (False);
+            if tags=None, then this doesn't have any effect


Would it make sense to have separate parsers for strip and escape modes?

That's really interesting. It would make sense if it reduced the likelihood of mistakes and/or was more performant. I'll write up an issue to look into that after I land this.

g-k · 2018-09-17T15:32:11Z

 <IMG SRC="javascript:alert('XSS');">
 --
-&lt;img src="javascript:alert('XSS');"&gt;
+&lt;IMG SRC="javascript:alert('XSS');"&gt;


g-k · 2018-09-17T15:34:19Z

 >"'><img%20src%3D%26%23x6a;%26%23x61;%26%23x76;%26%23x61;%26%23x73;%26%23x63;%26%23x72;%26%23x69;%26%23x70;%26%23x74;%26%23x3a;alert(%26quot;%26%23x20;XSS%26%23x20;Test%26%23x20;Successful%26quot;)>
 --
-&gt;"'&gt;&lt;img%20src%3d%26%23x6a;%26%23x61;%26%23x76;%26%23x61;%26%23x73;%26%23x63;%26%23x72;%26%23x69;%26%23x70;%26%23x74;%26%23x3a;alert(%26quot;%26%23x20;xss%26%23x20;test%26%23x20;successful%26quot;)&gt;&lt;/img%20src%3d%26%23x6a;%26%23x61;%26%23x76;%26%23x61;%26%23x73;%26%23x63;%26%23x72;%26%23x69;%26%23x70;%26%23x74;%26%23x3a;alert(%26quot;%26%23x20;xss%26%23x20;test%26%23x20;successful%26quot;)&gt;
+&gt;"'&gt;&lt;img%20src%3D%26%23x6a;%26%23x61;%26%23x76;%26%23x61;%26%23x73;%26%23x63;%26%23x72;%26%23x69;%26%23x70;%26%23x74;%26%23x3a;alert(%26quot;%26%23x20;XSS%26%23x20;Test%26%23x20;Successful%26quot;)&gt;


g-k · 2018-09-17T15:36:23Z

-
-
 def test_invalid_char_in_tag():
-    # NOTE(willkg): Two possible outcomes because attrs aren't ordered


g-k · 2018-09-17T15:39:59Z

+    # namespace didn't exist. After the fixes for Bleach 3.0, this no longer
+    # goes through the HTML parser as a tag, so it doesn't tickle the bad
+    # namespace code.
+    assert clean('<d {c}>') == '&lt;d {c}&gt;'


willkg · 2018-09-17T21:04:37Z

@g-k These comments are awesome! I'll work through them over the next couple of days.

willkg · 2018-09-18T13:54:28Z

^^^ That fixes/addresses many of the comments.

Work that still needs to get done:

look at effect of non-well-formed HTML in get_recent_tag_string (Fix regressions from Bleach 2.0 rewrite #391 (comment))
think about reimplementing get_recent_tag_string using a regex (Fix regressions from Bleach 2.0 rewrite #391 (comment))
write up issue to look into separate escape/strip tokenizers

willkg · 2018-09-18T23:08:30Z

vvv That nixes get_recent_tag_string() altogether by abusing tagOpenState(). This is a muuuuuch better idea.

(Edit: It was above this comment, but now it's below because I fixed something else and then amended and force-pushed. I don't know why I did that. But now it's done and while I could fiddle with git to undo it, it's 7:30pm and I'm tired.)

tagOpenState() is always called for every < that denotes a StartTag, EndTag, or ParseError and never any other time. We can (ab)use that fact to reset the input stream history and then we don't have to search backwards in the history for the < to find the original tag string which is super because that was fragile and not very performant.

willkg · 2018-09-19T20:05:10Z

I think this is as good as we're going to get it at this stage. I'm going to land it and solicit help testing it out.

Thank you, @g-k!

willkg requested a review from g-k September 13, 2018 21:12

willkg commented Sep 13, 2018

View reviewed changes

Comment thread bleach/html5lib_shim.py Outdated

willkg commented Sep 13, 2018

View reviewed changes

Comment thread bleach/html5lib_shim.py Outdated

willkg commented Sep 13, 2018

View reviewed changes

g-k approved these changes Sep 17, 2018

View reviewed changes

willkg added 3 commits September 17, 2018 20:43

Fix BleachHTMLTokenizer.__init__ to do less

173202f

Fix typos and make QUOTEY_THINGS a set

d3366bc

Clear stream history once a tag has been found

a708fb8

willkg merged commit a6b14bd into mozilla:master Sep 19, 2018

willkg deleted the 280-regression-fix branch October 1, 2018 16:56



		def test_invalid_char_in_tag():
		# NOTE(willkg): Two possible outcomes because attrs aren't ordered

Conversation

willkg commented Sep 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

willkg commented Sep 13, 2018

Uh oh!

g-k left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

willkg commented Sep 17, 2018

Uh oh!

willkg commented Sep 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

willkg commented Sep 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

willkg commented Sep 19, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

willkg commented Sep 13, 2018 •

edited

Loading

willkg commented Sep 18, 2018 •

edited

Loading

willkg commented Sep 18, 2018 •

edited

Loading