Remove <script> tags with their content. #57

originell · 2012-02-28T11:28:19Z

I needed to remove html tags and found out that the contents of the script tags did not get stripped too. Therefore I implemented an option to do just that 😄

originell · 2012-02-28T11:39:06Z

Just found a bug. Gonna fix and extend the commit range.

jsocol · 2012-02-28T15:08:16Z

bleach/__init__.py

@@ -1,8 +1,6 @@
-import itertools


Ah thanks, I missed those.

jsocol · 2012-02-28T15:12:21Z

Ping me if I don't see the new commits. :)

originell · 2012-02-28T16:34:36Z

I will try to get it ironed out by thursday. Problem is that it is a bit hard to detect:

<p>Some random text <script>function with_a_(script) { alert("tag <script> inside a js string"); }</script> without cutting away the after-script-tag text</p>

html5lib sees the <script> inside the JavaScript String as a new HTML element. I have a solution but it's not working perfectly yet ;-)

It is now possible to detect <script> tags inside <script> tags... This fixes a bug with content being deleted after a <script> tag.

originell · 2012-03-04T13:58:12Z

Soo. This should fix that nasty bug. I have the feeling that I've missed something in the tests. However this could just be me being paranoid haha :)

To be honest it seems to work better than I thought g

originell · 2012-03-04T14:44:15Z

Found something else I missed ;-) See the commit. Furthermore I added another more real test

jsocol · 2012-03-06T14:56:18Z

bleach/sanitizer.py

+                                           basestring)):
+                        self.skip_token = False
+                    # This might be too dumb.
+                    elif any([keyw in self.previous_token['data']


Add a .lower() so we catch VAR etc? What's the worst case if this misses? I assume that any errant <script> tags would still get escaped if they're not whitelisted?

Ah sure I will add that... Mhh before I answer the second question I'm going to write some tests ;-)

I added some more tests to ensure correct handling of failure cases. See tests_security.py L167+

jsocol · 2012-03-06T15:06:04Z

One nit and one question but otherwise it's looking good and tests are passing.

My preference is to land this on master and not backport to 1.1.x, but since it's a backwards compatible API change I'm open to doing it as a 1.1.2.

jsocol · 2012-05-12T02:31:35Z

Rebased everything here (surprisingly, the rebase just worked!)

I want another set of eyes on this so I'm going to ask some security folks to take a gander before I merge to master, and I'll probably do some squashing to get it cleaned up. It'll make 1.2.

originell · 2012-05-12T08:47:23Z

Ha very nice! 👍

Yeah I'd be glad to have another set of eyepairs looking over this too. Using this in a production right now and it seems to perform just fine. However.. you never know! Maybe I should've added the bitwise operators to the keywords.

jbalogh · 2012-05-13T18:07:00Z

bleach/sanitizer.py

+                              for keyw in ('"', "'", 'var', ';', '=', '{',
+                                           '}', '[', ']', '++', '--', '+=',
+                                           '-=', '*=', '/=', '%=', 'return',
+                                           'function')]):


What does this do? Why do you have an incomplete list of javascript productions? I'm not familiar with the data structures html5lib is producing, but I'm suspicious of this part. What happens if my javascript doesn't start with one of these?

Also, is this loop getting run on every token inside a script tag? How does that affect performance?

HTML5Lib returns dicts with tags split into their contents. This means f.e. that the following html:

<p>This is a text.</p>

might be returned as

{'name': 'P', 'data': 'This is a text', 'type': …}

So what happens in the whole block (I'm starting with l80) is that I'm trying to be "smart" about the tag. Is there another <script> tag inside the <script> tag? If not then I'm moving on to detect what is happening inside the <script> tag. It might be possible that someone is just passing in faulty html, which might mean that there is a script tag without a closing end. However on second thought I'm not sure if this is needed in any way and we should just skip the token in all it's glory. Need to test this.

The list is not complete because I was trying to reduce the keywords to the ones which are common javascript. I couldn't think of any line of javascript which would not have at least one of these strings in it. However, as I commented it in the line above the check, I'm not quite sure if this isn't too dumb. I'd be happy to see me proven wrong by some tests!

When it comes down to performance.. Yes this is run on every token after an opening script tag until htm5llib tells me it's over. I didn't do any benchmarks on this. It might be a bottleneck on huge script tags, but usually python's iteration and string in foobar are pretty fast. One thing which might be pretty would be to move this on the top so the tuple get's loaded on import.
Anyways, a regular expression might be faster.

jbalogh · 2012-05-14T17:58:15Z

Removing script tags is a valid thing for bleach to do, but I'm concerned about all the additional code added in the sanitize_token function. That function is already hairy (@jsocol!) and keeping track of skip_token and previous_token in every branch looks bad for future maintenance.

Is it possible strip script tags in a more self-contained manner?

originell · 2012-05-15T06:34:32Z

I totally agree with you that it is very hairy and I was not happy with the solution either. Back when I authored the code I needed to get this done asap. Actually, I can't quite remember why I really need to put it here, but I guess it had something to do with the architecture of bleach's cleaning mechanisms. After writing this code I opened a ticket (#58) which proposes some changes to this very function so we have something like a 'pipeline' to run things through.

jsocol · 2012-05-15T11:52:34Z

OK, this is probably naive, because Luis might have tried this originally (but maybe if you were under a tight deadline you didn't have time to really figure it out) but is there a way we can, if skip_script is True, just pass or continue or otherwise fail to yield the node (not the token) completely? Maybe this is something that is better served at a different step?

jsocol · 2012-05-15T11:54:21Z

And @jbalogh: try not to blame me, this is the HTML5 tokenizing algorithm, taken pretty directly from html5lib and tweaked. The hairiness belongs to Hixie and Henri.

Which is also why I'm reluctant, on #58, to move to a cleaning pipeline--because moving away from that algorithm is a big deal.

originell · 2012-05-15T13:09:16Z

I remember trying to just pass/continue. But there was something that made me do it the way I did it. However I honestly can't remember if it was my deadline or anything else. So yeah that might actually work!

And I totally agree with you on #58, hence why I never touched the subject again. If someone could reliably port that algo to another architecture it would be great.

jsocol · 2012-05-15T14:09:43Z

Maybe we should do a tree-walking step to strip scripts and their content, instead of a tokenizing step. We'll make clean() bigger, internally, but we don't have to do the tree-walking unless someone wants to drop script content.

originell · 2012-05-16T21:27:27Z

I like that 👍

jsocol · 2012-07-03T14:39:50Z

Luis, I'm going to close this pull req and look at doing a tree-walk version like in my previous comment. (Unless you want to take a crack at that?) I think we all agree that's the right way to go here.

originell · 2012-07-03T18:17:16Z

Totally agree. I'm not 100% sure that I currently have the time to take a crack at that (ha never heard that phrase before).

Garito · 2013-03-10T14:59:58Z

May I ask a question?

I don't fully understand why are you make the modification inside an allowed tag instead of making it in a not allowed one

I'm preparing another aproximation to the problem acting in, in my view, the correct place

Could you enlight me, please?

Garito · 2013-03-10T15:40:58Z

I don't have any experience working on github and making pull request and so on, so, please be patience if I'm not doing things correctly (fell free to point me what I'm missing)

I upload a branch with my preliminar tests for your review here: Garito@9aeb234

May you check the new tests to see the weaknesses of this approximation?

Please be indulgent with this poor human ;)

originell · 2013-03-10T22:58:04Z

Never mind :) There is always a first time and you'll get used/addicted to the github workflow pretty quickly, I'm sure! 👍

Going back to your first question, I don't completely understand what you mean. Which might not be because you explained it wrong, but rather that I don't quite get the code anymore since that modification is already more than half a year ago haha

Could you maybe give some code lines to your question? Hopefully I can then answer you :)

Garito · 2013-03-11T00:26:47Z

Did you check my branch?

jsocol · 2013-03-11T18:41:57Z

@Garito — this will sound pedantic but can we keep conversation of this issue to issue #67? I'd just like to keep discussion coherent for myself in the future. Thanks!

Add option to strip <script> tags and their content.

f8d95b1

jsocol reviewed Feb 28, 2012
View reviewed changes

bleach/__init__.py

@@ -1,8 +1,6 @@

import itertools

Copy link

Contributor

jsocol Feb 28, 2012

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks, I missed those.

originell added 2 commits March 4, 2012 11:50

trying to fix a bug in script content stripping

a2780e4

Fixes <script>-stripping <script> inside <script>

a9b79fe

It is now possible to detect <script> tags inside <script> tags... This fixes a bug with content being deleted after a <script> tag.

add a strip_script "reallife" test.

9565ca8

jsocol reviewed Mar 6, 2012
View reviewed changes

originell added 2 commits March 6, 2012 16:15

transform script tag content to lowercase

236c486

add more strip_script_tag tests

e40ba10

jbalogh reviewed May 13, 2012
View reviewed changes

jsocol closed this Jul 3, 2012

jsocol mentioned this pull request Jul 3, 2012

Remove script content with tag #67

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove <script> tags with their content. #57

Remove <script> tags with their content. #57

originell commented Feb 28, 2012

originell commented Feb 28, 2012

jsocol Feb 28, 2012

jsocol commented Feb 28, 2012

originell commented Feb 28, 2012

originell commented Mar 4, 2012

originell commented Mar 4, 2012

jsocol Mar 6, 2012

originell Mar 6, 2012

originell Mar 6, 2012

jsocol commented Mar 6, 2012

jsocol commented May 12, 2012

originell commented May 12, 2012

jbalogh May 13, 2012

originell May 13, 2012

jbalogh commented May 14, 2012

originell commented May 15, 2012

jsocol commented May 15, 2012

jsocol commented May 15, 2012

originell commented May 15, 2012

jsocol commented May 15, 2012

originell commented May 16, 2012

jsocol commented Jul 3, 2012

originell commented Jul 3, 2012

Garito commented Mar 10, 2013

Garito commented Mar 10, 2013

originell commented Mar 10, 2013

Garito commented Mar 11, 2013

jsocol commented Mar 11, 2013

Remove <script> tags with their content. #57

Remove <script> tags with their content. #57

Conversation

originell commented Feb 28, 2012

originell commented Feb 28, 2012

jsocol Feb 28, 2012

Choose a reason for hiding this comment

jsocol commented Feb 28, 2012

originell commented Feb 28, 2012

originell commented Mar 4, 2012

originell commented Mar 4, 2012

jsocol Mar 6, 2012

Choose a reason for hiding this comment

originell Mar 6, 2012

Choose a reason for hiding this comment

originell Mar 6, 2012

Choose a reason for hiding this comment

jsocol commented Mar 6, 2012

jsocol commented May 12, 2012

originell commented May 12, 2012

jbalogh May 13, 2012

Choose a reason for hiding this comment

originell May 13, 2012

Choose a reason for hiding this comment

jbalogh commented May 14, 2012

originell commented May 15, 2012

jsocol commented May 15, 2012

jsocol commented May 15, 2012

originell commented May 15, 2012

jsocol commented May 15, 2012

originell commented May 16, 2012

jsocol commented Jul 3, 2012

originell commented Jul 3, 2012

Garito commented Mar 10, 2013

Garito commented Mar 10, 2013

originell commented Mar 10, 2013

Garito commented Mar 11, 2013

jsocol commented Mar 11, 2013