Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove <script> tags with their content. #57

Closed
wants to merge 6 commits into from

Conversation

originell
Copy link
Contributor

I needed to remove html tags and found out that the contents of the script tags did not get stripped too. Therefore I implemented an option to do just that 😄

@originell
Copy link
Contributor Author

Just found a bug. Gonna fix and extend the commit range.

@@ -1,8 +1,6 @@
import itertools
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks, I missed those.

@jsocol
Copy link
Contributor

jsocol commented Feb 28, 2012

Ping me if I don't see the new commits. :)

@originell
Copy link
Contributor Author

I will try to get it ironed out by thursday. Problem is that it is a bit hard to detect:

<p>Some random text <script>function with_a_(script) { alert("tag <script> inside a js string"); }</script> without cutting away the after-script-tag text</p>

html5lib sees the <script> inside the JavaScript String as a new HTML element. I have a solution but it's not working perfectly yet ;-)

It is now possible to detect <script> tags inside <script> tags...
This fixes a bug with content being deleted after a <script> tag.
@originell
Copy link
Contributor Author

Soo. This should fix that nasty bug. I have the feeling that I've missed something in the tests. However this could just be me being paranoid haha :)

To be honest it seems to work better than I thought g

@originell
Copy link
Contributor Author

Found something else I missed ;-) See the commit. Furthermore I added another more real test

basestring)):
self.skip_token = False
# This might be too dumb.
elif any([keyw in self.previous_token['data']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a .lower() so we catch VAR etc? What's the worst case if this misses? I assume that any errant <script> tags would still get escaped if they're not whitelisted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sure I will add that... Mhh before I answer the second question I'm going to write some tests ;-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some more tests to ensure correct handling of failure cases. See tests_security.py L167+

@jsocol
Copy link
Contributor

jsocol commented Mar 6, 2012

One nit and one question but otherwise it's looking good and tests are passing.

My preference is to land this on master and not backport to 1.1.x, but since it's a backwards compatible API change I'm open to doing it as a 1.1.2.

@jsocol
Copy link
Contributor

jsocol commented May 12, 2012

Rebased everything here (surprisingly, the rebase just worked!)

I want another set of eyes on this so I'm going to ask some security folks to take a gander before I merge to master, and I'll probably do some squashing to get it cleaned up. It'll make 1.2.

@originell
Copy link
Contributor Author

Ha very nice! 👍

Yeah I'd be glad to have another set of eyepairs looking over this too. Using this in a production right now and it seems to perform just fine. However.. you never know! Maybe I should've added the bitwise operators to the keywords.

for keyw in ('"', "'", 'var', ';', '=', '{',
'}', '[', ']', '++', '--', '+=',
'-=', '*=', '/=', '%=', 'return',
'function')]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do? Why do you have an incomplete list of javascript productions? I'm not familiar with the data structures html5lib is producing, but I'm suspicious of this part. What happens if my javascript doesn't start with one of these?

Also, is this loop getting run on every token inside a script tag? How does that affect performance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HTML5Lib returns dicts with tags split into their contents. This means f.e. that the following html:

<p>This is a text.</p>

might be returned as

{'name': 'P', 'data': 'This is a text', 'type': …}

So what happens in the whole block (I'm starting with l80) is that I'm trying to be "smart" about the tag. Is there another <script> tag inside the <script> tag? If not then I'm moving on to detect what is happening inside the <script> tag. It might be possible that someone is just passing in faulty html, which might mean that there is a script tag without a closing end. However on second thought I'm not sure if this is needed in any way and we should just skip the token in all it's glory. Need to test this.

The list is not complete because I was trying to reduce the keywords to the ones which are common javascript. I couldn't think of any line of javascript which would not have at least one of these strings in it. However, as I commented it in the line above the check, I'm not quite sure if this isn't too dumb. I'd be happy to see me proven wrong by some tests!

When it comes down to performance.. Yes this is run on every token after an opening script tag until htm5llib tells me it's over. I didn't do any benchmarks on this. It might be a bottleneck on huge script tags, but usually python's iteration and string in foobar are pretty fast. One thing which might be pretty would be to move this on the top so the tuple get's loaded on import.
Anyways, a regular expression might be faster.

@jbalogh
Copy link
Contributor

jbalogh commented May 14, 2012

Removing script tags is a valid thing for bleach to do, but I'm concerned about all the additional code added in the sanitize_token function. That function is already hairy (@jsocol!) and keeping track of skip_token and previous_token in every branch looks bad for future maintenance.

Is it possible strip script tags in a more self-contained manner?

@originell
Copy link
Contributor Author

I totally agree with you that it is very hairy and I was not happy with the solution either. Back when I authored the code I needed to get this done asap. Actually, I can't quite remember why I really need to put it here, but I guess it had something to do with the architecture of bleach's cleaning mechanisms. After writing this code I opened a ticket (#58) which proposes some changes to this very function so we have something like a 'pipeline' to run things through.

@jsocol
Copy link
Contributor

jsocol commented May 15, 2012

OK, this is probably naive, because Luis might have tried this originally (but maybe if you were under a tight deadline you didn't have time to really figure it out) but is there a way we can, if skip_script is True, just pass or continue or otherwise fail to yield the node (not the token) completely? Maybe this is something that is better served at a different step?

@jsocol
Copy link
Contributor

jsocol commented May 15, 2012

And @jbalogh: try not to blame me, this is the HTML5 tokenizing algorithm, taken pretty directly from html5lib and tweaked. The hairiness belongs to Hixie and Henri.

Which is also why I'm reluctant, on #58, to move to a cleaning pipeline--because moving away from that algorithm is a big deal.

@originell
Copy link
Contributor Author

I remember trying to just pass/continue. But there was something that made me do it the way I did it. However I honestly can't remember if it was my deadline or anything else. So yeah that might actually work!

And I totally agree with you on #58, hence why I never touched the subject again. If someone could reliably port that algo to another architecture it would be great.

@jsocol
Copy link
Contributor

jsocol commented May 15, 2012

Maybe we should do a tree-walking step to strip scripts and their content, instead of a tokenizing step. We'll make clean() bigger, internally, but we don't have to do the tree-walking unless someone wants to drop script content.

@originell
Copy link
Contributor Author

I like that 👍

@jsocol
Copy link
Contributor

jsocol commented Jul 3, 2012

Luis, I'm going to close this pull req and look at doing a tree-walk version like in my previous comment. (Unless you want to take a crack at that?) I think we all agree that's the right way to go here.

@jsocol jsocol closed this Jul 3, 2012
@originell
Copy link
Contributor Author

Totally agree. I'm not 100% sure that I currently have the time to take a crack at that (ha never heard that phrase before).

@Garito
Copy link

Garito commented Mar 10, 2013

May I ask a question?

I don't fully understand why are you make the modification inside an allowed tag instead of making it in a not allowed one

I'm preparing another aproximation to the problem acting in, in my view, the correct place

Could you enlight me, please?

@Garito
Copy link

Garito commented Mar 10, 2013

I don't have any experience working on github and making pull request and so on, so, please be patience if I'm not doing things correctly (fell free to point me what I'm missing)

I upload a branch with my preliminar tests for your review here: Garito@9aeb234

May you check the new tests to see the weaknesses of this approximation?

Please be indulgent with this poor human ;)

@originell
Copy link
Contributor Author

Never mind :) There is always a first time and you'll get used/addicted to the github workflow pretty quickly, I'm sure! 👍

Going back to your first question, I don't completely understand what you mean. Which might not be because you explained it wrong, but rather that I don't quite get the code anymore since that modification is already more than half a year ago haha

Could you maybe give some code lines to your question? Hopefully I can then answer you :)

@Garito
Copy link

Garito commented Mar 11, 2013

Did you check my branch?

@jsocol
Copy link
Contributor

jsocol commented Mar 11, 2013

@Garito — this will sound pedantic but can we keep conversation of this issue to issue #67? I'd just like to keep discussion coherent for myself in the future. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants