-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove <script> tags with their content. #57
Conversation
Just found a bug. Gonna fix and extend the commit range. |
@@ -1,8 +1,6 @@ | |||
import itertools |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah thanks, I missed those.
Ping me if I don't see the new commits. :) |
I will try to get it ironed out by thursday. Problem is that it is a bit hard to detect: <p>Some random text <script>function with_a_(script) { alert("tag <script> inside a js string"); }</script> without cutting away the after-script-tag text</p>
|
It is now possible to detect <script> tags inside <script> tags... This fixes a bug with content being deleted after a <script> tag.
Soo. This should fix that nasty bug. I have the feeling that I've missed something in the tests. However this could just be me being paranoid haha :) To be honest it seems to work better than I thought g |
Found something else I missed ;-) See the commit. Furthermore I added another more real test |
basestring)): | ||
self.skip_token = False | ||
# This might be too dumb. | ||
elif any([keyw in self.previous_token['data'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a .lower()
so we catch VAR
etc? What's the worst case if this misses? I assume that any errant <script>
tags would still get escaped if they're not whitelisted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sure I will add that... Mhh before I answer the second question I'm going to write some tests ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some more tests to ensure correct handling of failure cases. See tests_security.py
L167+
One nit and one question but otherwise it's looking good and tests are passing. My preference is to land this on master and not backport to 1.1.x, but since it's a backwards compatible API change I'm open to doing it as a 1.1.2. |
Rebased everything here (surprisingly, the rebase just worked!) I want another set of eyes on this so I'm going to ask some security folks to take a gander before I merge to master, and I'll probably do some squashing to get it cleaned up. It'll make 1.2. |
Ha very nice! 👍 Yeah I'd be glad to have another set of eyepairs looking over this too. Using this in a production right now and it seems to perform just fine. However.. you never know! Maybe I should've added the bitwise operators to the keywords. |
for keyw in ('"', "'", 'var', ';', '=', '{', | ||
'}', '[', ']', '++', '--', '+=', | ||
'-=', '*=', '/=', '%=', 'return', | ||
'function')]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this do? Why do you have an incomplete list of javascript productions? I'm not familiar with the data structures html5lib is producing, but I'm suspicious of this part. What happens if my javascript doesn't start with one of these?
Also, is this loop getting run on every token inside a script tag? How does that affect performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HTML5Lib returns dicts with tags split into their contents. This means f.e. that the following html:
<p>This is a text.</p>
might be returned as
{'name': 'P', 'data': 'This is a text', 'type': …}
So what happens in the whole block (I'm starting with l80) is that I'm trying to be "smart" about the tag. Is there another <script>
tag inside the <script>
tag? If not then I'm moving on to detect what is happening inside the <script>
tag. It might be possible that someone is just passing in faulty html, which might mean that there is a script tag without a closing end. However on second thought I'm not sure if this is needed in any way and we should just skip the token in all it's glory. Need to test this.
The list is not complete because I was trying to reduce the keywords to the ones which are common javascript. I couldn't think of any line of javascript which would not have at least one of these strings in it. However, as I commented it in the line above the check, I'm not quite sure if this isn't too dumb. I'd be happy to see me proven wrong by some tests!
When it comes down to performance.. Yes this is run on every token after an opening script tag until htm5llib tells me it's over. I didn't do any benchmarks on this. It might be a bottleneck on huge script tags, but usually python's iteration and string in foobar
are pretty fast. One thing which might be pretty would be to move this on the top so the tuple get's loaded on import.
Anyways, a regular expression might be faster.
Removing script tags is a valid thing for bleach to do, but I'm concerned about all the additional code added in the Is it possible strip script tags in a more self-contained manner? |
I totally agree with you that it is very hairy and I was not happy with the solution either. Back when I authored the code I needed to get this done asap. Actually, I can't quite remember why I really need to put it here, but I guess it had something to do with the architecture of bleach's cleaning mechanisms. After writing this code I opened a ticket (#58) which proposes some changes to this very function so we have something like a 'pipeline' to run things through. |
OK, this is probably naive, because Luis might have tried this originally (but maybe if you were under a tight deadline you didn't have time to really figure it out) but is there a way we can, if |
I remember trying to just pass/continue. But there was something that made me do it the way I did it. However I honestly can't remember if it was my deadline or anything else. So yeah that might actually work! And I totally agree with you on #58, hence why I never touched the subject again. If someone could reliably port that algo to another architecture it would be great. |
Maybe we should do a tree-walking step to strip scripts and their content, instead of a tokenizing step. We'll make |
I like that 👍 |
Luis, I'm going to close this pull req and look at doing a tree-walk version like in my previous comment. (Unless you want to take a crack at that?) I think we all agree that's the right way to go here. |
Totally agree. I'm not 100% sure that I currently have the time to take a crack at that (ha never heard that phrase before). |
May I ask a question? I don't fully understand why are you make the modification inside an allowed tag instead of making it in a not allowed one I'm preparing another aproximation to the problem acting in, in my view, the correct place Could you enlight me, please? |
I don't have any experience working on github and making pull request and so on, so, please be patience if I'm not doing things correctly (fell free to point me what I'm missing) I upload a branch with my preliminar tests for your review here: Garito@9aeb234 May you check the new tests to see the weaknesses of this approximation? Please be indulgent with this poor human ;) |
Never mind :) There is always a first time and you'll get used/addicted to the github workflow pretty quickly, I'm sure! 👍 Going back to your first question, I don't completely understand what you mean. Which might not be because you explained it wrong, but rather that I don't quite get the code anymore since that modification is already more than half a year ago haha Could you maybe give some code lines to your question? Hopefully I can then answer you :) |
Did you check my branch? |
I needed to remove html tags and found out that the contents of the
script
tags did not get stripped too. Therefore I implemented an option to do just that 😄