Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove script content with tag #67

Closed
jsocol opened this issue Jul 3, 2012 · 10 comments
Closed

Remove script content with tag #67

jsocol opened this issue Jul 3, 2012 · 10 comments
Labels

Comments

@jsocol
Copy link
Contributor

jsocol commented Jul 3, 2012

In clean(), there's no way to remove the contents of a <script> tag. In discussion in #57 we decided the best approach was an additional kwarg and optional treewalk to remove a <script> and any of its children, including text nodes (e.g. the script content).

@originell
Copy link
Contributor

👍 :-)

@Garito
Copy link

Garito commented Mar 12, 2013

Ok, for me to continue here...

@Garito
Copy link

Garito commented Mar 12, 2013

So... should I continue with my aproach or not?

@brutasse
Copy link

brutasse commented Jun 8, 2013

This probably shouldn't be limited to <script> tags. This use case is valid for <style> tags or anything that could contain non-human-readable content.

I'd suggest clean() gets a new strip_content kwarg with a list of tags for which to strip the content, although the API would be a bit weird with strip being a boolean. Maybe there are cleaner options but it'd be nice to have this not limited to <script> tags.

@kimus
Copy link

kimus commented Jun 28, 2013

👍

@EmilStenstrom
Copy link

Another use-case for this feature is to clean incoming HTML e-mail. These are ofter comprised of full HTML documents, with html, head, body, script and style tags. Running bleach on these leaves the full content of the script and style tags as the page content.

Only workaround I've found so far is to allow script and style tags in bleach, and clear them in a later step with lxml.html.clean.Cleaner.

@jsocol
Copy link
Contributor Author

jsocol commented Aug 21, 2015

Another use-case for this feature is to clean incoming HTML e-mail. These are ofter comprised of full HTML documents

Bleach operates on document fragments, not full documents. Full document support is explicitly listed in the "Non Goals" section of the docs

@EmilStenstrom
Copy link

That's a shame. This is the only missing feature for that use-case.

@willkg
Copy link
Member

willkg commented Nov 3, 2016

I've thought about this for a while and I think I'm going to pass on it for .clean().

There are two big reasons for doing this:

  1. .clean() is about removing malicious content--not about transforming HTML documents for other mediums or prettifying content. .clean() is a security-focused function and as such, keeping its functionality minimal reduces the likelihood of bugs that have security-related impact. That's really important.
  2. Seems like the use cases for stripping content in tags are related to transforming HTML documents and prettifying content. I think that should get handled by a different function or possibly not by bleach at all.

Towards that, I've clarified the goals/non-goals language and updated the docs to make it clearer what .clean() is supposed to be doing and what it's not going to be good for.

Given that, I'm going to pass on this and close it out.

I'm game for talking about building a .prettify() function which is about prettifying HTML which would cover the use these changes cover. That should be a new issue, though.

@EmilStenstrom
Copy link

I'm opened a new issue for adding a prettify() method in #234.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants