Remove script content with tag #67

jsocol · 2012-07-03T14:44:59Z

In clean(), there's no way to remove the contents of a <script> tag. In discussion in #57 we decided the best approach was an additional kwarg and optional treewalk to remove a <script> and any of its children, including text nodes (e.g. the script content).

The text was updated successfully, but these errors were encountered:

originell · 2012-07-03T18:17:41Z

👍 :-)

Garito · 2013-03-12T00:39:50Z

Ok, for me to continue here...

Garito · 2013-03-12T21:13:55Z

So... should I continue with my aproach or not?

brutasse · 2013-06-08T11:59:20Z

This probably shouldn't be limited to <script> tags. This use case is valid for <style> tags or anything that could contain non-human-readable content.

I'd suggest clean() gets a new strip_content kwarg with a list of tags for which to strip the content, although the API would be a bit weird with strip being a boolean. Maybe there are cleaner options but it'd be nice to have this not limited to <script> tags.

kimus · 2013-06-28T17:16:30Z

👍

EmilStenstrom · 2015-08-21T12:28:54Z

Another use-case for this feature is to clean incoming HTML e-mail. These are ofter comprised of full HTML documents, with html, head, body, script and style tags. Running bleach on these leaves the full content of the script and style tags as the page content.

Only workaround I've found so far is to allow script and style tags in bleach, and clear them in a later step with lxml.html.clean.Cleaner.

jsocol · 2015-08-21T12:36:53Z

Another use-case for this feature is to clean incoming HTML e-mail. These are ofter comprised of full HTML documents

Bleach operates on document fragments, not full documents. Full document support is explicitly listed in the "Non Goals" section of the docs

EmilStenstrom · 2015-08-21T12:40:32Z

That's a shame. This is the only missing feature for that use-case.

willkg · 2016-11-03T13:39:53Z

I've thought about this for a while and I think I'm going to pass on it for .clean().

There are two big reasons for doing this:

.clean() is about removing malicious content--not about transforming HTML documents for other mediums or prettifying content. .clean() is a security-focused function and as such, keeping its functionality minimal reduces the likelihood of bugs that have security-related impact. That's really important.
Seems like the use cases for stripping content in tags are related to transforming HTML documents and prettifying content. I think that should get handled by a different function or possibly not by bleach at all.

Towards that, I've clarified the goals/non-goals language and updated the docs to make it clearer what .clean() is supposed to be doing and what it's not going to be good for.

Given that, I'm going to pass on this and close it out.

I'm game for talking about building a .prettify() function which is about prettifying HTML which would cover the use these changes cover. That should be a new issue, though.

EmilStenstrom · 2016-11-03T14:20:52Z

I'm opened a new issue for adding a prettify() method in #234.

jsocol mentioned this issue Mar 11, 2013

Remove <script> tags with their content. #57

Closed

twm mentioned this issue Oct 27, 2013

<style> tag content displayed in feeds radiac/django-yarr#21

Open

jaap3 mentioned this issue Oct 24, 2016

Add option to remove content inside of stripped tags #185

Closed

willkg modified the milestones: v1.6, v2.0 Oct 31, 2016

willkg closed this as completed Nov 3, 2016

willkg removed this from the v2.0 milestone Nov 3, 2016

EmilStenstrom mentioned this issue Nov 3, 2016

Support cleaning HTML for presentational purposes #234

Closed

xmo-odoo mentioned this issue Nov 14, 2022

[IMP] base: replace LXML HTML cleaner by bleach odoo/odoo#90965

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove script content with tag #67

Remove script content with tag #67

jsocol commented Jul 3, 2012

originell commented Jul 3, 2012

Garito commented Mar 12, 2013

Garito commented Mar 12, 2013

brutasse commented Jun 8, 2013

kimus commented Jun 28, 2013

EmilStenstrom commented Aug 21, 2015

jsocol commented Aug 21, 2015

EmilStenstrom commented Aug 21, 2015

willkg commented Nov 3, 2016

EmilStenstrom commented Nov 3, 2016

Remove script content with tag #67

Remove script content with tag #67

Comments

jsocol commented Jul 3, 2012

originell commented Jul 3, 2012

Garito commented Mar 12, 2013

Garito commented Mar 12, 2013

brutasse commented Jun 8, 2013

kimus commented Jun 28, 2013

EmilStenstrom commented Aug 21, 2015

jsocol commented Aug 21, 2015

EmilStenstrom commented Aug 21, 2015

willkg commented Nov 3, 2016

EmilStenstrom commented Nov 3, 2016