Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SVG cleaning should not use HTML cleaner #1894

Closed
akx opened this issue Oct 28, 2022 · 10 comments · Fixed by #2018
Closed

SVG cleaning should not use HTML cleaner #1894

akx opened this issue Oct 28, 2022 · 10 comments · Fixed by #2018

Comments

@akx
Copy link
Contributor

akx commented Oct 28, 2022

The templates for actual image/svg+xml cells (e.g.

{{ output.data['image/svg+xml'].encode("utf-8") | clean_html }}
) should use an XML-savvy cleaner since a image/svg+xml is not HTML to begin with.

Originally posted by @akx in #1890 (comment)


(I just wanted to make an issue out of this so it's searchable...)

@hadim
Copy link

hadim commented Oct 28, 2022

I confirm that | clean_html render corrupted SVG (black background, deformed shapes, etc). When removing it the SVG is rendered correctly.

It happens when generating HTML MKDocs page with mkdocs-jupyter on SVG rendered with data and rdkit.

@desilinguist
Copy link

I am also seeing the same issue in my nbconvert-ed HTML pages. See #1849 (comment)

@therzka
Copy link

therzka commented Nov 15, 2022

👋 Myself and a colleague have been looking into this as well. It looks like despite this PR although the style attribute is allowed on valid SVG elements, SVGs still don't render because the value of the style attribute is stripped.

image

Our current workaround is to revert back to the old sanitization behavior and define our own lxml.Cleaner() to override default_filters["clean_html"]:

cleaner = Cleaner(
    style=True,
    scripts=True,
    inline_style=False,
    safe_attrs_only=False,
    remove_unknown_tags=False
)
default_filters["clean_html"] = cleaner.clean_html

@stefmolin
Copy link

stefmolin commented Nov 20, 2022

I'll also share my hack while we wait for a fix. Change {{ output.data['image/svg+xml'].encode("utf-8") | clean_html }} to just {{ output.data['image/svg+xml'] }} in share/jupyter/nbconvert/templates/lab/base.html.j2 (this will be in your virtual environment).

I'm curious why there is an option to turn off HTML cleaning, but it doesn't apply to the SVG output (like line 173).

@desilinguist
Copy link

desilinguist commented Dec 1, 2022

I use nbconvert programmatically and so modifying the base.html.j2 template, while effective, wasn't really an ideal solution for me. Here's the fix that worked for me. Basically, the goal is to turn the offending clean_html into a noop. To achieve this, I did:

from nbconvert.exporters import HTMLExporter
from nbconvert.exporters.templateexporter import default_filters

def convert_to_html(notebook_file, html_file):

    def custom_clean_html(element):
        return element.decode() if isinstance(element, bytes) else str(element)

    default_filters["clean_html"] = custom_clean_html

    exportHtml = HTMLExporter()
    output, _ = exportHtml.from_filename(notebook_file)
    open(html_file, mode="w", encoding="utf-8").write(output)

This works for me with nbconvert=7.2.5 and my SVG figures are now all back! 🎉

@hadim
Copy link

hadim commented Dec 1, 2022

Thanks for sharing this fix.

For people like me using it trough tan external library (to generate a documentation in my case with mkdocs-jupyter), we will need that to be fixed in nbconvert directly.

@carlosefr
Copy link

This is also mentioned in #1863.

@josephmcasey
Copy link

Related to mkdocs-jupyter a consumer of nbconvert:
Thanks for writing that workaround @desilinguist . @hadim , I opened that leverages this solution for this mkdocs plugin, and it looks fully functional with the existing test suite. If you think you have a particularly complex SVG to render then it would be great to add your example to the repository.

Related to nbconvert:
As for this particular break, I think the obvious solution that would be suggested is that this library abide by the W3 specifications for SVG, so I will try to pose a question I think might be slightly more original and reduce the overall burden of overhead of library maintenance.

If this solution is not too lazy as to denigrate the character of the changeset author, would the maintainers find it an acceptable solution to introduce a github action that renders the html produced on a headless browser like chrome as a form of validation testing? I ask because it seems like that would quickly outsource to Google and consumers of the library most of the complexity that would come with writing unit tests that cover the entire spectrum of W3 Standards. When a break is found by a consumer, grab their notebook and add it to the array of validation tests.

@yksantaro
Copy link

I have the same problem on <xlink:href> SVG attribute. (Ex. axes labels in matplotlib objects)
I think that inline SVG image data can be bypassed rather than processed by clean_html() because the most images will probably be displayed in JupyterLab window and will be exported and displayed correctly in HTML without any cleaning.

So I added bypass_svg() function for strings.py (in lib/python3.11/site-packages/nbconvert/filters) and related files: templateexporter.py (in lib/python3.11/site-packages/nbconvert/exporters) and base.html.j2 (in share/jupyter/nbconvert/templates/lab) as attached patches below.

templateexporter.py.patch
base.html.j2.patch
strings.py.patch

This work around is so fine for me.

@jstorrs
Copy link
Contributor

jstorrs commented Jul 6, 2023

I don't know if this is correct, but while researching methods to sanitize SVG this site was mentioned on security.stackexchange.com which suggests that SVG loaded via img tags won't execute scripts. Since the SVG is already inside an img tag [Edit: incorrect, misread the template] Could we just put it in an img tag and base64 pack the SVG content?

Edit: this fixed nbconvert (7.6.0) html and webpdf output for me when using matplotlib with SVG output:

$ cd share/jupyter/nbconvert/templates/lab/
$ diff base.html.j2 base.html.j2.orig
167c167
< <img src="data:image/svg+xml;base64,{{ output.data['image/svg+xml'] | text_base64 | escape_html }}">
---
> {{ output.data['image/svg+xml'].encode("utf-8") | clean_html }}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants