Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove HTML escaping JSON-encoded widget state #1934

Merged
merged 5 commits into from Jul 17, 2023

Conversation

manzt
Copy link
Contributor

@manzt manzt commented Jan 10, 2023

Fixes #1900

The widget data in the templates are either the special JSON mime-type or within a <script> tag (JS). Transforming the JSON-encoded widget data with HTML escapes breaks application widget code. The widget author should be responsible for escaping HTML strings if necessary before encoding the data as JSON in the template - not a post-encoding mutation.

@manzt manzt changed the title fix: remove HTML escaping from widget state Remove HTML escaping from widget state Jan 10, 2023
@manzt manzt changed the title Remove HTML escaping from widget state Remove HTML escaping JSON-encoded widget state Jan 10, 2023
@manzt
Copy link
Contributor Author

manzt commented Jan 10, 2023

cc: @keller-mark

@maartenbreddels
Copy link
Collaborator

Overview of the PR's / commits that touched this line:

With this PR we are reverting #1665, which is not what we want I guess (since it fixed a lot of issues).

I don't know why the 2nd and 3rd commit come from (ping @SylvainCorlay and @martinRenou ) but it seems that that has a different purpose (can you find the PR for this?).

@maartenbreddels
Copy link
Collaborator

@maartenbreddels
Copy link
Collaborator

I don't think GHSL-2021-1025 applied anymore after spatialaudio/nbsphinx#611 (which is the improved version of #1665 cc @mgeier)

@manzt
Copy link
Contributor Author

manzt commented Jan 11, 2023

Thanks for digging into this and having a look. I understand why avoiding a closing a script tag early is desirable, but the current html escaping modifies much more than this specific case. Would a method which is more precise to cover this case be accepted?

As a side note, I think I understand issue in GHSL-2021-1025 but Jupyter widgets generally expose a mechanism to run any JavaScript on the page (in the global scope, with or without the identified </script> vulnerability).

@martinRenou
Copy link
Member

martinRenou commented Jan 11, 2023

I don't think GHSL-2021-1025 applied anymore after spatialaudio/nbsphinx#611

Unfortunately the security issue applies more broadly than the nbsphinx case, so the nbsphinx PR does not fix the nbconvert issue.

but Jupyter widgets generally expose a mechanism to run any JavaScript on the page

That's true, but this comment also applies to the following code cell, and we would probably like to escape it:

from IPython.display import Javascript
display(Javascript('alert("this code runs")'))

I think those were the main reasons for introducing the --sanitize_html=False option and you should use it if you need widgets working.

In the long term we could probably think of a way to trust some well-known widgets libraries like ipywidgets/ipyleaflet.

@maartenbreddels
Copy link
Collaborator

Why is spatialaudio/nbsphinx#611 not sufficient for the widget output? So only addressing GHSL-2021-1025 and the widget output, not things like the title etc.

@martinRenou
Copy link
Member

Why is spatialaudio/nbsphinx#611 not sufficient for the widget output?

Because nbsphinx is not nbconvert, nbconvert is used in other tools, unless I misunderstand your point?

@maartenbreddels
Copy link
Collaborator

Yes, I mean #1665, but with the improvement of making it case insensitive (like what was done in spatialaudio/nbsphinx#611)

@maartenbreddels
Copy link
Collaborator

I think those were the main reasons for introducing the --sanitize_html=False option and you should use it if you need widgets working.

Does this mean that with --sanitize_html=False, we still have the issues that #1665 tried to address? If so, we should still make sure we escape closing script tags right?

@martinRenou
Copy link
Member

martinRenou commented Jan 11, 2023

I think your fix still applies after c90b746 because it should be escaped by escape_html (which was later renamed escape_html_keep_quotes).

I may have read that thread and the code too fast, setting --sanitize_html=False would not fix the initial PR purpose to not escape the HTML in the widget JSON.

Maybe reintroducing #1665 instead of using escape_html_keep_quotes would make everyone happy here? Though I fear that it would not handle all the security breaches.

@martinRenou
Copy link
Member

Though I fear that it would not handle all the security breaches.

But when --sanitize_html=False we should not sanitize the widget output, so let's change this PR and replace the escape_html_keep_quotes call with the escape_html_script from https://github.com/jupyter/nbconvert/pull/1665/files so that we don't escape anything except the closing script tag.

@maartenbreddels
Copy link
Collaborator

Just for clarity, #1665 does not try to solve a security issue, but fixes a bug when a script end tag is included in a string for the JSON. So what #1665 does (or better, the way spatialaudio/nbsphinx#611 does it) should always happen indeed.

@maartenbreddels
Copy link
Collaborator

Just for my understanding, if --sanitize_html=True, why would we need to sanitize the widget output, what are possible security issues? GHSL-2021-1025 only mentions the closing script tag, which would already be solved by #1665/spatialaudio/nbsphinx#611 . Maybe I'm slow this morning :)

@martinRenou
Copy link
Member

So what #1665 does (or better, the way spatialaudio/nbsphinx#611 does it) should always happen indeed.

Agreed 👍🏽

Just for my understanding, if --sanitize_html=True, why would we need to sanitize the widget output, what are possible security issues?

You could inject any HTML in the widget JSON repr (the same way you can close the script tag). There are other ways than script tags for injecting malicious code in the page (through image sources for example).

I'm the one being slow this morning sorry. That code that the PR touches is behind a {%- if not resources.should_sanitize_html %} check so it should not do any sanitizing (apart from your script tag closing bug fix?).

@martinRenou
Copy link
Member

The code should probably be the following:

{%- block footer %}
{% set mimetype = 'application/vnd.jupyter.widget-state+json'%}
{%- if not resources.should_sanitize_html %}
  {% if mimetype in nb.metadata.get("widgets",{})%}
    <script type="{{ mimetype }}">
    {{ nb.metadata.widgets[mimetype] | json_dumps | escape_html_script }}
    </script>
  {% endif %}
{%- elif %}
  {% if mimetype in nb.metadata.get("widgets",{})%}
    <script type="{{ mimetype }}">
    {{ nb.metadata.widgets[mimetype] | json_dumps | escape_html_keep_quotes }}
    </script>
  {% endif %}
{% endif %}
{{ super() }}
{%- endblock footer-%}

With escape_html_script from #1665 that only removes the closing scripts.

@maartenbreddels
Copy link
Collaborator

With escape_html_script from #1665 that only removes the closing scripts.

Better, the regex version from here: spatialaudio/nbsphinx#611

You could inject any HTML in the widget JSON repr

But we don't insert whatever is in nb.metadata.widgets[mimetype] verbatim, but we do a json_dumps. This means that everything is numbers, strings etc. Only the strings can include a closing script tag, and that's already solved by escape_html_script.

For comparison:

{% set nb_title = nb.metadata.get('title', resources['metadata']['name']) | clean_html %}
<title>{{nb_title}}</title>

Here, we just include the value of the nb_title variable in the HTML, without any quotes/json_dumps, so we indeed need to clean using clean_html.

But I think that after a json_dumps | escape_html_script there shouldn't be a security issue at all, and we can skip clean_html.

@martinRenou
Copy link
Member

martinRenou commented Jan 11, 2023

I'm not an expert in security, though escape_html_script seems to do less than clean_html does? It does not handle some cases like

<[%00]/script><script>alert('malicious')

@maartenbreddels
Copy link
Collaborator

Good question. So, can we write a water-tight escape_html_script so we don't need clean_html?

@martinRenou
Copy link
Member

I also don't like clean_html as it would remove any < character which could be in the JSON repr (e.g. a widget with a label containing this character would not render properly). But that was the safest to use for me who does not have background in security.

@maartenbreddels
Copy link
Collaborator

[%00]

What is this called? I have never seen this, is this some kind of encoding? What should it do?

@maartenbreddels
Copy link
Collaborator

I wonder if replacing / by \/ is the safest option for escape_html_script. It basically never allows the browser to see a close tag ...</... since there is always something between the < and the / (i.e. the browser will see ...<\/...`

@martinRenou
Copy link
Member

martinRenou commented Jan 11, 2023

What is this called? I have never seen this, is this some kind of encoding? What should it do?

This is a NULL byte, from what I've seen online this notation is only valid (valid, as in it would read </script>) on IE (I don't know which version).

@manzt
Copy link
Contributor Author

manzt commented Jan 11, 2023

Thanks for the active discussion. I'm just catching up...

That's true, but this comment also applies to the following code cell, and we would probably like to escape it.

I meant more generally that jupyter widgets provide a mechanism to run third-party JS. The security issue here is concerned with preventing widget authors from injecting malicious scripts into the widget data, but every widget has third-party JS which is loaded from a CDN by @jupyter-widgets/html-manager and executed in the global scope.

https://github.com/jupyter-widgets/ipywidgets/blob/9fd4c3052ca90591d635ff802c2487196025a721/packages/html-manager/src/embed.ts#L10-L14

For example, if I use a ipyobservable widget in my notebook, https://cdn.jsdelivr.net/npm/ipyobservable@0.1.1/dist/index.js is loaded and executed by the widgets manger.

Widget authors control both the widget data (which may be exploited) and widget JS (which needs no exploit to run arbitrary JS). If the widget JS is treated as a trusted source, by extension I don't see why the widget data must be sanitized.

With escape_html_script from #1665 that only removes the closing scripts.

I would be in support of this solution, although with the point above don't see why this shouldn't just be the default.

This is a NULL byte, from what I've seen online this notation is only valid (valid, as in it would read </script>) on IE (I don't know which version).

I'm not sure how prevalent this edge case is (no pun intended) since IE support officially ended as of June 2022.

@maartenbreddels
Copy link
Collaborator

replacing / by \/

I think this is a water-tight solution. I can't think how this can be circumvented (famous last words?)

@ankostis
Copy link

ankostis commented Apr 2, 2023

Please apply the same fixes also to share/templates/lab/base.html.j2 file.

@flekschas
Copy link

Are there any plans to get this merged? I'd love to use v7 and this PR seems to fix the issue. Happy to help getting it merged if there's something I can do.

@blink1073 blink1073 enabled auto-merge (squash) July 17, 2023 10:21
@blink1073 blink1073 merged commit f2fc3e1 into jupyter:main Jul 17, 2023
22 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Widget with "<" or ">" characters in Unicode state embeds "&lt" and "&gt" in emitted HTML
7 participants