do not pipe non-json data through json.dumps #910

sgsaenger · 2018-11-06T12:27:05Z

fixes #904 and possible similar errors by treating json data explicitely

mgeier · 2018-11-13T13:55:28Z

I'm not sure if this makes sense here, but instead of mime_type == 'application/json' you might want to consider using this regular expression:

^application/(.*\+)?json$

https://github.com/jupyter/nbformat/blob/5c18151bab30c4b6c81da279812bf1434ec54271/nbformat/v4/nbformat.v4.schema.json#L407

sgsaenger · 2018-11-18T10:33:02Z

I don't even know how i would end up with plain json-data in my notebooks, therefore I'd rather not add any complexity I can't judge myself.

mgeier · 2018-11-20T11:33:16Z

AFAICT, Jupyter widgets use JSON to store their data in the notebook file. They use MIME types like application/vnd.jupyter.widget-view+json, which are matched by the regular expression mentioned above.

An example notebook: https://github.com/jupyter-widgets/ipywidgets/blob/master/docs/source/examples/Widget%20List.ipynb

I don't know if those output data should be ignored by the ExtractOutputPreprocessor, though.

If not, they should be added extract_output_types, too.

sgsaenger · 2018-11-20T12:43:15Z

At least in this example, I'd wager it should explicitly not be extracted. (What can the user want with the model_id that seems to be for communication with the widget-handler?)
But my intention was just fixing #904, maybe @akhmerov and @minrk who handled the PR #847 that introduced the json-handling can have a look at this?

akhmerov · 2018-11-20T19:44:50Z

I'm quite confused why a byte string ended up in the notebook data to be honest. I thought according to the nbformat everything is a JSON data type, so either a string or another JSON. In that case the fix would be to ensure that the data is read as a string instead.

The problem with the spec is that it's a bit ambiguous: JSON output may consist of a single string. That's why the logic I applied in determining the output type is to check whether it's not a string, and therefore must be JSON, or a string, in which case it may be a JSON and we have to check the mime type.

If for some reason we do expect that byte arrays may appear in cell outputs, I believe a better fix would be to include those in the check if isinstance(data, (text_type, bytes)): ....

In addition to ipywidgets storing their outputs as JSON, so does plotly (its mime type is application/vnd.plotly.v1+json) and ipython.display.JSON (jupyterlab even has a nice rendering for it).

sgsaenger · 2018-11-21T01:05:40Z

I guess you're right about the specs. The file itself even says that it expects images to be enclosed in a string:

nbconvert/nbconvert/preprocessors/extractoutput.py

Line 94 in 061543f

# data is b64-encoded as text (str, unicode),

Still, i'm not sold on your solution.
We know the mime type before we reach this part of the code. By the contract of this function, we trust the mime type (else we wouldn't be here. Note that you omitted including json in the list of accepted mime types). So we perform some specific actions on the data, based on the mime_type.
Previously, we didn't care much for the actual python types handled; if I understand you correctly, you also don't really care whether you have a string or non-string, only whether we should treat the object as json, else you wouldn't include the ´isinstance´-dance.
So where do you see a problem in just checking the mime type against ´application/json´?
The ´isinstance´ call ONLY comes into play if the data is jpg, png, pdf or svg (or anything else that may be included in the future), shouldn't this be independent from json-specific code-paths?

akhmerov · 2018-11-21T07:43:00Z

Note that you omitted including json in the list of accepted mime types).

The list of mimetypes accepted by default is a separate issue from extracting the output correctly, and therefore I did not touch it. That traitlet isn't a complete list of mime types that should ever be treated by the extract output preprocessor, but merely what the extract output treats by default. Several preprocessors override that list. We therefore may not assume that we are never going to encounter other mimetypes. Specifically your PR would break nbconvert for those who want to extract widget information as pointed by @mgeier, or a plotly plot, which is also JSON.

We know the mime type before we reach this part of the code. By the contract of this function, we trust the mime type (else we wouldn't be here.

That is certainly true, but this bit of code is not about the meaning of the data (represented as mime type), but rather about the storage format of the data. Unfortunately right now the storage format is not explicitly specified in the metadata (see also #858). Roughly speaking, there are 3 options currently:

any JSON-like data is stored as JSON without double serialization
any other plaintext-ish mime type is stored as a string
anything else is base64 encoded

The second and third option would become strings at this point in the pipeline, therefore the logic I am applying is the following;

If we encounter any python object other than a string (something like a list/dict), we're dealing with JSON-encoded data (that's the reason for what you called isinstance-dance)
Otherwise if we are either dealing with either a single JSON string or the other two options. Storing a single JSON string is rather uncommon, but allowed by the spec, and therefore only in that case I do rely on the explicit mimetype.

That's my understanding of what is happening in the preprocessor, largely based on discussions with @minrk.

What I really do not understand is: why do we ever encounter a byte string! This seems like a bug: JSON doesn't have a way to specify the type of string data, svg may contain arbitrary unicode symbols, and nothing in the spec tells about byte strings. Instead we're base64 encoding anything that isn't unicode.

sgsaenger · 2018-11-21T09:37:04Z

That traitlet isn't a complete list of mime types that should ever be treated by the extract output preprocessor, but merely what the extract output treats by default. Several preprocessors override that list.

Ah, totally didn't catch that usage, but fair point.

If we encounter any python object other than a string (something like a list/dict), we're dealing with JSON-encoded data (that's the reason for what you called isinstance-dance)

Ok, I now understand your assumptions. But still, is this the right place to (implicitly) introduce strong typing? I have two issues with this:

We're implicitly introducing strong type-dependencies on data we don't care about. If it is important that we more closely match some specs, shouldn't this be done explicitly, and possibly elsewhere?
Isn't there a way to do the json-conversion while being both more explicit in our intent and/or being more robust?
We could e.g. check for json in mime_type to match the widget information, or wrap the dump in try and ignore possible exceptions when the data is not wanted.

akhmerov · 2018-11-21T13:46:41Z

Some remarks:

The best way to avoid guessing would be WIP: output extractor: infer data encoding from output metadata #858.
For now if we let something that doesn't look like a string go further down the pipeline, the follow-up code will break. This would happen with this PR and a mime-type that we don't explicitly catch. I don't like strong typing just likewise, but here I don't see a good reason to avoid it. (Obviously there's something subtle happening with bytes, so I'm not claiming that the current code is correct.)
I don't know enough about mimetypes combined with how jupyter decides which data storage format to apply to a datatype. I am therefore not certain that a check 'json' in mimetype is safe. Maybe some json-related mimetypes would somehow be still string-serialized and we would end up double-serializing them? Maybe some mimetypes that don't contain JSON in their name are stored as a JSON data structure?

MSeal · 2019-04-13T21:48:04Z

Related to this topic, I ended up tackling the application.json vs other data type paths in #985. It doesn't expand the scope of data transformation but fixed a bug where latex exports ended up in the data.decode path.

do not pipe non-json data through json.dumps

061543f

sdpython mentioned this pull request Dec 18, 2018

conversion of notebook including svg fails sdpython/pyquickhelper#216

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

do not pipe non-json data through json.dumps #910

do not pipe non-json data through json.dumps #910

sgsaenger commented Nov 6, 2018

mgeier commented Nov 13, 2018

sgsaenger commented Nov 18, 2018

mgeier commented Nov 20, 2018

sgsaenger commented Nov 20, 2018

akhmerov commented Nov 20, 2018

sgsaenger commented Nov 21, 2018 •

edited

Loading

akhmerov commented Nov 21, 2018

sgsaenger commented Nov 21, 2018

akhmerov commented Nov 21, 2018

MSeal commented Apr 13, 2019

do not pipe non-json data through json.dumps #910

Are you sure you want to change the base?

do not pipe non-json data through json.dumps #910

Conversation

sgsaenger commented Nov 6, 2018

mgeier commented Nov 13, 2018

sgsaenger commented Nov 18, 2018

mgeier commented Nov 20, 2018

sgsaenger commented Nov 20, 2018

akhmerov commented Nov 20, 2018

sgsaenger commented Nov 21, 2018 • edited Loading

akhmerov commented Nov 21, 2018

sgsaenger commented Nov 21, 2018

akhmerov commented Nov 21, 2018

MSeal commented Apr 13, 2019

sgsaenger commented Nov 21, 2018 •

edited

Loading