nbformat v4 #6045

minrk · 2014-06-25T22:23:56Z

Still quite a bit to do, but the implemented parts of v4 are working pretty well. I did find and fix some things while testing the validation. Both v3 and v4 have a Draft 4 jsonschema, and use the standard jsonschema format for references, so jsonpointer is no longer used.

closes #5074

TODO in separate PRs:

backport v4->v3 to 2.x
what to do when uploading old notebooks (upgrade on upload, or leave alone and upgrade on open)?
add convert 3<->4
make v4 current
update stream messages to name, text
move display_data.png to display_data.data['image/png']
what's new
decide on UI for downgrade
upgrade existing example notebooks

found via validation in ipython#6045

minrk · 2014-06-26T04:53:30Z

General validation question:

When a notebook is converted from one version to another, it makes sense to me to validate the input notebook prior to conversion, and validate the result after conversion. The question is If either of these validations fails, should we warn or raise?

The argument for raising, at least on the second validation, is that we shouldn't be creating invalid notebooks.

The argument against raising is that most 'invalid' notebooks aren't actually problematic - they may have some extra keys defined, or some empty fields that are safely handled by default values.

Another argument against raising is that, since there aren't any v4 notebooks, once we make the switch, all notebooks will go through an upgrade step, so raise on invalid in upgrade means all notebooks that fail validation will not open.

jdfreder · 2014-06-26T05:18:27Z

Another argument against raising is that, since there aren't any v4 notebooks, once we make the switch, all notebooks will go through an upgrade step, so raise on invalid in upgrade means all notebooks that fail validation will not open.

I think that would be a disaster. If we are able to upgrade a notebook, invalid or not, we should do so and warn if it's invalid. I think it's important to allow the user to attempt to open the invalid notebook it and salvage what converted. The only time I think raising would be appropriate is if no output was produced.

minrk · 2014-06-30T18:40:40Z

We should figure out the order of operations on this one. Various tasks:

finish implementation of v4 (pretty much done)
implement 3<->4 conversion (pretty much done)
Make downgrade to v3 easier via public API, for working with (to do)
point nbformat.current to v4, and the various consumers of the API (nbconvert done, html partially done)
backport v4 to 2.x (to do)

And figure out exactly what order to do these things, and in how many PRs.

ellisonbg · 2014-07-25T18:19:15Z

@minrk do you need feedback from us on the process for this? IIRC we discussed the order of operations at a dev meeting, but didn't put the conclusion here or there.

minrk · 2014-07-25T23:57:16Z

It's been a while since I looked at this one. I'll need to better enumerate the current state and

One question:

Following the msg_spec : output consistency fixes, I updated the stream output in the file format to match the msg spec, so stream output changes from having stream, text to having name, data. Alternately, I could change the msg_spec to have stream, text instead, or some mixture (e.g. change both to have name, text).

minrk · 2014-07-26T20:33:03Z

Do we want to add any UI for 'save a copy as v3' or otherwise upgrade/downgrade? We have Python APIs for it, but no CLI or GUI access to it.

minrk · 2014-07-28T07:42:26Z

There are some questions about UI for up/downgrade, stream outputs, etc. that I have enumerated as tasks in the descriptions, but at least I have all of the tests passing with nbformat.current = v4.

minrk · 2014-07-28T19:37:03Z

I've rebased this one on #5938, since there were conflicts to resolve. I'll clean it back up once that one is merged.

ivanov · 2014-08-05T19:48:13Z

one of the changes this introduces that may be worth noting is that because the 'worksheets' have been flattened into the root namespace, the metadata key now appears last (at the end of the file) as opposed to the top. I tried to cook up something that would restore the old order, but it's ugly. That's because json.dumps supports the sort_keys boolean flag, but doesn't allow you to specify a custom sort ordering.

takluyver · 2014-08-05T20:07:29Z

IPython/nbformat/current.py

+'new_code_cell', 'new_markdown_cell', 'new_notebook', 'new_output',
+'new_heading_cell', 'nbformat', 'nbformat_minor', 'output_from_msg',
+'to_notebook_json', 'convert', 'validate', 'nbformat_schema',
+'reads_json', 'writes_json', 'reads', 'writes', 'read', 'write',


Since this is explicit public API, I think we should leave new_text_cell as a deprecated alias for one release.

ivanov · 2014-08-06T21:50:08Z

given that we're changing the file format, @fperez had the suggestion of sticking with the singular "worksheet" key name in place of "cells" in order to preserve "metadata" being at the top of the sorted json file (so we can count on being able to peak at just the head of the file down the line to find out kernel information for a given notebook, for example)

minrk · 2014-08-06T22:01:07Z

'count on' seems a bit strong to me, since sorting keys is not actually a part of the file format specification, so we cannot assume that metadata comes first, but we can at least ensure that it is for notebook files that we do write. I'm not thrilled by the idea of picking names based on their lexicographical sorting, but I don't have a big problem with worksheet instead of cells.

We could call it nb['ze cells'] to be extra sure.

Carreau · 2014-08-06T22:03:35Z

Meh... I woudl prefer a solution based on extended metadata when possible, like xattr, it exist with ext4, probably ntfs but I don't know of package that can use it, and it is not the end of the world if we can't distinguish kernel by peeking at file header. so -0.5 for worksheet. -0.4 for 'ze cell' :-)

fperez · 2014-08-06T22:05:51Z

I don't particularly like 'worksheet', but I'm even less thrilled about solutions that rely on filesystem-support for extended metadata. Our notebooks may well be stored on non-FS backends, so we shouldn't make assumptions about what properties a FS has.

Carreau · 2014-08-06T22:09:02Z

That's why I say "when possible", if it is non-FS, then it is trivial to store the kernel along the file content.
This is basicaly saying that the backend is capable of "maybe" reading the kernel without reading the all file.

It might be in the file head,
it might be on xattr.
it might be on a DB that sotre "kernel" as another column.
might be oracle.

We shouldn't rely on any.

Carreau · 2014-08-06T22:11:21Z

Moreover as metadata can be arbitrarily big, you are not even sure the kernel spec will be in the first X lines of the file. I don't think we should design the fileformat on such things.

takluyver · 2014-08-06T22:18:18Z

The trouble with external metadata isn't just storage, it's also
transmission - we want to be able to send notebooks around as email
attachments, gists, etc., and I doubt those things all preserve FS metadata.

I think it is worth trying to keep metadata at the top when we write files.
It's not just about assuming that it's in the first X lines - there are
iterative JSON parsers that can read values without loading the whole file
into memory. This is potentially useful even if the order isn't guaranteed,
because of the memory savings, but it's even more useful if you know
certain keys are likely to be near the top of the file.

Carreau · 2014-08-06T22:30:40Z

And do you really need to peek at the kernel? If we do something like that it should be in the spec. Otherwise we define a "kernel hint" class that can be overwritten.
BDFL said that notebook might not be stored in filesytem. Can you guaranty order of keys on mongo db? Couch db? Postgres?
Is it the end of the world if you do not know a notebook kernel without first reading it? Do people have .ipynb file in 75 languages in 1 folder?

I'm just not convince baking that into the file format before needing it is wise. You know like ... Multiple worksheet :-)

I know that you are attached to ipynb but why not file extensions also?

Envoyé de mon iPhone

Le 7 août 2014 à 00:18, Thomas Kluyver notifications@github.com a écrit :

The trouble with external metadata isn't just storage, it's also
transmission - we want to be able to send notebooks around as email
attachments, gists, etc., and I doubt those things all preserve FS metadata.

I think it is worth trying to keep metadata at the top when we write files.
It's not just about assuming that it's in the first X lines - there are
iterative JSON parsers that can read values without loading the whole file
into memory. This is potentially useful even if the order isn't guaranteed,
because of the memory savings, but it's even more useful if you know
certain keys are likely to be near the top of the file.
—
Reply to this email directly or view it on GitHub.

takluyver · 2014-08-06T22:53:08Z

File extensions for different kernels would be a horrible mess, and I
definitely don't think we want to go down that route.

The metadata is there, and it's conceivable that tools might want to load
metadata when they're not interested in the rest of the file - that's kind
of the nature of metadata, after all. Other backends may have different
ways of making that convenient, like storing it in a separate database
field. For files, the best we can do is to try to keep metadata near the
beginning (some formats can read from the end, but that would be distinctly
awkward with JSON).

Carreau · 2014-08-07T09:17:03Z

File extensions for different kernels would be a horrible mess, and I
definitely don't think we want to go down that route.

I understand that. But keep in mind that this would also allow for different icons in GUI. (Note I, personally am -0.6 on changing extension, just raising the point)

Now, if it is a question of just sorting metadata at top, just use an ordered dict at write time which can be arbitrarily sorted and dumped in the correct order:

Would that suit @fperez ?

In [18]: d = OrderedDict({'cell':{},'metadata':{}});d
Out[18]: OrderedDict([('cell', {}), ('metadata', {})])

In [19]: json.dumps(d)
Out[19]: '{"cell": {}, "metadata": {}}'

In [20]: d.move_to_end('metadata',last=False)

In [21]: json.dumps(d)
Out[21]: '{"metadata": {}, "cell": {}}'

ellisonbg · 2014-08-07T15:45:24Z

I think we should not try to monkey with key order in JSON. We are not the
only ones who will write notebook files in the long run and we don't want
to create an approach where a plain JSON writer won't suffice.

On Thu, Aug 7, 2014 at 2:17 AM, Matthias Bussonnier <
notifications@github.com> wrote:

File extensions for different kernels would be a horrible mess, and I
definitely don't think we want to go down that route.

I understand that. But keep in mind that this would also allow for
different icons in GUI. (Note I, personally am -0.6 on changing extension,
just raising the point)

Now, if it is a question of just sorting metadata at top, just use an
ordered dict at write time which can be arbitrarily sorted and dumped in
the correct order:

Would that suit @fperez https://github.com/fperez ?

In [18]: d = OrderedDict({'cell':{},'metadata':{}});d
Out[18]: OrderedDict([('cell', {}), ('metadata', {})])

In [19]: json.dumps(d)
Out[19]: '{"cell": {}, "metadata": {}}'

In [20]: d.move_to_end('metadata',last=False)

In [21]: json.dumps(d)
Out[21]: '{"metadata": {}, "cell": {}}'

—
Reply to this email directly or view it on GitHub
#6045 (comment).

Brian E. Granger
Cal Poly State University, San Luis Obispo
@ellisonbg on Twitter and GitHub
bgranger@calpoly.edu and ellisonbg@gmail.com

Carreau · 2014-08-07T15:58:08Z

We already mess with the indentation and insure order is preserved for reproductible save. I don't see in what trying to save with a specific order is problematic. But of course we should not rely on it.

minrk · 2014-08-14T19:28:48Z

Getting back to work on this. A few questions:

Are we decided that nb.worksheet is better than nb.cells for the cell list? If so, do we want it to be like the v3 worksheet, which was actually a dict with metadata and a cells attribute, or just a list? (i.e. is it v4nb.worksheet = v3nb.worksheets[0] or v4nb.worksheet = v3nb.worksheets[0].cells?
Do we want notebook GUI for saving a downgraded copy, or is nbconvert CLI adequate?
Does anyone have a preference for the stream output keys? Code currently represents nbformat changing to match msg spec, but since we are updating both, we can reasonably make any change. It's currently:
- name (msg) vs stream (nb)
- data (msg) vs text (nb)
The advantage of updating nb to match msg spec is that I think more authors are writing kernels against the msg spec than are writing javasript or nbconvert code against the nbformat. The advantage of updating msg spec to match nb is that there will be ever-so-slightly less git churn when updating (although indentation will all change, so it will still be a ~100% diff).

Related to the nb.cells / nb.worksheet, we could change the top-level structure of the notebook to be a list instead of a dict, so that we can actually guarantee that a header comes before the body. That would be a more drastic change (1.x would not even be able to identify v4 notebooks as v4), but it would solve the header sniffing problem without resorting to key sorting.

not confusing to_notebook_json

- various typos - discuss multi-line strings in nbformat doc - testing cleanup - py3compat simplification - don't use setdefault when composing notebook nodes - mime-type fix in svg2pdf

to avoid method:module conflict, renamed convert->converter

should common to all nbformats (I didn't change the old nbformats to use it, just in case)

use top-level nbformat.read/write, v4 directly for compose

My French is weak.

found via validation in ipython#6045

satra · 2014-11-03T15:20:52Z

FWIW: adding to wes' comment on the mailing list i would strongly encourage adding one simple field to the nbformat structure:

{
"@context": ...
}

which by default could simply point to:

{
"@context": "http://ipython.org/nbformat/v4/context"
}

(here is an example from the foaf world:

{
  "@context": "http://xmlns.com/foaf/context",
  "name": "Manu Sporny",
  "homepage": "http://manu.sporny.org/"
}

you could then say, that you may not interpret anything in "@context" in IPython or jupyter for now. but it would allow an immediate entry into a much wider Web. some of the key advantages are:

a machine can interpret this document without having to read some html page or wiki
every key that you use has an explicit meaning
one can use fairly standard tools even for validation
every ipython notebook document itself becomes a graph on the Linked Data Web
and you conform to established terms and practices (see schema.org that wes pointed to)

in this situation, one additional keyword gains you a whole lot more, without much maintenance on your side.

ps: you guys are doing a terrific job and i can no longer practically keep up with all the things going on. however, when it comes to standards especially those that play well with others, i have drunk some coolade here in cambridge. so this is merely a suggestion to support a format that links with the wider Linked Data Web and a lot of standardization efforts (http://5stardata.info/ , http://schema.org). also it might be worthwhile to encourage an RFC type approach for things like this after you have fleshed out an initial draft. i do continue to monitor the dev list :) and wes did bring this up on a sep 13 email, but didn't see anything since then till matthias' announcement.

pps: btw, if you were seriously interested in computational provenance and the notebook, the format will be key to this. i would be happy to join one of the dev-calls and present some thoughts on it. we have done a fair bit of work in other contexts. it does seem that your effort to characterize the notebook as a computational standard has to take provenance into account - otherwise it picks up on a very narrow slice of "reproducibility".

minrk · 2014-11-03T17:51:03Z

it might be worthwhile to encourage an RFC type approach for things like this after you have fleshed out an initial draft. i do continue to monitor the dev list

I would say that's what we are doing with our IPEP approach, which includes the nbformat 4 IPEP, which began prior to the release of IPython 1.0.

I'm not particularly persuaded by JSON-LD for notebooks, but if such data can be added to notebook-level metadata, there shouldn't be an issue with adoption by those who are interested. If it needs to be top-level, a proposal can be made to include it in nbformat v4.1.

I think provenance is outside the scope of the notebook format itself, but I think different provenance approaches could be pursued via metadata in the notebook, without affecting the nbformat itself.

nbformat v4

Carreau · 2014-11-03T18:01:41Z

Yeahhh ! Know we wait for the bug report flow.

satra · 2014-11-03T19:59:37Z

@minrk - thanks - i stopped watching ipython on github because of the intense amount of activity (which is a great thing) and just focused on the dev list for monitoring and didn't see anything there. if there is a better place for us to discuss these issues, i'm happy to move it there.

re: json-ld - the key questions here are:

do you see the notebook as a document or data on the web?
do you see the notebook as encapsulating some form of data model (whether for jupyter or ipython)?
if the answers are yes (and i would argue even yes to 1 is a sufficient condition), then the world of linked data would suggest that it follow one of the formats for such a document (json-ld is one such representation). and under such circumstances one should try to follow http://5stardata.info/ to prevent further silos being built. if we had arpanet, http, metatcp, myproto, etc.,. we would never have the web work in the way it does today.

re: provenance - is the notebook defined to capture a slice of reproducibility, i.e some static snapshot that can be rerun? if so i agree that it's outside the scope. if not, then provenance as metadata is not the optimal route to provenance.

fperez · 2014-11-03T22:44:46Z

I'm sadly mostly offline for a few weeks, and I'm typing this just about to
board a plane to PyCon Brasil, so forgive me for the fire and forget...

Full provenance is a gnarly problem, and I'd like us to "play nice" in that
space, but not attempt to tackle it head on. It's simply too large of a
question for us to handle effectively given our resources, and our
strengths lie in the interactive environment itself. But obviously we want
IPython to be part of the toolbox for better reproducibility in research,
so to the extent that small adjustments to what we do can help in that
direction, we should look at them.

The project is getting large enough, that there's really lots of room for
individuals or small teams to take point on specific topics, in a way that
they can then relay back to the whole core team with knowledge, ideas,
suggestions, and even implementation once some agreement is reached.

Satra, I know you have a lot of knowledge on this topic. It would be great
to put this on the agenda for one of the meetings, potentially the Tuesday
ones (we try to keep the Thursday one for core 'critical path' topics, and
right now 3.0 is 100% of the critical path). I'd love to be part of that
discussion, if you're willing to lead it. Given my current travel, it
would need to be post-Nov 20.

Cheers

f

On Mon, Nov 3, 2014 at 4:59 PM, Satrajit Ghosh notifications@github.com
wrote:

@minrk https://github.com/minRK - thanks - i stopped watching ipython
on github because of the intense amount of activity (which is a great
thing) and just focused on the dev list for monitoring and didn't see
anything there. if there is a better place for us to discuss these issues,
i'm happy to move it there.

re: json-ld - the key questions here are:

do you see the notebook as a document or data on the web?

do you see the notebook as encapsulating some form of data model
(whether for jupyter or ipython)?
if the answers are yes (and i would argue even yes to 1 is a sufficient
condition), then the world of linked data would suggest that it follow one
of the formats for such a document (json-ld is one such representation).
and under such circumstances one should try to follow
http://5stardata.info/ to prevent further silos being built. if we had
arpanet, http, metatcp, myproto, etc.,. we would never have the web work in
the way it does today.

re: provenance - is the notebook defined to capture a slice of
reproducibility, i.e some static snapshot that can be rerun? if so i agree
that it's outside the scope. if not, then provenance as metadata is not the
optimal route to provenance.

—
Reply to this email directly or view it on GitHub
#6045 (comment).

Fernando Perez (@fperez_org; http://fperez.org)
fperez.net-at-gmail: mailing lists only (I ignore this when swamped!)
fernando.perez-at-berkeley: contact me here for any direct mail

satra · 2014-11-04T00:48:50Z

@fperez - happy to - let's plan on something after thanksgiving. it looks like we should announce on the mailing list depending on what we plan to cover. i'm quite positive there are others who might want to be part of the discussion. bon voyage!

fperez · 2014-11-04T10:12:45Z

Great, thanks! (typed from Brasilia, as I watch outside the kind of
rainstorm that only happens in tropical jungles, and wonder if my next
flight will make it anywhere :-)
On Nov 3, 2014 10:49 PM, "Satrajit Ghosh" notifications@github.com wrote:

@fperez https://github.com/fperez - happy to - let's plan on something
after thanksgiving. it looks like we should announce on the mailing list
depending on what we plan to cover. i'm quite positive there are others who
might want to be part of the discussion. bon voyage!

—
Reply to this email directly or view it on GitHub
#6045 (comment).

minrk mentioned this pull request Jun 25, 2014

fix stream output created by raw_input #6046

Merged

minrk added a commit to minrk/ipython that referenced this pull request Jun 25, 2014

fix a couple of invalid notebooks

b4521bb

found via validation in ipython#6045

minrk mentioned this pull request Jun 25, 2014

fix a couple of invalid notebooks #6047

Merged

minrk added this to the 3.0 milestone Jul 28, 2014

minrk changed the title ~~WIP: nbformat v4~~ nbformat v4 Jul 28, 2014

minrk added the Review ready label Jul 29, 2014

takluyver mentioned this pull request Aug 5, 2014

parse_filename: Fixed issues when file name contains dots. #6265

Merged

takluyver reviewed Aug 5, 2014
View reviewed changes

minrk added 17 commits November 1, 2014 16:41

add metadata tables to nbformat doc

6a5e66b

use v4 for test validation from the future

4633a8f

convert v3->v4: only set collapsed metadata if value is defined

0dec22f

use from_dict for dict->notebook

9f917e2

not confusing to_notebook_json

address review from takluyver

2377691

- various typos - discuss multi-line strings in nbformat doc - testing cleanup - py3compat simplification - don't use setdefault when composing notebook nodes - mime-type fix in svg2pdf

Add top-level IPython.nbformat API

cd1bfb0

to avoid method:module conflict, renamed convert->converter

Don't use nbformat.current in core

d22b117

Don't use nbformat.current in nbconvert

9867311

move NotebookNode to top-level

776d1fa

should common to all nbformats (I didn't change the old nbformats to use it, just in case)

don't use nbformat.current in IPython.html

973d734

use top-level nbformat.read/write, v4 directly for compose

deprecation warning on nbformat.current

f3b4b76

remove unused OuptutArea.rename_keys

7691800

fix gender on Jupyter

9a1939d

My French is weak.

default to NO_CONVERT in nbformat.writes

aec6686

restore ability to sign v3 notebooks

418df74

fix backward f, nb args for nbformat.write

380a2e9

note about mime-bundle in nbformat doc

0975bf8

minrk force-pushed the nbformat4 branch from 4496db9 to 0975bf8 Compare November 1, 2014 23:41

mattvonrocketstein pushed a commit to mattvonrocketstein/ipython that referenced this pull request Nov 3, 2014

fix a couple of invalid notebooks

98b60ae

found via validation in ipython#6045

takluyver added a commit that referenced this pull request Nov 3, 2014

Merge pull request #6045 from minrk/nbformat4

482c7bd

nbformat v4

takluyver merged commit 482c7bd into ipython:master Nov 3, 2014

ivanov removed the -Needs review label Nov 3, 2014

millejoh mentioned this pull request Nov 3, 2014

EIN not compatible with nbformat v4 millejoh/emacs-ipython-notebook#7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nbformat v4 #6045

nbformat v4 #6045

minrk commented Jun 25, 2014

minrk commented Jun 26, 2014

jdfreder commented Jun 26, 2014

minrk commented Jun 30, 2014

ellisonbg commented Jul 25, 2014

minrk commented Jul 25, 2014

minrk commented Jul 26, 2014

minrk commented Jul 28, 2014

minrk commented Jul 28, 2014

ivanov commented Aug 5, 2014

takluyver Aug 5, 2014

ivanov commented Aug 6, 2014

minrk commented Aug 6, 2014

Carreau commented Aug 6, 2014

fperez commented Aug 6, 2014

Carreau commented Aug 6, 2014

Carreau commented Aug 6, 2014

takluyver commented Aug 6, 2014

Carreau commented Aug 6, 2014

takluyver commented Aug 6, 2014

Carreau commented Aug 7, 2014

ellisonbg commented Aug 7, 2014

Carreau commented Aug 7, 2014

minrk commented Aug 14, 2014

satra commented Nov 3, 2014

minrk commented Nov 3, 2014

Carreau commented Nov 3, 2014

satra commented Nov 3, 2014

fperez commented Nov 3, 2014

satra commented Nov 4, 2014

fperez commented Nov 4, 2014

nbformat v4 #6045

nbformat v4 #6045

Conversation

minrk commented Jun 25, 2014

minrk commented Jun 26, 2014

jdfreder commented Jun 26, 2014

minrk commented Jun 30, 2014

ellisonbg commented Jul 25, 2014

minrk commented Jul 25, 2014

minrk commented Jul 26, 2014

minrk commented Jul 28, 2014

minrk commented Jul 28, 2014

ivanov commented Aug 5, 2014

takluyver Aug 5, 2014

Choose a reason for hiding this comment

ivanov commented Aug 6, 2014

minrk commented Aug 6, 2014

Carreau commented Aug 6, 2014

fperez commented Aug 6, 2014

Carreau commented Aug 6, 2014

Carreau commented Aug 6, 2014

takluyver commented Aug 6, 2014

Carreau commented Aug 6, 2014

takluyver commented Aug 6, 2014

Carreau commented Aug 7, 2014

ellisonbg commented Aug 7, 2014

Carreau commented Aug 7, 2014

minrk commented Aug 14, 2014

satra commented Nov 3, 2014

minrk commented Nov 3, 2014

Carreau commented Nov 3, 2014

satra commented Nov 3, 2014

fperez commented Nov 3, 2014

satra commented Nov 4, 2014

fperez commented Nov 4, 2014