Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nbformat v4 #6045

Merged
merged 47 commits into from Nov 3, 2014
Merged

nbformat v4 #6045

merged 47 commits into from Nov 3, 2014

Conversation

minrk
Copy link
Member

@minrk minrk commented Jun 25, 2014

Still quite a bit to do, but the implemented parts of v4 are working pretty well. I did find and fix some things while testing the validation. Both v3 and v4 have a Draft 4 jsonschema, and use the standard jsonschema format for references, so jsonpointer is no longer used.

closes #5074

TODO in separate PRs:

  • backport v4->v3 to 2.x
  • what to do when uploading old notebooks (upgrade on upload, or leave alone and upgrade on open)?
  • add convert 3<->4
  • make v4 current
  • update stream messages to name, text
  • move display_data.png to display_data.data['image/png']
  • what's new
  • decide on UI for downgrade
  • upgrade existing example notebooks

minrk added a commit to minrk/ipython that referenced this pull request Jun 25, 2014
@minrk
Copy link
Member Author

minrk commented Jun 26, 2014

General validation question:

When a notebook is converted from one version to another, it makes sense to me to validate the input notebook prior to conversion, and validate the result after conversion. The question is If either of these validations fails, should we warn or raise?

The argument for raising, at least on the second validation, is that we shouldn't be creating invalid notebooks.

The argument against raising is that most 'invalid' notebooks aren't actually problematic - they may have some extra keys defined, or some empty fields that are safely handled by default values.

Another argument against raising is that, since there aren't any v4 notebooks, once we make the switch, all notebooks will go through an upgrade step, so raise on invalid in upgrade means all notebooks that fail validation will not open.

@jdfreder
Copy link
Member

Another argument against raising is that, since there aren't any v4 notebooks, once we make the switch, all notebooks will go through an upgrade step, so raise on invalid in upgrade means all notebooks that fail validation will not open.

I think that would be a disaster. If we are able to upgrade a notebook, invalid or not, we should do so and warn if it's invalid. I think it's important to allow the user to attempt to open the invalid notebook it and salvage what converted. The only time I think raising would be appropriate is if no output was produced.

@minrk
Copy link
Member Author

minrk commented Jun 30, 2014

We should figure out the order of operations on this one. Various tasks:

  1. finish implementation of v4 (pretty much done)
  2. implement 3<->4 conversion (pretty much done)
  3. Make downgrade to v3 easier via public API, for working with (to do)
  4. point nbformat.current to v4, and the various consumers of the API (nbconvert done, html partially done)
  5. backport v4 to 2.x (to do)

And figure out exactly what order to do these things, and in how many PRs.

@ellisonbg
Copy link
Member

@minrk do you need feedback from us on the process for this? IIRC we discussed the order of operations at a dev meeting, but didn't put the conclusion here or there.

@minrk
Copy link
Member Author

minrk commented Jul 25, 2014

It's been a while since I looked at this one. I'll need to better enumerate the current state and

One question:

Following the msg_spec : output consistency fixes, I updated the stream output in the file format to match the msg spec, so stream output changes from having stream, text to having name, data. Alternately, I could change the msg_spec to have stream, text instead, or some mixture (e.g. change both to have name, text).

@minrk
Copy link
Member Author

minrk commented Jul 26, 2014

Do we want to add any UI for 'save a copy as v3' or otherwise upgrade/downgrade? We have Python APIs for it, but no CLI or GUI access to it.

@minrk minrk added this to the 3.0 milestone Jul 28, 2014
@minrk
Copy link
Member Author

minrk commented Jul 28, 2014

There are some questions about UI for up/downgrade, stream outputs, etc. that I have enumerated as tasks in the descriptions, but at least I have all of the tests passing with nbformat.current = v4.

@minrk minrk changed the title WIP: nbformat v4 nbformat v4 Jul 28, 2014
@minrk
Copy link
Member Author

minrk commented Jul 28, 2014

I've rebased this one on #5938, since there were conflicts to resolve. I'll clean it back up once that one is merged.

@ivanov
Copy link
Member

ivanov commented Aug 5, 2014

one of the changes this introduces that may be worth noting is that because the 'worksheets' have been flattened into the root namespace, the metadata key now appears last (at the end of the file) as opposed to the top. I tried to cook up something that would restore the old order, but it's ugly. That's because json.dumps supports the sort_keys boolean flag, but doesn't allow you to specify a custom sort ordering.

'new_code_cell', 'new_markdown_cell', 'new_notebook', 'new_output',
'new_heading_cell', 'nbformat', 'nbformat_minor', 'output_from_msg',
'to_notebook_json', 'convert', 'validate', 'nbformat_schema',
'reads_json', 'writes_json', 'reads', 'writes', 'read', 'write',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is explicit public API, I think we should leave new_text_cell as a deprecated alias for one release.

@ivanov
Copy link
Member

ivanov commented Aug 6, 2014

given that we're changing the file format, @fperez had the suggestion of sticking with the singular "worksheet" key name in place of "cells" in order to preserve "metadata" being at the top of the sorted json file (so we can count on being able to peak at just the head of the file down the line to find out kernel information for a given notebook, for example)

@minrk
Copy link
Member Author

minrk commented Aug 6, 2014

'count on' seems a bit strong to me, since sorting keys is not actually a part of the file format specification, so we cannot assume that metadata comes first, but we can at least ensure that it is for notebook files that we do write. I'm not thrilled by the idea of picking names based on their lexicographical sorting, but I don't have a big problem with worksheet instead of cells.

We could call it nb['ze cells'] to be extra sure.

@Carreau
Copy link
Member

Carreau commented Aug 6, 2014

Meh... I woudl prefer a solution based on extended metadata when possible, like xattr, it exist with ext4, probably ntfs but I don't know of package that can use it, and it is not the end of the world if we can't distinguish kernel by peeking at file header. so -0.5 for worksheet. -0.4 for 'ze cell' :-)

@fperez
Copy link
Member

fperez commented Aug 6, 2014

I don't particularly like 'worksheet', but I'm even less thrilled about solutions that rely on filesystem-support for extended metadata. Our notebooks may well be stored on non-FS backends, so we shouldn't make assumptions about what properties a FS has.

@Carreau
Copy link
Member

Carreau commented Aug 6, 2014

That's why I say "when possible", if it is non-FS, then it is trivial to store the kernel along the file content.
This is basicaly saying that the backend is capable of "maybe" reading the kernel without reading the all file.

  • It might be in the file head,
  • it might be on xattr.
  • it might be on a DB that sotre "kernel" as another column.
  • might be oracle.

We shouldn't rely on any.

@Carreau
Copy link
Member

Carreau commented Aug 6, 2014

Moreover as metadata can be arbitrarily big, you are not even sure the kernel spec will be in the first X lines of the file. I don't think we should design the fileformat on such things.

@takluyver
Copy link
Member

The trouble with external metadata isn't just storage, it's also
transmission - we want to be able to send notebooks around as email
attachments, gists, etc., and I doubt those things all preserve FS metadata.

I think it is worth trying to keep metadata at the top when we write files.
It's not just about assuming that it's in the first X lines - there are
iterative JSON parsers that can read values without loading the whole file
into memory. This is potentially useful even if the order isn't guaranteed,
because of the memory savings, but it's even more useful if you know
certain keys are likely to be near the top of the file.

@Carreau
Copy link
Member

Carreau commented Aug 6, 2014

And do you really need to peek at the kernel? If we do something like that it should be in the spec. Otherwise we define a "kernel hint" class that can be overwritten.
BDFL said that notebook might not be stored in filesytem. Can you guaranty order of keys on mongo db? Couch db? Postgres?
Is it the end of the world if you do not know a notebook kernel without first reading it? Do people have .ipynb file in 75 languages in 1 folder?

I'm just not convince baking that into the file format before needing it is wise. You know like ... Multiple worksheet :-)

I know that you are attached to ipynb but why not file extensions also?

Envoyé de mon iPhone

Le 7 août 2014 à 00:18, Thomas Kluyver notifications@github.com a écrit :

The trouble with external metadata isn't just storage, it's also
transmission - we want to be able to send notebooks around as email
attachments, gists, etc., and I doubt those things all preserve FS metadata.

I think it is worth trying to keep metadata at the top when we write files.
It's not just about assuming that it's in the first X lines - there are
iterative JSON parsers that can read values without loading the whole file
into memory. This is potentially useful even if the order isn't guaranteed,
because of the memory savings, but it's even more useful if you know
certain keys are likely to be near the top of the file.

Reply to this email directly or view it on GitHub.

@takluyver
Copy link
Member

File extensions for different kernels would be a horrible mess, and I
definitely don't think we want to go down that route.

The metadata is there, and it's conceivable that tools might want to load
metadata when they're not interested in the rest of the file - that's kind
of the nature of metadata, after all. Other backends may have different
ways of making that convenient, like storing it in a separate database
field. For files, the best we can do is to try to keep metadata near the
beginning (some formats can read from the end, but that would be distinctly
awkward with JSON).

@Carreau
Copy link
Member

Carreau commented Aug 7, 2014

File extensions for different kernels would be a horrible mess, and I
definitely don't think we want to go down that route.

I understand that. But keep in mind that this would also allow for different icons in GUI. (Note I, personally am -0.6 on changing extension, just raising the point)

Now, if it is a question of just sorting metadata at top, just use an ordered dict at write time which can be arbitrarily sorted and dumped in the correct order:

Would that suit @fperez ?

In [18]: d = OrderedDict({'cell':{},'metadata':{}});d
Out[18]: OrderedDict([('cell', {}), ('metadata', {})])

In [19]: json.dumps(d)
Out[19]: '{"cell": {}, "metadata": {}}'

In [20]: d.move_to_end('metadata',last=False)

In [21]: json.dumps(d)
Out[21]: '{"metadata": {}, "cell": {}}'

@ellisonbg
Copy link
Member

I think we should not try to monkey with key order in JSON. We are not the
only ones who will write notebook files in the long run and we don't want
to create an approach where a plain JSON writer won't suffice.

On Thu, Aug 7, 2014 at 2:17 AM, Matthias Bussonnier <
notifications@github.com> wrote:

File extensions for different kernels would be a horrible mess, and I
definitely don't think we want to go down that route.

I understand that. But keep in mind that this would also allow for
different icons in GUI. (Note I, personally am -0.6 on changing extension,
just raising the point)

Now, if it is a question of just sorting metadata at top, just use an
ordered dict at write time which can be arbitrarily sorted and dumped in
the correct order:

Would that suit @fperez https://github.com/fperez ?

In [18]: d = OrderedDict({'cell':{},'metadata':{}});d
Out[18]: OrderedDict([('cell', {}), ('metadata', {})])

In [19]: json.dumps(d)
Out[19]: '{"cell": {}, "metadata": {}}'

In [20]: d.move_to_end('metadata',last=False)

In [21]: json.dumps(d)
Out[21]: '{"metadata": {}, "cell": {}}'


Reply to this email directly or view it on GitHub
#6045 (comment).

Brian E. Granger
Cal Poly State University, San Luis Obispo
@ellisonbg on Twitter and GitHub
bgranger@calpoly.edu and ellisonbg@gmail.com

@Carreau
Copy link
Member

Carreau commented Aug 7, 2014

We already mess with the indentation and insure order is preserved for reproductible save. I don't see in what trying to save with a specific order is problematic. But of course we should not rely on it.

@minrk
Copy link
Member Author

minrk commented Aug 14, 2014

Getting back to work on this. A few questions:

  • Are we decided that nb.worksheet is better than nb.cells for the cell list? If so, do we want it to be like the v3 worksheet, which was actually a dict with metadata and a cells attribute, or just a list? (i.e. is it v4nb.worksheet = v3nb.worksheets[0] or v4nb.worksheet = v3nb.worksheets[0].cells?

  • Do we want notebook GUI for saving a downgraded copy, or is nbconvert CLI adequate?

  • Does anyone have a preference for the stream output keys? Code currently represents nbformat changing to match msg spec, but since we are updating both, we can reasonably make any change. It's currently:

    • name (msg) vs stream (nb)
    • data (msg) vs text (nb)

    The advantage of updating nb to match msg spec is that I think more authors are writing kernels against the msg spec than are writing javasript or nbconvert code against the nbformat. The advantage of updating msg spec to match nb is that there will be ever-so-slightly less git churn when updating (although indentation will all change, so it will still be a ~100% diff).

Related to the nb.cells / nb.worksheet, we could change the top-level structure of the notebook to be a list instead of a dict, so that we can actually guarantee that a header comes before the body. That would be a more drastic change (1.x would not even be able to identify v4 notebooks as v4), but it would solve the header sniffing problem without resorting to key sorting.

mattvonrocketstein pushed a commit to mattvonrocketstein/ipython that referenced this pull request Nov 3, 2014
@satra
Copy link
Contributor

satra commented Nov 3, 2014

FWIW: adding to wes' comment on the mailing list i would strongly encourage adding one simple field to the nbformat structure:

{
"@context": ...
}

which by default could simply point to:

{
"@context": "http://ipython.org/nbformat/v4/context"
}

(here is an example from the foaf world:

{
  "@context": "http://xmlns.com/foaf/context",
  "name": "Manu Sporny",
  "homepage": "http://manu.sporny.org/"
}

you could then say, that you may not interpret anything in "@context" in IPython or jupyter for now. but it would allow an immediate entry into a much wider Web. some of the key advantages are:

  • a machine can interpret this document without having to read some html page or wiki
  • every key that you use has an explicit meaning
  • one can use fairly standard tools even for validation
  • every ipython notebook document itself becomes a graph on the Linked Data Web
  • and you conform to established terms and practices (see schema.org that wes pointed to)

in this situation, one additional keyword gains you a whole lot more, without much maintenance on your side.

ps: you guys are doing a terrific job and i can no longer practically keep up with all the things going on. however, when it comes to standards especially those that play well with others, i have drunk some coolade here in cambridge. so this is merely a suggestion to support a format that links with the wider Linked Data Web and a lot of standardization efforts (http://5stardata.info/ , http://schema.org). also it might be worthwhile to encourage an RFC type approach for things like this after you have fleshed out an initial draft. i do continue to monitor the dev list :) and wes did bring this up on a sep 13 email, but didn't see anything since then till matthias' announcement.

pps: btw, if you were seriously interested in computational provenance and the notebook, the format will be key to this. i would be happy to join one of the dev-calls and present some thoughts on it. we have done a fair bit of work in other contexts. it does seem that your effort to characterize the notebook as a computational standard has to take provenance into account - otherwise it picks up on a very narrow slice of "reproducibility".

@minrk
Copy link
Member Author

minrk commented Nov 3, 2014

it might be worthwhile to encourage an RFC type approach for things like this after you have fleshed out an initial draft. i do continue to monitor the dev list

I would say that's what we are doing with our IPEP approach, which includes the nbformat 4 IPEP, which began prior to the release of IPython 1.0.

I'm not particularly persuaded by JSON-LD for notebooks, but if such data can be added to notebook-level metadata, there shouldn't be an issue with adoption by those who are interested. If it needs to be top-level, a proposal can be made to include it in nbformat v4.1.

I think provenance is outside the scope of the notebook format itself, but I think different provenance approaches could be pursued via metadata in the notebook, without affecting the nbformat itself.

takluyver added a commit that referenced this pull request Nov 3, 2014
@takluyver takluyver merged commit 482c7bd into ipython:master Nov 3, 2014
@Carreau
Copy link
Member

Carreau commented Nov 3, 2014

Yeahhh ! Know we wait for the bug report flow.

@satra
Copy link
Contributor

satra commented Nov 3, 2014

@minrk - thanks - i stopped watching ipython on github because of the intense amount of activity (which is a great thing) and just focused on the dev list for monitoring and didn't see anything there. if there is a better place for us to discuss these issues, i'm happy to move it there.

re: json-ld - the key questions here are:

  1. do you see the notebook as a document or data on the web?
  2. do you see the notebook as encapsulating some form of data model (whether for jupyter or ipython)?
    if the answers are yes (and i would argue even yes to 1 is a sufficient condition), then the world of linked data would suggest that it follow one of the formats for such a document (json-ld is one such representation). and under such circumstances one should try to follow http://5stardata.info/ to prevent further silos being built. if we had arpanet, http, metatcp, myproto, etc.,. we would never have the web work in the way it does today.

re: provenance - is the notebook defined to capture a slice of reproducibility, i.e some static snapshot that can be rerun? if so i agree that it's outside the scope. if not, then provenance as metadata is not the optimal route to provenance.

@fperez
Copy link
Member

fperez commented Nov 3, 2014

I'm sadly mostly offline for a few weeks, and I'm typing this just about to
board a plane to PyCon Brasil, so forgive me for the fire and forget...

Full provenance is a gnarly problem, and I'd like us to "play nice" in that
space, but not attempt to tackle it head on. It's simply too large of a
question for us to handle effectively given our resources, and our
strengths lie in the interactive environment itself. But obviously we want
IPython to be part of the toolbox for better reproducibility in research,
so to the extent that small adjustments to what we do can help in that
direction, we should look at them.

The project is getting large enough, that there's really lots of room for
individuals or small teams to take point on specific topics, in a way that
they can then relay back to the whole core team with knowledge, ideas,
suggestions, and even implementation once some agreement is reached.

Satra, I know you have a lot of knowledge on this topic. It would be great
to put this on the agenda for one of the meetings, potentially the Tuesday
ones (we try to keep the Thursday one for core 'critical path' topics, and
right now 3.0 is 100% of the critical path). I'd love to be part of that
discussion, if you're willing to lead it. Given my current travel, it
would need to be post-Nov 20.

Cheers

f

On Mon, Nov 3, 2014 at 4:59 PM, Satrajit Ghosh notifications@github.com
wrote:

@minrk https://github.com/minRK - thanks - i stopped watching ipython
on github because of the intense amount of activity (which is a great
thing) and just focused on the dev list for monitoring and didn't see
anything there. if there is a better place for us to discuss these issues,
i'm happy to move it there.

re: json-ld - the key questions here are:

  1. do you see the notebook as a document or data on the web?
  2. do you see the notebook as encapsulating some form of data model
    (whether for jupyter or ipython)?
    if the answers are yes (and i would argue even yes to 1 is a sufficient
    condition), then the world of linked data would suggest that it follow one
    of the formats for such a document (json-ld is one such representation).
    and under such circumstances one should try to follow
    http://5stardata.info/ to prevent further silos being built. if we had
    arpanet, http, metatcp, myproto, etc.,. we would never have the web work in
    the way it does today.

re: provenance - is the notebook defined to capture a slice of
reproducibility, i.e some static snapshot that can be rerun? if so i agree
that it's outside the scope. if not, then provenance as metadata is not the
optimal route to provenance.


Reply to this email directly or view it on GitHub
#6045 (comment).

Fernando Perez (@fperez_org; http://fperez.org)
fperez.net-at-gmail: mailing lists only (I ignore this when swamped!)
fernando.perez-at-berkeley: contact me here for any direct mail

@satra
Copy link
Contributor

satra commented Nov 4, 2014

@fperez - happy to - let's plan on something after thanksgiving. it looks like we should announce on the mailing list depending on what we plan to cover. i'm quite positive there are others who might want to be part of the discussion. bon voyage!

@fperez
Copy link
Member

fperez commented Nov 4, 2014

Great, thanks! (typed from Brasilia, as I watch outside the kind of
rainstorm that only happens in tropical jungles, and wonder if my next
flight will make it anywhere :-)
On Nov 3, 2014 10:49 PM, "Satrajit Ghosh" notifications@github.com wrote:

@fperez https://github.com/fperez - happy to - let's plan on something
after thanksgiving. it looks like we should announce on the mailing list
depending on what we plan to cover. i'm quite positive there are others who
might want to be part of the discussion. bon voyage!


Reply to this email directly or view it on GitHub
#6045 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Editing markdown, Css, Header cell.
9 participants