IPEP 17: Notebook Format 4

minrk edited this page Nov 3, 2014 · 19 revisions
Clone this wiki locally
Status Implemented
Author Min RK <benjaminrk@gmail.com>
Created April 29, 2013
Updated November 3, 2014
Discussion 5733
Implementation 6045

There are a few changes we need to make to the notebook that will not be backward compatible. We do not intend to make these changes for 1.0, because nbformat changes are quite painful. This is a catalog of the changes we intend to make when we do next rev the nbformat.

remove multiple worksheets

The worksheets field is a list, but we have no UI to support multiple worksheets. Our design has since shifted to heading-cell based structure, so we never intend to support the multiple worksheet model. The worksheets list of lists shall be replaced with a single list, called cells.

use mime-type output keys

We transform mimetype output data to short names, like json or png. These should be restored to proper mimetype values of image/png and application/json, etc. used by the message spec. The output should be generated by a simple passthrough of the messages, rather than a whitelist transform.

Remove python-centric names

Following IPEP 13, Python-specific keys in the message spec and notebook will be removed. Those affecting the notebook format:

  • pyout will become execute_result
  • pyerr will become error

Make cell content key uniform

Currently text cells have a source key, which contains the text, and code cells have an input key. There is no reason for the two cell types to have a different name for their content:

  • CodeCell.input will become CodeCell.source, matching TextCell.source.

metadata changes

  • remove notebook name from metadata
  • move language key from code cells to top-level notebook metadata
  • add kernel info to top-level notebook metadata in some form
  • add format key to raw_cell metadata
  • add state for show/hide (already have) and auto-scroll.

Implementation and Coordination

Tasks involved in creating nbformat v4:

  • thoroughly define the v4 spec
  • update message spec keys (pyout, pyerr, etc.)
  • mime-type keys for output (affects nbconvert, nbformat, javascript)
  • remove worksheets, move cells to top-level list
  • add conversions to nbformat: v3->v2, v4->v3, v3->v4
  • metadata changes
  • widget-related changes (TBD)
  • we will need v4->v4 to track changes to v4 during development. If so, this should probably not be included in release, right?

I think this is the logical order of these tasks:

  1. Define v4 in a doc (not just changes, full spec - v3 was never fully defined)
  2. add downgrade API to nbformat (or nbconvert, unclear which), and implement v3->v2
  3. copy v3 to v4, adding empty v4->v3 and v3->v4, removing the py/json distinction (nbconvert is responsible for .py now)
  4. remove worksheet in v4
  5. update msg spec keys that are reflected in notebook
  6. use mime-type output keys
  7. update various metadata keys (this mainly affects javascript code)

v2<->v3 conversion APIs can be done while v4 is being defined, but no part of v4 should be implemented until the spec is documented. Incremental implementations of v4 features, starting with 4. can be implemented in discrete PRs, probably on a v4 feature branch. Their order relative to each other isn't critically important.

Each time a change is made to the in-development v4 spec:

  • update spec doc
  • update nbformat.v4
  • update v4->v3 and v3->v4
  • update v4->v4?
  • update javascript, if affected
  • update nbconvert, if affected
  • TEST EVERY NEW CHANGE

Full v4 Specification

The specification is being defined using a JSON schema, which notebooks can then be validated against. The actual schema document is being developed in 5733. Additionally, here is an outline of the specification:

Notebook-level format

  • metadata: an object containing any top-level notebook metadata. There are three reserved metadata keys which are optional, but if included must follow the following format:
    • kernel_info: an object containing information about the kernel that the notebook should be run with (see also IPEP 13. It should include the following keys:
      • name: the name of the kernel specification
      • language: the language that the kernel runs
      • codemirror_mode: (optional) the codemirror mode to use when displaying the notebook
    • signature: a string containing the hash of the notebook, for verification purposes
    • orig_nbformat: if the notebook was converted from a different format, this should be an integer indicating the major version of that format
  • nbformat_minor: notebook format minor number
  • nbformat: notebook format major number (should be 4)
  • cells: an array of cells, which should be of type raw, markdown, heading, or code.

Cell-level formats

In general, cells should have:

  • cell_type: a string indicating the cell type, one of "raw", "markdown", "heading", or "code"
  • metadata: an object containing any cell-level metadata. There are two reserved keys, which are optional but if used must conform to the following format (see also IPEP 20)
    • name: a non-empty string representing the cell's name
    • tags: an array of cell tags, each of which is a string. Tags should not contain commas, and should be unique.
  • source: a "multiline string", which is either an array of strings that will be concatenated, or a single string

Raw cell format

Raw cells have an additional reserved metadata key:

  • format: a string indicating the raw cell format for use with nbconvert

Markdown cell format

Markdown cells have no additional properties.

Heading cell format

Heading cells should have one additional property:

  • level: an integer from 1-6 indicating the heading level

Code cell format

Code cells should have a few additional properties:

  • outputs: an array of outputs; see the Output formats section below
  • prompt_number: the cell's prompt number, which is either an integer value or null

Code cells also have a few additional reserved metadata keys:

  • collapsed: a boolean indicating whether the cell is collapsed or expanded
  • autoscroll: a value indicating whether the cell should be autoscrolled; should be one of true, false, or "auto"

Output formats

There are four different types of outputs that may be associated with a code cell: execute_result (the result of executing the cell), display_data (data that is displayed from the cell), stream (text that is printed from a stream, usually standard out), and error (the traceback that is produced when an error occurs).

All output formats should have the following properties:

  • output_type: a string, either "execute_result", "display_data", "stream", or "error"
  • metadata: an object containing output metadata. This is mainly used just for execute_result and display_data outputs, and should include the same mimetype keys as the output itself. See also IPEP 13

Execute result

The execute_result output should have the following additional properties:

  • prompt_number: the prompt number of the output (should be the same as the cell's prompt number)
  • mimetype: the key itself should be a valid mimetype (e.g., "text/plain" or "image/png"). The value should be either a string, or an array of strings.

Display data

The display_data output should have the following additional properties:

  • mimetype: the key itself should be a valid mimetype (e.g., "text/plain" or "image/png"). The value should be either a string, or an array of strings.

Stream

The stream output should have the following additional properties:

  • name: a string denoting the stream type or destination (e.g. "stdout")
  • data: the stream's text output, which is a "multiline string" (stored as either a single string, or an array of strings).

Error

The error output should have the following additional properties:

  • ename: the name of the error
  • evalue: the value, or message, of the error
  • traceback: the error's traceback, represented as an array of strings