Split likely multiline text when writing JSON notebooks #981

Merged
merged 3 commits into from Nov 10, 2011

Projects

None yet

3 participants

@minrk
Member
minrk commented Nov 9, 2011

I did not increment the nbformat, because this is JSON-specific, and doesn't actually change the notebook data structure itself. Also, when loading a notebook multiline strings are only rejoined if they have been split, which means that all existing json notebooks are still perfectly valid after the change, but notebooks saved after the change will have split multiline blocks.

I was able to open/re-save all the example notebooks after making the change, and it seems to work just fine, and nbformat tests continue to pass without alteration.

So for now, it should play a bit nicer with VCS, while we can think further about a v3 format with separate output-caching, etc. as discussed on-list.

@fperez
Member
fperez commented Nov 9, 2011

Awesome!! I don't have time to check it carefully right now, but many thanks for this! It's excellent, and it really will be very useful moving forward...

@minrk
Member
minrk commented Nov 9, 2011

Since this is a change to the file format, we do have to be very careful. Though I'm not too concerned about it, because the only adversely affected parties are users who update to master trying to share notebooks with uses who are running IPython dev, but are behind, who will not be able to read the files until they catch up. Sharing in the other direction is not affected.

@fperez fperez commented on the diff Nov 10, 2011
IPython/frontend/html/notebook/notebookmanager.py
@@ -118,7 +118,12 @@ class NotebookManager(LoggingConfigurable):
if format not in self.allowed_formats:
raise web.HTTPError(415, u'Invalid notebook format: %s' % format)
last_modified, nb = self.get_notebook_object(notebook_id)
- data = current.writes(nb, format)
+ kwargs = {}
+ if format == 'json':
+ # don't split lines for sending over the wire, because it
+ # should match the Python in-memory format.
+ kwargs['split_lines'] = False
+ data = current.writes(nb, format, **kwargs)
@fperez
fperez Nov 10, 2011 IPython member

Do we really need to create kwargs here? Why not just

split_lines = False if format=='json' else True
data = current.writes(nb, format, split_lines=split_lines)
@minrk
minrk Nov 10, 2011 IPython member

because it's not a valid kwarg for writers other than json. So it's either create kwargs or have two distinct current.writes(nb,format...) lines. Not that there's a significant difference, but I could imagine that there might be further customization that we want to specify when sending over the wire that differ from the defaults that we use when writing a file.

@fperez fperez commented on the diff Nov 10, 2011
IPython/nbformat/v2/tests/test_json.py
@@ -16,6 +16,19 @@ class TestJSON(TestCase):
# print
# print s
self.assertEquals(reads(s),nb0)
+
+ def test_roundtrip_nosplit(self):
@fperez
fperez Nov 10, 2011 IPython member

Do these tests fully exercise the two functions above, or would we need further tests for them? Just curious...

@minrk
minrk Nov 10, 2011 IPython member

They test the roundtrip-reproducibility of the notebook through a save action, nothing further. So they are tested exactly as much as other exports.

There is nothing about the actual exported document that is tested, and there never has been. I can add further tests of the intermediate nb structure.

What we really need is a utility to generate all the various combinations of valid notebooks - with/without all combinations of outputs, etc. Right now, we just have a couple that have extremely basic structure.

@ellisonbg
Member

@minrk does this change also affect Markdown cells and output as well?

@ellisonbg ellisonbg closed this Nov 10, 2011
@minrk minrk reopened this Nov 10, 2011
@minrk
Member
minrk commented Nov 10, 2011

@ellisonbg - yes, everything that is reasonably likely to be multiline:

  • input
  • outputs: text, html, json, javascript, svg
  • both source and rendered for text cells
@fperez
Member
fperez commented Nov 10, 2011

I've just tested it by making a toy git repo with a notebook and making changes to it, it's great! The diffs are indeed much more readable, so I'm a big +1 on this one.

One last question: @ellisonbg and I were chatting yesterday, and wondering if the json text wouldn't be more readable if we decreased the indentation of the writer to two, or even just one space per level. That way, the code blocks would be much closer to the left margin, reducing the need for scrolling.

Since in this case the spaces are not semantically meaningful, I like the idea of trying out like that. The code is here. I tried it out locally and even indent of just 1 looks actually pretty good. Because there's so much nesting in json, even just one space per level actually produces pretty readable json.

Give it a try and see what you think; after trying 2 and 1, my vote is actually for 1 (I'm surprised, I would have intuitively thought that would be too tight, but I actually like how the notebooks read this way).

I'd like to merge this one with all these changes included, so we can tell people who update to expect just one round of changes to the on-disk format of their notebooks.

For reference, here's the dump of a toy notebook with indent=1:

{
 "metadata": {
  "name": "nb"
 }, 
 "nbformat": 2, 
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown", 
     "source": [
      "# new title", 
      "", 
      "some text that changes", 
      "", 
      "* one is the first", 
      "* two", 
      "* three", 
      "", 
      "and more!"
     ]
    }, 
    {
     "cell_type": "code", 
     "collapsed": false, 
     "input": [
      "if 2:", 
      "    print 'hi ho'", 
      "    ", 
      "def ff(x):", 
      "    \"\"\"func\"\"\"", 
      "    return x**3", 
      "", 
      "ff(10)"
     ], 
     "language": "python", 
     "outputs": [
      {
       "output_type": "stream", 
       "stream": "stdout", 
       "text": [
        "hi ho"
       ]
      }, 
      {
       "output_type": "pyout", 
       "prompt_number": 3, 
       "text": [
        "1000"
       ]
      }
     ], 
     "prompt_number": 3
    }, 
    {
     "cell_type": "code", 
     "collapsed": true, 
     "input": [], 
     "language": "python", 
     "outputs": []
    }
   ]
  }
 ]
}
@minrk
Member
minrk commented Nov 10, 2011

given the level of nesting, (input lines are indented 6x, and outputs are 8), I'm fine with 1.

@fperez
Member
fperez commented Nov 10, 2011

Awesome; I'll merge now and will send an announcement on the list.

@fperez fperez merged commit d6aa307 into ipython:master Nov 10, 2011
@minrk minrk deleted the minrk:nblines branch Mar 31, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment