-
Notifications
You must be signed in to change notification settings - Fork 1
/
motivation.jupyter
289 lines (252 loc) · 9.5 KB
/
motivation.jupyter
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
nbformat 4
nbformat_minor 2
markdown
This notebook is part of the `jupyter_format` documentation:
https://jupyter-format.readthedocs.io/.
cell_metadata
{
"nbsphinx": "hidden"
}
markdown
# Motivation
## Status Quo
The original format for [Jupyter](https://jupyter.org/) notebooks
uses [JSON](http://json.org/) as underlying storage format.
This has the great advantage that such files are very easy to handle
programmatically in many different environments,
because JSON parsers are readily available for many programming languages.
One disadvantage, however, is that the format is only semi-human-readable
and not very well human-editable.
All textual content
(e.g. text in Markdown cells and source code in code cells)
is stored in lists of JSON strings -- one string for each line.
This means that each line is surrounded by quotes (`"`) and
strings are separated by commas (`,`),
while lists of strings are surrounded by brackets (`[` and `]`).
On top of that,
several common characters are not allowed in JSON strings,
which means that they have to be escaped by backslashes,
e.g. `\"` and `\n`.
And since a backslash is used for escaping,
a literal backslash occurring in the text
(which is quite common in programming languages and markup languages)
has to be escaped itself (`\\`).
markdown
As an example, let's create a notebook
containing the previous two sentences:
code
import nbformat
nb = nbformat.v4.new_notebook()
nb.cells.append(nbformat.v4.new_markdown_cell(
r"""On top of that,
several common characters are not allowed in JSON strings,
which means that they have to be escaped by backslashes,
e.g. `\"` and `\n`.
And since a backslash is used for escaping,
a literal backslash occurring in the text
(which is quite common in programming languages and markup languages)
has to be escaped itself (`\\`)."""))
markdown
The JSON-based storage of this minimal notebook looks like this:
code
print(nbformat.writes(nb))
markdown
Escaped characters and JSON syntax elements
make this harder than necessary to read,
and even harder to modify with a text editor.
When editing this by hand,
it is easy to mess up the JSON representation
by e.g. forgetting a comma.
As a comparison,
the same notebook is stored like this in the proposed new format:
code
import jupyter_format
print(jupyter_format.serialize(nb))
markdown
This is exactly the same as the original Markdown content,
except that it is indented by 4 spaces.
markdown
## YAML as Almost-Solution
It has been known for a long time
(probably since the inception of Jupyter/IPython notebooks)
that lists of JSON strings are not nicely readable for humans.
An obvious alternative would be to use [YAML](https://yaml.org),
which provides multiple ways to store text content.
One of those ways is the so-called [literal style],
which doesn't require any escaping,
making the text much more readable.
[literal style]: https://yaml.org/spec/1.2/spec.html#style/block/literal
This was already suggested in several blog posts:
* https://matthiasbussonnier.com/posts/05-YAML%20Notebook.html
* http://droettboom.com/blog/2018/01/18/diffable-jupyter-notebooks/
And there are even some implementations available:
* https://github.com/prabhuramachandran/ipyaml
* https://github.com/mdboom/nbconvert_vc
markdown
This is an example how a YAML-based storage format could look like:
code
yaml_content = """
nbformat: 4
nbformat_minor: 2
cells:
- cell_type: markdown
source: |+2
# A Jupyter Notebook
This is a code cell:
metadata: {}
- cell_type: code
source: |+2
print('Hello, world!')
outputs:
- output_type: stream
name: stdout
text: |+2
Hello, world!
execution_count: 1
metadata: {}
metadata: {}
"""
markdown
This is valid YAML, compatible with both version 1.1 and 1.2.
Let's use [PyYAML](https://pyyaml.org) to read this:
code
import yaml
nb_dict = yaml.safe_load(yaml_content)
nb_dict
markdown
This Python dictionary can easily be converted to a notebook node:
code
nb = nbformat.from_dict(nb_dict)
markdown
And we can use `nbconvert` to convert this to HTML:
code
from nbconvert.exporters import HTMLExporter
html_content, resources = HTMLExporter().from_notebook_node(nb)
code
import urllib
data_uri = 'data:text/html;charset=utf-8,' + urllib.parse.quote(html_content)
code
from IPython.display import IFrame
IFrame(data_uri, width='100%', height='250')
markdown
This looks promising, doesn't it?
The problem is, as so often, in the details.
It's great that we can use *literal style* without
littering the text with quotes and escape characters,
but sadly, YAML only allows
[printable characters](https://yaml.org/spec/1.2/spec.html#printable%20character).
This means that we cannot use some control characters
which might occur in cell outputs,
for example ANSI escape characters.
There are two options here:
* Go back to escaped strings, at least in some circumstances.
But that's exactly what we wanted to avoid by using YAML!
* Don't use YAML after all
markdown
## Other Partial Solutions
Some alternative notebook formats are supported by the very popular projects
https://github.com/aaren/notedown (Markdown) and
https://github.com/mwouts/jupytext (Markdown, Rmd, Julia/Python/R-scripts etc.).
Those can be very useful, but none of them can store cell outputs,
therefore they cannot be a full replacement for the current storage format.
markdown
## The Need for a Custom Format
Looks like none of the existing formats are sufficient.
Probably we can achieve our goals with a custom format.
Having to implement a custom parser for such a custom format
is of course a disadvantage,
but if we keep it really simple,
probably we can get away with it?
Remember the YAML example from [above](#YAML-as-Almost-Solution)?
code
print(yaml_content)
markdown
The contained text (Markdown and Python source code)
is quite readable,
but it is still stuffed with many distracting things inbetween.
Since we are not limited by YAML anymore,
we can agressively reduce this to only contain
the absolutely necessary information:
code
content = """nbformat 4
nbformat_minor 2
markdown
# A Jupyter Notebook
This is a code cell:
code 1
print('Hello, world!')
stream stdout
Hello, world!
"""
markdown
And that's the proposed new format!
It can be converted to a notebook node
(which will look the same as in the YAML example above):
code
jupyter_format.deserialize(content)
markdown
Just to make sure it is a valid Jupyter notebook node:
code
nbformat.validate(_)
markdown
## Complementary Tools
Oftentimes cell outputs (e.g. plots) stored in notebooks
make it hard to read and manipulate the text representation
of such notebooks.
They make it also hard to use with version control systems
(e.g. Git).
The proposed new format has the same problem,
outputs are still stored in the notebook file,
right next to the code cells that generated them.
It is recommended to remove all outputs from a notebook
before storing it in version control
or before doing any manipulations with a text editor.
Outputs can be removed manually in the Jupyter user interface,
but there are also tools to remove outputs programmatically:
* https://github.com/kynan/nbstripout
* https://github.com/choldgraf/nbclean
* https://github.com/toobaz/ipynb_output_filter
If you want to present your notebooks publicly,
you often want to show the outputs to your audience,
without them having to run the notebooks themselves.
So do you have to store your outputs after all?
No! You can still store your notebooks without outputs
and run your notebooks on a server that will re-create
the outputs.
One tool to do this is:
* https://nbsphinx.readthedocs.io/
This is a [Sphinx](http://www.sphinx-doc.org/)
extension that can convert a bunch
of Jupyter notebooks (and other source files)
to HTML and PDF pages (and other output formats).
This way you have the best of both worlds:
No outputs in your (version controlled) notebook files,
but full outputs in the public HTML (or PDF) version.
There are still some cases where you do want to store
the outputs for some reason.
Because of the outputs, it is hard to see the changes
to the text/code content of the notebook
with traditional tools like `diff`.
But luckily, there is a tool that can make
meaningful "diffs" for Jupyter notebooks:
* https://github.com/jupyter/nbdime
notebook_metadata
{
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.2+"
}
}