execution: add option to write notebook dataframe to disk as JSON #56

lukasheinrich · 2017-10-25T18:37:56Z

Hi,

papermill allows the notebook to attach arbitrary data during it's execution, it would be useful to be able to write that file to disk, so that it can be further manipulated by tools outside of the papermill ecosystem / outside of python even.

A similar option could be useful for the .data property (which seems to be a JSON'able dict

adds an option in execute_notebook() to write dataframe to disk via pandas to_json() function
adds corresponding CLI option --dataframe-file

* adds an option in execute_notebook() to write dataframe to disk via pandas to_json() function * adds corresponding CLI option --dataframe-file

lukasheinrich · 2017-10-25T18:40:22Z

papermill/execute.py

@@ -169,7 +170,10 @@ def execute_notebook(notebook, output, parameters=None, kernel_name=None, progre
    # Write final Notebook to disk.
    write_ipynb(nb, output)
    raise_for_execution_errors(nb, output)
-
+    if dataframe_file:
+        outputnb = read_notebook(output)


i'm not sure this is the most elegant way, but nb.dataframe didn't seem to work

Yeah it looks like it's not easy to get a papermill.api.Notebook object straightaway with this.

Looks like you'd need to take this nb (nbformat object) and do the following:

from .api import Notebook n = Notebook() # we should probably let this take an `nbformat` node object n.path = 'whateverthepathis' n.node = nb

Probably in a separate PR, I think we should let papermill.api.Notebook accept both the path (optional) and the node in it's __init__ to be able to take the nbformat node. That would be super helpful for this PR and future ones.

something like this #57 ?

Merged, thanks!

Coming back to this now I am confused why nb == outputnb isn't true. Would be good if we can add a comment because I predict we will be confused by this again :-/

codecov · 2017-10-25T18:45:36Z

Codecov Report

Merging #56 into master will decrease coverage by 0.02%.
The diff coverage is 57.14%.

@@            Coverage Diff             @@
##           master      #56      +/-   ##
==========================================
- Coverage   52.99%   52.96%   -0.03%     
==========================================
  Files           8        8              
  Lines         836      842       +6     
==========================================
+ Hits          443      446       +3     
- Misses        393      396       +3

betatim · 2017-10-25T20:47:24Z

What is the advantage of writing it to an external file instead of the output notebook (which is just a big JSON document)?

lukasheinrich · 2017-10-25T20:52:29Z

in that case you need to understand the notebook format. Say if you just want the raw recorded data (actually this may be more applicable .data instead of .dataframe), the downstream tool should not need to be able to know that it came from a notebook at all.

rgbkrk · 2017-10-25T20:58:38Z

First off, thanks for making a contribution and diving straight in to what's here!

Approach wise, we'd like to keep the container format be the notebook itself while writing functions within papermill (whether these Python bindings, R bindings, or a new binding) that know how to extract the right records from the notebook.

My worry over writing out a separate JSON file is that there's more than one way to serialize a dataframe to JSON. Once we start doing it, we have to support it.

lukasheinrich · 2017-10-25T21:06:12Z

I agree multiple versions to persist may be an issue, but that could actually be supported in a uniform way with e.g. #53

i.e. there might be "persistence options" in general that could include

format (json, pickle, dill, cloudpickle..)
format-specific options (between e.g. .dataframe/.data and output format) . e.g. if dataframe and if format is json, then one may be able to choose an orientation

I guess one way to think of the disk persistence is a very low-level POSIX bindings. The use-case I'm looking at is e.g. generating some data and processing it further with jq or other tools that natively support JSON structures.

Alternatively, the notebook could of course write that data to an actual json file itself, but it seems the record's were purposefully added to the notebook to give the notebook some "output" semantics (to match the "input" which are the parameters)

lukasheinrich · 2017-10-25T21:09:34Z

by the way what's the model for .data? is this expected to be a JSON'able thing?

rgbkrk · 2017-10-25T21:44:21Z

I'm going to have to look through it again and come back to this PR after thinking about it overnight. The primary lead on this library just had a baby, so I'll wait until he gets back from paternity leave to pester him.

Hey @adparker what do you think of this PR + discussion?

lukasheinrich · 2018-01-15T14:17:15Z

Hi, should we revisit this PR?

rgbkrk · 2018-01-15T21:46:24Z

My opinion on writing out sidecar data files still holds. However, I don't wish to be a blocker.

What do you think @MSeal @willingc?

MSeal · 2018-01-15T22:34:18Z

I think I agree with your opinion @rgbkrk. In principle the capability sounds like it makes certain workflows simpler. In practice at a general level this would open up a rather large shoot-your-self-in-the-foot door where giant dataframes or dataframes generated from a variety of sources would need to be supported. A few concerns come to mind: custom types, pandas version support/new features, data size, graceful failures, and testing potential downstream consumers.

I think the idea could be useful in more controlled environments but I don't think it belongs in papermill as a generally supported feature.

willingc · 2018-01-15T22:41:50Z

@rgbkrk For now, let's stick with simple. Until we have this fully documented and coverage increased, I'm hesitant to add an additional file to keep track of. Once that happens, let's revisit which approach makes most sense longer-term.

rgbkrk · 2018-01-23T15:18:32Z

Until we have this fully documented and coverage increased, I'm hesitant to add an additional file to keep track of. Once that happens, let's revisit which approach makes most sense longer-term.

Based on this, let's go ahead and close for now @lukasheinrich. Once we've got tests and documentation up, we can start entertaining ideas like this. I'm still super appreciative that you went through the review and followed up with some smaller PRs to fix up other bits in papermill. Keep it coming!

execution: add option to write notebook dataframe to disk as JSON

04d3bf1

* adds an option in execute_notebook() to write dataframe to disk via pandas to_json() function * adds corresponding CLI option --dataframe-file

lukasheinrich commented Oct 25, 2017

View reviewed changes

lukasheinrich mentioned this pull request Oct 25, 2017

recording non-JSON'able objects #53

Open

rgbkrk closed this Jan 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

execution: add option to write notebook dataframe to disk as JSON #56

execution: add option to write notebook dataframe to disk as JSON #56

lukasheinrich commented Oct 25, 2017 •

edited

lukasheinrich Oct 25, 2017

rgbkrk Oct 25, 2017

rgbkrk Oct 25, 2017

lukasheinrich Oct 25, 2017

rgbkrk Oct 25, 2017

betatim Jan 15, 2018

codecov bot commented Oct 25, 2017 •

edited

betatim commented Oct 25, 2017

lukasheinrich commented Oct 25, 2017

rgbkrk commented Oct 25, 2017

lukasheinrich commented Oct 25, 2017 •

edited

lukasheinrich commented Oct 25, 2017 •

edited

rgbkrk commented Oct 25, 2017

lukasheinrich commented Jan 15, 2018

rgbkrk commented Jan 15, 2018

MSeal commented Jan 15, 2018

willingc commented Jan 15, 2018

rgbkrk commented Jan 23, 2018

execution: add option to write notebook dataframe to disk as JSON #56

execution: add option to write notebook dataframe to disk as JSON #56

Conversation

lukasheinrich commented Oct 25, 2017 • edited

lukasheinrich Oct 25, 2017

Choose a reason for hiding this comment

rgbkrk Oct 25, 2017

Choose a reason for hiding this comment

rgbkrk Oct 25, 2017

Choose a reason for hiding this comment

lukasheinrich Oct 25, 2017

Choose a reason for hiding this comment

rgbkrk Oct 25, 2017

Choose a reason for hiding this comment

betatim Jan 15, 2018

Choose a reason for hiding this comment

codecov bot commented Oct 25, 2017 • edited

Codecov Report

betatim commented Oct 25, 2017

lukasheinrich commented Oct 25, 2017

rgbkrk commented Oct 25, 2017

lukasheinrich commented Oct 25, 2017 • edited

lukasheinrich commented Oct 25, 2017 • edited

rgbkrk commented Oct 25, 2017

lukasheinrich commented Jan 15, 2018

rgbkrk commented Jan 15, 2018

MSeal commented Jan 15, 2018

willingc commented Jan 15, 2018

rgbkrk commented Jan 23, 2018

lukasheinrich commented Oct 25, 2017 •

edited

codecov bot commented Oct 25, 2017 •

edited

lukasheinrich commented Oct 25, 2017 •

edited

lukasheinrich commented Oct 25, 2017 •

edited