Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

execution: add option to write notebook dataframe to disk as JSON #56

Conversation

lukasheinrich
Copy link
Contributor

@lukasheinrich lukasheinrich commented Oct 25, 2017

Hi,

papermill allows the notebook to attach arbitrary data during it's execution, it would be useful to be able to write that file to disk, so that it can be further manipulated by tools outside of the papermill ecosystem / outside of python even.

A similar option could be useful for the .data property (which seems to be a JSON'able dict

  • adds an option in execute_notebook() to write dataframe to disk via pandas to_json() function
  • adds corresponding CLI option --dataframe-file

* adds an option in execute_notebook() to write dataframe to disk via pandas to_json() function
* adds corresponding CLI option --dataframe-file
@@ -169,7 +170,10 @@ def execute_notebook(notebook, output, parameters=None, kernel_name=None, progre
# Write final Notebook to disk.
write_ipynb(nb, output)
raise_for_execution_errors(nb, output)

if dataframe_file:
outputnb = read_notebook(output)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure this is the most elegant way, but nb.dataframe didn't seem to work

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it looks like it's not easy to get a papermill.api.Notebook object straightaway with this.

Looks like you'd need to take this nb (nbformat object) and do the following:

from .api import Notebook

n = Notebook() # we should probably let this take an `nbformat` node object
n.path = 'whateverthepathis'
n.node = nb

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably in a separate PR, I think we should let papermill.api.Notebook accept both the path (optional) and the node in it's __init__ to be able to take the nbformat node. That would be super helpful for this PR and future ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something like this #57 ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged, thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming back to this now I am confused why nb == outputnb isn't true. Would be good if we can add a comment because I predict we will be confused by this again :-/

@codecov
Copy link

codecov bot commented Oct 25, 2017

Codecov Report

Merging #56 into master will decrease coverage by 0.02%.
The diff coverage is 57.14%.

@@            Coverage Diff             @@
##           master      #56      +/-   ##
==========================================
- Coverage   52.99%   52.96%   -0.03%     
==========================================
  Files           8        8              
  Lines         836      842       +6     
==========================================
+ Hits          443      446       +3     
- Misses        393      396       +3

@betatim
Copy link
Member

betatim commented Oct 25, 2017

What is the advantage of writing it to an external file instead of the output notebook (which is just a big JSON document)?

@lukasheinrich
Copy link
Contributor Author

in that case you need to understand the notebook format. Say if you just want the raw recorded data (actually this may be more applicable .data instead of .dataframe), the downstream tool should not need to be able to know that it came from a notebook at all.

@rgbkrk
Copy link
Member

rgbkrk commented Oct 25, 2017

First off, thanks for making a contribution and diving straight in to what's here!

Approach wise, we'd like to keep the container format be the notebook itself while writing functions within papermill (whether these Python bindings, R bindings, or a new binding) that know how to extract the right records from the notebook.

My worry over writing out a separate JSON file is that there's more than one way to serialize a dataframe to JSON. Once we start doing it, we have to support it.

@lukasheinrich
Copy link
Contributor Author

lukasheinrich commented Oct 25, 2017

I agree multiple versions to persist may be an issue, but that could actually be supported in a uniform way with e.g. #53

i.e. there might be "persistence options" in general that could include

  • format (json, pickle, dill, cloudpickle..)
  • format-specific options (between e.g. .dataframe/.data and output format) . e.g. if dataframe and if format is json, then one may be able to choose an orientation

I guess one way to think of the disk persistence is a very low-level POSIX bindings. The use-case I'm looking at is e.g. generating some data and processing it further with jq or other tools that natively support JSON structures.

Alternatively, the notebook could of course write that data to an actual json file itself, but it seems the record's were purposefully added to the notebook to give the notebook some "output" semantics (to match the "input" which are the parameters)

@lukasheinrich
Copy link
Contributor Author

lukasheinrich commented Oct 25, 2017

by the way what's the model for .data? is this expected to be a JSON'able thing?

@rgbkrk
Copy link
Member

rgbkrk commented Oct 25, 2017

I'm going to have to look through it again and come back to this PR after thinking about it overnight. The primary lead on this library just had a baby, so I'll wait until he gets back from paternity leave to pester him.

Hey @adparker what do you think of this PR + discussion?

@lukasheinrich
Copy link
Contributor Author

Hi, should we revisit this PR?

@rgbkrk
Copy link
Member

rgbkrk commented Jan 15, 2018

My opinion on writing out sidecar data files still holds. However, I don't wish to be a blocker.

What do you think @MSeal @willingc?

@MSeal
Copy link
Member

MSeal commented Jan 15, 2018

I think I agree with your opinion @rgbkrk. In principle the capability sounds like it makes certain workflows simpler. In practice at a general level this would open up a rather large shoot-your-self-in-the-foot door where giant dataframes or dataframes generated from a variety of sources would need to be supported. A few concerns come to mind: custom types, pandas version support/new features, data size, graceful failures, and testing potential downstream consumers.

I think the idea could be useful in more controlled environments but I don't think it belongs in papermill as a generally supported feature.

@willingc
Copy link
Member

@rgbkrk For now, let's stick with simple. Until we have this fully documented and coverage increased, I'm hesitant to add an additional file to keep track of. Once that happens, let's revisit which approach makes most sense longer-term.

@rgbkrk
Copy link
Member

rgbkrk commented Jan 23, 2018

Until we have this fully documented and coverage increased, I'm hesitant to add an additional file to keep track of. Once that happens, let's revisit which approach makes most sense longer-term.

Based on this, let's go ahead and close for now @lukasheinrich. Once we've got tests and documentation up, we can start entertaining ideas like this. I'm still super appreciative that you went through the review and followed up with some smaller PRs to fix up other bits in papermill. Keep it coming!

@rgbkrk rgbkrk closed this Jan 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants