-
Notifications
You must be signed in to change notification settings - Fork 420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
execution: add option to write notebook dataframe to disk as JSON #56
execution: add option to write notebook dataframe to disk as JSON #56
Conversation
* adds an option in execute_notebook() to write dataframe to disk via pandas to_json() function * adds corresponding CLI option --dataframe-file
@@ -169,7 +170,10 @@ def execute_notebook(notebook, output, parameters=None, kernel_name=None, progre | |||
# Write final Notebook to disk. | |||
write_ipynb(nb, output) | |||
raise_for_execution_errors(nb, output) | |||
|
|||
if dataframe_file: | |||
outputnb = read_notebook(output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not sure this is the most elegant way, but nb.dataframe
didn't seem to work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it looks like it's not easy to get a papermill.api.Notebook
object straightaway with this.
Looks like you'd need to take this nb
(nbformat object) and do the following:
from .api import Notebook
n = Notebook() # we should probably let this take an `nbformat` node object
n.path = 'whateverthepathis'
n.node = nb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably in a separate PR, I think we should let papermill.api.Notebook
accept both the path
(optional) and the node
in it's __init__
to be able to take the nbformat
node. That would be super helpful for this PR and future ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something like this #57 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merged, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coming back to this now I am confused why nb == outputnb
isn't true. Would be good if we can add a comment because I predict we will be confused by this again :-/
Codecov Report
@@ Coverage Diff @@
## master #56 +/- ##
==========================================
- Coverage 52.99% 52.96% -0.03%
==========================================
Files 8 8
Lines 836 842 +6
==========================================
+ Hits 443 446 +3
- Misses 393 396 +3 |
What is the advantage of writing it to an external file instead of the output notebook (which is just a big JSON document)? |
in that case you need to understand the notebook format. Say if you just want the raw |
First off, thanks for making a contribution and diving straight in to what's here! Approach wise, we'd like to keep the container format be the notebook itself while writing functions within papermill (whether these Python bindings, R bindings, or a new binding) that know how to extract the right records from the notebook. My worry over writing out a separate JSON file is that there's more than one way to serialize a dataframe to JSON. Once we start doing it, we have to support it. |
I agree multiple versions to persist may be an issue, but that could actually be supported in a uniform way with e.g. #53 i.e. there might be "persistence options" in general that could include
I guess one way to think of the disk persistence is a very low-level POSIX bindings. The use-case I'm looking at is e.g. generating some data and processing it further with Alternatively, the notebook could of course write that data to an actual json file itself, but it seems the record's were purposefully added to the notebook to give the notebook some "output" semantics (to match the "input" which are the parameters) |
by the way what's the model for |
I'm going to have to look through it again and come back to this PR after thinking about it overnight. The primary lead on this library just had a baby, so I'll wait until he gets back from paternity leave to pester him. Hey @adparker what do you think of this PR + discussion? |
Hi, should we revisit this PR? |
My opinion on writing out sidecar data files still holds. However, I don't wish to be a blocker. |
I think I agree with your opinion @rgbkrk. In principle the capability sounds like it makes certain workflows simpler. In practice at a general level this would open up a rather large shoot-your-self-in-the-foot door where giant dataframes or dataframes generated from a variety of sources would need to be supported. A few concerns come to mind: custom types, pandas version support/new features, data size, graceful failures, and testing potential downstream consumers. I think the idea could be useful in more controlled environments but I don't think it belongs in papermill as a generally supported feature. |
@rgbkrk For now, let's stick with simple. Until we have this fully documented and coverage increased, I'm hesitant to add an additional file to keep track of. Once that happens, let's revisit which approach makes most sense longer-term. |
Based on this, let's go ahead and close for now @lukasheinrich. Once we've got tests and documentation up, we can start entertaining ideas like this. I'm still super appreciative that you went through the review and followed up with some smaller PRs to fix up other bits in papermill. Keep it coming! |
Hi,
papermill allows the notebook to attach arbitrary data during it's execution, it would be useful to be able to write that file to disk, so that it can be further manipulated by tools outside of the papermill ecosystem / outside of python even.
A similar option could be useful for the
.data
property (which seems to be a JSON'able dict