Skip to content
This repository has been archived by the owner on Mar 26, 2019. It is now read-only.

Notebook normalization for version control #174

Closed
akhmerov opened this issue May 30, 2017 · 12 comments
Closed

Notebook normalization for version control #174

akhmerov opened this issue May 30, 2017 · 12 comments

Comments

@akhmerov
Copy link
Member

akhmerov commented May 30, 2017

For a while we are using nbconvert with ClearOutputPreprocessor to make notebooks git-friendly. However I have observed that there are several cell and notebook metadata fields that are left intact by it. Looking at the latest nbformat, it appears that the following cell metadata doesn't belong to VC:

  • collapsed
  • autoscroll
  • output metadata isolated

I cannot quite find the standard kernelspec json entries, but I remember that at the very least the minor language version was causing troubles sometimes.

Now to my question: what is the best way to strip the above metadata and is there anything available short of writing an own preprocessor?

@takluyver
Copy link
Member

nbdime can hide metadata changes, which may be some help. We also try to leave fields like those out of the metadata when they would have their default value, to minimise clutter in version control.

@akhmerov
Copy link
Member Author

Yes, we use nbdime in our workflow, but I didn't realize that it can be used as a git filter; can it?

Would it be reasonable to remove output-related metadata when the corresponding output is removed?

@takluyver
Copy link
Member

Possibly not as a filter, looks like it integrates with the diff and merge operations themselves:
http://nbdime.readthedocs.io/en/latest/vcs.html

Yes, I think metadata related to outputs should probably be cleared when the outputs are.

@akhmerov
Copy link
Member Author

Without having a filter there's extra VCS noise generated.

Yes, I think metadata related to outputs should probably be cleared when the outputs are.

Where shall I open the issue? Definitely nbconvert, but also the notebook UI doesn't clean the metadata on clearing output.

@takluyver
Copy link
Member

I'd go for it on the notebook repo.

@rsvp
Copy link

rsvp commented May 30, 2017

In the meantime, is there a way to clear output cells with just images?
For example, lines containing "image/png": -- without losing trust state.

A major issue in committing notebooks in repositories is, of course, the size
bulk introduced by images.

@takluyver
Copy link
Member

I don't think there's anything ready-made to do that, but it should be pretty easy (~10 lines in Python) to load a notebook, filter out image outputs and write the result out.

@rsvp
Copy link

rsvp commented May 31, 2017

@takluyver I suppose the filter could be a sed one-liner, but
the notebook will probably lose its trusted state, so which
non-interactive utilities did you have in mind?

@takluyver
Copy link
Member

I wouldn't try to sed JSON. That's not going to be remotely reliable. i was thinking Python code, using the nbformat library.

@akhmerov
Copy link
Member Author

closing as answered

@minrk
Copy link
Member

minrk commented Jun 1, 2017

nbstripout provides git filters for notebooks for doing this kind of thing. If it doesn't cover these metadata keys, they can be added pretty easily. It may be appropriate to bring nbstripout's functionality into nbdime eventually if we want to have a single package providing "notebooks in version control" toolkit, but for now nbdime is only a renderer and merge tool.

@rsvp
Copy link

rsvp commented Jun 1, 2017

@minrk Thanks for your great suggestion. Does nbstripout
need to sign out using nbformat.sign.NotebookNotary()
to leave the modified notebook in trusted state ??

Re: stripping only images: kynan/nbstripout#58

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants