Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strip output extension data/metadata #85

Closed
stas00 opened this issue Aug 6, 2018 · 5 comments · Fixed by #92
Closed

strip output extension data/metadata #85

stas00 opened this issue Aug 6, 2018 · 5 comments · Fixed by #92

Comments

@stas00
Copy link

stas00 commented Aug 6, 2018

Would it be possible to have an option to strip out extension data as well?

Different users use different extensions and in different ways and currently nbstripout doesn't strip that data, causing conflicts or/and unnecessary commit noise.

Examples:

a) if I use ToC extension, but others don't, I end up with:

-   "toc_position": {},
+   "toc_position": {
+    "height": "calc(100% - 180px)",
+    "left": "10px",
+    "top": "150px",
+    "width": "290.391px"
+   },
    "toc_section_display": "block",
-   "toc_window_display": false
+   "toc_window_display": true

this is a "toc": entry. Perhaps there is a known set of jupyter core top-level entries that can be kept and then all the extension top-level entries removed during stripout?

b) If I use Collapse Headers extension - it adds a bunch of metadata noise:

   {
    "cell_type": "markdown",
-   "metadata": {},
+   "metadata": {
+    "heading_collapsed": true
+   },
    "source": [
     "# Fin"
    ]
@@ -1154,7 +1310,9 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "hidden": true
+   },
    "outputs": [],
    "source": []

Perhaps there can be an option to force to set metadata to {}? Which would solve this particular extension.

Thank you!

@stas00
Copy link
Author

stas00 commented Aug 7, 2018

I modified the code to do what I wanted above.

Perhaps these can be made configurable? Since different extensions add different metadata inside cells, and push their own entries into the global metadata entry.

diff --git a/nbstripout.py b/nbstripout.py
index ef16ff3..c67c080 100755
--- a/nbstripout.py
+++ b/nbstripout.py
@@ -152,6 +152,8 @@ def strip_output(nb, keep_output, keep_count):

     nb.metadata.pop('signature', None)
     nb.metadata.pop('widgets', None)
+    nb.metadata.pop('toc', None)
+    nb.metadata.pop('varInspector', None)

     for cell in _cells(nb):

@@ -189,7 +191,7 @@ def strip_output(nb, keep_output, keep_count):
             if output_style in cell.metadata:
                 cell.metadata[output_style] = False
         if 'metadata' in cell:
-            for field in ['collapsed', 'scrolled', 'ExecuteTime']:
+            for field in ['collapsed', 'scrolled', 'ExecuteTime', 'heading_collapsed', 'hidden']:
                 cell.metadata.pop(field, None)
     return nb

@kynan
Copy link
Owner

kynan commented Aug 12, 2018

Do you want to submit a PR with your changes?

Making those configurable would be an option. However I'm leaning more and more towards a whitelist model where we specify which fields to keep and drop all the rest.

@stas00
Copy link
Author

stas00 commented Aug 12, 2018

Since my post I added (to-remove) even more fields, and I'm sure other people using other extensions will have others, so it won't be very efficient maintenance-wise (and also surprise-wise if suddenly a field gets stripped that wasn't stripped out before - after someone updates nbstripout).

So, yes, I totally agree with you, that a whitelist model is a much better way to go.

Thank you.

it also would be nice to switch to some faster json parser, as with many notebooks under git, used as a git filter it now noticeably slows things down (git status, git prompt, etc.). It is a totally different issue and the reason I mention it here that I'm experimenting with a much faster way of doing it with jq, except it comes with a bunch of dependencies.

update: I have recoded nbstripout's core functionality using jq and it's about 10-20 times faster now, and I no longer experience slowing down when working with git.
If anybody is curious there are 2 different nbstrip out jq versions here:
https://github.com/fastai/fastai_v1/blob/master/tools/fastai-nbstripout-jq

@kynan
Copy link
Owner

kynan commented Aug 14, 2018

Using jq to improve performance has been suggested before (see #33). Please move any further performance discussion to this issue. That said, I'm open to a PR iff jq is an optional dependency.

I have created #86 to track the whitelist model.

@stas00
Copy link
Author

stas00 commented Aug 14, 2018

Perfect. I will close this issue then, as your whitelist ticket references this one already. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants