New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue #33

Open
michaelaye opened this Issue Aug 10, 2016 · 10 comments

Comments

Projects
None yet
4 participants
@michaelaye
Contributor

michaelaye commented Aug 10, 2016

Everytime i go into a repo with lots of notebooks, it takes several seconds before i get my prompt back, which can be annoying...
I'm wondering if the official PreProcessor of the notebook tools is faster than your manual filtering?

@jankatins

This comment has been minimized.

jankatins commented Aug 10, 2016

I'm not sure how the prompt (of your shell?) can be affected by nbstripoutput, but if you want fast, try jq: http://janschulz.github.io/windows-dev-environment.html -> "proper diffs and commits for notebooks" (sorry, my homepage seems to not put in anchors in headlines :-( )

@michaelaye

This comment has been minimized.

Contributor

michaelaye commented Aug 10, 2016

oh, that's because my prompt (zsh->garrett) includes git status information, and the status is defined by git getting active, using the nbstripout filter to define the status for each notebook.

@jankatins

This comment has been minimized.

jankatins commented Aug 10, 2016

Huih, I've currently 38 ipynb in a folder with a jq based stripoutput filter (*.ipynb filter=stripoutput) and it's fast in cmder (which also tells me the current git status). in my case git status is also fast (instant?). cmder uses git status --porcelain 2>nul to check if the dir is dirty...

@michaelaye

This comment has been minimized.

Contributor

michaelaye commented Aug 10, 2016

are u aware that git apparently is able to cache the status somehow? It's only slow for me for the first time entering the folder after I-dont-know-what-interval-or-action. After that it's seemless as well. 66 notebooks here though. ;)

@jankatins

This comment has been minimized.

jankatins commented Aug 10, 2016

Thats sounds more like a filesystem cache (the git status manpage doesn't say anything about a cache for status information)? Do you have a ssd or hdd?

@michaelaye

This comment has been minimized.

Contributor

michaelaye commented Aug 10, 2016

ssd.

@kynan

This comment has been minimized.

Owner

kynan commented Aug 18, 2016

@michaelaye Do you have core.preloadindex enabled? That might be it:

Enable parallel index preload for operations like git diff

This can speed up operations like git diff and git status especially on filesystems like NFS that have weak caching semantics and thus
relatively high IO latencies. With this set to true, Git will do the index comparison to the filesystem data in parallel, allowing overlapping
IO’s.

@janschulz Thanks for the pointer to jq. Not sure I'd want to introduce such a heavy dependency, but will have a play with it.

@stas00

This comment has been minimized.

stas00 commented Aug 17, 2018

FYI, I uploaded two versions of much faster nbstripout-like tools. They focus on being fast git strip out filters, without much else.

  1. nbstripout-fast - a pure python json version (it also supports rapidjson, but I found it to be slower)
  2. nbstripout-jq - jq version with a thin /bin/sh wrapper

As of this writing nbstripout-fast and nbstripout-jq are respectively 13 and 9 times faster than nbstripout with the notebooks I tested them with.

They aren't exactly the same as nbstripout, so you may need to apply a few minor tweaks if you need the exact behavior.

These were written for fastai - we ended up using the pure python version as it was the fastest.

Enjoy.

@kynan

This comment has been minimized.

Owner

kynan commented Aug 19, 2018

Thanks, @stas00! Feel free to send a PR for the rapidjson variant. I'd also be open to add the nbstripout-jq as a separate script.

@stas00

This comment has been minimized.

stas00 commented Aug 19, 2018

@kynan, the python rewrite is very different internally - so PR won't be possible. It doesn't use any of the nbconvert/nbformat stuff. it's faster than rapidjson or jq. On the other hand it only does the stripping out (no installation tools, validation, etc.).

I first tried to get nbstripout to use rapidjson or ujson, but neither has an identical API to json. So that didn't work.

Given that most likely the strip out has to be as fast as possible for using git with many notebooks, i'd say the original nbstripout needs to be split into 2 pieces - one optimized for speed of the critical path of doing the strip out and another for installation/configuration/validation/etc. If you do that then you can just use nbstripout-fast as is - with a few tweaks to support keep_outputs and and a few other flags. And leave the rest of the features in the other script.

Also, nbstripout-fast is already written in the whitelist format as you suggested here, except the fields are hardcoded and will probably need to be configurable to make it generic enough for anybody to use without changes to the source code.

And yes, feel free to incorporate nbstripout-jq in your repo. it too will need some tweaks to be identical to nbstripout. We dropped the jq version since pure python json rewrite nbstripout-fast is faster.

p.s. if you decide to integrate/adopt the code the -d option keeps outputs and also preserves a few other metadata fields that we use for documentation notebooks, which differs from code notebooks. Obviously this won't be useful for a general user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment