Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diffs very slow #358

Closed
jakezcruit opened this issue Mar 23, 2018 · 10 comments
Closed

Diffs very slow #358

jakezcruit opened this issue Mar 23, 2018 · 10 comments

Comments

@jakezcruit
Copy link

When I use nbdiff it takes several minutes for the diff to appear. This happens when I do it from the terminal, use the web diff command, or use the jupyter extension.

@vidartf
Copy link
Collaborator

vidartf commented Mar 26, 2018

Is this always the case, or just for specific notebooks? The diffing algorithm in use has not been optimized a lot. If this is for specific notebooks, it would be very helpful if you could supply an example set of such notebooks. We could add those to the set of test notebooks we use, and include them in an optimization effort!

@vidartf
Copy link
Collaborator

vidartf commented Mar 26, 2018

PS: If the notebooks in question are not publicly sharable, you could possibly send them to me privately, and I would try to profile the differ. Then, I could engineer a new set of notebooks that trigger the same snags, but which does not include the original content.

@jakezcruit
Copy link
Author

Actually let's close this for now, since I can't recreate it on the notebooks I'm working on today. I'll re-open w/ example notebooks if it happens again.

@ianozsvald
Copy link

Whilst this is a closed issue, I figured adding some additional context might be useful for future discussions. I'm working on private (work) Notebooks that diff and merge very slowly. I'm using nbdime 1.0.0 (thanks - this is a great tool!).

My primary Notebook has circa 200 cells, lots of embedded for loops, tqdm output, various images, lots of pandas and matplotlib. Some of the cells contain a lot of text output (e.g. hundreds of lines of text output from print statments), some contain long DataFrames.

I do diffs using -s to avoid image comparisons. A diff takes circa 50 minutes even if only 1 cell is different.

This Notebook is, pragmatically speaking, far too large and ought to be refactored. It is a part of a research pipeline and is still evolving. Right now it won't be refactored and two of us occasionally work on it, hence the use of nbdime.

Hopefully the above could provide enough ideas for someone to create a dummy long Notebook to test speed issues?

@vidartf
Copy link
Collaborator

vidartf commented Jun 12, 2018

A diff takes circa 50 minutes even if only 1 cell is different.

Is this with the -s flag?

The current diffing algorithm is known to not scale very well with increasing number of items to compare. There is an issue for re-implementing this part using the Meyers algorithm here: #3 This should allow the diffing performance to scale much better. That being said, it would be nice to profile such huge notebooks to verify where it's spending the time.

@ianozsvald
Copy link

Yes using -s.

In addition - using nbmerge earlier this morning on the same Notebook, after 1.5 hours it raised an Exception and failed to do the merge. 5 cells were different between the 2 Notebooks (visually checked using nbdiff-web after circa 50 mins earlier in the morning).

Maybe a useful "beginner bug" that could be filed would be for someone to make some "realistic and large" Notebooks using different tools (e.g. pure Python in 1, some Pandas DataFrames in another, some Matplotlib output in another) to see which ones are diff'd poorly. That might help getting others to prioritise how the diffing is done?

@girisandeep
Copy link

girisandeep commented May 28, 2019

It took a really long time for this: https://github.com/cloudxlab/ml/blob/master/deep_learning/tensorflow_keras_regression.ipynb

I had made some changes to this notebook and had enabled nbdime in git using 'nbdime config-git --enable'. I was trying to do 'git diff tensorflow_keras_regression.ipynb'. It took more than 10mins and then I gave up.

After disabling nbdime, "git diff" worked pretty well.

Hope that helps.

Also, when I was killing git diff with CTRL+C, I noticed that it is breaking at some method of nbdime which is computing lowest common subsequence. So, the chances are that implementation is slow. So, probably moving the logic from python to cython might improve. Also, I would suggest that instead of implementing entire diff logic again, use existing libraries (probably that might hurt compatibility.)

@vidartf
Copy link
Collaborator

vidartf commented May 28, 2019

@girisandeep Thanks for the report! This seems like a bug rather than an inherent problem with the algorithm (it is not optimized, but >10 min is still completely out of bounds). Which versions (git tags) were you comparing? In order to reproduce locally, I would need both versions of the notebook.

Also for reference: Which version of nbdime/python where you using?

@vidartf
Copy link
Collaborator

vidartf commented May 28, 2019

@girisandeep Actually, I was able to reproduce based on the most recent commit on that file. I'll open a separate issue for it once I've profiled it a little closer. Thanks again!

@mikofski
Copy link

mikofski commented Jun 1, 2019

I know this is closed, but this is happening to me, so just posting my environment for others in case they encounter this issue too.
nbdime-1.0.4 (pypi)
Python 3.7.3 (conda)
I'll try updating to latest to see if that resovles the slowdown.
Thanks! :)

eduardobonet-gitlab added a commit to eduardobonet-gitlab/ipynb-mr-sample that referenced this issue Aug 17, 2021
While the diffs are parseable, and can be used to create a Merge Review,
diffing these 2 small files took a good minute. It seems to be a known
issue: jupyter/nbdime#358
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants