Graphical tool for diffing notebooks #3355

Open
wolever opened this Issue May 24, 2013 · 10 comments

Comments

Projects
None yet
7 participants
@wolever
Contributor

wolever commented May 24, 2013

As per the discussion leading from: https://twitter.com/swcarpentry/status/337611439382593537

I'll be taking a crack at this some time the week of May 27th.

@Carreau

This comment has been minimized.

Show comment
Hide comment
@Carreau

Carreau May 24, 2013

Member

I took a shot some time ago. I think it would require cell-id at some point. cf #2342 , and example of diff'd notebook.
http://nbviewer.ipython.org/urls/raw.github.com/Carreau/ipython/cell_id/docs/examples/notebooks/octave_diff.ipynb

Member

Carreau commented May 24, 2013

I took a shot some time ago. I think it would require cell-id at some point. cf #2342 , and example of diff'd notebook.
http://nbviewer.ipython.org/urls/raw.github.com/Carreau/ipython/cell_id/docs/examples/notebooks/octave_diff.ipynb

@takluyver

This comment has been minimized.

Show comment
Hide comment
@takluyver

takluyver May 24, 2013

Member

I've got a vague idea for a kind of 'rich diff protocol', because I find there's a lot of filetypes where standard diff isn't much use (e.g. word processor documents - even if you use a text-based format like lyx or flat ODT, the noise makes the diffs all but unusable). This is kind of a long-term thing: it might be interesting to build the prototype framework with a notebook diff tool, but you're equally welcome to just solve the problem at hand directly if you prefer.

This has actually been going around in my head this morning, so here's a few notes on what I envisage:

  • The diffs only need to be for viewing: if you need to actually send a change to be applied, use a standard diff or binary diff format.
  • Format adapters only need to handle extracting features from one file: the master program will compare those features.
  • The master calls the format adapter with a list of mimetypes it can diff and display, like the HTTP accept header. So you might have a command-line diff tool that can only display text, and a web frontend that can display rich text and images.
  • The format adapter returns a series of chunks with data in those mimetypes. The master program aligns them and displays a rich diff.
  • If a block of data can't be represented in any of the requested mimetypes, the format adapter emits it as a hash and a mimetype, so the frontend can optionally show something like [Image data changed].
  • It possibly needs some features to deal with unordered container files (e.g. zip files are an unordered collection of files, sqlite files are an unordered collection of tables). I haven't thought much about this yet.

This also draws a bit from lesspipe, which I just learned about this morning.

Member

takluyver commented May 24, 2013

I've got a vague idea for a kind of 'rich diff protocol', because I find there's a lot of filetypes where standard diff isn't much use (e.g. word processor documents - even if you use a text-based format like lyx or flat ODT, the noise makes the diffs all but unusable). This is kind of a long-term thing: it might be interesting to build the prototype framework with a notebook diff tool, but you're equally welcome to just solve the problem at hand directly if you prefer.

This has actually been going around in my head this morning, so here's a few notes on what I envisage:

  • The diffs only need to be for viewing: if you need to actually send a change to be applied, use a standard diff or binary diff format.
  • Format adapters only need to handle extracting features from one file: the master program will compare those features.
  • The master calls the format adapter with a list of mimetypes it can diff and display, like the HTTP accept header. So you might have a command-line diff tool that can only display text, and a web frontend that can display rich text and images.
  • The format adapter returns a series of chunks with data in those mimetypes. The master program aligns them and displays a rich diff.
  • If a block of data can't be represented in any of the requested mimetypes, the format adapter emits it as a hash and a mimetype, so the frontend can optionally show something like [Image data changed].
  • It possibly needs some features to deal with unordered container files (e.g. zip files are an unordered collection of files, sqlite files are an unordered collection of tables). I haven't thought much about this yet.

This also draws a bit from lesspipe, which I just learned about this morning.

@wolever

This comment has been minimized.

Show comment
Hide comment
@wolever

wolever May 24, 2013

Contributor

@Carreau Ah, yes — sorry, I should have gone into more detail here.

For a first pass, I'd like to build a tool that would would just diff entire notebooks, not cell contents. Imagine a 3-way-merge where, instead of lines of code, you have complete notebook cells.

This way you can sidestep a lot of the Very Hard problems, and get something that will be immediately (well, with only a few hours of work) somewhat useful.

Again at first pass: I imagine building it as a standalone tool which can be called in place of merge (1)… Something like: nbmerge a.ipynb b.ipynb base.ipynb. It would pull up a browser which would look more or less like a standard 3-way merge, and when the merge is complete, it would save the merged notebook.

Of course, from there, it would be straight forward to diff only two notebooks.

Contributor

wolever commented May 24, 2013

@Carreau Ah, yes — sorry, I should have gone into more detail here.

For a first pass, I'd like to build a tool that would would just diff entire notebooks, not cell contents. Imagine a 3-way-merge where, instead of lines of code, you have complete notebook cells.

This way you can sidestep a lot of the Very Hard problems, and get something that will be immediately (well, with only a few hours of work) somewhat useful.

Again at first pass: I imagine building it as a standalone tool which can be called in place of merge (1)… Something like: nbmerge a.ipynb b.ipynb base.ipynb. It would pull up a browser which would look more or less like a standard 3-way merge, and when the merge is complete, it would save the merged notebook.

Of course, from there, it would be straight forward to diff only two notebooks.

@fperez

This comment has been minimized.

Show comment
Hide comment
@fperez

fperez May 24, 2013

Member

Glad to see you're taking a shot at this! Needless to say, this should be done as a purely standalone experiment for now, so you have the freedom to control development without worrying too much about integration with the core.

While I agree with @takluyver that this problem fits into the larger context of complex format diffing, I also think that you should start by focusing on one specific thing, namely IPython notebooks, for a first prototype. It can be generalized later once you have something that works, but this is exactly the kind of problem where trying to build from the outset a completely generic tool is likely to lead to an abstraction monstrosity that's both unmanageable and sub-optimal in any specific case.

There's a ton of room for interesting experimentation here on what will be good output. I personally really like LaTeXdiff, as a tool for diffing latex-sourced files in a rich context. I'd encourage you to have a look at it for inspiration.

Member

fperez commented May 24, 2013

Glad to see you're taking a shot at this! Needless to say, this should be done as a purely standalone experiment for now, so you have the freedom to control development without worrying too much about integration with the core.

While I agree with @takluyver that this problem fits into the larger context of complex format diffing, I also think that you should start by focusing on one specific thing, namely IPython notebooks, for a first prototype. It can be generalized later once you have something that works, but this is exactly the kind of problem where trying to build from the outset a completely generic tool is likely to lead to an abstraction monstrosity that's both unmanageable and sub-optimal in any specific case.

There's a ton of room for interesting experimentation here on what will be good output. I personally really like LaTeXdiff, as a tool for diffing latex-sourced files in a rich context. I'd encourage you to have a look at it for inspiration.

@minrk

This comment has been minimized.

Show comment
Hide comment
@minrk

minrk May 24, 2013

Member

I've done a few toy attempts of this kind of thing, and I don't think that a cell ID needs to, or even should be, a part of it.

Member

minrk commented May 24, 2013

I've done a few toy attempts of this kind of thing, and I don't think that a cell ID needs to, or even should be, a part of it.

@diego898

This comment has been minimized.

Show comment
Hide comment
@diego898

diego898 Aug 20, 2015

Are there any plans to incorporate something like this? I just found this tool:

https://github.com/tarmstrong/nbdiff

Are there any plans to incorporate something like this? I just found this tool:

https://github.com/tarmstrong/nbdiff

@Carreau

This comment has been minimized.

Show comment
Hide comment
@Carreau

Carreau Aug 20, 2015

Member

Yes we are aware of nbdiff (that also have a website: http://nbdiff.org/) we would need a full-time person to actually work on that.

Member

Carreau commented Aug 20, 2015

Yes we are aware of nbdiff (that also have a website: http://nbdiff.org/) we would need a full-time person to actually work on that.

@yarikoptic

This comment has been minimized.

Show comment
Hide comment
@yarikoptic

yarikoptic Jan 30, 2016

Contributor

FWIW -- more I use notebooks more I run into a need of a visual diff, as many of others -- nbdiff, https://github.com/csiro-scientific-computing/NotebookDiff, and who knows what else. But unfortunately none of those seems to be able to fully survive on their own partially due to the rapid pace of IPython development and lack of dedicated funding for their development.
@fperez Is there some "native" diffing support coming to IPython? If not -- could may be IPython core "adopt" at least one of the solutions and help maintaining it at least to the degree of usability with some/current IPython version(s)?

Contributor

yarikoptic commented Jan 30, 2016

FWIW -- more I use notebooks more I run into a need of a visual diff, as many of others -- nbdiff, https://github.com/csiro-scientific-computing/NotebookDiff, and who knows what else. But unfortunately none of those seems to be able to fully survive on their own partially due to the rapid pace of IPython development and lack of dedicated funding for their development.
@fperez Is there some "native" diffing support coming to IPython? If not -- could may be IPython core "adopt" at least one of the solutions and help maintaining it at least to the degree of usability with some/current IPython version(s)?

@takluyver

This comment has been minimized.

Show comment
Hide comment
@takluyver

takluyver Jan 30, 2016

Member

The current effort is nbdime, short for Notebook Diff and Merge. @minrk works with the author, so it should stay up to date.

Member

takluyver commented Jan 30, 2016

The current effort is nbdime, short for Notebook Diff and Merge. @minrk works with the author, so it should stay up to date.

@yarikoptic

This comment has been minimized.

Show comment
Hide comment
@yarikoptic

yarikoptic Jan 30, 2016

Contributor

Great, thank you for the pointer @takluyver

Contributor

yarikoptic commented Jan 30, 2016

Great, thank you for the pointer @takluyver

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment