[prototype] cell id and diff tool #2342

Closed
wants to merge 30 commits into
from

Conversation

Projects
None yet
5 participants
Owner

Carreau commented Aug 26, 2012

This introduce cell_id and "family tree" for cell to allows smart diffing.

With a little example script and some test notebook in doc to diff notebooks.

You should see that the diff tool is capable of findings cell ancesters, making the diff between the concatenated input of all ancesters, and the current input. Same for split.

There is no detection of deleted cell for now.

Hoping to convince you that adding a little more than just cell_id might allow to do lot more stuff.

Thought ?

ocatave magic
modified octave magic
generated diff

This pull request passes (merged 58d78fc into 5bb0ae4).

Owner

ellisonbg commented Sep 7, 2012

The idea of notebook diffs seems quite powerful, but I wonder if there are larger issues here about how we save notebooks. @fperez and @minrk do you think there is an IPEP lurking here? I am not quite sure what it would be though. But I love the idea of being able to do nb diffs, just not sure how to integrate it with the rest of our plans.

Owner

Carreau commented Sep 8, 2012

What do you mean with issues here about how we save notebooks?

The main point of this PR is to discuss how we handle UUID, is it worth keeping track of ancestors UUIDs or not.
I think it is for having good custom merging and diffing.

Owner

minrk commented Sep 8, 2012

I think it is probably not worth it to track ancestors.

Owner

Carreau commented Sep 8, 2012

Then how do you diff when you have cell merge, or uuid changes ?
there are some operation like copying/pasting/merging cell, that can't conserve uuid.

Owner

ellisonbg commented Sep 8, 2012

How do you diff a simple cell move?

How do you handle multiple merges?

It seems to me that there are cases where the ancestor id wont work.

On Saturday, September 8, 2012, Bussonnier Matthias wrote:

Then how do you diff when you have cell merge, or uuid changes ?
there are some operation like copying/pasting/merging cell, that con't
conserve uuid.


Reply to this email directly or view it on GitHubhttps://github.com/ipython/ipython/pull/2342#issuecomment-8393471.

Brian E. Granger
Cal Poly State University, San Luis Obispo
bgranger@calpoly.edu and ellisonbg@gmail.com

Contributor

tkf commented Sep 8, 2012

Is this diff tool going to be a building block for "Google Doc like" notebook sync across clients?

Owner

minrk commented Sep 8, 2012

On Sat, Sep 8, 2012 at 10:59 AM, Bussonnier Matthias <
notifications@github.com> wrote:

Then how do you diff when you have cell merge, or uuid changes ?
there are some operation like copying/pasting/merging cell, that con't
conserve uuid.

Sometimes the diff won't deduce everything, and might show the change as
larger than a human would.
This happens all the time with regular diffs, and is not generally a
problem.

I just don't think the benefit is worth the complexity.

For example, compare these two actions:

users A and B are both given the same notebook with a single cell
user A does split-cell to create a two-cell notebook
user B creates a new cell, and does copy/paste to move text from one cell
to the next.

Their two notebooks are now identical, and created from the same source.
If we depend on ancestor ID, wouldn't A's diff be significantly different
from B's? This doesn't seem desirable.


Reply to this email directly or view it on GitHubhttps://github.com/ipython/ipython/pull/2342#issuecomment-8393471.

Owner

Carreau commented Sep 8, 2012

@ellisonbg

How do you diff a simple cell move?

just extend the diff syntax to

@@ diff for cell 4 @@
--- 1_old_cell
+++ 1_new_cell
@@ -1,4 +1,3, move from position 4 to 18 @@

for example, or store in the metadata of the newly created cell to show a special UI

How do you handle multiple merges?
it does work with current implementation.

assumming A,B,C to AB,C to C,AB to CAB

your family tree woull be smth like

upper_cell:{
    C
    }
lower_cell:
    {
    AB : { upper_cell: A,
           lower_cell: B
    }

so wether you have (C,AB), or (C,A,B) you can reconstitute the merge.

It seems to me that there are cases where the ancestor id wont work.

Maybe, but introducing cell ID, allows us to do something better than simple
json diffing in many cases.

@tkf,

Is this diff tool going to be a building block for "Google Doc like" notebook
sync across clients?

I don't thinkn so, with live collaborating, document will be in sync, so you
don't need to look for differences. But UUID will help with syncing.

@minrk,

Sometimes the diff won't deduce everything, and might show the change as
larger than a human would.
This happens all the time with regular diffs, and is not generally a
problem.

Yes, and we have the occasion to reduce this effect, we can remove it totally but it will be better than anything.

Lots of interpreters don't show colored tracebacks, and it's not generally a problem.

I just don't think the benefit is worth the complexity.

The added complexity is small, and stored in metadata.

We have the occasion of storing additionnal metadata on the document and to
build a format that might use that to offer better versionning. Like some
Software allow you to see who edited what, let not miss the occasion.

users A and B are both given the same notebook with a single cell
user A does split-cell to create a two-cell notebook
user B creates a new cell, and does copy/paste to move text from one cell
to the next.

Their two notebooks are now identical, and created from the same source.
If we depend on ancestor ID, wouldn't A's diff be significantly different
from B's? This doesn't seem desirable.

I know some people that don't use/know git mv, should I consider that git-mv is worthless ?

I don't say that we should alway rely the extra-information to do diff and/or merge,
but we can have it, in metadata, the point of metadata is to add additionnal
informations, and we have the occasion of adding something that is awfully
difficult to guess afterward, and would make a really good plus for the fileformat.

git allows to manually edit chunk when commiting because diff tool does not always get things right.
let's do better and give versioning tool what they need to do their job better !

Contributor

tkf commented Sep 8, 2012

@Carreau Ah, sorry, this is a sub-cell level diff tool, not worksheet level (this tool is not for detecting reordering of cells, etc.).

Owner

Carreau commented Sep 8, 2012

@Carreau Ah, sorry, this is a sub-cell level diff tool, not worksheet level (this tool is not for detecting reordering of cells, etc.).

Yes This is is to detect cell movement and merge.

You don't need to detect cell move and merge while live editting because you are actually dooing it
Live collaboration does not need this tool.

Owner

ellisonbg commented Sep 9, 2012

The reason this is related to how notebooks are saved is that if we want to add a UI to the notebook for viewing diffs, we would need access to different versions of the notebooks. This would require some sort of version control for the notebooks. We could build that ourselves or use git, but it definitely gets back to how the notebooks are saved.

But if we just want to have this diff capability as a separate tool, we don't have to think about that.

Owner

ellisonbg commented Sep 9, 2012

It also seems like there are other overlaps between:

  • Notebook saving
  • Live collaboration
  • Diffs and version control

Let me try to explain. Live notebook collaboration requires that the client sends cell level diffs to the server, which are then broadcast to other clients. These diffs are things like:

  • Move cell 2 down
  • Merge cells 4 and 5
  • Update the contents of cell 4 with the text "..."

Even without multiple people editing documents, this is exactly the information required to do things like track versions through time, rewind to older versions, etc. Essentially we are inventing structured version control for notebooks.

But there is then the question of where do we store that information. If we don't save it to a db/file, it will be lost. If we want to save it to file, we would have to invent a new file format.

But this all takes me back to my desire to see a new IPEP created where we can discuss all of these things. There are huge issues here that span a number of different areas of the code base.

Owner

Carreau commented Sep 10, 2012

Even without multiple people editing documents, this is exactly the information required to do things like track versions through time, rewind to older versions, etc. Essentially we are inventing structured version control for notebooks.

Ok, you think about storing "patches" and use them to do diffs and other.
I didn't thought of this one.

This is (I think) what darcs is trying to do. It is IMHO, a much to complex approach.
Patch could be used to lower the data that have to transit on the wire, but storing and using them give you theoretically more freedom, but would be a nightmare to program.

I have a better understanding of what you want to discuss, and will try to comme up with a new PEP.

Owner

Carreau commented Sep 30, 2012

Hi,
This PR is without activities for 2 month now,
I'm going to close it and open an issue to reference it.
This does not mean than we refuse the code that is here, but that we try to keep the number of opened Pull request as low as possible to have better bird view of active code.

Fell free to reopen this PR whenever you want or contact us if you have any questions.

Thanks for contributing.

@Carreau Carreau closed this Sep 30, 2012

@Carreau Carreau deleted the Carreau:cell_id branch Dec 15, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment