-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Diffing and merging notebooks #8
Conversation
* Command line tools for interactive resolution of merge conflicts | ||
* Make the merge tool git compatible | ||
* Make a web gui for displaying notebook diffs | ||
* Make a web gui for interactive resolution of merge conflicts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to make sure that the web gui is fully integrated with Jupyter. Otherwise it will be very difficult for this to gain adoption, especially on systems (remote JupyterHub) where users only have access to the system through Jupyter in the browser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the most part, I think this should be standalone, but it should at least play well with the new UI. I would expect most cases to be triggered by command-line calls, with a web view that only lives for a short period of time, dedicated to one diff view. If we properly separate how to request a diff of two notebooks from how they are displayed, adding a UI to request a diff from the web shouldn't be difficult, but I also think it shouldn't be the priority.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should really have similar APIs for requesting diffs and merges from python, javascript and commandline. If the UI can be built as extensions to other JupyterLab work, that sounds great to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@minrk I don't quite understand how to use this pull request for discussion. Should I add text to the document?
@ellisonbg @fperez should we merge this enhancement proposal as "accepted"? We are working on the nbdime repo, and can iron out the details of implementation later, but I think the "we are going to do something about this" bar is cleared, yes? |
I sure think we should merge this as is and keep iterating. |
+1 On Thu, Feb 11, 2016 at 1:31 PM, Kyle Kelley notifications@github.com
Brian E. Granger |
reuse as much as possible from jupyter notebook. An initial | ||
version of conflict resolution can be to output a notebook with | ||
conflicts marked within cells, to be manually edited as a regular | ||
jupyter notebook. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though below it says that merge of arbitrary output isn't in scope, it should clarify what it will produce in the scenario where the input doesn't change but the output does. We don't have tools to edit output, nor to present multiple outputs and delete them selectively. So this important edge case will need to be handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do!
Yup @minrk , "we are doing something about this" :) But one key thing that seems missing to me (and I don't find it in the nbdime repo either) is a high-level description of the approach taken. To me, the notebook diff problem seems separable in two phases: identification of changes to the 'container' by finding cell insertions, deletions, transpositions, moves, etc, and then within-cell diffing, that then becomes content/mimetype-specific. This is a more complex version of the problem of finding the location of changes in a linear pure-text file, and then applying the local diff itself. That's the approach that seems natural to me, but if that's not the case, then what approach is being used should be described. Second, I think at least mention of what happens to metadata should be made. We have a lot of metadata in the notebook, and there can be differences therein. So, while I apologize for the delay, I'm -1 on merging this document until these high-level ideas are a bit better described. While I realize that the actual implementation in the nbdime repo is where the 'real work' happens, I think it's also important that the high-level description is reasonably complete. It shouldn't be too difficult to add a bit of language to that effect, and the document in the long run will be much more valuable. We want our JEPs to serve a similar role to Python's PEPs, in that they provide reasonable stand-alone descriptions of the problem and the proposed solution, that one can read to understand the whole thing at a high-level without digging into the implementation itself. |
FYI for those that would like more info on Python's PEP process. It's detailed in PEP-0001 https://www.python.org/dev/peps/pep-0001/#what-belongs-in-a-successful-pep There is also a shorter description of the PEP process in the Python Dev Guide https://docs.python.org/devguide/langchanges.html Anecdotally, the Achille's heel of the PEP process is that it is a slow slog with long periods of inactivity before acceptance of a PEP. Given Jupyter's rapid pace of development, I wonder if the PEP model, which is more a gatekeeper process than rapid innovation, is the best analogy. I love the focus on openness and transparency for JEPs; I would find it sad to see development flow ebb, due to PEP process inefficiences. Perhaps "concept" status (which would equate to 5 or so committers/community members thinking it is worth exploring) would be a reasonable step before "draft" status; the benefit would be a place for a document to be worked on in the open and versioned while giving people an opportunity to iterate on the "concept" doc toward "draft" status. 🐢 🐰 🏃 |
@willingc that's a good point, and I think one thing that helps the JEP process be perhaps less cumbersome to the project as a whole is that relatively few changes warrant a JEP (e.g. a totally new project, like this one). The vast majority of changes, even relatively large ones, can go through the regular PR process. @fperez thanks for the feedback! It's AOK to add things to do here, there just weren't any todo items for a while, so I thought I'd ping for a merge. Perhaps I should have phrased it as "Should we merge, or are there more things people would like added?" We'll get right on addressing your points. |
Indeed, the Python PEP process can be too slow sometimes, but I also think that for key ideas it can be useful to stop and think carefully through the key implications. We don't need to set the bar that high, and can still allow for iteration/evolution, but I think it's also good practice to have the base document in a reasonably complete state of ideas before merge. Thanks @minrk and @martinal for responding, I think with a few such changes, the overall description will be much improved and will help others understand the direction of these ideas. Nothing prevents the actual implementation from evolving further and even changing this initial description, of course. |
* Provided a high-level description of the diff algorithm approach. * Updated with the major changes to the diff format that nbdime already implements (mostly copy-paste from current nbdime docs). * Many smaller updates.
Update notebook-diff.md
I've updated the JEP in some significant ways, recommend reading through the new version instead of looking at the diffs. Hope this covers the most important things. I believe the questions not covered here are best handled as specific issues towards the nbdime repository. |
Thanks @martinal! Much better indeed. The only thing I see missing is at least some acknowledgment of the metadata issue and some indication of what your strategy there would be. It doesn't need to be too detailed, and I'm sure it will evolve. But metadata is pervasive to the notebook, and will be the source of interesting questions and technical challenges, so at least these questions should be acknowledged here for the proposal to really cover the key points of the problem. Once that is done, I'm happy merging. Thanks again! |
For the Big Ol Unstructured JSON Data Tree, They, too, offer yet another transformation language: Reminds me of a certain XKCD... On Tue, Feb 23, 2016 at 2:18 PM Fernando Perez notifications@github.com
|
@fperez I'm not quite sure what "the metadata issue" is. Basically the plan is to treat it as a json blob in general, for which diff and merge basically works fine using already functional generic tools. @minrk has mentioned there are cases where we can know more about what the metadata means, in which case we can add options to do validation, use left, use right, etc. For visualization of arbitrary metadata, the demo @bollwyvl links to looks pretty nice. @bollwyvl I think I saw this project before and rejected it because the way they approach comparing items in array diffs is too limiting. Also note that their diff format is even more terse than my original format ;) We should make a note of checking if they do something smart though. |
@martinal, it's possible e.g. to have metadata changes in cells that are otherwise identical. But our UI doesn't by default display this explicitly, and for most users this is "hidden state". So the system will need to surface these changes and handle them in parallel to input/output, which is what users probably think most of the time about. It's not so much a technical problem (I agree with you, they are just another JSON blob), more of a UI/UX/workflow one. But you don't need much more, just a quick pointer of what your plan is there for now, so we flag the issue. The implementation details can emerge in development. |
@fperez so I see this as a two part problem. One is computing the diff (the merge follows), where the idea so far has been to make source code the primary identifier for cell identity, output a secondary identifier, and metadata ignored at the moment. There is a framework in place for implementing a sequence of predicates that include more or less of the cell contents to determine if they're to be treated as 'the same cell'. This can be extended to include e.g. edit distance of metadata blobs at the stricter equality levels, while ignoring metadata at the lower strictness. At the least strict level any change to output and metadata is ignored as long as source is approximately equal. These predicates will need testing and experimenting to make the algorithm behave sensibly. The other is the UX part. In the diff visualization there can be for example another box above, between or below cell source and output, showing metadata conflicts by showing the json tree, one line for each leaf key:value, aligned with corresponding keys between left/right view. For somewhat structured data I think a simple approach like this could be the best. Either the user can edit the json directly, and we validate it on the fly, or a UI is built to handle editing in a safer way. |
add note about displaying metadata
@fperez I added a note about diffing metadata. Anything left to do on this? |
Great, thanks! Merging now. |
High-level description of a toolchain for diffing and merging of notebooks. Currently prototype implementation exists as the nbdime project.
Done, thanks everyone for patiently taking feedback into consideration. |
Thanks! |
No description provided.