-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the base diffing algorithm #160
Comments
I haven't looked at the implementation yet, but I agree that the output is preferable. I think the reason the output seems better is not purely attributable to humans preferring diffs with longer chunks removed/added (although that's part of it); instead it's that more of the edits in the diff your Python version produces are insertions or deletions of spaces, and more of the edits in the diff jsdiff produces are insertions or deletions of words. It's illustrative to visualise just how they would diff the text
against
Here is what your library comes up with. (
And here's what jsdiff comes up with:
I think everyone would agree your version is better, in the sense that it's a more intuitive diff to a human. But I only half-agree with you about why. You say it is because your version has fewer components - i.e. after you join together adjacent insertions/deletions, there are only 3 in your diff versus 10 in jsdiff's - and that humans in general prefer diffs with fewer components. I agree that's a factor that makes diffs more intuitive, but I think it's secondary here. More important is that your diff preserved three out of six words ("TERMS", "AND"m "CONDITIONS") in the text, whereas jsdiff's diff preserved only ONE WORD (namely "AND"). The idea that these two word diffs can even be considered to have the same edit distance when one deleted/added a total of 10 words and the other deleted/added only 6 words seems crazy! The only reason they do from the algorithm's perspective is that deleting a space between words is considered to be as costly as deleting a word, and therefore these two strategies both roughly come out as inserting/deleting "half" of each text:
A side effect of preferring fewer contiguous sequences of insertions/deletions is that your algorithm tends to prefer the second strategy, which does yield a better result, but I view as kind of indirectly working around a more fundamental problem here, which is that those two strategies shouldn't be considered to have equal edit distance in the first place! According to the docs, (I'll still take a look at your implementation, though, and it may be that it would still be superior even if the edit cost bug above were fixed.) |
I just took a look and I revise my opinion and now even more strongly think that the algo change here doesn't really improve If I understand right, the one pertinent change is this condition in base.py...
which differs from the equivalent condition in
The context of these lines is that we're trying to figure out the furthest we can get onto diagonal Let's illustrate the change by considering diffing First we consider how far we can get along the graph with 0 edits. That's easy: we preserve the Next we consider how far we can get on each diagonal with 1 edit. With 1 edit we can move to either diagonal 1 or -1, respectively corresponding to an insertion of a Next we consider how far we can get on each diagonal with 2 edits. There's only one path to diagonal 2 or -2, but there are two ways to get to diagonal 0, shown here as dotted lines, and they both get to exactly the same point on the diagonal, which means neither is more optimal than the other: These correspond to the following diffs:
In jsdiff, we break the tie by preferring to add a deletion onto a path that has already done an extra insertion - i.e. we choose option 2: In pydiff, we break the tie by preferring to add an insertion onto a path that has already done an extra deletion - i.e. we choose option 1: This is the better choice, because by convention diff tools prefer breaking these ties by putting deletions first. If you diff the file
against
by running BUT! this does not in any way whatsoever skew pydiff in favour of a diff with fewer components. The fact that it had that effect in the particular example @gdavoianb used in this issue is pure coincidence. The argument for why this must be the case is based on symmetry. If you reverse which of two texts you treat as the "old" vs "new" text, then you just get the mirror image edit graph where the insertion-first paths are now removal-first paths. So in any situation where pydiff's rule results in simpler diffs, then by symmetry, if you reverse the order of the arguments to the diffing function, you must get a situation where pydiff's rule results in more complicated diffs. We can see this in action with the example above:
Sure, when you diff Despite all the above, I still think we should make the suggested change, just for consistency with the convention followed by other diffing tools. But the reduction in the number of change objects returned by calls to |
I've got a PR at #439 making us favour delete-then-insert paths over insert-then-delete paths, which brings us into alignment with the Myers diff paper and with the Unix I'd welcome a review of #439! |
Hi, folks. I was so inspired by
jsdiff
library, that I decided to port it to Python, see pydiff (@kpdecker, I hope I am not violating anything, but if so, then let me know please).The current implementation of the algorithm tends to perform additions first, and then removals, and sometimes even swapping added and removed parts afterwards doesn't help too much, because we might already obtain a "bad" solution.
Let me show you an example. Consider the following two strings:
and
The content of the strings doesn't matter, the most interesting part here is the clause
C
, where there are a deleted partAND IBOOKS STORE
and an inserted part, AND SOMETHING ELSE
.Using
jsdiff.diffWords
for comparing these string we obtain the following diff:Though this diff is optimal in terms of edit distance (17 removals/additions), it is not good enough, it is too complicated and has too many components.
But there exists yet another optimal diff with 17 changes (obtained in Python):
IMHO, this diff looks much better, because it has less components and thus are more human-readable. Also it tries to remove as many tokens as possible, and only then tries to add tokens, and as a result we get bigger disjoint groups of removed/added tokens instead of many one-word changes like removed/added, removed/add etc.
What do you think about this change? Do you find it reasonable? When I started porting JS code to Python, I tried to make Python-output to be as close as possible to JS-output, but I decided to make an additional step ahead to improve readability of diffs, and now I am suggesting to back-port this change to
jsdiff
. I don't have any JS implementation yet, but my Python implementation can be used as a baseline.Any suggestions, comments, alternative opinions, and also criticism are welcome :)
Thank you for attention.
The text was updated successfully, but these errors were encountered: