Improve the base diffing algorithm #160

gdavoianb · 2017-01-21T16:05:49Z

Hi, folks. I was so inspired by jsdiff library, that I decided to port it to Python, see pydiff (@kpdecker, I hope I am not violating anything, but if so, then let me know please).

The current implementation of the algorithm tends to perform additions first, and then removals, and sometimes even swapping added and removed parts afterwards doesn't help too much, because we might already obtain a "bad" solution.

Let me show you an example. Consider the following two strings:

TERMS AND CONDITIONS
A. TERMS OF SALE
B. ITUNES STORE TERMS AND CONDITIONS
C. MAC APP STORE, APP STORE, APP STORE FOR APPLE TV AND IBOOKS STORE TERMS AND CONDITIONS
D. APPLE MUSIC TERMS AND CONDITIONS
THE LEGAL AGREEMENTS SET OUT BELOW GOVERN YOUR USE OF THE ITUNES STORE, MAC APP STORE, APP STORE, APP STORE FOR APPLE TV, IBOOKS STORE AND APPLE MUSIC SERVICES ("SERVICES").

and

TERMS OR CONDITIONS
A. TERMS OF SALE
B. ITUNES STORE TERMS AND CONDITIONS
C. MAC APP STORE, APP STORE, APP STORE FOR APPLE TV TERMS AND CONDITIONS, AND SOMETHING ELSE
D. APPLE MUSIC TERMS AND CONDITIONS
THE LEGAL CONTRACTS SET OUT BELOW GOVERN YOUR USE OF THE ITUNES STORE, MAC APP STORE, APP STORE, APP STORE FOR APPLE TV, IBOOKS STORE AND APPLE MUSIC SERVICES ("SERVICES").

The content of the strings doesn't matter, the most interesting part here is the clause C, where there are a deleted part AND IBOOKS STORE and an inserted part , AND SOMETHING ELSE.

Using jsdiff.diffWords for comparing these string we obtain the following diff:

[ { count: 2, value: 'TERMS ' },
  { count: 1, added: undefined, removed: true, value: 'AND' },
  { count: 1, added: true, removed: undefined, value: 'OR' },
  { count: 50,
    value: ' CONDITIONS\nA. TERMS OF SALE\nB. ITUNES STORE TERMS AND CONDITIONS\nC. MAC APP STORE, APP STORE, APP STORE FOR APPLE TV ' },
  { count: 2, added: true, removed: undefined, value: 'TERMS ' },
  { count: 2, value: 'AND ' },
  { count: 1, added: undefined, removed: true, value: 'IBOOKS' },
  { count: 2,
    added: true,
    removed: undefined,
    value: 'CONDITIONS,' },
  { count: 1, value: ' ' },
  { count: 1, added: undefined, removed: true, value: 'STORE' },
  { count: 1, added: true, removed: undefined, value: 'AND' },
  { count: 1, value: ' ' },
  { count: 1, added: undefined, removed: true, value: 'TERMS' },
  { count: 1, added: true, removed: undefined, value: 'SOMETHING' },
  { count: 1, value: ' ' },
  { count: 1, added: undefined, removed: true, value: 'AND' },
  { count: 1, added: true, removed: undefined, value: 'ELSE' },
  { count: 1, value: '\n' },
  { count: 2,
    added: undefined,
    removed: true,
    value: 'CONDITIONS\n' },
  { count: 17,
    value: 'D. APPLE MUSIC TERMS AND CONDITIONS\nTHE LEGAL ' },
  { count: 1, added: undefined, removed: true, value: 'AGREEMENTS' },
  { count: 1, added: true, removed: undefined, value: 'CONTRACTS' },
  { count: 60,
    value: ' SET OUT BELOW GOVERN YOUR USE OF THE ITUNES STORE, MAC APP STORE, APP STORE, APP STORE FOR APPLE TV, IBOOKS STORE AND APPLE MUSIC SERVICES ("SERVICES").' } ]

Though this diff is optimal in terms of edit distance (17 removals/additions), it is not good enough, it is too complicated and has too many components.

But there exists yet another optimal diff with 17 changes (obtained in Python):

[{'count': 2, 'value': 'TERMS '},
 {'count': 1, 'removed': True, 'added': None, 'value': 'AND'},
 {'count': 1, 'removed': None, 'added': True, 'value': 'OR'},
 {'count': 50, 'value': ' CONDITIONS\nA. TERMS OF SALE\nB. ITUNES STORE TERMS AND CONDITIONS\nC. MAC APP STORE, APP STORE, APP STORE FOR APPLE TV '},
 {'count': 6, 'removed': True, 'value': 'AND IBOOKS STORE ', 'added': None},
 {'count': 5, 'value': 'TERMS AND CONDITIONS'},
 {'count': 1, 'removed': None, 'added': True, 'value': ','},
 {'count': 1, 'value': ' '},
 {'count': 6, 'removed': None, 'value': 'AND SOMETHING ELSE\n', 'added': True},
 {'count': 17, 'value': 'D. APPLE MUSIC TERMS AND CONDITIONS\nTHE LEGAL '},
 {'count': 1, 'removed': True, 'added': None, 'value': 'AGREEMENTS'},
 {'count': 1, 'removed': None, 'added': True, 'value': 'CONTRACTS'},
 {'count': 60, 'value': ' SET OUT BELOW GOVERN YOUR USE OF THE ITUNES STORE, MAC APP STORE, APP STORE, APP STORE FOR APPLE TV, IBOOKS STORE AND APPLE MUSIC SERVICES ("SERVICES").'}]

IMHO, this diff looks much better, because it has less components and thus are more human-readable. Also it tries to remove as many tokens as possible, and only then tries to add tokens, and as a result we get bigger disjoint groups of removed/added tokens instead of many one-word changes like removed/added, removed/add etc.

What do you think about this change? Do you find it reasonable? When I started porting JS code to Python, I tried to make Python-output to be as close as possible to JS-output, but I decided to make an additional step ahead to improve readability of diffs, and now I am suggesting to back-port this change to jsdiff. I don't have any JS implementation yet, but my Python implementation can be used as a baseline.

Any suggestions, comments, alternative opinions, and also criticism are welcome :)

Thank you for attention.

The text was updated successfully, but these errors were encountered:

ExplodingCabbage · 2023-12-21T11:34:56Z

I haven't looked at the implementation yet, but I agree that the output is preferable. I think the reason the output seems better is not purely attributable to humans preferring diffs with longer chunks removed/added (although that's part of it); instead it's that more of the edits in the diff your Python version produces are insertions or deletions of spaces, and more of the edits in the diff jsdiff produces are insertions or deletions of words.

It's illustrative to visualise just how they would diff the text

AND IBOOKS STORE TERMS AND CONDITIONS

against

TERMS AND CONDITIONS, AND SOMETHING ELSE

Here is what your library comes up with. (-s are deletions; +s are insertions; numbers on the third row are cumulative edit costs after the block of insertions/deletions they appear beneath):

AND IBOOKS STORE TERMS AND CONDITIONS+ ++++++++++++++++++
-----------------TERMS AND CONDITIONS, AND SOMETHING ELSE
        6                            7          12

And here's what jsdiff comes up with:

++++++AND IBOOKS +++++++++++ STORE+++ TERMS+++++++++ AND++++ CONDITIONS
TERMS AND ------ CONDITIONS, -----AND -----SOMETHING ---ELSE ----------
  2         3        5         6   7    8      9     10  11      12

I think everyone would agree your version is better, in the sense that it's a more intuitive diff to a human. But I only half-agree with you about why. You say it is because your version has fewer components - i.e. after you join together adjacent insertions/deletions, there are only 3 in your diff versus 10 in jsdiff's - and that humans in general prefer diffs with fewer components. I agree that's a factor that makes diffs more intuitive, but I think it's secondary here. More important is that your diff preserved three out of six words ("TERMS", "AND"m "CONDITIONS") in the text, whereas jsdiff's diff preserved only ONE WORD (namely "AND").

The idea that these two word diffs can even be considered to have the same edit distance when one deleted/added a total of 10 words and the other deleted/added only 6 words seems crazy! The only reason they do from the algorithm's perspective is that deleting a space between words is considered to be as costly as deleting a word, and therefore these two strategies both roughly come out as inserting/deleting "half" of each text:

lining up all the spaces ("half" of each text as measured by edit distance score!) from the two texts and then replacing literally all the words (the other "half" of the text)
lining up the second half of the first text with the perfectly-matching first half of the second text, then deleting the entirety of the non-matching half, including spaces

A side effect of preferring fewer contiguous sequences of insertions/deletions is that your algorithm tends to prefer the second strategy, which does yield a better result, but I view as kind of indirectly working around a more fundamental problem here, which is that those two strategies shouldn't be considered to have equal edit distance in the first place! According to the docs, diffWords does its diff "ignoring whitespace", but here we are treating a diff that preserves only 2/12 words in the two texts as equal to one that preserves 6/12 words purely because the former preserves more whitespace. If we want to live up to the description in the README, whitespace insertions and deletions should cost zero, and then we'd probably get something similar to your diff anyway.

(I'll still take a look at your implementation, though, and it may be that it would still be superior even if the edit cost bug above were fixed.)

ExplodingCabbage · 2023-12-21T15:56:47Z

I just took a look and I revise my opinion and now even more strongly think that the algo change here doesn't really improve diffWords output at all. I still think your change to the algorithm is good, but only because it brings jsdiff in line with conventions followed by other diffing tools. I think it doesn't even tend to produce diffs with longer runs of deletions/additions or to favour deleting spaces over words in diffWords, and the fact that it did so in the particular case shown in the issue description is 100% coincidence.

If I understand right, the one pertinent change is this condition in base.py...

if diag == -edit_dist or diag != edit_dist and \
        self._get_endpoint(add_path, diag - 1)[0] <= \
        self._get_endpoint(remove_path, diag + 1)[0]:

which differs from the equivalent condition in base.js:

if (!canAdd || (canRemove && addPath.newPos < removePath.newPos)) {

The context of these lines is that we're trying to figure out the furthest we can get onto diagonal diagonalPath of the edit graph with editLength edits, and we're choosing between taking the furthest-reaching path of length editLength-1 on diagonal diagonalPath+1 and adding a move corresponding to deleting a token from the old sequence, or taking the furthest-reaching path of length editLength-1 on diagonal diagonalPath-1 and adding a move corresponding to adding a token from the new sequence. Both conditions agree that if one of those options will place you strictly further along diagonalPath than the other, you should take it. But they disagree about how to break a tie; pydiff favours doing an insert, while jsdiff favours doing a removal. Counterintuitively, that means pydiff favours paths that do removals earlier (because when we're choosing between stepping onto a diagonal via a character deletion or stepping to the same spot on that diagonal via a character insertion, that's respectively equivalent to continuing a path that has already done one extra insertion or continuing a path that has already done one extra removal).

Let's illustrate the change by considering diffing abcd (old) against acbd (new). Here's the edit graph...

First we consider how far we can get along the graph with 0 edits. That's easy: we preserve the a, and then we're stuck:

Next we consider how far we can get on each diagonal with 1 edit. With 1 edit we can move to either diagonal 1 or -1, respectively corresponding to an insertion of a c and a deletion of a b. Either way we then greedily include the next character, which is common:

Next we consider how far we can get on each diagonal with 2 edits. There's only one path to diagonal 2 or -2, but there are two ways to get to diagonal 0, shown here as dotted lines, and they both get to exactly the same point on the diagonal, which means neither is more optimal than the other:

These correspond to the following diffs:

Keep "a", delete "b", keep "c", insert "b", keep "d"
Keep "a", insert "c", keep "b", delete "c", keep "d"

In jsdiff, we break the tie by preferring to add a deletion onto a path that has already done an extra insertion - i.e. we choose option 2:

In pydiff, we break the tie by preferring to add an insertion onto a path that has already done an extra deletion - i.e. we choose option 1:

This is the better choice, because by convention diff tools prefer breaking these ties by putting deletions first. If you diff the file

a
b
c
d

against

a
c
b
d

by running diff old.txt new.txt you'll see that diff prefers a diff that does a deletion first and then an insertion.

BUT! this does not in any way whatsoever skew pydiff in favour of a diff with fewer components. The fact that it had that effect in the particular example @gdavoianb used in this issue is pure coincidence.

The argument for why this must be the case is based on symmetry. If you reverse which of two texts you treat as the "old" vs "new" text, then you just get the mirror image edit graph where the insertion-first paths are now removal-first paths. So in any situation where pydiff's rule results in simpler diffs, then by symmetry, if you reverse the order of the arguments to the diffing function, you must get a situation where pydiff's rule results in more complicated diffs.

We can see this in action with the example above:

>>> from pprint import pprint
>>> import pydiff
>>> old = "AND IBOOKS STORE TERMS AND CONDITIONS"
>>> new = "TERMS AND CONDITIONS, AND SOMETHING ELSE"
>>> pprint(pydiff.diff_words(old,new))
[{'added': None, 'count': 6, 'removed': True, 'value': 'AND IBOOKS STORE '},
 {'count': 5, 'value': 'TERMS AND CONDITIONS'},
 {'added': True, 'count': 7, 'removed': None, 'value': ', AND SOMETHING ELSE'}]
>>> pprint(pydiff.diff_words(new,old))
[{'added': None, 'count': 2, 'removed': True, 'value': 'TERMS '},
 {'count': 2, 'value': 'AND '},
 {'added': None, 'count': 2, 'removed': True, 'value': 'CONDITIONS,'},
 {'added': True, 'count': 1, 'removed': None, 'value': 'IBOOKS'},
 {'count': 1, 'value': ' '},
 {'added': None, 'count': 1, 'removed': True, 'value': 'AND'},
 {'added': True, 'count': 1, 'removed': None, 'value': 'STORE'},
 {'count': 1, 'value': ' '},
 {'added': None, 'count': 1, 'removed': True, 'value': 'SOMETHING'},
 {'added': True, 'count': 1, 'removed': None, 'value': 'TERMS'},
 {'count': 1, 'value': ' '},
 {'added': None, 'count': 1, 'removed': True, 'value': 'ELSE'},
 {'added': True, 'count': 3, 'removed': None, 'value': 'AND CONDITIONS'}]
>>> 
mark@fractal:~$ cd jsdiff
mark@fractal:~/jsdiff$ node
Welcome to Node.js v21.1.0.
Type ".help" for more information.
> diff = require('.')
{
  Diff: [Getter],
  diffChars: [Getter],
  diffWords: [Getter],
  diffWordsWithSpace: [Getter],
  diffLines: [Getter],
  diffTrimmedLines: [Getter],
  diffSentences: [Getter],
  diffCss: [Getter],
  diffJson: [Getter],
  canonicalize: [Getter],
  diffArrays: [Getter],
  applyPatch: [Getter],
  applyPatches: [Getter],
  parsePatch: [Getter],
  merge: [Getter],
  structuredPatch: [Getter],
  createTwoFilesPatch: [Getter],
  createPatch: [Getter],
  convertChangesToDMP: [Getter],
  convertChangesToXML: [Getter]
}
> old = "AND IBOOKS STORE TERMS AND CONDITIONS"
'AND IBOOKS STORE TERMS AND CONDITIONS'
> nyoo = "TERMS AND CONDITIONS, AND SOMETHING ELSE"
'TERMS AND CONDITIONS, AND SOMETHING ELSE'
> diff.diffWords(old, nyoo)
[
  { count: 2, added: true, removed: undefined, value: 'TERMS ' },
  { count: 2, value: 'AND ' },
  { count: 1, added: undefined, removed: true, value: 'IBOOKS' },
  { count: 2, added: true, removed: undefined, value: 'CONDITIONS,' },
  { count: 1, value: ' ' },
  { count: 1, added: undefined, removed: true, value: 'STORE' },
  { count: 1, added: true, removed: undefined, value: 'AND' },
  { count: 1, value: ' ' },
  { count: 1, added: undefined, removed: true, value: 'TERMS' },
  { count: 1, added: true, removed: undefined, value: 'SOMETHING' },
  { count: 1, value: ' ' },
  {
    count: 3,
    added: undefined,
    removed: true,
    value: 'AND CONDITIONS'
  },
  { count: 1, added: true, removed: undefined, value: 'ELSE' }
]
> diff.diffWords(nyoo, old)
[
  {
    count: 6,
    added: true,
    removed: undefined,
    value: 'AND IBOOKS STORE '
  },
  { count: 5, value: 'TERMS AND CONDITIONS' },
  {
    count: 7,
    added: undefined,
    removed: true,
    value: ', AND SOMETHING ELSE'
  }
]

Sure, when you diff AND IBOOKS STORE TERMS AND CONDITIONS against TERMS AND CONDITIONS, AND SOMETHING ELSE, pydiff gives a result with 3 change objects and jsdiff gives a result with 13 change objects. Looks very bad for jsdiff. But that automatically means that when you diff TERMS AND CONDITIONS, AND SOMETHING ELSE against AND IBOOKS STORE TERMS AND CONDITIONS, jsdiff will give you a result with 3 change objects while pydiff will give you one with 13 change objects - and sure enough, that's what happens! Every scenario where pydiff outperforms jsdiff on this metric has a mirror scenario where jsdiff equally outperforms pydiff. There's no general sense in which pydiff's algorithm can be expected to yield results with fewer change objects.

Despite all the above, I still think we should make the suggested change, just for consistency with the convention followed by other diffing tools. But the reduction in the number of change objects returned by calls to diffWords that this issue suggests will be delivered as a result of such a change is 100% illusory, and anyone expecting to see such an improvement is going to be disappointed.

ExplodingCabbage · 2023-12-21T20:04:50Z

I've got a PR at #439 making us favour delete-then-insert paths over insert-then-delete paths, which brings us into alignment with the Myers diff paper and with the Unix diff tool. It won't merge for a while since it's kind of a breaking change (given that it changes what results jsdiff gives) and I have some important non-breaking changes I very deliberately want to release before we release any breaking ones, but it'll probably go out at some point in January. I'm gonna go ahead and close this issue now, though, just to help me keep track of what I do and don't still need to do.

I'd welcome a review of #439!

ExplodingCabbage mentioned this issue Dec 18, 2023

Feature request: intelligent diff #16

Closed

ExplodingCabbage added the diffWords behaviour label Dec 20, 2023

This was referenced Dec 21, 2023

diffWords doesn't consistently ignore whitespace, despite the README saying it should #436

Closed

Prefer to order deletions before insertions when the edit cost is the same, like the Myers diff paper does #439

Merged

ExplodingCabbage closed this as completed Dec 21, 2023

ExplodingCabbage mentioned this issue Jan 2, 2024

What is count in a change object? #210

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the base diffing algorithm #160

Improve the base diffing algorithm #160

gdavoianb commented Jan 21, 2017

ExplodingCabbage commented Dec 21, 2023

ExplodingCabbage commented Dec 21, 2023 •

edited

Loading

ExplodingCabbage commented Dec 21, 2023

Improve the base diffing algorithm #160

Improve the base diffing algorithm #160

Comments

gdavoianb commented Jan 21, 2017

ExplodingCabbage commented Dec 21, 2023

ExplodingCabbage commented Dec 21, 2023 • edited Loading

ExplodingCabbage commented Dec 21, 2023

ExplodingCabbage commented Dec 21, 2023 •

edited

Loading