Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a performance bug that leads to worse time complexity and cripplingly slow runtime in some cases #411

Merged

Conversation

ExplodingCabbage
Copy link
Collaborator

@ExplodingCabbage ExplodingCabbage commented Nov 22, 2023

On master, diff keeps track of an array of components (changes) for the current best path terminating on each diagonal of the edit graph. Whenever a vertical move happens, a new copy of this array gets created using path.components.slice(0).

The trouble is that this array copying takes time proportional to the size of the array. This increases the worst case time complexity of diff such that it contains an O(d³) term, which shouldn't be the case for the Myers algorithm. To illustrate this, consider the following case:

diffChars("xy".repeat(100) + "z".repeat(100), "x".repeat(100))

If you log every execEditLength call, and also log every clonePath call along with the size of the components array being cloned, you will note that for every edit length from 100-200, 100 calls to clonePath happen, with the average size of the components array being 100. Since cloning the components array with .slice requires iterating over every item of it, that's 100 * 100 * 100 array accesses - i.e. an O(d³) term in our time complexity.

The fix for this is to record components in a way that doesn't necessitate any array cloning. An easy way to do this that I see is to store the components on the basePath object as a linked list, with the most recently added component as the head and (0,0) as the tail; this ensures that adding a component to a path object is always an O(1) operation. It doesn't even really make the code more complicated; we just need a little bit of logic at the start of buildValues to traverse the list, build an array from it, and reverse it.

A nice benchmark to illustrate the size of the speedup this achieves is to diff these two files using diffChars:

words.txt
words-formatted.txt

The first contains a JavaScript array of 1379 words, as single-quoted strings on one line. The latter contains the same array, but formatted using Prettier, a JavaScript formatter. After making the change I propose in this PR, jsdiff can diff them in 5 seconds on my machine; using the code from master, it takes over 10 minutes. That's over a 100-fold speedup!

(Aside: this was actually a real-world scenario I encountered, when my editor tried to use Prettier to autoformat a dictionary of words for a word game, since Prettier uses jsdiff under the hood to figure out where to position the text cursor after formatting. See prettier/prettier#4801.)

There's more performance to be gained, to be sure; @gliese1337's fast-myers-diff is another JavaScript Myers implementation that manages my benchmark above in well under a second, and Git (configured with CLI flags to do a character diff) can do it in under 30ms! In due course I will look at @gliese1337's implementation and see if I can speed jsdiff up more. In the meantime, though, this is a fairly surgical change that already provides a 100x speedup in some cases, so it seems worthwhile. :)

All 178 tests pass when I run npm test, which seems to confirm that I haven't changed the diff results at all - just made things faster, as hoped!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant