-
-
Notifications
You must be signed in to change notification settings - Fork 4.5k
formatWithCursor performance bottleneck #4801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This can cause my entire kernel to go over memory even when just calling from node, much less inside of prettier-atom (prettier/prettier-atom#480). I'm getting a bunch of bug reports over there because of this code and my Atom debugger likewise points to this diff arrays code the OP mentions as causing an OOM error. Is there any way we can improve the performance of this diff check? There are example HTML code files in that issue that can reproduce the problem. You can reproduce with this simple harness, assuming you have a file called prettier = require('prettier');
fs = require('fs');
fs.readFile('./test.html', 'utf8', function(err, contents) {
prettier.formatWithCursor(contents, { cursorOffset: 2, parser: 'html' });
});
console.log('FINISHED'); An example of HTML that causes things to blow up is this guy's homepage copy and pasted into that test.html file: https://miloslav.website/ |
#6164 is the current outstanding PR. Unfortunately it's quite tricky to produce exactly the same behavior as the current diff-based implementation, and thus none of the PRs are landing. |
I don’t know at all how formatWithCursor works, and I’ve never used it. So I might not know what I’m talking about, but to me it sounds sensible to merge a PR that makes this feature usable at the expensive of not working 100% the same in all cases (and probably the current implementation isn’t perfect either so it probably matters even less). |
Even if it were some type of trigger that realizes when something is going to be extremely unperformant it just gives up and returns an arbitrary cursor position it would be worth it. Honestly if it were just a 10s lag, we might be able to live with that since every successive diff should be much faster since most of the issues were fixed (it seems to just be when a huge amount of stuff got changed). However, it's the fact that it completely runs out of memory and blows up that is really hurting us over at prettier-atom. So yeah it seems like everyone's on the same page that it's better than nothing? Do you think it's possible to retain the old behavior and you just haven't figured it out, or is it just not going to be possible with this alternative library? |
I don't think it's possible to keep exactly diff's behavior while using diff-sequences, but anyone is welcome to try it out and see if they can come up with a fix, or create a fork that maintains the same behavior. Fwiw, we @observablehq have been using this code in production for a few months now. |
@tmcw Using this code in production, did you feel this difference in the behavior anyhow? As per the changed snapshots, there is only one change in the behavior and it really looks negligible. |
As far as I can tell, nobody has noticed the difference in behavior. |
As per prettier/prettier#4801 , formatWithCursor can be extremely slow and consume large amounts of memory. We're actually doing something very similar to the algorithm described in that ticket already, namely character-wise diffs. This means that the cursor is actually fine if we just let Emacs manage the position.
As per prettier/prettier#4801 , formatWithCursor can be extremely slow and consume large amounts of memory. We're actually doing something very similar to the algorithm described in that ticket already, namely character-wise diffs. This means that the cursor is actually fine if we just let Emacs manage the position.
A very simple test case to reproduce this performance bug is a one-line file containing nothing but an array of a few thousand short strings on one line. Try to format it with Below is an example file you can copy and save as
|
After taking a look at this, I'm very confident that it should be possible to make this enormously faster. I think there are actually two problems here. The first is that we're diffing way bigger blocks of text than we need to. Consider my example above where we format a giant single-line array. Doing this with This is crazy! There's no node in the AST for the open square bracket of the array, so we can't use that as our The second problem - and this is the angle from which I see @tmcw previously attacked this issue - is that the diff package Prettier is using is absurdly slow compared to what's possible. If I diff my example file above against the output of formatting it with Prettier using Git's diffing implementation (configured to do a word diff treating every character as a "word", therefore effectively doing a character diff), like this...
then it's instant, whereas doing the same diff with I intend to keep working on this next week, but welcome any comments others have! |
Okay, I have a PR on jsdiff that makes it orders of magnitude faster in at least my test case above. (I haven't tried out @tmcw's one from the first post here, yet.) It is still undesirably slow for some large diffs, though. Things left for me to do:
|
@tmcw's example actually takes way longer than 14s for me - almost 5 minutes, in fact! Not sure what accounts for the difference in what we observed; perhaps something relevant has changed since 2018. Regardless, it's an instructive benchmark to have when looking at the performance problem that exists right now. We end up diffing more than needed for a reason similar to my example: the cursor lies between two top-level elements of the program (the definition of the My performance improvement to jsdiff gets the runtime for @tmcw's benchmark on my machine down from 4 minutes & 45 seconds to under 11 seconds, so what I've done so far is definitely progress, but it'll be the reduction in how big the strings are that Prettier diffs that finally gets us down to a truly acceptable run time. I'm starting work on that now! |
I've got a draft PR at #15709 implementing the idea of recording what two nodes the cursor falls between and thus being able to diff much smaller strings than when diffing the entire node containing the cursor (which is often huge). There are a whole bunch of nuances I haven't explored properly yet, though, and need to look into before the PR will be ready. I've listed them as a checklist in the description. I would be grateful for any input that maintainers or people with some knowledge of Prettier's internals can offer. :) |
Oh, by the way, my PR totally works on @tmcw's benchmark. Runtime down from 5 minutes on main to 150 milliseconds on my branch! Just need to get it to a state where it's not breaking anything... |
I've been in touch with @kpdecker, the maintainer of jsdiff. He's busy with other stuff in his life and doesn't want to spend much time on jsdiff, so at my suggestion he's given me collaborator access there; I'm planning to get my own stuff merged, clean up existing PRs and issues, and maybe implement further performance improvements. Ideally I'd like it if someone would review my work in kpdecker/jsdiff#411 before I merge it. Since it was Prettier's performance problems that motivated me doing that PR, maybe someone here would like to review? You'll probably want to read through the Myers diff paper beforehand if you're not already familiar with the basic underlying algorithm - but it's an interesting read, I promise! |
In other news, I reckon #15709 is review-ready. |
I've isolated a bottleneck from our production environment and here's a nifty self-contained benchmark for it: https://gist.github.com/tmcw/1a4e8ee47941454337dc5952dbf90180 (swap require('./') for require('prettier') if that's intended)
What I'm seeing in this and in testing different cases is that
formatWithCursor
has a huge performance penalty over format, and it's entirely rooted in the call to diff.diffArrays, which creates may intermediate objects and arrays. The further into the document the cursor is, the more of a problem it becomes. This testcase takes upwards of 14s to format about 300 lines of code.Reading the code, it's a bit insidery - I suspect that an approach that doesn't require a char array, or does the diff progressively with the checking, or skips the creation of so many intermediate objects, may be ideal.
The text was updated successfully, but these errors were encountered: