-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmarks, and 'interface' traits? #10
Comments
When benchmarking it really doesn't matter what the contents of the text you're editing are, you're usually only interested in figuring out how long each
Yes, but that's because of the editing pattern. The positions of the insertions and deletions in those benchmarks are usually randomly generated: you insert a character here, then delete some there, and so on. This is not how humans actually edit text. We usually insert or delete entire runs of characters before moving the cursor to another place in the document. We might jump around occasionally, but most of the edits are localized. If a rope is designed with this assumption in mind (and both crop and Jumprope are) it's nice to have benchmarks that don't make you ping-pong around. Luckily the author of Automerge (a CRDT library) and the author of Jumprope recorded several character-by-character editing traces of them typing out a document in real time. You can find those traces here.
I already did something along these lines in this repo. There you can find a bunch of benchmarks, including those editing traces I mentioned. You can clone that repo, add your rope to the dependencies, implement the |
Is that strictly true, given that rust strings are stored as vectors of bytes/u8, encoded as UTF-8? A string entirely composed of ASCII characters could be indexed just by the byte position in I've personally been playing around with indexing various things about the character composition of the strings (character widths, positions of newlines, etc.) to improve performance of certain operations, but this then does mean the distribution affects the actual times. It's in testing how successful strategies I'm experimenting with here that purely ASCII benchmarks (or ones with a non-representative distribution of newlines) aren't likely to give me particularly useful results.
Hm, might this assumption be missing some things? I've not thought too deeply on this yet, but alongside adding or removing a line (for example), operations like find/replace which could be operating across an entire document aren't all that uncommon, just as an example. I'm not sure how best to create benchmarks/test data to cover different types of edits, so that's just something I'm pondering on...
Ah, I think these are exactly the sort of thing I'm thinking about, thanks. =) I may try modifying this to use some rough unicode test files too, actually... |
Yes, but that time complexity is set at "design" time, not at runtime. If you design your rope to work with byte offsets then indexing is O(1), Unicode or not. It's the user's responsibility to make sure that the byte offsets they provide are valid. And if you instead design it around As an aside, I would advise against choosing codepoints as the indexing metric. Like you said it brings a performance penalty over bytes, and it doesn't provide any advantages over it. The fundamental unit of text in a real editor is not a Unicode code point, it's much closer to an extended grapheme cluster. Imo bytes and ECGs are the only two sensible metrics to choose from, and (for reasons that I won't go over here) using graphemes doesn't work.
Well it's a general rule-of-thumb, not a theorem ;) There are definitely cases where it breaks down, e.g. find&replace and multiple cursors, but it's still a useful insight. As a piece of anecdotal evidence supporting this, after reimplementing crop's leaves as gap buffers instead of simple
The traces already contain find&replace and all other kinds of edits, so I wouldn't worry about that. Of course you could record your own trace if you really wanted to. I think there's a vscode plugin for it but I'm not sure. |
I've been beavering away learning rust by working on
omnikron13/braid
, another generic rope implementation.I'm wondering what benchmarks you might have that would be worth stealing? I already lifted more or less any of the ones from
cessen/ropey
that I was able to apply to my codebase. =PAlthough useful, they don't seem to blend in much unicode, and they're definitely pretty artificial, being composed of Lorem Ipsum text (v.s. for example source code, presumably a primary target for a rope implementation). I'd be very interested, and much appreciative, for extra and perhaps more 'realistic' benchmarks you might be able to propose?
On an only partially related note... Do you think it would be worth attempting to abstract out any 'interface' traits related to accessing these string+ data types? I've been thinking of maybe trying some other 'text editor buffer' type string implementations too at some point, and looking at the few ropes that exist already and such, I wonder if trying to standardise something for plug-n-play use, and benchmarking efforts, might work? Idk, just an idea, lemme know your thoughts.
The text was updated successfully, but these errors were encountered: