-
Notifications
You must be signed in to change notification settings - Fork 0
Text Buffer Research
I don't have an exact goal with this document; I'm kind of just doing research and familiarizing myself with what a text buffer is and how it's going to fit into the application. Really, just getting my thoughts out there.
I started off looking into which data structure is the best for a text buffer but learned through research that I was looking at the problem wrong altogether, which makes sense because it started off as a pretty open-ended question. But we're locking in on what we're actually looking for. What isn't important is the data structure; what is important is the data, regardless of the data structure. The input/output remains the same, and that's something I think I overlooked. So, I don't need the fastest data structure out of the gate or the best data structure at the outset. I don't know what that looks like yet. It would be hard to optimize for that at this stage, if not impossible.
As far as researching which data structure, I'm going to default to flat string to get the ball rolling, and create a baseline for performance. This document will focus more on technical decisions around the SDK for the text buffer itself, like; what will the methods look like? I know there are underlying decisions as well that need to be made when designing a text buffer. That's what I'm still researching more about. And we'll use this document to answer those questions.
A text buffer is the data structure holding a documents text in memory. The data structure could be optimized so that preforming operations on the document stays fast, regardless of document size.
Stores and edits document text in memory while providing access to the content, lines, positions, and text replacements independent of any ui.
It seems like at the most basic level of abstraction, a text buffer typically has the following operations, at a minimum
insert(position, text)
delete(range)
replace(range, text)
getValue()
for example; take the following text: Hello world if i were to call the insert method
insert({line: 0, column: 5}, " marvelous")
i can expect the output to read as Hello marvelous world - this would be a consistent input/output regardless of the underlying data structure. Out of all the research in this document, this is probably going to be the most important takeaway.
this table was generated by AI. I need to review each of these row items and decide if that's sufficient or if we need to make adjustments as necessary. For example, I've already begun the offset unit research and decided that UTF-16 is what I want by default because it aligns with javascript runtimes. It also seems like other text buffers accommodate other forms of UTF such as UTF-8. I'm wondering if that's something I'll want to account for at some point, or if it's something only to account for out of the gate. I'm just leaving this here as a reference. We will come through and finalize and update this notice when complete.
| Question | M0 Decision |
|---|---|
| Offset unit | UTF-16 code units |
| Line endings | Normalize to \n |
| Empty document | One empty line |
| Final newline | Creates empty final line |
| Position base | Zero-based |
| Range style | Half-open [start, end) |
| Multi-line ranges | Allowed |
| Empty ranges | Allowed |
| Overlapping edits | Rejected |
| Multiple edits | Applied against original snapshot |
| Undo | Return inverse edits, manage history elsewhere |
| Versioning | Yes |
| Storage | Start simple, test against future piece table |
later problem not an mvp porblem
Plain Array / String (baseline), Rope, Line Array / Array of Lines, Gap Buffer, and Piece Table as solutions.
This is a large, but focused research effort with defined objectives.
- Be able to comprehend each text buffer data structure, document known pros/ cons with sources
- test correctness, invariants, API shape, and workload realism
- Publish the pros and cons found during benchmark testing (to a blog on this repo)
- Implement buffer data structure for rawmark
- Which algo is the best for the use-case? (Be able to explain pros/cons of each in conversation, and writing)
- https://en.wikipedia.org/wiki/Piece_table
- https://en.wikipedia.org/wiki/Rope_%28data_structure%29
- https://learn.microsoft.com/en-us/dotnet/api/microsoft.visualstudio.text.itextbuffer?view=visualstudiosdk-2022
- https://zed.dev/blog/zed-decoded-rope-sumtree
- https://www.cs.unm.edu/~crowley/papers/sds/sds.html - Charles Crowley, "Data Structures for Text Sequences" (1998)
- https://code.visualstudio.com/blogs/2018/03/23/text-buffer-reimplementation
- https://www.cs.tufts.edu/comp/150FP/archive/hans-boehm/ropes.pdf
- https://www.averylaird.com/programming/the%20text%20editor/2017/09/30/the-piece-table
- https://developer.mozilla.org/en-US/docs/Glossary/Buffer
- https://coredumped.dev/2023/08/09/text-showdown-gap-buffers-vs-ropes
- https://stackoverflow.com/questions/496321/utf-8-utf-16-and-utf-32
- Note to self; these are called text buffer data structures or text sequence representations, not algorithms.