Text Buffer Research

M0 — Buffer (`@rawmark/buffer`)

I'm essentially implementing pieces of the Language Server Protocol

Language Server Protocol (LSP)
            ↑
 Editor API
            ↑
 Selection / Cursor Model <- M1 and so on..
            ↑
 Text Buffer <- We are here!
            ↑
 Data Structure (and here)
(String Array / Piece Table / Rope / etc.)

What is a text buffer?

A text buffer is the data structure holding a documents text in memory. The data structure could be optimized so that preforming operations on the document stays fast, regardless of document size.

What does a text buffer do?

Stores and edits document text in memory while providing access to the content, lines, positions, and text replacements independent of any ui.

Abstracting

It seems like at the most basic level of abstraction, a text buffer typically has the following operations, at a minimum

insert(position, text)
delete(range)
replace(range, text)
getValue()

for example; take the following text: Hello world if i were to call the insert method

insert({line: 0, column: 5}, " marvelous")

i can expect the output to read as Hello marvelous world - this would be a consistent input/output regardless of the underlying data structure. Out of all the research in this document, this is probably going to be the most important takeaway.

this table was generated by AI. I need to review each of these row items and decide if that's sufficient or if we need to make adjustments as necessary. For example, I've already begun the offset unit research and decided that UTF-16 is what I want by default because it aligns with javascript runtimes. It also seems like other text buffers accommodate other forms of UTF such as UTF-8. I'm wondering if that's something I'll want to account for at some point, or if it's something only to account for out of the gate. I'm just leaving this here as a reference. We will come through and finalize and update this notice when complete.

Question	M0 Decision	Human reviewd
Offset unit	UTF-16 code units	✅
Line endings	Normalize to \n
Empty document	One empty line
Final newline	Creates empty final line
Position base	Zero-based	✅
Range style	Half-open [start, end)
Multi-line ranges	Allowed
Empty ranges	Allowed
Overlapping edits	Rejected
Multiple edits	Applied against original snapshot
Undo	Return inverse edits, manage history elsewhere
Versioning	Yes
Storage	Start simple, test against future piece table

Spec

This specification is about compatibility with established standards and specifications. im going to lean as close to the Microsoft Language Server Protocol (LSP) specifications, specifically version 3.18, as possible.

Position

Positions are represented using zero-based line and character indexes to align with the coordinate conventions defined by the LSP. A position is not a character. It is a location where a cursor can exist. There are 4 valid positions in cat, for example |c|a|t|. Note how the cursor sits between characters, not on top of them. Positions must be represented by actual coordinates; sentinel values are not supported.

Offset Units

Character offsets are represented as UTF-16 code units. This aligns with JavaScript's native string representation and the default position encoding used by the Language Server Protocol (LSP).

LSP 3.17+ supports multiple encodings, but UTF-16 remains the default and backwards-compatible encoding.

Worth exploring for future optimizations

later problem not an mvp porblem

Plain Array / String (baseline), Rope, Line Array / Array of Lines, Gap Buffer, and Piece Table as solutions.

Objectives

This is a large, but focused research effort with defined objectives.

Be able to comprehend each text buffer data structure, document known pros/ cons with sources
test correctness, invariants, API shape, and workload realism
Publish the pros and cons found during benchmark testing (to a blog on this repo)
Implement buffer data structure for rawmark
Which algo is the best for the use-case? (Be able to explain pros/cons of each in conversation, and writing)

Reading Materials

Implementation Examples

https://docs.rs/ropey/latest/ropey/struct.Rope.html

Notes

Note to self; these are called text buffer data structures or text sequence representations, not algorithms.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text Buffer Research

M0 — Buffer (`@rawmark/buffer`)

What is a text buffer?

What does a text buffer do?

Abstracting

Spec

Position

Offset Units

Worth exploring for future optimizations

Objectives

Reading Materials

Implementation Examples

Notes

Uh oh!

Uh oh!

Clone this wiki locally

Text Buffer Research

M0 — Buffer (@rawmark/buffer)

What is a text buffer?

What does a text buffer do?

Abstracting

Spec

Position

Offset Units

Worth exploring for future optimizations

Objectives

Reading Materials

Implementation Examples

Notes

Uh oh!

Uh oh!

Clone this wiki locally

M0 — Buffer (`@rawmark/buffer`)