Skip to content

Text Buffer Research

Matthew Bub edited this page Jun 19, 2026 · 27 revisions

M0 — Buffer (@rawmark/buffer)

Language Server Protocol (LSP)
            ↑
 Editor API
            ↑
 Selection / Cursor Model <- M1 and so on..
            ↑
 Text Buffer <- We are here!
            ↑
 Data Structure (and here)
(String Array / Piece Table / Rope / etc.)

Abstracting

at the most basic level of abstraction, a text buffer typically has the following operations

insert(position, text)
delete(range)
replace(range, text)
getValue()

for example; take the following text: Hello world if i were to call the insert method

insert({line: 0, column: 5}, " marvelous")

i can expect the output to read as Hello marvelous world - this would be a consistent input/output regardless of the underlying data structure. Out of all the research in this document, this is probably going to be the most important takeaway.

this table was generated by AI. I need to review each of these row items and decide if that's sufficient or if we need to make adjustments as necessary. For example, I've already begun the offset unit research and decided that UTF-16 is what I want by default because it aligns with javascript runtimes. It also seems like other text buffers accommodate other forms of UTF such as UTF-8. I'm wondering if that's something I'll want to account for at some point, or if it's something only to account for out of the gate. I'm just leaving this here as a reference. We will come through and finalize and update this notice when complete.

Question M0 Decision Human reviewd
Offset unit UTF-16 code units
Line endings Normalize to \n
Empty document One empty line
Final newline Trailing newline creates an empty final line
Position base Zero-based
Range style Half-open [start, end)
Multi-line ranges Allowed
Empty ranges Allowed
Overlapping edits Rejected
Multiple edits Applied against original snapshot
Undo Return inverse edits, manage history elsewhere
Versioning Yes
Storage Start simple, test against future piece table

Spec

This specification is about compatibility with established standards and specifications refrenced in the Language Server Protocol (LSP) specifications version 3.18. https://microsoft.github.io/language-server-protocol/specifications/lsp/3.18/specification

Position

Positions are represented using zero-based line and character indexes to align with the coordinate conventions defined by the LSP. A position is not a character. It is a location where a cursor can exist. There are 4 valid positions in cat, for example |c|a|t|. Note how the cursor sits between characters, not on top of them. Positions must be represented by actual coordinates; sentinel values are not supported.

Offset Units

Character offsets are represented as UTF-16 code units. This aligns with JavaScript's native string representation and the default position encoding used by the Language Server Protocol (LSP).

LSP 3.17+ supports multiple encodings, but UTF-16 remains the default and backwards-compatible encoding.

End-of-Line Sequences

The buffer recognizes \n, \r\n, and \r as valid end-of-line sequences, consistent with the line-splitting behavior defined by the Language Server Protocol (LSP).

Implementations may normalize line endings internally to simplify editing, position calculation, and line indexing semantics. The original end-of-line style of the document should be preserved and used by default when serializing text.

Empty Documents

An empty document contains exactly one empty line. This ensures that line-based operations always have a valid line at index 0, even when the document contains no text.

Final Newlines

A trailing end-of-line sequence creates an additional empty line.

Ranges

Ranges are represented using half-open semantics [start, end) to align with the Language Server Protocol (LSP). The start position is inclusive and the end position is exclusive.

Multi-line ranges

Ranges may span one or more lines. The start and end positions are not required to be on the same line.

Empty ranges

Empty ranges are allowed. A range whose start and end positions are equal represents a location in the document rather than a span of text.

This is consistent with the LSP range model and common editor implementations.