Skip to content

Core Text Utilities

Mike Strobel edited this page Jun 26, 2026 · 1 revision

Text utilities

Cursorial.Core ships two grapheme-aware text helpers in the Cursorial.Text namespace: GraphemeWidth (how many terminal cells a string occupies) and AnsiTextWrap (word-wrap that measures by rendered glyph and leaves embedded escape sequences untouched). They're small but load-bearing — without correct width accounting, any layer that paints to a grid misaligns the moment a user types an emoji or a CJK character.

Both are static utility classes with no setup. Reach for them whenever you need to know how wide a run of text will be on screen, or to fit styled text into a fixed column budget while keeping its ANSI styling intact.

These live one layer below the cell buffer and the UI framework. If you're building on Cursorial.Rendering or Cursorial.UI, width measurement already happens for you — these utilities are for direct/lower-level use.

Measuring width — GraphemeWidth

Terminals lay text out in fixed monospace cells. A "width" here is the number of those cells a glyph claims: 0 for combining marks and zero-width characters, 1 for ordinary printable characters, 2 for East Asian wide / fullwidth glyphs and most emoji. Three entry points:

using Cursorial.Text;

GraphemeWidth.CodepointWidth(0x41);          // 'A'      -> 1
GraphemeWidth.CodepointWidth(0x4E2D);        // '中'     -> 2  (CJK)
GraphemeWidth.CodepointWidth(0x0301);        // combining acute accent -> 0

GraphemeWidth.ClusterWidth("👩‍💻".AsSpan());  // one ZWJ emoji glyph -> 2
GraphemeWidth.StringWidth("héllo 中");        // measures the whole string -> 8
  • CodepointWidth(int codepoint) — width of a single Unicode scalar value (0–0x10FFFF, surrogates excluded), ignoring any surrounding cluster context. Throws ArgumentOutOfRangeException for negative or above-max values. There's also a CodepointWidth(Rune) overload.
  • ClusterWidth(ReadOnlySpan<char> cluster) — width of exactly one grapheme cluster (one user-visible glyph, possibly several codepoints). The caller is responsible for passing a single cluster; pass arbitrary text to StringWidth instead.
  • StringWidth(ReadOnlySpan<char> text) — total width of arbitrary text. It enumerates grapheme clusters (via the BCL's System.Globalization.StringInfo) and sums ClusterWidth over each, so emoji-with-modifiers, ZWJ sequences, and combining marks count once at their combined width.

If you only need the count of glyphs rather than their width, GraphemeWidth.ClusterCount(text) returns the number of grapheme clusters.

Cluster-level adjustments

ClusterWidth honors the standard presentation/joining rules so a real glyph measures the way a terminal actually draws it:

  • VS16 (U+FE0F, emoji presentation selector) retroactively bumps the preceding base to width 2 (e.g. ❤️ renders as a wide emoji even though the bare heart is narrow).
  • VS15 (U+FE0E, text presentation selector) pins the preceding base to width 1.
  • ZWJ (U+200D) is a zero-width continuation, so a joined sequence like 👩‍💻 counts as a single glyph rather than its parts.
  • Combining marks within the cluster add nothing — only the base contributes.

Coverage and its limits

The wide-character ranges are a hand-coded table covering the common majority: Hangul, CJK Unified Ideographs (Plane 0 + Plane 2/3 extensions), Compatibility Ideographs, Fullwidth Forms, and the major emoji blocks (Misc Symbols & Pictographs, Emoticons, Transport, Supplemental Symbols, and others). This is accurate for the overwhelming majority of real-world text.

What to know about the edges:

  • A recently-added codepoint outside the covered ranges reports width 1, not 2. The conservative fallback. The public surface is just int in / int out, so a fuller EastAsianWidth.txt-backed table can drop in later without an API change.

  • East-Asian-Ambiguous glyphs resolve to width 1. Box-drawing rules, block elements, geometric shapes (▲ ▼), arrows, Greek, Cyrillic, and many Latin-1 / punctuation symbols are ambiguous in the Unicode standard — width 1 in a non-East-Asian context, width 2 in a CJK one. Width 1 is the correct default for most terminals. For the cases where a terminal is configured to render them wide, GraphemeWidth.IsAmbiguousWidth(int codepoint) lets a renderer detect those glyphs and defend the cursor (the cell-buffer renderer uses exactly this to re-anchor after each ambiguous glyph on terminals flagged unreliable for wide glyphs):

    if (GraphemeWidth.IsAmbiguousWidth(0x2502))  // '│' box-drawing vertical -> true
    {
        // glyph may render as 1 or 2 cells depending on the terminal; treat with care
    }

    ASCII (< 0x80) is never ambiguous and returns false immediately, so plain prose stays on the fast path.

Wrapping styled text — AnsiTextWrap

AnsiTextWrap.Wrap is word-wrap that measures width by rendered glyph only. Embedded ANSI escape sequences — SGR color/attributes, cursor moves, OSC hyperlinks and text sizing, DCS, SS3 — contribute zero columns but are preserved verbatim in their original positions. The result is wrapped text whose styling survives the split.

using Cursorial.Text;

string styled = "\x1b[31mThe quick brown fox jumps over the lazy dog\x1b[0m";

// Wrap to 20 columns; the red SGR sequence passes through and stays in effect across lines.
string wrapped = AnsiTextWrap.Wrap(styled, columns: 20);

// Output (the \x1b[31m / \x1b[0m are preserved in place):
//   The quick brown fox
//   jumps over the lazy
//   dog

Width is measured with the same grapheme-cluster accounting as GraphemeWidth.ClusterWidth, so emoji-with-modifiers, ZWJ sequences, and CJK glyphs are wrapped on correct boundaries.

Hard line breaks in the source (\n, \r, or \r\n) are honored as a single break each. Word-break opportunities are ASCII spaces and tabs.

Options — WrapOptions

The simple overload Wrap(text, columns) uses all defaults. Pass a WrapOptions to override. It's a readonly record struct; default(WrapOptions) and new WrapOptions() both mean "all defaults", so you can set only the fields you care about with object-initializer syntax:

string wrapped = AnsiTextWrap.Wrap(text, columns: 40, new WrapOptions
{
    NewLine = "\r\n",          // raw-mode terminals (OPOST off) want CR-LF
    BreakLongWords = true,     // split a word that's wider than the column limit
    TrimTrailingSpaces = true, // drop trailing spaces so lines end flush
});
  • BreakLongWords (default true) — when a single word exceeds the column limit, break it at the column boundary instead of overflowing. Set false to let an over-long word run past the limit intact (useful for URLs or paths you don't want fractured).
  • NewLine (default "\n") — the separator inserted between wrapped lines. Raw-mode terminal output (with OPOST disabled) typically needs "\r\n".
  • TrimTrailingSpaces (default true) — drop runs of trailing ASCII spaces/tabs from each wrapped line so it ends flush against the limit.

How SGR state crosses a break

The wrapper deliberately injects no reset at a line break. An SGR sequence emitted before the break stays in effect on the next line, because the escape passes through untouched — exactly what you want when a colored phrase wraps mid-color. If your downstream consumer instead needs every line to start from a clean SGR state, emit your own SGR 0 ("\x1b[0m") after each break; the wrapper does not assume a reset policy on your behalf.

Locating an escape sequence directly

If you're writing your own column-accounting pass and just need to skip over escapes, the same classifier Wrap uses is exposed as a static helper:

// Returns the length (in UTF-16 code units) of the escape sequence starting at index,
// or 0 if no escape sequence starts there.
int escLen = AnsiTextWrap.MeasureEscapeSequence("\x1b[31mred", 0);  // -> 5

It recognizes CSI, OSC (terminated by BEL or ESC \), DCS/SOS/PM/APC, SS3, and the ESC + intermediates + final byte form.

Notes

  • All three width entry points and AnsiTextWrap are pure and allocation-frugal; StringWidth / ClusterCount enumerate clusters without materializing substrings. Wrap builds a new string for its result.
  • These utilities measure and reshape text; they emit no bytes to the terminal themselves. For turning a Style into SGR bytes or moving the cursor, see Core output.

See also

  • Core output — SGR encoding, cursor/screen writers, and the style primitives the escape sequences carry.
  • Rendering: cell buffer and renderer — the grid layer that uses this width accounting to place wide glyphs and continuation cells.
  • Core input — parsing the terminal's input side (the counterpart to output).

Clone this wiki locally