Skip to content

lexer: decode \f \b \v \a \0 \/ escape sequences in strings#293

Merged
danieljohnmorris merged 3 commits into
mainfrom
fix/spl-escape-recognition
May 16, 2026
Merged

lexer: decode \f \b \v \a \0 \/ escape sequences in strings#293
danieljohnmorris merged 3 commits into
mainfrom
fix/spl-escape-recognition

Conversation

@danieljohnmorris
Copy link
Copy Markdown
Collaborator

Summary

ilo's lexer only decoded five escape sequences in string literals (\n \t \r \" \\). The other standard C/JSON escapes fell through to the unknown-escape fallback and were emitted as literal backslash + letter. pdf-analyst on the Llama 3 paper rerun hit this on spl txt "\f" returning 1 over a 92-page PDF: pdftotext writes 0x0C between pages, ilo's "\f" was two chars, and spl saw no separator. Same shape for null-terminated formats (\0), ANSI cursor edits (\b), vertical tab (\v), bell (\a), and JSON-style \/.

This adds the rest of the standard escape table so PDF/log/ANSI pipelines get the control characters they expect on the first try, with no regex escape hatch needed.

Manifesto framing: every persona doing per-page PDF analytics, ANSI-aware logs, or JSON-embedded content was paying a tax — either reach for a regex pattern they didn't need, or build the literal control byte some other way (workarounds included reporting "pages: 1" knowingly and moving on). One missing escape was eating retries across every PDF persona. The fix is a few lines of lexer match arms; the win compounds across every future agent encountering the same pattern.

Repro

Before:

ilo 'f>n;len (spl "page1\fpage2\fpage3" "\f")' f
1   # WRONG: lexer emitted two chars for "\f", split found no separator

After:

ilo 'f>n;len (spl "page1\fpage2\fpage3" "\f")' f
3   # correct: form-feed decoded to 0x0C, split saw two separators

What's in the diff

Three commits:

  1. lexer: decode standard C-style string escapes (\f \b \v \a \0 \/) — adds the six new arms to the logos string-literal callback in src/lexer/mod.rs. Updates SPEC.md and ai.txt to document the full escape table with codepoints, the unknown-escape passthrough behaviour, and a pdftotext example. Also fixes a stale spec example that had spl "\n" text arg order backwards (canonical signature is spl t sep). Lexer unit tests cover the full set plus the unknown-escape fallback.

  2. test: cross-engine regression for string escape decodingtests/regression_string_escapes.rs runs five assertions across tree, VM, and Cranelift: the original pdf-analyst case (spl text "\f" returns 3), len of each escape is 1 (catches the "two chars" failure mode), form-feed survives chars as one scalar, mixed escapes in one literal all decode, and unknown escapes still passthrough.

  3. example: string-escapes pins the pdftotext page-split caseexamples/string-escapes.ilo with -- run: / -- out: assertions so tests/examples_engines.rs exercises the new behaviour on every engine, and so agents have an in-context example.

Compatibility note

\/ previously decoded to literal \ + / (two chars via the unknown-escape passthrough) and now decodes to / (one char, matching JSON spec). I grepped the repo and there's no idiomatic ilo usage of \/ in strings; the other five escapes (\f \b \v \a \0) were vanishingly unlikely to be load-bearing in existing programs since control chars in source strings are almost always a mistake. The unknown-escape fallback is preserved for everything else and explicitly pinned by tests.

Test plan

  • cargo test --release --features cranelift --lib lex_string → 3 passed
  • cargo test --release --features cranelift --test regression_string_escapes → 5 passed
  • cargo test --release --features cranelift --test examples_engines → passed (picks up examples/string-escapes.ilo)
  • cargo fmt --check → clean
  • CI green on this branch

Follow-ups

  • codegen/fmt.rs::escape_text only escapes \\ \" \n \r on emit. After this PR, re-formatting a string containing 0x0C will produce a literal form-feed byte rather than \f. Round-tripping was already incomplete (\t \0 etc.); worth a separate sweep when someone leans on the formatter for human-readable output.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 16, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

Add the C/JSON escape table beyond the existing five (\n \t \r \" \\)
so PDF/log/ANSI pipelines get the control characters they expect.
pdf-analyst on the Llama 3 paper rerun hit this on `spl txt "\f"`
returning 1 over a 92-page PDF: pdftotext writes 0x0C between pages,
ilo's lexer was emitting two chars (backslash + 'f'), and split saw
no separator. Same shape for null-terminated formats (\0), ANSI
cursor edits (\b), and JSON-style \/.

Codepoints: \f=0x0C, \b=0x08, \v=0x0B, \a=0x07, \0=0x00, \/=/.
Unknown escapes (e.g. \z) keep the existing backslash-passthrough
fallback so programs that lean on it don't shift meaning.

SPEC.md + ai.txt updated to document the full table; also fixes a
stale spec example that had `spl "\n" text` arg order backwards
(canonical signature is `spl t sep`).

Lexer unit tests cover the full set + the unknown-escape fallback.
Pin the lexer escape fix across tree, VM, and Cranelift. The original
pdf-analyst case (`spl text "\f"` over a multi-page document) is the
load-bearing assertion; the rest catch silent backend drift.

- formfeed_split: 3 pages from `page1\fpage2\fpage3` on every engine
- single_escape_len_one: every escape decodes to exactly one scalar
- formfeed_codepoint: form-feed survives `chars` as one element
- mixed_escapes: six escapes in one literal all decode
- unknown_escape_preserves_backslash: locks in the passthrough fallback
In-context example for agents writing PDF/log pipelines: shows that
`spl txt "\f"` actually splits on form-feed, and that the rest of the
escape table decodes to single scalars. `tests/examples_engines.rs`
picks it up automatically so it doubles as a higher-level cross-engine
regression alongside `tests/regression_string_escapes.rs`.
@danieljohnmorris danieljohnmorris force-pushed the fix/spl-escape-recognition branch from a84e4c4 to 2e7d299 Compare May 16, 2026 10:02
@danieljohnmorris danieljohnmorris merged commit 891cf84 into main May 16, 2026
5 checks passed
@danieljohnmorris danieljohnmorris deleted the fix/spl-escape-recognition branch May 16, 2026 10:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant