lexer: decode \f \b \v \a \0 \/ escape sequences in strings by danieljohnmorris · Pull Request #293 · ilo-lang/ilo

danieljohnmorris · 2026-05-16T00:57:27Z

Summary

ilo's lexer only decoded five escape sequences in string literals (\n \t \r \" \\). The other standard C/JSON escapes fell through to the unknown-escape fallback and were emitted as literal backslash + letter. pdf-analyst on the Llama 3 paper rerun hit this on spl txt "\f" returning 1 over a 92-page PDF: pdftotext writes 0x0C between pages, ilo's "\f" was two chars, and spl saw no separator. Same shape for null-terminated formats (\0), ANSI cursor edits (\b), vertical tab (\v), bell (\a), and JSON-style \/.

This adds the rest of the standard escape table so PDF/log/ANSI pipelines get the control characters they expect on the first try, with no regex escape hatch needed.

Manifesto framing: every persona doing per-page PDF analytics, ANSI-aware logs, or JSON-embedded content was paying a tax — either reach for a regex pattern they didn't need, or build the literal control byte some other way (workarounds included reporting "pages: 1" knowingly and moving on). One missing escape was eating retries across every PDF persona. The fix is a few lines of lexer match arms; the win compounds across every future agent encountering the same pattern.

Repro

Before:

ilo 'f>n;len (spl "page1\fpage2\fpage3" "\f")' f
1   # WRONG: lexer emitted two chars for "\f", split found no separator

After:

ilo 'f>n;len (spl "page1\fpage2\fpage3" "\f")' f
3   # correct: form-feed decoded to 0x0C, split saw two separators

What's in the diff

Three commits:

lexer: decode standard C-style string escapes (\f \b \v \a \0 \/) — adds the six new arms to the logos string-literal callback in src/lexer/mod.rs. Updates SPEC.md and ai.txt to document the full escape table with codepoints, the unknown-escape passthrough behaviour, and a pdftotext example. Also fixes a stale spec example that had spl "\n" text arg order backwards (canonical signature is spl t sep). Lexer unit tests cover the full set plus the unknown-escape fallback.
test: cross-engine regression for string escape decoding — tests/regression_string_escapes.rs runs five assertions across tree, VM, and Cranelift: the original pdf-analyst case (spl text "\f" returns 3), len of each escape is 1 (catches the "two chars" failure mode), form-feed survives chars as one scalar, mixed escapes in one literal all decode, and unknown escapes still passthrough.
example: string-escapes pins the pdftotext page-split case — examples/string-escapes.ilo with -- run: / -- out: assertions so tests/examples_engines.rs exercises the new behaviour on every engine, and so agents have an in-context example.

Compatibility note

\/ previously decoded to literal \ + / (two chars via the unknown-escape passthrough) and now decodes to / (one char, matching JSON spec). I grepped the repo and there's no idiomatic ilo usage of \/ in strings; the other five escapes (\f \b \v \a \0) were vanishingly unlikely to be load-bearing in existing programs since control chars in source strings are almost always a mistake. The unknown-escape fallback is preserved for everything else and explicitly pinned by tests.

Test plan

cargo test --release --features cranelift --lib lex_string → 3 passed
cargo test --release --features cranelift --test regression_string_escapes → 5 passed
cargo test --release --features cranelift --test examples_engines → passed (picks up examples/string-escapes.ilo)
cargo fmt --check → clean
CI green on this branch

Follow-ups

codegen/fmt.rs::escape_text only escapes \\ \" \n \r on emit. After this PR, re-formatting a string containing 0x0C will produce a literal form-feed byte rather than \f. Round-tripping was already incomplete (\t \0 etc.); worth a separate sweep when someone leans on the formatter for human-readable output.

codecov · 2026-05-16T01:00:42Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

Add the C/JSON escape table beyond the existing five (\n \t \r \" \\) so PDF/log/ANSI pipelines get the control characters they expect. pdf-analyst on the Llama 3 paper rerun hit this on `spl txt "\f"` returning 1 over a 92-page PDF: pdftotext writes 0x0C between pages, ilo's lexer was emitting two chars (backslash + 'f'), and split saw no separator. Same shape for null-terminated formats (\0), ANSI cursor edits (\b), and JSON-style \/. Codepoints: \f=0x0C, \b=0x08, \v=0x0B, \a=0x07, \0=0x00, \/=/. Unknown escapes (e.g. \z) keep the existing backslash-passthrough fallback so programs that lean on it don't shift meaning. SPEC.md + ai.txt updated to document the full table; also fixes a stale spec example that had `spl "\n" text` arg order backwards (canonical signature is `spl t sep`). Lexer unit tests cover the full set + the unknown-escape fallback.

Pin the lexer escape fix across tree, VM, and Cranelift. The original pdf-analyst case (`spl text "\f"` over a multi-page document) is the load-bearing assertion; the rest catch silent backend drift. - formfeed_split: 3 pages from `page1\fpage2\fpage3` on every engine - single_escape_len_one: every escape decodes to exactly one scalar - formfeed_codepoint: form-feed survives `chars` as one element - mixed_escapes: six escapes in one literal all decode - unknown_escape_preserves_backslash: locks in the passthrough fallback

In-context example for agents writing PDF/log pipelines: shows that `spl txt "\f"` actually splits on form-feed, and that the rest of the escape table decodes to single scalars. `tests/examples_engines.rs` picks it up automatically so it doubles as a higher-level cross-engine regression alongside `tests/regression_string_escapes.rs`.

danieljohnmorris added 3 commits May 16, 2026 11:00

danieljohnmorris force-pushed the fix/spl-escape-recognition branch from a84e4c4 to 2e7d299 Compare May 16, 2026 10:02

danieljohnmorris merged commit 891cf84 into main May 16, 2026
5 checks passed

danieljohnmorris deleted the fix/spl-escape-recognition branch May 16, 2026 10:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lexer: decode \f \b \v \a \0 \/ escape sequences in strings#293

lexer: decode \f \b \v \a \0 \/ escape sequences in strings#293
danieljohnmorris merged 3 commits into
mainfrom
fix/spl-escape-recognition

danieljohnmorris commented May 16, 2026

Uh oh!

codecov Bot commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danieljohnmorris commented May 16, 2026

Summary

Repro

What's in the diff

Compatibility note

Test plan

Follow-ups

Uh oh!

codecov Bot commented May 16, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant