lexer: decode \f \b \v \a \0 \/ escape sequences in strings#293
Merged
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Add the C/JSON escape table beyond the existing five (\n \t \r \" \\) so PDF/log/ANSI pipelines get the control characters they expect. pdf-analyst on the Llama 3 paper rerun hit this on `spl txt "\f"` returning 1 over a 92-page PDF: pdftotext writes 0x0C between pages, ilo's lexer was emitting two chars (backslash + 'f'), and split saw no separator. Same shape for null-terminated formats (\0), ANSI cursor edits (\b), and JSON-style \/. Codepoints: \f=0x0C, \b=0x08, \v=0x0B, \a=0x07, \0=0x00, \/=/. Unknown escapes (e.g. \z) keep the existing backslash-passthrough fallback so programs that lean on it don't shift meaning. SPEC.md + ai.txt updated to document the full table; also fixes a stale spec example that had `spl "\n" text` arg order backwards (canonical signature is `spl t sep`). Lexer unit tests cover the full set + the unknown-escape fallback.
Pin the lexer escape fix across tree, VM, and Cranelift. The original pdf-analyst case (`spl text "\f"` over a multi-page document) is the load-bearing assertion; the rest catch silent backend drift. - formfeed_split: 3 pages from `page1\fpage2\fpage3` on every engine - single_escape_len_one: every escape decodes to exactly one scalar - formfeed_codepoint: form-feed survives `chars` as one element - mixed_escapes: six escapes in one literal all decode - unknown_escape_preserves_backslash: locks in the passthrough fallback
In-context example for agents writing PDF/log pipelines: shows that `spl txt "\f"` actually splits on form-feed, and that the rest of the escape table decodes to single scalars. `tests/examples_engines.rs` picks it up automatically so it doubles as a higher-level cross-engine regression alongside `tests/regression_string_escapes.rs`.
a84e4c4 to
2e7d299
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ilo's lexer only decoded five escape sequences in string literals (
\n \t \r \" \\). The other standard C/JSON escapes fell through to the unknown-escape fallback and were emitted as literal backslash + letter. pdf-analyst on the Llama 3 paper rerun hit this onspl txt "\f"returning 1 over a 92-page PDF: pdftotext writes 0x0C between pages, ilo's"\f"was two chars, andsplsaw no separator. Same shape for null-terminated formats (\0), ANSI cursor edits (\b), vertical tab (\v), bell (\a), and JSON-style\/.This adds the rest of the standard escape table so PDF/log/ANSI pipelines get the control characters they expect on the first try, with no regex escape hatch needed.
Manifesto framing: every persona doing per-page PDF analytics, ANSI-aware logs, or JSON-embedded content was paying a tax — either reach for a regex pattern they didn't need, or build the literal control byte some other way (workarounds included reporting "pages: 1" knowingly and moving on). One missing escape was eating retries across every PDF persona. The fix is a few lines of lexer match arms; the win compounds across every future agent encountering the same pattern.
Repro
Before:
After:
What's in the diff
Three commits:
lexer: decode standard C-style string escapes (\f \b \v \a \0 \/)— adds the six new arms to the logos string-literal callback insrc/lexer/mod.rs. UpdatesSPEC.mdandai.txtto document the full escape table with codepoints, the unknown-escape passthrough behaviour, and a pdftotext example. Also fixes a stale spec example that hadspl "\n" textarg order backwards (canonical signature isspl t sep). Lexer unit tests cover the full set plus the unknown-escape fallback.test: cross-engine regression for string escape decoding—tests/regression_string_escapes.rsruns five assertions across tree, VM, and Cranelift: the original pdf-analyst case (spl text "\f"returns 3),lenof each escape is 1 (catches the "two chars" failure mode), form-feed survivescharsas one scalar, mixed escapes in one literal all decode, and unknown escapes still passthrough.example: string-escapes pins the pdftotext page-split case—examples/string-escapes.ilowith-- run:/-- out:assertions sotests/examples_engines.rsexercises the new behaviour on every engine, and so agents have an in-context example.Compatibility note
\/previously decoded to literal\+/(two chars via the unknown-escape passthrough) and now decodes to/(one char, matching JSON spec). I grepped the repo and there's no idiomatic ilo usage of\/in strings; the other five escapes (\f \b \v \a \0) were vanishingly unlikely to be load-bearing in existing programs since control chars in source strings are almost always a mistake. The unknown-escape fallback is preserved for everything else and explicitly pinned by tests.Test plan
cargo test --release --features cranelift --lib lex_string→ 3 passedcargo test --release --features cranelift --test regression_string_escapes→ 5 passedcargo test --release --features cranelift --test examples_engines→ passed (picks upexamples/string-escapes.ilo)cargo fmt --check→ cleanFollow-ups
codegen/fmt.rs::escape_textonly escapes\\ \" \n \ron emit. After this PR, re-formatting a string containing 0x0C will produce a literal form-feed byte rather than\f. Round-tripping was already incomplete (\t \0etc.); worth a separate sweep when someone leans on the formatter for human-readable output.