Snake_case field names in dot-access position#160
Merged
Conversation
Real-world JSON is overwhelmingly snake_case (stargazers_count, change_1d, people_vaccinated_per_hundred), but record.snake_case dot-access used to fail because the lexer tokenises `_` as a separate Underscore token (used as wildcard). Every persona that touched real-world JSON in the assessment runs hit this and fell back to the verbose `jpth!` per-field workaround. Added a post-lex pass that runs immediately before the ILO-L002 friendly-error scan from PR #154. After consuming `Dot` or `DotQuestion`, contiguous `Ident (Underscore (Ident|Number Ident?))*` runs (with strict span-contiguity, no whitespace gaps) get spliced back into a single `Token::Ident` using the original source slice. The L002 scan immediately below is byte-for-byte unchanged, so plain bindings like `my_var=5` still emit the friendly underscore error. The merge handles three shapes uniformly inside one loop: - `field_name` (Ident _ Ident) - `change_1d` (Ident _ Number Ident) - `ema_20d_change_5d` (alternating Ident / Number-Ident segments) The trailing-letter-after-Number absorb lives inside the loop so each `_Number Ident?` group is consumed atomically and the loop continues to absorb further segments. Without this, alternating real-world field names like `ema_20d_change_5d` would only stitch the first two segments and leave the rest as separate tokens. Single-pass linear walk with in-place splice. No parser changes needed: expect_ident() after Dot now sees one merged Ident.
22 cross-engine regression tests covering: simple `r.field_name`, multi-segment `r.people_vaccinated_per_hundred`, digit-trailing-letter `r.change_1d`, bare-digit `r.x_1`, alternating `r.x_2y_3z`, real-world `r.ema_20d_change_5d`, double-question-mark `r.?field_name?`, plus negative regressions confirming `my_var=5` bindings still emit the ILO-L002 friendly error and `r.foo bar` keeps tokens separate. examples/snake-fields.ilo demonstrates the pattern on a record with three snake_case fields (single, multi-segment, trailing-digit). Three -- run: / -- out: cases so examples_engines.rs exercises each shape across every engine. One-line header. PR #154's 30 friendly-error tests still pass unchanged. The L002 scan is downstream of the new merge pass; merged Ident tokens never reach L002, plain Ident-Underscore-Ident bindings still do.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Resolves the doc-cited "real-world JSON is overwhelmingly snake_case, so dot-access is a minority path" friction. Every persona that touched real JSON in the assessment runs (
stargazers_count,change_1d,people_vaccinated_per_hundred,budget_authority_amount,ema_20d_change_5d) had to fall back to the verbosejpth!per-field workaround becauser.snake_casefailed at the lexer.Through the manifesto lens: every API call agents make against real-world endpoints used to require a token-expensive workaround. Now the natural dot-access form works.
Approach
Lexer-cooperative — added a post-lex merge pass that runs immediately before PR #154's
ILO-L002friendly-error scan. After consumingDotorDotQuestion, contiguousIdent (Underscore (Ident|Number Ident?))*runs (with strict span-contiguity, no whitespace gaps) get spliced into a singleToken::Identusing the original source slice. The L002 scan immediately below is byte-for-byte unchanged, so plain bindings likemy_var=5still emit the friendly underscore error — only dot-access position gets the loosened identifier rule.The merge handles three shapes uniformly inside one loop:
field_name(Ident _ Ident)change_1d(Ident _ Number Ident)ema_20d_change_5d(alternatingIdent / Number-Identsegments)The trailing-letter-after-Number absorb lives inside the loop so each
_Number Ident?group is consumed atomically and the loop continues. Without this, alternating real-world names likeema_20d_change_5dwould only stitch the first two segments.What's in the diff
lexer: merge snake_case field names in dot-access position—src/lexer/mod.rspost-lex pass, +64 lines. No parser changes needed;expect_ident()after aDotnow sees one mergedIdent.tests + example: pin snake_case dot-access across engines— 22 cross-engine tests covering simple, multi-segment, digit-trailing-letter, bare-digit, alternating, and real-world shapes, plus negative regressions (my_var=5still errors with ILO-L002;r.foo barkeeps tokens separate).examples/snake-fields.ilodemonstrates the pattern with three-- run:/-- out:cases sotests/examples_engines.rsexercises each shape across every engine.Test plan
cargo test --release --features cranelift— full suite green, +10 new regression tests + 6 new example-engine casescargo fmt --all -- --checkcleancargo clippy --all-targets --features cranelift -- -D warningscleanr.ema_20d_change_5dshape (alternating segments) stitches correctly across tree/vm/craneliftmy_var=5bindings still emit ILO-L002 (binding-position underscore still flagged)r.foo barkeeps tokens separate (no false merge across whitespace)Follow-ups
r.foo._bar(leading underscore on inner field) is not handled. Deferred — the doc-cited shapes are all flat snake_case, and starting a field with_is unusual in JSON.