refactor(lexer): split read_word_token into classify + advance + dispatch helpers#63
Merged
Merged
Conversation
…atch helpers `read_word_token` was 130 lines with `#[allow(clippy::too_many_lines)]`, interleaving three logical phases: character-by-character word assembly, post-assembly classification, and context-flag update for the next token. Split into six private helpers on `impl Lexer`: - `classify_word` — encodes the reserved-word / AssignmentWord / Word classification plus the `]]` → Word fixup when not inside `[[ ]]` (issues #35, #37). - `advance_word_context` — sets `command_start` / `reserved_words_ok` for the next token, including the AssignmentWord special case from issue #44. - `read_open_paren_word` — handles the `(` mid-word dispatch (array assignment, extglob, or break). Returns `ControlFlow<()>` so the outer `while let` loop knows whether to break. - `read_angle_bracket_word` — same pattern for mid-word `<(…)` / `>(…)` process substitutions. - `is_extglob_trigger` — consolidates the twin `@|?|+` and `!|*` match arms into one predicate, correctly gating `!`/`*` behind `config.extglob`. - `word_enters_bracket_subscript` — predicate extracting the guard condition for `[` bracket-subscript absorption. Main `read_word_token` body drops from 130 to 45 lines; the `too_many_lines` allow is gone. Add two regression guards in `src/lexer/tests.rs`: - `extglob_prefix_without_paren_is_ordinary_char` — bare `@`, `?`, `+`, `!`, `*` without a following `(` stay ordinary word chars. - `extglob_disabled_does_not_absorb_paren` — with `extglob=false`, `!(cmd)` and `foo*(bar)` must produce distinct `LeftParen` tokens rather than being absorbed as extglob words. Pins the `config.extglob` gate of `is_extglob_trigger`, which was otherwise only covered by integration tests that enable extglob. No behavior change — 243 pre-existing tests still pass (+2 new guard tests = 245). Downstream rippy (1332 tests) and tokf (cargo check) both green against a local patched build. Closes #52 Refs #61 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mpecan
added a commit
that referenced
this pull request
Apr 16, 2026
…pers
`read_matched_parens_inner` in `src/lexer/expansions.rs` was ~140 lines
with `#[allow(clippy::too_many_lines)]`, driving all `$((…))`
arithmetic and `@(…)`/`?(…)`/etc. extglob tokenisation. A second stale
`too_many_lines` allow on the 10-line `read_matched_parens` wrapper was
cruft left over from a previous consolidation.
Introduce a private `ParenLoopState { depth, case, word_buf }` struct
(adapted from the issue's `ParenDepth` suggestion to reflect what the
loop actually tracks) with an `end_word()` method that bundles the
`case.check_word(&word_buf)` + `word_buf.clear()` pair called from
every structural match arm.
Split the 12-arm dispatch into 7 per-arm helpers on `impl Lexer`,
reusing the `Result<ControlFlow<()>>` pattern established in PR 2
(#63) for the `)` arm that decides whether the outer loop should
return:
- `handle_paren_close` — `)` arm with case-pattern-close branching.
- `handle_paren_open` — `(` arm with pattern-open gating.
- `read_paren_quote` — shared `'` / `"` dispatch.
- `handle_paren_escape` — `\` with line-continuation handling.
- `handle_paren_comment` — `#` with whitespace-prefix gating.
- `handle_paren_semi` — `;` / `;;` / `;&` / `;;&` variants.
- `handle_paren_word_boundary` — shared space/tab/newline/`|`.
`read_matched_parens_inner` body shrinks from ~140 to ~40 lines; both
`too_many_lines` allows are gone.
Behavior-preserving — every sharp-edge invariant is encoded verbatim
in the helpers: `end_word` order, which arms clear vs. end-word, case
tracker branching on `is_pattern_close`/`is_pattern_open`, comment
gating on whitespace prefix, `;;`/`;&`/`;;&` `resume_pattern` calls,
backslash-newline line continuation.
Add five regression guards in `src/lexer/tests.rs`:
- `arithmetic_double_paren_closes_at_double_right_paren` — `$((1+2))`
exercises the `close_count=2` path through `handle_paren_close`.
- `extglob_with_alternation_is_one_word` — `@(foo|bar|baz)` exercises
the `|` word-boundary arm under extglob.
- `paren_content_comment_stops_at_newline` — `$(( # comment\n 1+2 ))`
exercises `handle_paren_comment`'s whitespace-prefix gate and the
re-entry into the close arm after the newline.
- `empty_extglob_closes_at_immediate_right_paren` — `@()` hits
depth==0 immediately.
- `trailing_backslash_in_arith_is_graceful_error` — `\` at EOF inside
`$((…))` must produce a matched-pair error (not panic).
Document the `case.*` branches as defensive — not reached by current
callers since `$(…)` forks to `parse_paren_body` and `case` inside
`$((…))` is rejected by bash, but preserved for future reuse.
No behavior change — the 245 pre-existing tests pass, plus 5 new
guards (250 total). Downstream rippy (1332 tests) and tokf (cargo
check) both green against a local patched build.
Closes #53
Refs #61
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5 tasks
mpecan
added a commit
that referenced
this pull request
Apr 16, 2026
…pers (#64) Closes #53 · Part of #61 (v0.2.0 Refactoring Cycle, PR 3 of 10) > ⚠ This is the highest-risk PR of the cycle. Nested parens drive `$((…))` arithmetic and `@(…)`/`?(…)` extglob tokenisation, so extra reviewer attention on behavior preservation is warranted. ## Summary Decomposes the ~140-line `read_matched_parens_inner` in `src/lexer/expansions.rs` into 7 per-arm helpers backed by a new private `ParenLoopState { depth, case, word_buf }` struct. Reuses the `Result<ControlFlow<()>>` pattern established in PR 2 (#63) for the `)` arm that signals "break outer loop". Also removes a stale `#[allow(clippy::too_many_lines)]` from the 10-line `read_matched_parens` wrapper (cruft left over from a previous consolidation). Adapts the issue's original plan where the code had moved on: - **Fork/merge extraction** was already done as `read_paren_body_forked` — skipped. - **`ParenDepth { parens, braces, brackets }`** suggestion doesn't fit since the loop only tracks paren depth; adapted to `ParenLoopState { depth, case, word_buf }` matching actual state. ## Test plan - [x] `cargo fmt && cargo clippy --all-targets -- -D warnings` — clean. - [x] `cargo test` — 250 passed (was 245; +5 new guard tests). - [x] Downstream smoke: `cargo test` against `../rippy` (1332 tests) with rable patched — all green. - [x] Downstream smoke: `cargo check --all-targets -p tokf` against `../tokf` — clean. - [x] `grep -n "too_many_lines" src/lexer/expansions.rs` — zero matches. ## Five new guard tests 1. `arithmetic_double_paren_closes_at_double_right_paren` — `$((1+2))` exercises the `close_count=2` path through `handle_paren_close`. 2. `extglob_with_alternation_is_one_word` — `@(foo|bar|baz)` exercises the `|` word-boundary arm under extglob. 3. `paren_content_comment_stops_at_newline` — `$(( # comment\n 1+2 ))` exercises `handle_paren_comment`'s whitespace-prefix gate and re-entry into the close arm after newline. 4. `empty_extglob_closes_at_immediate_right_paren` — `@()` hits depth==0 immediately (raised by Agent 4 of the review). 5. `trailing_backslash_in_arith_is_graceful_error` — `\` at EOF inside `$((…))` must produce a matched-pair error rather than panic (raised by Agent 4 of the review). ## Behavior preservation All eight sharp-edge invariants preserved verbatim, documented in the plan: - `end_word` order: `case.check_word` BEFORE `word_buf.clear`. - Which arms call `end_word` vs only `clear`: `)`, `(`, `'`, `"`, `#`, `;`, whitespace, `|` → `end_word`; `\`, `$`, `` ` `` → bare `clear`. - `)` branching on `case.is_pattern_close()` — case-pattern `)` does NOT decrement depth. - `(` branching on `case.is_pattern_open()` — case-pattern leading `(` does NOT increment depth. - `#` comment gating: only starts a comment when preceded by whitespace/newline. - `;;` / `;&` / `;;&` variants call `case.resume_pattern()`. - `\\\n` line continuation: both chars consumed, neither pushed. - Bare char accumulates `word_buf` (no clear). The `case.*` branches are documented as **defensive** — not reached by current callers since `$(…)` forks to `parse_paren_body` and `case` inside `$((…))` is invalid bash, but preserved for future reuse. ## Architectural note Agent 3 flagged that this pattern is a direct fit for `read_param_expansion_inner` and `read_backtick_inner` in the same file. Not in scope here, but de-risks potential future refactors. ## Stack Fresh from `main`; no stack. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mpecan
pushed a commit
that referenced
this pull request
Apr 19, 2026
🤖 I have created a release *beep* *boop* --- ## [0.2.0](rable-v0.1.15...rable-v0.2.0) (2026-04-18) ### ⚠ BREAKING CHANGES * tighten lexer API surface and relocate WordSpan to ast ([#70](#70)) ### Bug Fixes * **format:** align cmdsub reformatter with bash canonical form ([#49](#49)) ([c7a4411](c7a4411)) * **lexer:** accept sloppy heredoc terminator in cmdsub mode ([#50](#50)) ([40f394f](40f394f)) * **lexer:** backticks opaque when content is invalid ([#71](#71)) ([e72166f](e72166f)), closes [#38](#38) * **lexer:** disable reserved-word recognition after assignment words ([#44](#44)) ([42e1fc0](42e1fc0)) * **lexer:** stop treating ]] and unbalanced [...] as special outside conditionals ([#45](#45)) ([4bf5a5c](4bf5a5c)) * **parser:** fall back from (( … )) arith to nested subshells ([#48](#48)) ([1437f00](1437f00)) ### Code Refactoring * **format:** introduce Formatter struct ([#65](#65)) ([d965a8f](d965a8f)) * **lexer:** drop Result<Token> wrapper from operator readers ([#62](#62)) ([d52a841](d52a841)) * **lexer:** split read_word_token into classify + advance + dispatch helpers ([#63](#63)) ([3ba09f5](3ba09f5)) * **parser:** extract fill_heredoc_contents visitor helpers ([#68](#68)) ([40e6165](40e6165)) * **parser:** extract helpers from three oversize parsers ([#69](#69)) ([25d0762](25d0762)) * **sexp:** dispatch NodeKind Display to per-category helpers ([#66](#66)) ([44b0330](44b0330)) * **sexp:** table-drive ANSI-C escape dispatch ([#67](#67)) ([91a5267](91a5267)) * tighten lexer API surface and relocate WordSpan to ast ([#70](#70)) ([5171d01](5171d01)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: repository-butler[bot] <166800726+repository-butler[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #52 · Part of #61 (v0.2.0 Refactoring Cycle, PR 2 of 10)
Summary
Splits the 130-line
read_word_tokeninsrc/lexer/words.rsinto six private helpers onimpl Lexer:classify_word— reserved-word / AssignmentWord / Word classification +]]→ Word fixup (issues lexer:]]not tokenized as separate word outside[[ ... ]]#35, parser: reserved words should parse as plain words in non-reserved positions #37).advance_word_context— updatescommand_start/reserved_words_okfor the next token (issue fix(lexer): disable reserved-word recognition after assignment words #44).read_open_paren_word— the(mid-word dispatch (array assignment, extglob, or break). Returnsstd::ops::ControlFlow<()>so the outer loop knows whether to break.read_angle_bracket_word— mid-word<(…)/>(…)process substitution.is_extglob_trigger— consolidates the twin@|?|+and!|*match arms into one predicate, correctly gating!/*behindconfig.extglob.word_enters_bracket_subscript— predicate extracting the[bracket-subscript guard.read_word_tokenbody shrinks from 130 to 45 lines; the#[allow(clippy::too_many_lines)]allow is gone.Net diff: +148 / −64 across 2 files.
Test plan
cargo fmt && cargo clippy --all-targets -- -D warnings— clean.cargo test— 245 passed, 0 failed (was 243; +2 new guard tests).cargo testagainst../rippy(1332 tests) with rable patched to this branch — all green.cargo check --all-targets -p tokfagainst../tokf— clean.grep -n "too_many_lines" src/lexer/words.rs— zero matches.New guard tests
Two tests added to
src/lexer/tests.rspinning invariants this refactor centralizes:extglob_prefix_without_paren_is_ordinary_char— bare@,?,+,!,*without a following(stay ordinary word chars.extglob_disabled_does_not_absorb_paren— raised by the multi-agent review (Agent 4). Withextglob=false,!(cmd)andfoo*(bar)must produce distinctLeftParentokens rather than being absorbed as extglob words. Pins theconfig.extglobgate ofis_extglob_trigger, which was otherwise only covered by integration tests that enable extglob.Behavior preservation
All four sharp-edge invariants noted in the issue are preserved verbatim:
command_startANDreserved_words_ok.]]not tokenized as separate word outside[[ ... ]]#35:]]outside[[ ]]is a Word, notDoubleRightBracket.AssignmentWordsetscommand_start=truebut clearsreserved_words_ok.@|?|+always trigger on(;!|*only whenconfig.extglobis on.Architectural note
The
Result<ControlFlow<()>>pattern established here is a direct fit for PR 3 (#53,lexer/expansions.rs—read_matched_parens_inner), flagged as the highest-risk PR of the cycle. Landing this pattern on the simplerwords.rscase first de-risks that follow-up.Stack
Fresh from
main; no stack.