refactor(lexer): split read_word_token into classify + advance + dispatch helpers by mpecan · Pull Request #63 · mpecan/rable

mpecan · 2026-04-16T13:56:24Z

Closes #52 · Part of #61 (v0.2.0 Refactoring Cycle, PR 2 of 10)

Summary

Splits the 130-line read_word_token in src/lexer/words.rs into six private helpers on impl Lexer:

classify_word — reserved-word / AssignmentWord / Word classification + ]] → Word fixup (issues lexer: ]] not tokenized as separate word outside [[ ... ]] #35, parser: reserved words should parse as plain words in non-reserved positions #37).
advance_word_context — updates command_start / reserved_words_ok for the next token (issue fix(lexer): disable reserved-word recognition after assignment words #44).
read_open_paren_word — the ( mid-word dispatch (array assignment, extglob, or break). Returns std::ops::ControlFlow<()> so the outer loop knows whether to break.
read_angle_bracket_word — mid-word <(…) / >(…) process substitution.
is_extglob_trigger — consolidates the twin @|?|+ and !|* match arms into one predicate, correctly gating !/* behind config.extglob.
word_enters_bracket_subscript — predicate extracting the [ bracket-subscript guard.

read_word_token body shrinks from 130 to 45 lines; the #[allow(clippy::too_many_lines)] allow is gone.

Net diff: +148 / −64 across 2 files.

Test plan

cargo fmt && cargo clippy --all-targets -- -D warnings — clean.
cargo test — 245 passed, 0 failed (was 243; +2 new guard tests).
Downstream smoke: cargo test against ../rippy (1332 tests) with rable patched to this branch — all green.
Downstream smoke: cargo check --all-targets -p tokf against ../tokf — clean.
grep -n "too_many_lines" src/lexer/words.rs — zero matches.

New guard tests

Two tests added to src/lexer/tests.rs pinning invariants this refactor centralizes:

extglob_prefix_without_paren_is_ordinary_char — bare @, ?, +, !, * without a following ( stay ordinary word chars.
extglob_disabled_does_not_absorb_paren — raised by the multi-agent review (Agent 4). With extglob=false, !(cmd) and foo*(bar) must produce distinct LeftParen tokens rather than being absorbed as extglob words. Pins the config.extglob gate of is_extglob_trigger, which was otherwise only covered by integration tests that enable extglob.

Behavior preservation

All four sharp-edge invariants noted in the issue are preserved verbatim:

Issue parser: reserved words should parse as plain words in non-reserved positions #37: reserved-word classification requires both command_start AND reserved_words_ok.
Issue lexer: ]] not tokenized as separate word outside [[ ... ]] #35: ]] outside [[ ]] is a Word, not DoubleRightBracket.
Issue fix(lexer): disable reserved-word recognition after assignment words #44: AssignmentWord sets command_start=true but clears reserved_words_ok.
Extglob gating: @|?|+ always trigger on (; !|* only when config.extglob is on.

Architectural note

The Result<ControlFlow<()>> pattern established here is a direct fit for PR 3 (#53, lexer/expansions.rs — read_matched_parens_inner), flagged as the highest-risk PR of the cycle. Landing this pattern on the simpler words.rs case first de-risks that follow-up.

Stack

Fresh from main; no stack.

…atch helpers `read_word_token` was 130 lines with `#[allow(clippy::too_many_lines)]`, interleaving three logical phases: character-by-character word assembly, post-assembly classification, and context-flag update for the next token. Split into six private helpers on `impl Lexer`: - `classify_word` — encodes the reserved-word / AssignmentWord / Word classification plus the `]]` → Word fixup when not inside `[[ ]]` (issues #35, #37). - `advance_word_context` — sets `command_start` / `reserved_words_ok` for the next token, including the AssignmentWord special case from issue #44. - `read_open_paren_word` — handles the `(` mid-word dispatch (array assignment, extglob, or break). Returns `ControlFlow<()>` so the outer `while let` loop knows whether to break. - `read_angle_bracket_word` — same pattern for mid-word `<(…)` / `>(…)` process substitutions. - `is_extglob_trigger` — consolidates the twin `@|?|+` and `!|*` match arms into one predicate, correctly gating `!`/`*` behind `config.extglob`. - `word_enters_bracket_subscript` — predicate extracting the guard condition for `[` bracket-subscript absorption. Main `read_word_token` body drops from 130 to 45 lines; the `too_many_lines` allow is gone. Add two regression guards in `src/lexer/tests.rs`: - `extglob_prefix_without_paren_is_ordinary_char` — bare `@`, `?`, `+`, `!`, `*` without a following `(` stay ordinary word chars. - `extglob_disabled_does_not_absorb_paren` — with `extglob=false`, `!(cmd)` and `foo*(bar)` must produce distinct `LeftParen` tokens rather than being absorbed as extglob words. Pins the `config.extglob` gate of `is_extglob_trigger`, which was otherwise only covered by integration tests that enable extglob. No behavior change — 243 pre-existing tests still pass (+2 new guard tests = 245). Downstream rippy (1332 tests) and tokf (cargo check) both green against a local patched build. Closes #52 Refs #61 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…pers `read_matched_parens_inner` in `src/lexer/expansions.rs` was ~140 lines with `#[allow(clippy::too_many_lines)]`, driving all `$((…))` arithmetic and `@(…)`/`?(…)`/etc. extglob tokenisation. A second stale `too_many_lines` allow on the 10-line `read_matched_parens` wrapper was cruft left over from a previous consolidation. Introduce a private `ParenLoopState { depth, case, word_buf }` struct (adapted from the issue's `ParenDepth` suggestion to reflect what the loop actually tracks) with an `end_word()` method that bundles the `case.check_word(&word_buf)` + `word_buf.clear()` pair called from every structural match arm. Split the 12-arm dispatch into 7 per-arm helpers on `impl Lexer`, reusing the `Result<ControlFlow<()>>` pattern established in PR 2 (#63) for the `)` arm that decides whether the outer loop should return: - `handle_paren_close` — `)` arm with case-pattern-close branching. - `handle_paren_open` — `(` arm with pattern-open gating. - `read_paren_quote` — shared `'` / `"` dispatch. - `handle_paren_escape` — `\` with line-continuation handling. - `handle_paren_comment` — `#` with whitespace-prefix gating. - `handle_paren_semi` — `;` / `;;` / `;&` / `;;&` variants. - `handle_paren_word_boundary` — shared space/tab/newline/`|`. `read_matched_parens_inner` body shrinks from ~140 to ~40 lines; both `too_many_lines` allows are gone. Behavior-preserving — every sharp-edge invariant is encoded verbatim in the helpers: `end_word` order, which arms clear vs. end-word, case tracker branching on `is_pattern_close`/`is_pattern_open`, comment gating on whitespace prefix, `;;`/`;&`/`;;&` `resume_pattern` calls, backslash-newline line continuation. Add five regression guards in `src/lexer/tests.rs`: - `arithmetic_double_paren_closes_at_double_right_paren` — `$((1+2))` exercises the `close_count=2` path through `handle_paren_close`. - `extglob_with_alternation_is_one_word` — `@(foo|bar|baz)` exercises the `|` word-boundary arm under extglob. - `paren_content_comment_stops_at_newline` — `$(( # comment\n 1+2 ))` exercises `handle_paren_comment`'s whitespace-prefix gate and the re-entry into the close arm after the newline. - `empty_extglob_closes_at_immediate_right_paren` — `@()` hits depth==0 immediately. - `trailing_backslash_in_arith_is_graceful_error` — `\` at EOF inside `$((…))` must produce a matched-pair error (not panic). Document the `case.*` branches as defensive — not reached by current callers since `$(…)` forks to `parse_paren_body` and `case` inside `$((…))` is rejected by bash, but preserved for future reuse. No behavior change — the 245 pre-existing tests pass, plus 5 new guards (250 total). Downstream rippy (1332 tests) and tokf (cargo check) both green against a local patched build. Closes #53 Refs #61 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…pers (#64) Closes #53 · Part of #61 (v0.2.0 Refactoring Cycle, PR 3 of 10) > ⚠ This is the highest-risk PR of the cycle. Nested parens drive `$((…))` arithmetic and `@(…)`/`?(…)` extglob tokenisation, so extra reviewer attention on behavior preservation is warranted. ## Summary Decomposes the ~140-line `read_matched_parens_inner` in `src/lexer/expansions.rs` into 7 per-arm helpers backed by a new private `ParenLoopState { depth, case, word_buf }` struct. Reuses the `Result<ControlFlow<()>>` pattern established in PR 2 (#63) for the `)` arm that signals "break outer loop". Also removes a stale `#[allow(clippy::too_many_lines)]` from the 10-line `read_matched_parens` wrapper (cruft left over from a previous consolidation). Adapts the issue's original plan where the code had moved on: - **Fork/merge extraction** was already done as `read_paren_body_forked` — skipped. - **`ParenDepth { parens, braces, brackets }`** suggestion doesn't fit since the loop only tracks paren depth; adapted to `ParenLoopState { depth, case, word_buf }` matching actual state. ## Test plan - [x] `cargo fmt && cargo clippy --all-targets -- -D warnings` — clean. - [x] `cargo test` — 250 passed (was 245; +5 new guard tests). - [x] Downstream smoke: `cargo test` against `../rippy` (1332 tests) with rable patched — all green. - [x] Downstream smoke: `cargo check --all-targets -p tokf` against `../tokf` — clean. - [x] `grep -n "too_many_lines" src/lexer/expansions.rs` — zero matches. ## Five new guard tests 1. `arithmetic_double_paren_closes_at_double_right_paren` — `$((1+2))` exercises the `close_count=2` path through `handle_paren_close`. 2. `extglob_with_alternation_is_one_word` — `@(foo|bar|baz)` exercises the `|` word-boundary arm under extglob. 3. `paren_content_comment_stops_at_newline` — `$(( # comment\n 1+2 ))` exercises `handle_paren_comment`'s whitespace-prefix gate and re-entry into the close arm after newline. 4. `empty_extglob_closes_at_immediate_right_paren` — `@()` hits depth==0 immediately (raised by Agent 4 of the review). 5. `trailing_backslash_in_arith_is_graceful_error` — `\` at EOF inside `$((…))` must produce a matched-pair error rather than panic (raised by Agent 4 of the review). ## Behavior preservation All eight sharp-edge invariants preserved verbatim, documented in the plan: - `end_word` order: `case.check_word` BEFORE `word_buf.clear`. - Which arms call `end_word` vs only `clear`: `)`, `(`, `'`, `"`, `#`, `;`, whitespace, `|` → `end_word`; `\`, `$`, `` ` `` → bare `clear`. - `)` branching on `case.is_pattern_close()` — case-pattern `)` does NOT decrement depth. - `(` branching on `case.is_pattern_open()` — case-pattern leading `(` does NOT increment depth. - `#` comment gating: only starts a comment when preceded by whitespace/newline. - `;;` / `;&` / `;;&` variants call `case.resume_pattern()`. - `\\\n` line continuation: both chars consumed, neither pushed. - Bare char accumulates `word_buf` (no clear). The `case.*` branches are documented as **defensive** — not reached by current callers since `$(…)` forks to `parse_paren_body` and `case` inside `$((…))` is invalid bash, but preserved for future reuse. ## Architectural note Agent 3 flagged that this pattern is a direct fit for `read_param_expansion_inner` and `read_backtick_inner` in the same file. Not in scope here, but de-risks potential future refactors. ## Stack Fresh from `main`; no stack. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

🤖 I have created a release *beep* *boop* --- ## [0.2.0](rable-v0.1.15...rable-v0.2.0) (2026-04-18) ### ⚠ BREAKING CHANGES * tighten lexer API surface and relocate WordSpan to ast ([#70](#70)) ### Bug Fixes * **format:** align cmdsub reformatter with bash canonical form ([#49](#49)) ([c7a4411](c7a4411)) * **lexer:** accept sloppy heredoc terminator in cmdsub mode ([#50](#50)) ([40f394f](40f394f)) * **lexer:** backticks opaque when content is invalid ([#71](#71)) ([e72166f](e72166f)), closes [#38](#38) * **lexer:** disable reserved-word recognition after assignment words ([#44](#44)) ([42e1fc0](42e1fc0)) * **lexer:** stop treating ]] and unbalanced [...] as special outside conditionals ([#45](#45)) ([4bf5a5c](4bf5a5c)) * **parser:** fall back from (( … )) arith to nested subshells ([#48](#48)) ([1437f00](1437f00)) ### Code Refactoring * **format:** introduce Formatter struct ([#65](#65)) ([d965a8f](d965a8f)) * **lexer:** drop Result<Token> wrapper from operator readers ([#62](#62)) ([d52a841](d52a841)) * **lexer:** split read_word_token into classify + advance + dispatch helpers ([#63](#63)) ([3ba09f5](3ba09f5)) * **parser:** extract fill_heredoc_contents visitor helpers ([#68](#68)) ([40e6165](40e6165)) * **parser:** extract helpers from three oversize parsers ([#69](#69)) ([25d0762](25d0762)) * **sexp:** dispatch NodeKind Display to per-category helpers ([#66](#66)) ([44b0330](44b0330)) * **sexp:** table-drive ANSI-C escape dispatch ([#67](#67)) ([91a5267](91a5267)) * tighten lexer API surface and relocate WordSpan to ast ([#70](#70)) ([5171d01](5171d01)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: repository-butler[bot] <166800726+repository-butler[bot]@users.noreply.github.com>

mpecan merged commit 3ba09f5 into main Apr 16, 2026
5 checks passed

mpecan deleted the feat/52-split-read-word-token branch April 16, 2026 13:57

repository-butler Bot mentioned this pull request Apr 16, 2026

chore(main): release rable 0.2.0 #47

Merged

mpecan mentioned this pull request Apr 16, 2026

refactor(lexer): decompose read_matched_parens_inner into per-arm helpers #64

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(lexer): split read_word_token into classify + advance + dispatch helpers#63

refactor(lexer): split read_word_token into classify + advance + dispatch helpers#63
mpecan merged 1 commit into
mainfrom
feat/52-split-read-word-token

mpecan commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mpecan commented Apr 16, 2026

Summary

Test plan

New guard tests

Behavior preservation

Architectural note

Stack

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant