Skip to content

refactor(lexer): split read_word_token into classify + advance + dispatch helpers#63

Merged
mpecan merged 1 commit into
mainfrom
feat/52-split-read-word-token
Apr 16, 2026
Merged

refactor(lexer): split read_word_token into classify + advance + dispatch helpers#63
mpecan merged 1 commit into
mainfrom
feat/52-split-read-word-token

Conversation

@mpecan
Copy link
Copy Markdown
Owner

@mpecan mpecan commented Apr 16, 2026

Closes #52 · Part of #61 (v0.2.0 Refactoring Cycle, PR 2 of 10)

Summary

Splits the 130-line read_word_token in src/lexer/words.rs into six private helpers on impl Lexer:

read_word_token body shrinks from 130 to 45 lines; the #[allow(clippy::too_many_lines)] allow is gone.

Net diff: +148 / −64 across 2 files.

Test plan

  • cargo fmt && cargo clippy --all-targets -- -D warnings — clean.
  • cargo test — 245 passed, 0 failed (was 243; +2 new guard tests).
  • Downstream smoke: cargo test against ../rippy (1332 tests) with rable patched to this branch — all green.
  • Downstream smoke: cargo check --all-targets -p tokf against ../tokf — clean.
  • grep -n "too_many_lines" src/lexer/words.rs — zero matches.

New guard tests

Two tests added to src/lexer/tests.rs pinning invariants this refactor centralizes:

  • extglob_prefix_without_paren_is_ordinary_char — bare @, ?, +, !, * without a following ( stay ordinary word chars.
  • extglob_disabled_does_not_absorb_paren — raised by the multi-agent review (Agent 4). With extglob=false, !(cmd) and foo*(bar) must produce distinct LeftParen tokens rather than being absorbed as extglob words. Pins the config.extglob gate of is_extglob_trigger, which was otherwise only covered by integration tests that enable extglob.

Behavior preservation

All four sharp-edge invariants noted in the issue are preserved verbatim:

Architectural note

The Result<ControlFlow<()>> pattern established here is a direct fit for PR 3 (#53, lexer/expansions.rsread_matched_parens_inner), flagged as the highest-risk PR of the cycle. Landing this pattern on the simpler words.rs case first de-risks that follow-up.

Stack

Fresh from main; no stack.

…atch helpers

`read_word_token` was 130 lines with `#[allow(clippy::too_many_lines)]`,
interleaving three logical phases: character-by-character word assembly,
post-assembly classification, and context-flag update for the next token.

Split into six private helpers on `impl Lexer`:

- `classify_word` — encodes the reserved-word / AssignmentWord / Word
  classification plus the `]]` → Word fixup when not inside `[[ ]]`
  (issues #35, #37).
- `advance_word_context` — sets `command_start` / `reserved_words_ok`
  for the next token, including the AssignmentWord special case from
  issue #44.
- `read_open_paren_word` — handles the `(` mid-word dispatch (array
  assignment, extglob, or break). Returns `ControlFlow<()>` so the
  outer `while let` loop knows whether to break.
- `read_angle_bracket_word` — same pattern for mid-word `<(…)` /
  `>(…)` process substitutions.
- `is_extglob_trigger` — consolidates the twin `@|?|+` and `!|*` match
  arms into one predicate, correctly gating `!`/`*` behind
  `config.extglob`.
- `word_enters_bracket_subscript` — predicate extracting the guard
  condition for `[` bracket-subscript absorption.

Main `read_word_token` body drops from 130 to 45 lines; the
`too_many_lines` allow is gone.

Add two regression guards in `src/lexer/tests.rs`:

- `extglob_prefix_without_paren_is_ordinary_char` — bare `@`, `?`, `+`,
  `!`, `*` without a following `(` stay ordinary word chars.
- `extglob_disabled_does_not_absorb_paren` — with `extglob=false`,
  `!(cmd)` and `foo*(bar)` must produce distinct `LeftParen` tokens
  rather than being absorbed as extglob words. Pins the
  `config.extglob` gate of `is_extglob_trigger`, which was otherwise
  only covered by integration tests that enable extglob.

No behavior change — 243 pre-existing tests still pass (+2 new guard
tests = 245). Downstream rippy (1332 tests) and tokf (cargo check)
both green against a local patched build.

Closes #52
Refs #61

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mpecan mpecan merged commit 3ba09f5 into main Apr 16, 2026
5 checks passed
@mpecan mpecan deleted the feat/52-split-read-word-token branch April 16, 2026 13:57
mpecan added a commit that referenced this pull request Apr 16, 2026
…pers

`read_matched_parens_inner` in `src/lexer/expansions.rs` was ~140 lines
with `#[allow(clippy::too_many_lines)]`, driving all `$((…))`
arithmetic and `@(…)`/`?(…)`/etc. extglob tokenisation. A second stale
`too_many_lines` allow on the 10-line `read_matched_parens` wrapper was
cruft left over from a previous consolidation.

Introduce a private `ParenLoopState { depth, case, word_buf }` struct
(adapted from the issue's `ParenDepth` suggestion to reflect what the
loop actually tracks) with an `end_word()` method that bundles the
`case.check_word(&word_buf)` + `word_buf.clear()` pair called from
every structural match arm.

Split the 12-arm dispatch into 7 per-arm helpers on `impl Lexer`,
reusing the `Result<ControlFlow<()>>` pattern established in PR 2
(#63) for the `)` arm that decides whether the outer loop should
return:

- `handle_paren_close` — `)` arm with case-pattern-close branching.
- `handle_paren_open` — `(` arm with pattern-open gating.
- `read_paren_quote` — shared `'` / `"` dispatch.
- `handle_paren_escape` — `\` with line-continuation handling.
- `handle_paren_comment` — `#` with whitespace-prefix gating.
- `handle_paren_semi` — `;` / `;;` / `;&` / `;;&` variants.
- `handle_paren_word_boundary` — shared space/tab/newline/`|`.

`read_matched_parens_inner` body shrinks from ~140 to ~40 lines; both
`too_many_lines` allows are gone.

Behavior-preserving — every sharp-edge invariant is encoded verbatim
in the helpers: `end_word` order, which arms clear vs. end-word, case
tracker branching on `is_pattern_close`/`is_pattern_open`, comment
gating on whitespace prefix, `;;`/`;&`/`;;&` `resume_pattern` calls,
backslash-newline line continuation.

Add five regression guards in `src/lexer/tests.rs`:

- `arithmetic_double_paren_closes_at_double_right_paren` — `$((1+2))`
  exercises the `close_count=2` path through `handle_paren_close`.
- `extglob_with_alternation_is_one_word` — `@(foo|bar|baz)` exercises
  the `|` word-boundary arm under extglob.
- `paren_content_comment_stops_at_newline` — `$(( # comment\n 1+2 ))`
  exercises `handle_paren_comment`'s whitespace-prefix gate and the
  re-entry into the close arm after the newline.
- `empty_extglob_closes_at_immediate_right_paren` — `@()` hits
  depth==0 immediately.
- `trailing_backslash_in_arith_is_graceful_error` — `\` at EOF inside
  `$((…))` must produce a matched-pair error (not panic).

Document the `case.*` branches as defensive — not reached by current
callers since `$(…)` forks to `parse_paren_body` and `case` inside
`$((…))` is rejected by bash, but preserved for future reuse.

No behavior change — the 245 pre-existing tests pass, plus 5 new
guards (250 total). Downstream rippy (1332 tests) and tokf (cargo
check) both green against a local patched build.

Closes #53
Refs #61

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mpecan added a commit that referenced this pull request Apr 16, 2026
…pers (#64)

Closes #53 · Part of #61 (v0.2.0 Refactoring Cycle, PR 3 of 10)

> ⚠ This is the highest-risk PR of the cycle. Nested parens drive
`$((…))` arithmetic and `@(…)`/`?(…)` extglob tokenisation, so extra
reviewer attention on behavior preservation is warranted.

## Summary

Decomposes the ~140-line `read_matched_parens_inner` in
`src/lexer/expansions.rs` into 7 per-arm helpers backed by a new private
`ParenLoopState { depth, case, word_buf }` struct. Reuses the
`Result<ControlFlow<()>>` pattern established in PR 2 (#63) for the `)`
arm that signals "break outer loop".

Also removes a stale `#[allow(clippy::too_many_lines)]` from the 10-line
`read_matched_parens` wrapper (cruft left over from a previous
consolidation).

Adapts the issue's original plan where the code had moved on:
- **Fork/merge extraction** was already done as `read_paren_body_forked`
— skipped.
- **`ParenDepth { parens, braces, brackets }`** suggestion doesn't fit
since the loop only tracks paren depth; adapted to `ParenLoopState {
depth, case, word_buf }` matching actual state.

## Test plan

- [x] `cargo fmt && cargo clippy --all-targets -- -D warnings` — clean.
- [x] `cargo test` — 250 passed (was 245; +5 new guard tests).
- [x] Downstream smoke: `cargo test` against `../rippy` (1332 tests)
with rable patched — all green.
- [x] Downstream smoke: `cargo check --all-targets -p tokf` against
`../tokf` — clean.
- [x] `grep -n "too_many_lines" src/lexer/expansions.rs` — zero matches.

## Five new guard tests

1. `arithmetic_double_paren_closes_at_double_right_paren` — `$((1+2))`
exercises the `close_count=2` path through `handle_paren_close`.
2. `extglob_with_alternation_is_one_word` — `@(foo|bar|baz)` exercises
the `|` word-boundary arm under extglob.
3. `paren_content_comment_stops_at_newline` — `$(( # comment\n 1+2 ))`
exercises `handle_paren_comment`'s whitespace-prefix gate and re-entry
into the close arm after newline.
4. `empty_extglob_closes_at_immediate_right_paren` — `@()` hits depth==0
immediately (raised by Agent 4 of the review).
5. `trailing_backslash_in_arith_is_graceful_error` — `\` at EOF inside
`$((…))` must produce a matched-pair error rather than panic (raised by
Agent 4 of the review).

## Behavior preservation

All eight sharp-edge invariants preserved verbatim, documented in the
plan:
- `end_word` order: `case.check_word` BEFORE `word_buf.clear`.
- Which arms call `end_word` vs only `clear`: `)`, `(`, `'`, `"`, `#`,
`;`, whitespace, `|` → `end_word`; `\`, `$`, `` ` `` → bare `clear`.
- `)` branching on `case.is_pattern_close()` — case-pattern `)` does NOT
decrement depth.
- `(` branching on `case.is_pattern_open()` — case-pattern leading `(`
does NOT increment depth.
- `#` comment gating: only starts a comment when preceded by
whitespace/newline.
- `;;` / `;&` / `;;&` variants call `case.resume_pattern()`.
- `\\\n` line continuation: both chars consumed, neither pushed.
- Bare char accumulates `word_buf` (no clear).

The `case.*` branches are documented as **defensive** — not reached by
current callers since `$(…)` forks to `parse_paren_body` and `case`
inside `$((…))` is invalid bash, but preserved for future reuse.

## Architectural note

Agent 3 flagged that this pattern is a direct fit for
`read_param_expansion_inner` and `read_backtick_inner` in the same file.
Not in scope here, but de-risks potential future refactors.

## Stack

Fresh from `main`; no stack.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mpecan pushed a commit that referenced this pull request Apr 19, 2026
🤖 I have created a release *beep* *boop*
---


##
[0.2.0](rable-v0.1.15...rable-v0.2.0)
(2026-04-18)


### ⚠ BREAKING CHANGES

* tighten lexer API surface and relocate WordSpan to ast
([#70](#70))

### Bug Fixes

* **format:** align cmdsub reformatter with bash canonical form
([#49](#49))
([c7a4411](c7a4411))
* **lexer:** accept sloppy heredoc terminator in cmdsub mode
([#50](#50))
([40f394f](40f394f))
* **lexer:** backticks opaque when content is invalid
([#71](#71))
([e72166f](e72166f)),
closes [#38](#38)
* **lexer:** disable reserved-word recognition after assignment words
([#44](#44))
([42e1fc0](42e1fc0))
* **lexer:** stop treating ]] and unbalanced [...] as special outside
conditionals ([#45](#45))
([4bf5a5c](4bf5a5c))
* **parser:** fall back from (( … )) arith to nested subshells
([#48](#48))
([1437f00](1437f00))


### Code Refactoring

* **format:** introduce Formatter struct
([#65](#65))
([d965a8f](d965a8f))
* **lexer:** drop Result&lt;Token&gt; wrapper from operator readers
([#62](#62))
([d52a841](d52a841))
* **lexer:** split read_word_token into classify + advance + dispatch
helpers ([#63](#63))
([3ba09f5](3ba09f5))
* **parser:** extract fill_heredoc_contents visitor helpers
([#68](#68))
([40e6165](40e6165))
* **parser:** extract helpers from three oversize parsers
([#69](#69))
([25d0762](25d0762))
* **sexp:** dispatch NodeKind Display to per-category helpers
([#66](#66))
([44b0330](44b0330))
* **sexp:** table-drive ANSI-C escape dispatch
([#67](#67))
([91a5267](91a5267))
* tighten lexer API surface and relocate WordSpan to ast
([#70](#70))
([5171d01](5171d01))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: repository-butler[bot] <166800726+repository-butler[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PR 2: lexer/words.rs — split read_word_token

1 participant