fix(lexer): disable reserved-word recognition after assignment words by mpecan · Pull Request #44 · mpecan/rable

mpecan · 2026-04-14T15:04:16Z

Closes #37

Summary

Disables bash reserved-word recognition after a simple command has consumed at least one AssignmentWord, matching bash behavior.

Root cause

After a simple command has consumed an AssignmentWord, bash stops recognizing reserved words in that same simple command even though it's still at command-word position for assignment-detection purposes. Rable incorrectly kept command_start=true for reserved-word purposes too, so the lexer emitted For / Do / Then reserved tokens where plain Word is expected.

Bash verification:

bash -n -c 'foo=bar for'    -> OK (for is a plain word)
bash -n -c 'foo=bar do'     -> OK
bash -n -c 'foo=bar then'   -> OK

Implementation

Splits the LexerContext flag into two independent concepts:

command_start — eligible to begin a new simple command or to accept an AssignmentWord (unchanged semantics).
reserved_words_ok (new) — narrowly gates reserved-word classification in read_word_token. Cleared when an AssignmentWord is emitted; re-armed on command separators (;, ;;, ;;&, ;&, |, ||, |&, &, &&), newlines, and all parser-driven set_command_start() calls.

All sites that set command_start=true now also set reserved_words_ok=true; the only place they diverge is the AssignmentWord branch in read_word_token, which keeps command_start=true but clears reserved_words_ok.

Fixed cases (issue #37)

reserved_word_as_word 1 — if foo= for $(pwd); then foo=$(pwd); fi
reserved_word_as_word 3 — while arr[0]=$fo do o; do ...
reserved_word_as_word 4 — case $x in a) x=$ then [1+2];; esac

Incidental bonus fix

rbracket_outside_cond 3 — Cdeclare -n ref=t ]] arget. Reclassifying the post-assignment ]] as a plain Word (instead of the DoubleRightBracket reserved token) is a side effect of the same change and is documented inline in KNOWN_ORACLE_FAILURES.

Deferred cases

reserved_word_as_word 2 — ((x\n> 0)\n). Root cause is a (( vs ( ( disambiguation bug in parse_arith_command: the raw-char scanner for )) never finds a match when the body contains a newline, and there is no fallback to nested subshells. Tracked as parser: (( ... )) should fall back to nested subshells when body is not a valid arithmetic expression #42.
reserved_word_as_word 5 — if [ $"yes" = yes9 ][ $; then echo ok; fi. Same root cause as lexer: words containing unbalanced [...] absorb || / && / spaces #36 — read_bracket_subscript runs to EOF looking for ]. Being fixed as a side effect of lexer: ]] not tokenized as separate word outside [[ ... ]] #35.

Test plan

New unit tests in src/lexer/tests.rs:
- reserved_word_after_assignment_is_plain_word — locks in token classification for cases 1, 3, 4.
- reserved_word_re_armed_after_separator — verifies reserved-word recognition comes back after ; and |.
Existing assignment_before_command_keeps_command_start still passes unchanged.
tests/oracle/bash_valid_divergences.tests cases 1, 3, 4 removed from KNOWN_ORACLE_FAILURES; inline comments on cases 2 and 5 point at their respective tracking issues.
cargo fmt && cargo clippy --all-targets -- -D warnings && cargo test → 221 pass (166 unit + 1 tree-sitter + 54 integration).

🤖 Generated with Claude Code

After a simple command has consumed an AssignmentWord, bash stops recognizing reserved words in that same simple command, even though the word position is still eligible for more assignments. Rable was conflating these two concepts, which caused it to emit For/Do/Then reserved tokens for `foo= for`, `arr[0]=$fo do`, `x=$ then`, etc. Split the LexerContext flag: command_start continues to gate assignment-word detection and new-command eligibility, while the new reserved_words_ok flag narrowly controls reserved-word classification. Both re-arm on command separators, newlines, and parser-driven set_command_start calls; only reserved_words_ok is cleared when an AssignmentWord is emitted. Fixes reserved_word_as_word cases 1, 3, 4 (issue #37). Incidentally also fixes rbracket_outside_cond 3 (whose input leads with an AssignmentWord before the stray ]]). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…zzing (#46) ## Summary Adds 26 test cases that bash-oracle parses successfully but rable produces a different AST (or errors) for. These are **real rable bugs**, collected from differential fuzzing. Every case is listed in `KNOWN_ORACLE_FAILURES` in `tests/integration.rs`, so this PR does not introduce any failing tests — it just gives the 7 tracked bugs a stable regression corpus to land fixes against. ## What this PR contains - **`tests/fuzz.py`** — new `--valid-only` flag on both `mutate` and `generate` modes. Filters out mismatches where the reference parser rejects the input, so only divergences on *valid* bash remain. This is the change that surfaced the 26 cases. - **`tests/oracle/bash_valid_divergences.tests`** — 26 test cases grouped by cluster, each with the canonical bash-oracle output. Header references the 7 tracked issues. - **`tests/integration.rs`** — registers the new oracle file and lists all 26 cases in `KNOWN_ORACLE_FAILURES` with inline comments grouping them by issue. ## Clusters and tracking issues | Cluster | Issue | Cases | |---|---|---| | `rbracket_outside_cond` — `]]` tokenization outside `[[ ]]` | #35 | 3 | | `bracket_op_split` — unbalanced `[...]` absorbing `\|\|` / `&&` | #36 | 3 | | `reserved_word_as_word` — reserved words in non-reserved positions | #37 | 5 | | `backtick_opaque` — backticks opaque on invalid content | #38 | 6 | | `heredoc_in_cmdsub` — heredoc inside `$( ... )` | #39 | 2 | | `cmdsub_reformat` — command-sub canonical reformat drift | #40 | 6 | | `linecont_in_arith` — `\<newline>` inside `$(( ))` | #41 | 1 | ## Motivation Landing this first de-risks the three open fix PRs (#43 / #44 / #45) by making their diffs show only the fix, not the 260-line corpus addition as baseline noise. ## Test plan - [ ] `cargo test --test integration oracle_bash_valid_divergences` — reports `0/26 passed (26 remaining)`, which is the expected baseline - [ ] `cargo test --all-targets` — full suite stays green (nothing new fails; the corpus only lives in `KNOWN_ORACLE_FAILURES`) - [ ] Quality gate: `cargo fmt` + `cargo clippy --all-targets -- -D warnings` 🤖 Generated with [Claude Code](https://claude.com/claude-code)

…atch helpers `read_word_token` was 130 lines with `#[allow(clippy::too_many_lines)]`, interleaving three logical phases: character-by-character word assembly, post-assembly classification, and context-flag update for the next token. Split into six private helpers on `impl Lexer`: - `classify_word` — encodes the reserved-word / AssignmentWord / Word classification plus the `]]` → Word fixup when not inside `[[ ]]` (issues #35, #37). - `advance_word_context` — sets `command_start` / `reserved_words_ok` for the next token, including the AssignmentWord special case from issue #44. - `read_open_paren_word` — handles the `(` mid-word dispatch (array assignment, extglob, or break). Returns `ControlFlow<()>` so the outer `while let` loop knows whether to break. - `read_angle_bracket_word` — same pattern for mid-word `<(…)` / `>(…)` process substitutions. - `is_extglob_trigger` — consolidates the twin `@|?|+` and `!|*` match arms into one predicate, correctly gating `!`/`*` behind `config.extglob`. - `word_enters_bracket_subscript` — predicate extracting the guard condition for `[` bracket-subscript absorption. Main `read_word_token` body drops from 130 to 45 lines; the `too_many_lines` allow is gone. Add two regression guards in `src/lexer/tests.rs`: - `extglob_prefix_without_paren_is_ordinary_char` — bare `@`, `?`, `+`, `!`, `*` without a following `(` stay ordinary word chars. - `extglob_disabled_does_not_absorb_paren` — with `extglob=false`, `!(cmd)` and `foo*(bar)` must produce distinct `LeftParen` tokens rather than being absorbed as extglob words. Pins the `config.extglob` gate of `is_extglob_trigger`, which was otherwise only covered by integration tests that enable extglob. No behavior change — 243 pre-existing tests still pass (+2 new guard tests = 245). Downstream rippy (1332 tests) and tokf (cargo check) both green against a local patched build. Closes #52 Refs #61 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…atch helpers (#63) Closes #52 · Part of #61 (v0.2.0 Refactoring Cycle, PR 2 of 10) ## Summary Splits the 130-line `read_word_token` in `src/lexer/words.rs` into six private helpers on `impl Lexer`: - `classify_word` — reserved-word / AssignmentWord / Word classification + `]]` → Word fixup (issues #35, #37). - `advance_word_context` — updates `command_start` / `reserved_words_ok` for the next token (issue #44). - `read_open_paren_word` — the `(` mid-word dispatch (array assignment, extglob, or break). Returns `std::ops::ControlFlow<()>` so the outer loop knows whether to break. - `read_angle_bracket_word` — mid-word `<(…)` / `>(…)` process substitution. - `is_extglob_trigger` — consolidates the twin `@|?|+` and `!|*` match arms into one predicate, correctly gating `!`/`*` behind `config.extglob`. - `word_enters_bracket_subscript` — predicate extracting the `[` bracket-subscript guard. `read_word_token` body shrinks from 130 to 45 lines; the `#[allow(clippy::too_many_lines)]` allow is gone. Net diff: +148 / −64 across 2 files. ## Test plan - [x] `cargo fmt && cargo clippy --all-targets -- -D warnings` — clean. - [x] `cargo test` — 245 passed, 0 failed (was 243; +2 new guard tests). - [x] Downstream smoke: `cargo test` against `../rippy` (1332 tests) with rable patched to this branch — all green. - [x] Downstream smoke: `cargo check --all-targets -p tokf` against `../tokf` — clean. - [x] `grep -n "too_many_lines" src/lexer/words.rs` — zero matches. ## New guard tests Two tests added to `src/lexer/tests.rs` pinning invariants this refactor centralizes: - `extglob_prefix_without_paren_is_ordinary_char` — bare `@`, `?`, `+`, `!`, `*` without a following `(` stay ordinary word chars. - `extglob_disabled_does_not_absorb_paren` — raised by the multi-agent review (Agent 4). With `extglob=false`, `!(cmd)` and `foo*(bar)` must produce distinct `LeftParen` tokens rather than being absorbed as extglob words. Pins the `config.extglob` gate of `is_extglob_trigger`, which was otherwise only covered by integration tests that enable extglob. ## Behavior preservation All four sharp-edge invariants noted in the issue are preserved verbatim: - Issue #37: reserved-word classification requires both `command_start` AND `reserved_words_ok`. - Issue #35: `]]` outside `[[ ]]` is a Word, not `DoubleRightBracket`. - Issue #44: `AssignmentWord` sets `command_start=true` but clears `reserved_words_ok`. - Extglob gating: `@|?|+` always trigger on `(`; `!|*` only when `config.extglob` is on. ## Architectural note The `Result<ControlFlow<()>>` pattern established here is a direct fit for PR 3 (#53, `lexer/expansions.rs` — `read_matched_parens_inner`), flagged as the highest-risk PR of the cycle. Landing this pattern on the simpler `words.rs` case first de-risks that follow-up. ## Stack Fresh from `main`; no stack. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

🤖 I have created a release *beep* *boop* --- ## [0.2.0](rable-v0.1.15...rable-v0.2.0) (2026-04-18) ### ⚠ BREAKING CHANGES * tighten lexer API surface and relocate WordSpan to ast ([#70](#70)) ### Bug Fixes * **format:** align cmdsub reformatter with bash canonical form ([#49](#49)) ([c7a4411](c7a4411)) * **lexer:** accept sloppy heredoc terminator in cmdsub mode ([#50](#50)) ([40f394f](40f394f)) * **lexer:** backticks opaque when content is invalid ([#71](#71)) ([e72166f](e72166f)), closes [#38](#38) * **lexer:** disable reserved-word recognition after assignment words ([#44](#44)) ([42e1fc0](42e1fc0)) * **lexer:** stop treating ]] and unbalanced [...] as special outside conditionals ([#45](#45)) ([4bf5a5c](4bf5a5c)) * **parser:** fall back from (( … )) arith to nested subshells ([#48](#48)) ([1437f00](1437f00)) ### Code Refactoring * **format:** introduce Formatter struct ([#65](#65)) ([d965a8f](d965a8f)) * **lexer:** drop Result<Token> wrapper from operator readers ([#62](#62)) ([d52a841](d52a841)) * **lexer:** split read_word_token into classify + advance + dispatch helpers ([#63](#63)) ([3ba09f5](3ba09f5)) * **parser:** extract fill_heredoc_contents visitor helpers ([#68](#68)) ([40e6165](40e6165)) * **parser:** extract helpers from three oversize parsers ([#69](#69)) ([25d0762](25d0762)) * **sexp:** dispatch NodeKind Display to per-category helpers ([#66](#66)) ([44b0330](44b0330)) * **sexp:** table-drive ANSI-C escape dispatch ([#67](#67)) ([91a5267](91a5267)) * tighten lexer API surface and relocate WordSpan to ast ([#70](#70)) ([5171d01](5171d01)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: repository-butler[bot] <166800726+repository-butler[bot]@users.noreply.github.com>

mpecan mentioned this pull request Apr 14, 2026

test(oracle): add 26 bash-valid divergence cases from differential fuzzing #46

Merged

3 tasks

Merge branch 'main' into feat/37-reserved-word-as-word

77e9f81

mpecan merged commit 42e1fc0 into main Apr 14, 2026
5 checks passed

mpecan deleted the feat/37-reserved-word-as-word branch April 14, 2026 20:20

repository-butler Bot mentioned this pull request Apr 14, 2026

chore(main): release rable 0.2.0 #47

Merged

mpecan mentioned this pull request Apr 16, 2026

PR 2: lexer/words.rs — split read_word_token #52

Closed

mpecan mentioned this pull request Apr 16, 2026

refactor(lexer): split read_word_token into classify + advance + dispatch helpers #63

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(lexer): disable reserved-word recognition after assignment words#44

fix(lexer): disable reserved-word recognition after assignment words#44
mpecan merged 2 commits into
mainfrom
feat/37-reserved-word-as-word

mpecan commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mpecan commented Apr 14, 2026

Summary

Root cause

Implementation

Fixed cases (issue #37)

Incidental bonus fix

Deferred cases

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant