fix(lexer): disable reserved-word recognition after assignment words#44
Merged
Conversation
After a simple command has consumed an AssignmentWord, bash stops recognizing reserved words in that same simple command, even though the word position is still eligible for more assignments. Rable was conflating these two concepts, which caused it to emit For/Do/Then reserved tokens for `foo= for`, `arr[0]=$fo do`, `x=$ then`, etc. Split the LexerContext flag: command_start continues to gate assignment-word detection and new-command eligibility, while the new reserved_words_ok flag narrowly controls reserved-word classification. Both re-arm on command separators, newlines, and parser-driven set_command_start calls; only reserved_words_ok is cleared when an AssignmentWord is emitted. Fixes reserved_word_as_word cases 1, 3, 4 (issue #37). Incidentally also fixes rbracket_outside_cond 3 (whose input leads with an AssignmentWord before the stray ]]). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 tasks
mpecan
added a commit
that referenced
this pull request
Apr 14, 2026
…zzing (#46) ## Summary Adds 26 test cases that bash-oracle parses successfully but rable produces a different AST (or errors) for. These are **real rable bugs**, collected from differential fuzzing. Every case is listed in `KNOWN_ORACLE_FAILURES` in `tests/integration.rs`, so this PR does not introduce any failing tests — it just gives the 7 tracked bugs a stable regression corpus to land fixes against. ## What this PR contains - **`tests/fuzz.py`** — new `--valid-only` flag on both `mutate` and `generate` modes. Filters out mismatches where the reference parser rejects the input, so only divergences on *valid* bash remain. This is the change that surfaced the 26 cases. - **`tests/oracle/bash_valid_divergences.tests`** — 26 test cases grouped by cluster, each with the canonical bash-oracle output. Header references the 7 tracked issues. - **`tests/integration.rs`** — registers the new oracle file and lists all 26 cases in `KNOWN_ORACLE_FAILURES` with inline comments grouping them by issue. ## Clusters and tracking issues | Cluster | Issue | Cases | |---|---|---| | `rbracket_outside_cond` — `]]` tokenization outside `[[ ]]` | #35 | 3 | | `bracket_op_split` — unbalanced `[...]` absorbing `\|\|` / `&&` | #36 | 3 | | `reserved_word_as_word` — reserved words in non-reserved positions | #37 | 5 | | `backtick_opaque` — backticks opaque on invalid content | #38 | 6 | | `heredoc_in_cmdsub` — heredoc inside `$( ... )` | #39 | 2 | | `cmdsub_reformat` — command-sub canonical reformat drift | #40 | 6 | | `linecont_in_arith` — `\<newline>` inside `$(( ))` | #41 | 1 | ## Motivation Landing this first de-risks the three open fix PRs (#43 / #44 / #45) by making their diffs show only the fix, not the 260-line corpus addition as baseline noise. ## Test plan - [ ] `cargo test --test integration oracle_bash_valid_divergences` — reports `0/26 passed (26 remaining)`, which is the expected baseline - [ ] `cargo test --all-targets` — full suite stays green (nothing new fails; the corpus only lives in `KNOWN_ORACLE_FAILURES`) - [ ] Quality gate: `cargo fmt` + `cargo clippy --all-targets -- -D warnings` 🤖 Generated with [Claude Code](https://claude.com/claude-code)
mpecan
added a commit
that referenced
this pull request
Apr 16, 2026
…atch helpers `read_word_token` was 130 lines with `#[allow(clippy::too_many_lines)]`, interleaving three logical phases: character-by-character word assembly, post-assembly classification, and context-flag update for the next token. Split into six private helpers on `impl Lexer`: - `classify_word` — encodes the reserved-word / AssignmentWord / Word classification plus the `]]` → Word fixup when not inside `[[ ]]` (issues #35, #37). - `advance_word_context` — sets `command_start` / `reserved_words_ok` for the next token, including the AssignmentWord special case from issue #44. - `read_open_paren_word` — handles the `(` mid-word dispatch (array assignment, extglob, or break). Returns `ControlFlow<()>` so the outer `while let` loop knows whether to break. - `read_angle_bracket_word` — same pattern for mid-word `<(…)` / `>(…)` process substitutions. - `is_extglob_trigger` — consolidates the twin `@|?|+` and `!|*` match arms into one predicate, correctly gating `!`/`*` behind `config.extglob`. - `word_enters_bracket_subscript` — predicate extracting the guard condition for `[` bracket-subscript absorption. Main `read_word_token` body drops from 130 to 45 lines; the `too_many_lines` allow is gone. Add two regression guards in `src/lexer/tests.rs`: - `extglob_prefix_without_paren_is_ordinary_char` — bare `@`, `?`, `+`, `!`, `*` without a following `(` stay ordinary word chars. - `extglob_disabled_does_not_absorb_paren` — with `extglob=false`, `!(cmd)` and `foo*(bar)` must produce distinct `LeftParen` tokens rather than being absorbed as extglob words. Pins the `config.extglob` gate of `is_extglob_trigger`, which was otherwise only covered by integration tests that enable extglob. No behavior change — 243 pre-existing tests still pass (+2 new guard tests = 245). Downstream rippy (1332 tests) and tokf (cargo check) both green against a local patched build. Closes #52 Refs #61 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5 tasks
mpecan
added a commit
that referenced
this pull request
Apr 16, 2026
…atch helpers (#63) Closes #52 · Part of #61 (v0.2.0 Refactoring Cycle, PR 2 of 10) ## Summary Splits the 130-line `read_word_token` in `src/lexer/words.rs` into six private helpers on `impl Lexer`: - `classify_word` — reserved-word / AssignmentWord / Word classification + `]]` → Word fixup (issues #35, #37). - `advance_word_context` — updates `command_start` / `reserved_words_ok` for the next token (issue #44). - `read_open_paren_word` — the `(` mid-word dispatch (array assignment, extglob, or break). Returns `std::ops::ControlFlow<()>` so the outer loop knows whether to break. - `read_angle_bracket_word` — mid-word `<(…)` / `>(…)` process substitution. - `is_extglob_trigger` — consolidates the twin `@|?|+` and `!|*` match arms into one predicate, correctly gating `!`/`*` behind `config.extglob`. - `word_enters_bracket_subscript` — predicate extracting the `[` bracket-subscript guard. `read_word_token` body shrinks from 130 to 45 lines; the `#[allow(clippy::too_many_lines)]` allow is gone. Net diff: +148 / −64 across 2 files. ## Test plan - [x] `cargo fmt && cargo clippy --all-targets -- -D warnings` — clean. - [x] `cargo test` — 245 passed, 0 failed (was 243; +2 new guard tests). - [x] Downstream smoke: `cargo test` against `../rippy` (1332 tests) with rable patched to this branch — all green. - [x] Downstream smoke: `cargo check --all-targets -p tokf` against `../tokf` — clean. - [x] `grep -n "too_many_lines" src/lexer/words.rs` — zero matches. ## New guard tests Two tests added to `src/lexer/tests.rs` pinning invariants this refactor centralizes: - `extglob_prefix_without_paren_is_ordinary_char` — bare `@`, `?`, `+`, `!`, `*` without a following `(` stay ordinary word chars. - `extglob_disabled_does_not_absorb_paren` — raised by the multi-agent review (Agent 4). With `extglob=false`, `!(cmd)` and `foo*(bar)` must produce distinct `LeftParen` tokens rather than being absorbed as extglob words. Pins the `config.extglob` gate of `is_extglob_trigger`, which was otherwise only covered by integration tests that enable extglob. ## Behavior preservation All four sharp-edge invariants noted in the issue are preserved verbatim: - Issue #37: reserved-word classification requires both `command_start` AND `reserved_words_ok`. - Issue #35: `]]` outside `[[ ]]` is a Word, not `DoubleRightBracket`. - Issue #44: `AssignmentWord` sets `command_start=true` but clears `reserved_words_ok`. - Extglob gating: `@|?|+` always trigger on `(`; `!|*` only when `config.extglob` is on. ## Architectural note The `Result<ControlFlow<()>>` pattern established here is a direct fit for PR 3 (#53, `lexer/expansions.rs` — `read_matched_parens_inner`), flagged as the highest-risk PR of the cycle. Landing this pattern on the simpler `words.rs` case first de-risks that follow-up. ## Stack Fresh from `main`; no stack. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mpecan
pushed a commit
that referenced
this pull request
Apr 19, 2026
🤖 I have created a release *beep* *boop* --- ## [0.2.0](rable-v0.1.15...rable-v0.2.0) (2026-04-18) ### ⚠ BREAKING CHANGES * tighten lexer API surface and relocate WordSpan to ast ([#70](#70)) ### Bug Fixes * **format:** align cmdsub reformatter with bash canonical form ([#49](#49)) ([c7a4411](c7a4411)) * **lexer:** accept sloppy heredoc terminator in cmdsub mode ([#50](#50)) ([40f394f](40f394f)) * **lexer:** backticks opaque when content is invalid ([#71](#71)) ([e72166f](e72166f)), closes [#38](#38) * **lexer:** disable reserved-word recognition after assignment words ([#44](#44)) ([42e1fc0](42e1fc0)) * **lexer:** stop treating ]] and unbalanced [...] as special outside conditionals ([#45](#45)) ([4bf5a5c](4bf5a5c)) * **parser:** fall back from (( … )) arith to nested subshells ([#48](#48)) ([1437f00](1437f00)) ### Code Refactoring * **format:** introduce Formatter struct ([#65](#65)) ([d965a8f](d965a8f)) * **lexer:** drop Result<Token> wrapper from operator readers ([#62](#62)) ([d52a841](d52a841)) * **lexer:** split read_word_token into classify + advance + dispatch helpers ([#63](#63)) ([3ba09f5](3ba09f5)) * **parser:** extract fill_heredoc_contents visitor helpers ([#68](#68)) ([40e6165](40e6165)) * **parser:** extract helpers from three oversize parsers ([#69](#69)) ([25d0762](25d0762)) * **sexp:** dispatch NodeKind Display to per-category helpers ([#66](#66)) ([44b0330](44b0330)) * **sexp:** table-drive ANSI-C escape dispatch ([#67](#67)) ([91a5267](91a5267)) * tighten lexer API surface and relocate WordSpan to ast ([#70](#70)) ([5171d01](5171d01)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: repository-butler[bot] <166800726+repository-butler[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #37
Summary
Disables bash reserved-word recognition after a simple command has consumed at least one
AssignmentWord, matching bash behavior.Root cause
After a simple command has consumed an
AssignmentWord, bash stops recognizing reserved words in that same simple command even though it's still at command-word position for assignment-detection purposes. Rable incorrectly keptcommand_start=truefor reserved-word purposes too, so the lexer emittedFor/Do/Thenreserved tokens where plainWordis expected.Bash verification:
Implementation
Splits the
LexerContextflag into two independent concepts:command_start— eligible to begin a new simple command or to accept anAssignmentWord(unchanged semantics).reserved_words_ok(new) — narrowly gates reserved-word classification inread_word_token. Cleared when anAssignmentWordis emitted; re-armed on command separators (;,;;,;;&,;&,|,||,|&,&,&&), newlines, and all parser-drivenset_command_start()calls.All sites that set
command_start=truenow also setreserved_words_ok=true; the only place they diverge is theAssignmentWordbranch inread_word_token, which keepscommand_start=truebut clearsreserved_words_ok.Fixed cases (issue #37)
reserved_word_as_word 1—if foo= for $(pwd); then foo=$(pwd); fireserved_word_as_word 3—while arr[0]=$fo do o; do ...reserved_word_as_word 4—case $x in a) x=$ then [1+2];; esacIncidental bonus fix
rbracket_outside_cond 3—Cdeclare -n ref=t ]] arget. Reclassifying the post-assignment]]as a plainWord(instead of theDoubleRightBracketreserved token) is a side effect of the same change and is documented inline inKNOWN_ORACLE_FAILURES.Deferred cases
reserved_word_as_word 2—((x\n> 0)\n). Root cause is a((vs( (disambiguation bug inparse_arith_command: the raw-char scanner for))never finds a match when the body contains a newline, and there is no fallback to nested subshells. Tracked as parser: (( ... )) should fall back to nested subshells when body is not a valid arithmetic expression #42.reserved_word_as_word 5—if [ $"yes" = yes9 ][ $; then echo ok; fi. Same root cause as lexer: words containing unbalanced[...]absorb||/&&/ spaces #36 —read_bracket_subscriptruns to EOF looking for]. Being fixed as a side effect of lexer:]]not tokenized as separate word outside[[ ... ]]#35.Test plan
src/lexer/tests.rs:reserved_word_after_assignment_is_plain_word— locks in token classification for cases 1, 3, 4.reserved_word_re_armed_after_separator— verifies reserved-word recognition comes back after;and|.assignment_before_command_keeps_command_startstill passes unchanged.tests/oracle/bash_valid_divergences.testscases 1, 3, 4 removed fromKNOWN_ORACLE_FAILURES; inline comments on cases 2 and 5 point at their respective tracking issues.cargo fmt && cargo clippy --all-targets -- -D warnings && cargo test→ 221 pass (166 unit + 1 tree-sitter + 54 integration).🤖 Generated with Claude Code