Skip to content

fix(lexer): disable reserved-word recognition after assignment words#44

Merged
mpecan merged 2 commits into
mainfrom
feat/37-reserved-word-as-word
Apr 14, 2026
Merged

fix(lexer): disable reserved-word recognition after assignment words#44
mpecan merged 2 commits into
mainfrom
feat/37-reserved-word-as-word

Conversation

@mpecan
Copy link
Copy Markdown
Owner

@mpecan mpecan commented Apr 14, 2026

Closes #37

Summary

Disables bash reserved-word recognition after a simple command has consumed at least one AssignmentWord, matching bash behavior.

Root cause

After a simple command has consumed an AssignmentWord, bash stops recognizing reserved words in that same simple command even though it's still at command-word position for assignment-detection purposes. Rable incorrectly kept command_start=true for reserved-word purposes too, so the lexer emitted For / Do / Then reserved tokens where plain Word is expected.

Bash verification:

bash -n -c 'foo=bar for'    -> OK (for is a plain word)
bash -n -c 'foo=bar do'     -> OK
bash -n -c 'foo=bar then'   -> OK

Implementation

Splits the LexerContext flag into two independent concepts:

  • command_start — eligible to begin a new simple command or to accept an AssignmentWord (unchanged semantics).
  • reserved_words_ok (new) — narrowly gates reserved-word classification in read_word_token. Cleared when an AssignmentWord is emitted; re-armed on command separators (;, ;;, ;;&, ;&, |, ||, |&, &, &&), newlines, and all parser-driven set_command_start() calls.

All sites that set command_start=true now also set reserved_words_ok=true; the only place they diverge is the AssignmentWord branch in read_word_token, which keeps command_start=true but clears reserved_words_ok.

Fixed cases (issue #37)

  • reserved_word_as_word 1if foo= for $(pwd); then foo=$(pwd); fi
  • reserved_word_as_word 3while arr[0]=$fo do o; do ...
  • reserved_word_as_word 4case $x in a) x=$ then [1+2];; esac

Incidental bonus fix

  • rbracket_outside_cond 3Cdeclare -n ref=t ]] arget. Reclassifying the post-assignment ]] as a plain Word (instead of the DoubleRightBracket reserved token) is a side effect of the same change and is documented inline in KNOWN_ORACLE_FAILURES.

Deferred cases

Test plan

  • New unit tests in src/lexer/tests.rs:
    • reserved_word_after_assignment_is_plain_word — locks in token classification for cases 1, 3, 4.
    • reserved_word_re_armed_after_separator — verifies reserved-word recognition comes back after ; and |.
  • Existing assignment_before_command_keeps_command_start still passes unchanged.
  • tests/oracle/bash_valid_divergences.tests cases 1, 3, 4 removed from KNOWN_ORACLE_FAILURES; inline comments on cases 2 and 5 point at their respective tracking issues.
  • cargo fmt && cargo clippy --all-targets -- -D warnings && cargo test → 221 pass (166 unit + 1 tree-sitter + 54 integration).

🤖 Generated with Claude Code

After a simple command has consumed an AssignmentWord, bash stops
recognizing reserved words in that same simple command, even though
the word position is still eligible for more assignments. Rable was
conflating these two concepts, which caused it to emit For/Do/Then
reserved tokens for `foo= for`, `arr[0]=$fo do`, `x=$ then`, etc.

Split the LexerContext flag: command_start continues to gate
assignment-word detection and new-command eligibility, while the new
reserved_words_ok flag narrowly controls reserved-word classification.
Both re-arm on command separators, newlines, and parser-driven
set_command_start calls; only reserved_words_ok is cleared when an
AssignmentWord is emitted.

Fixes reserved_word_as_word cases 1, 3, 4 (issue #37). Incidentally
also fixes rbracket_outside_cond 3 (whose input leads with an
AssignmentWord before the stray ]]).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mpecan added a commit that referenced this pull request Apr 14, 2026
…zzing (#46)

## Summary

Adds 26 test cases that bash-oracle parses successfully but rable
produces a different AST (or errors) for. These are **real rable bugs**,
collected from differential fuzzing. Every case is listed in
`KNOWN_ORACLE_FAILURES` in `tests/integration.rs`, so this PR does not
introduce any failing tests — it just gives the 7 tracked bugs a stable
regression corpus to land fixes against.

## What this PR contains

- **`tests/fuzz.py`** — new `--valid-only` flag on both `mutate` and
`generate` modes. Filters out mismatches where the reference parser
rejects the input, so only divergences on *valid* bash remain. This is
the change that surfaced the 26 cases.
- **`tests/oracle/bash_valid_divergences.tests`** — 26 test cases
grouped by cluster, each with the canonical bash-oracle output. Header
references the 7 tracked issues.
- **`tests/integration.rs`** — registers the new oracle file and lists
all 26 cases in `KNOWN_ORACLE_FAILURES` with inline comments grouping
them by issue.

## Clusters and tracking issues

| Cluster | Issue | Cases |
|---|---|---|
| `rbracket_outside_cond` — `]]` tokenization outside `[[ ]]` | #35 | 3
|
| `bracket_op_split` — unbalanced `[...]` absorbing `\|\|` / `&&` | #36
| 3 |
| `reserved_word_as_word` — reserved words in non-reserved positions |
#37 | 5 |
| `backtick_opaque` — backticks opaque on invalid content | #38 | 6 |
| `heredoc_in_cmdsub` — heredoc inside `$( ... )` | #39 | 2 |
| `cmdsub_reformat` — command-sub canonical reformat drift | #40 | 6 |
| `linecont_in_arith` — `\<newline>` inside `$(( ))` | #41 | 1 |

## Motivation

Landing this first de-risks the three open fix PRs (#43 / #44 / #45) by
making their diffs show only the fix, not the 260-line corpus addition
as baseline noise.

## Test plan

- [ ] `cargo test --test integration oracle_bash_valid_divergences` —
reports `0/26 passed (26 remaining)`, which is the expected baseline
- [ ] `cargo test --all-targets` — full suite stays green (nothing new
fails; the corpus only lives in `KNOWN_ORACLE_FAILURES`)
- [ ] Quality gate: `cargo fmt` + `cargo clippy --all-targets -- -D
warnings`

🤖 Generated with [Claude Code](https://claude.com/claude-code)
@mpecan mpecan merged commit 42e1fc0 into main Apr 14, 2026
5 checks passed
@mpecan mpecan deleted the feat/37-reserved-word-as-word branch April 14, 2026 20:20
mpecan added a commit that referenced this pull request Apr 16, 2026
…atch helpers

`read_word_token` was 130 lines with `#[allow(clippy::too_many_lines)]`,
interleaving three logical phases: character-by-character word assembly,
post-assembly classification, and context-flag update for the next token.

Split into six private helpers on `impl Lexer`:

- `classify_word` — encodes the reserved-word / AssignmentWord / Word
  classification plus the `]]` → Word fixup when not inside `[[ ]]`
  (issues #35, #37).
- `advance_word_context` — sets `command_start` / `reserved_words_ok`
  for the next token, including the AssignmentWord special case from
  issue #44.
- `read_open_paren_word` — handles the `(` mid-word dispatch (array
  assignment, extglob, or break). Returns `ControlFlow<()>` so the
  outer `while let` loop knows whether to break.
- `read_angle_bracket_word` — same pattern for mid-word `<(…)` /
  `>(…)` process substitutions.
- `is_extglob_trigger` — consolidates the twin `@|?|+` and `!|*` match
  arms into one predicate, correctly gating `!`/`*` behind
  `config.extglob`.
- `word_enters_bracket_subscript` — predicate extracting the guard
  condition for `[` bracket-subscript absorption.

Main `read_word_token` body drops from 130 to 45 lines; the
`too_many_lines` allow is gone.

Add two regression guards in `src/lexer/tests.rs`:

- `extglob_prefix_without_paren_is_ordinary_char` — bare `@`, `?`, `+`,
  `!`, `*` without a following `(` stay ordinary word chars.
- `extglob_disabled_does_not_absorb_paren` — with `extglob=false`,
  `!(cmd)` and `foo*(bar)` must produce distinct `LeftParen` tokens
  rather than being absorbed as extglob words. Pins the
  `config.extglob` gate of `is_extglob_trigger`, which was otherwise
  only covered by integration tests that enable extglob.

No behavior change — 243 pre-existing tests still pass (+2 new guard
tests = 245). Downstream rippy (1332 tests) and tokf (cargo check)
both green against a local patched build.

Closes #52
Refs #61

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mpecan added a commit that referenced this pull request Apr 16, 2026
…atch helpers (#63)

Closes #52 · Part of #61 (v0.2.0 Refactoring Cycle, PR 2 of 10)

## Summary

Splits the 130-line `read_word_token` in `src/lexer/words.rs` into six
private helpers on `impl Lexer`:

- `classify_word` — reserved-word / AssignmentWord / Word classification
+ `]]` → Word fixup (issues #35, #37).
- `advance_word_context` — updates `command_start` / `reserved_words_ok`
for the next token (issue #44).
- `read_open_paren_word` — the `(` mid-word dispatch (array assignment,
extglob, or break). Returns `std::ops::ControlFlow<()>` so the outer
loop knows whether to break.
- `read_angle_bracket_word` — mid-word `<(…)` / `>(…)` process
substitution.
- `is_extglob_trigger` — consolidates the twin `@|?|+` and `!|*` match
arms into one predicate, correctly gating `!`/`*` behind
`config.extglob`.
- `word_enters_bracket_subscript` — predicate extracting the `[`
bracket-subscript guard.

`read_word_token` body shrinks from 130 to 45 lines; the
`#[allow(clippy::too_many_lines)]` allow is gone.

Net diff: +148 / −64 across 2 files.

## Test plan

- [x] `cargo fmt && cargo clippy --all-targets -- -D warnings` — clean.
- [x] `cargo test` — 245 passed, 0 failed (was 243; +2 new guard tests).
- [x] Downstream smoke: `cargo test` against `../rippy` (1332 tests)
with rable patched to this branch — all green.
- [x] Downstream smoke: `cargo check --all-targets -p tokf` against
`../tokf` — clean.
- [x] `grep -n "too_many_lines" src/lexer/words.rs` — zero matches.

## New guard tests

Two tests added to `src/lexer/tests.rs` pinning invariants this refactor
centralizes:

- `extglob_prefix_without_paren_is_ordinary_char` — bare `@`, `?`, `+`,
`!`, `*` without a following `(` stay ordinary word chars.
- `extglob_disabled_does_not_absorb_paren` — raised by the multi-agent
review (Agent 4). With `extglob=false`, `!(cmd)` and `foo*(bar)` must
produce distinct `LeftParen` tokens rather than being absorbed as
extglob words. Pins the `config.extglob` gate of `is_extglob_trigger`,
which was otherwise only covered by integration tests that enable
extglob.

## Behavior preservation

All four sharp-edge invariants noted in the issue are preserved
verbatim:

- Issue #37: reserved-word classification requires both `command_start`
AND `reserved_words_ok`.
- Issue #35: `]]` outside `[[ ]]` is a Word, not `DoubleRightBracket`.
- Issue #44: `AssignmentWord` sets `command_start=true` but clears
`reserved_words_ok`.
- Extglob gating: `@|?|+` always trigger on `(`; `!|*` only when
`config.extglob` is on.

## Architectural note

The `Result<ControlFlow<()>>` pattern established here is a direct fit
for PR 3 (#53, `lexer/expansions.rs` — `read_matched_parens_inner`),
flagged as the highest-risk PR of the cycle. Landing this pattern on the
simpler `words.rs` case first de-risks that follow-up.

## Stack

Fresh from `main`; no stack.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mpecan pushed a commit that referenced this pull request Apr 19, 2026
🤖 I have created a release *beep* *boop*
---


##
[0.2.0](rable-v0.1.15...rable-v0.2.0)
(2026-04-18)


### ⚠ BREAKING CHANGES

* tighten lexer API surface and relocate WordSpan to ast
([#70](#70))

### Bug Fixes

* **format:** align cmdsub reformatter with bash canonical form
([#49](#49))
([c7a4411](c7a4411))
* **lexer:** accept sloppy heredoc terminator in cmdsub mode
([#50](#50))
([40f394f](40f394f))
* **lexer:** backticks opaque when content is invalid
([#71](#71))
([e72166f](e72166f)),
closes [#38](#38)
* **lexer:** disable reserved-word recognition after assignment words
([#44](#44))
([42e1fc0](42e1fc0))
* **lexer:** stop treating ]] and unbalanced [...] as special outside
conditionals ([#45](#45))
([4bf5a5c](4bf5a5c))
* **parser:** fall back from (( … )) arith to nested subshells
([#48](#48))
([1437f00](1437f00))


### Code Refactoring

* **format:** introduce Formatter struct
([#65](#65))
([d965a8f](d965a8f))
* **lexer:** drop Result&lt;Token&gt; wrapper from operator readers
([#62](#62))
([d52a841](d52a841))
* **lexer:** split read_word_token into classify + advance + dispatch
helpers ([#63](#63))
([3ba09f5](3ba09f5))
* **parser:** extract fill_heredoc_contents visitor helpers
([#68](#68))
([40e6165](40e6165))
* **parser:** extract helpers from three oversize parsers
([#69](#69))
([25d0762](25d0762))
* **sexp:** dispatch NodeKind Display to per-category helpers
([#66](#66))
([44b0330](44b0330))
* **sexp:** table-drive ANSI-C escape dispatch
([#67](#67))
([91a5267](91a5267))
* tighten lexer API surface and relocate WordSpan to ast
([#70](#70))
([5171d01](5171d01))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: repository-butler[bot] <166800726+repository-butler[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

parser: reserved words should parse as plain words in non-reserved positions

1 participant