fix(lexer): stop treating ]] and unbalanced [...] as special outside conditionals#45
Merged
Merged
Conversation
This was referenced Apr 14, 2026
mpecan
added a commit
that referenced
this pull request
Apr 14, 2026
…zzing (#46) ## Summary Adds 26 test cases that bash-oracle parses successfully but rable produces a different AST (or errors) for. These are **real rable bugs**, collected from differential fuzzing. Every case is listed in `KNOWN_ORACLE_FAILURES` in `tests/integration.rs`, so this PR does not introduce any failing tests — it just gives the 7 tracked bugs a stable regression corpus to land fixes against. ## What this PR contains - **`tests/fuzz.py`** — new `--valid-only` flag on both `mutate` and `generate` modes. Filters out mismatches where the reference parser rejects the input, so only divergences on *valid* bash remain. This is the change that surfaced the 26 cases. - **`tests/oracle/bash_valid_divergences.tests`** — 26 test cases grouped by cluster, each with the canonical bash-oracle output. Header references the 7 tracked issues. - **`tests/integration.rs`** — registers the new oracle file and lists all 26 cases in `KNOWN_ORACLE_FAILURES` with inline comments grouping them by issue. ## Clusters and tracking issues | Cluster | Issue | Cases | |---|---|---| | `rbracket_outside_cond` — `]]` tokenization outside `[[ ]]` | #35 | 3 | | `bracket_op_split` — unbalanced `[...]` absorbing `\|\|` / `&&` | #36 | 3 | | `reserved_word_as_word` — reserved words in non-reserved positions | #37 | 5 | | `backtick_opaque` — backticks opaque on invalid content | #38 | 6 | | `heredoc_in_cmdsub` — heredoc inside `$( ... )` | #39 | 2 | | `cmdsub_reformat` — command-sub canonical reformat drift | #40 | 6 | | `linecont_in_arith` — `\<newline>` inside `$(( ))` | #41 | 1 | ## Motivation Landing this first de-risks the three open fix PRs (#43 / #44 / #45) by making their diffs show only the fix, not the 260-line corpus addition as baseline noise. ## Test plan - [ ] `cargo test --test integration oracle_bash_valid_divergences` — reports `0/26 passed (26 remaining)`, which is the expected baseline - [ ] `cargo test --all-targets` — full suite stays green (nothing new fails; the corpus only lives in `KNOWN_ORACLE_FAILURES`) - [ ] Quality gate: `cargo fmt` + `cargo clippy --all-targets -- -D warnings` 🤖 Generated with [Claude Code](https://claude.com/claude-code)
…prefix
Two independent bugs around `]]` and `[...]` outside a `[[ ]]` conditional.
Bug A — `]]` was unconditionally classified as `DoubleRightBracket` and
`is_list_terminator` treated any `Word "]]"` as a terminator, so a stray
`]]` in the middle of a plain command (e.g. `Cdeclare -n ref=t ]] arget`)
silently terminated the list instead of becoming an ordinary word.
Bug B — `read_bracket_subscript` fired on any `[` mid-word as long as the
word so far was non-empty and didn't already end with `[`. That absorbed
whitespace and control operators inside constructs like `][[`, `[c[`,
`$$[a||b]`, `case[a&&b]` and `foo^[...]` where bash does not.
Fix:
* Only enter `read_bracket_subscript` when (a) we're at command-start
and the word so far is a plain identifier (`arr[i]=val`, `foo[...]`)
or (b) we're inside `[[ ]]` (regex char classes). Everywhere else
`[` is a literal word character.
* Downgrade `]]` from `DoubleRightBracket` to `Word` outside
`cond_expr`; drop the `]]` branch from `Parser::is_list_terminator`.
Inside `[[ ]]`, `parse_cond_command` keeps setting `cond_expr`, so
`]]` remains the reserved closing token there.
Also removes the stale `read_bracket_subscript` call sites in
`continue_word` and `read_array_element` (they were never needed once
the main loop is guarded).
Closes #35.
Side-effect fix for cases originally reported in #36 (all three
`bracket_op_split` cases) and #37 (`reserved_word_as_word 5`) — they
share the same root cause and now pass. #36 and #37 remain open for
whoever verifies there are no further manifestations.
Unit tests in `src/lexer/tests.rs` directly exercise the fix at the
lexer level (no reliance on the oracle corpus), and the three
`rbracket_outside_cond`, three `bracket_op_split`, and one
`reserved_word_as_word 5` entries are removed from
`KNOWN_ORACLE_FAILURES`.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
62790a6 to
719e3c1
Compare
mpecan
pushed a commit
that referenced
this pull request
Apr 19, 2026
🤖 I have created a release *beep* *boop* --- ## [0.2.0](rable-v0.1.15...rable-v0.2.0) (2026-04-18) ### ⚠ BREAKING CHANGES * tighten lexer API surface and relocate WordSpan to ast ([#70](#70)) ### Bug Fixes * **format:** align cmdsub reformatter with bash canonical form ([#49](#49)) ([c7a4411](c7a4411)) * **lexer:** accept sloppy heredoc terminator in cmdsub mode ([#50](#50)) ([40f394f](40f394f)) * **lexer:** backticks opaque when content is invalid ([#71](#71)) ([e72166f](e72166f)), closes [#38](#38) * **lexer:** disable reserved-word recognition after assignment words ([#44](#44)) ([42e1fc0](42e1fc0)) * **lexer:** stop treating ]] and unbalanced [...] as special outside conditionals ([#45](#45)) ([4bf5a5c](4bf5a5c)) * **parser:** fall back from (( … )) arith to nested subshells ([#48](#48)) ([1437f00](1437f00)) ### Code Refactoring * **format:** introduce Formatter struct ([#65](#65)) ([d965a8f](d965a8f)) * **lexer:** drop Result<Token> wrapper from operator readers ([#62](#62)) ([d52a841](d52a841)) * **lexer:** split read_word_token into classify + advance + dispatch helpers ([#63](#63)) ([3ba09f5](3ba09f5)) * **parser:** extract fill_heredoc_contents visitor helpers ([#68](#68)) ([40e6165](40e6165)) * **parser:** extract helpers from three oversize parsers ([#69](#69)) ([25d0762](25d0762)) * **sexp:** dispatch NodeKind Display to per-category helpers ([#66](#66)) ([44b0330](44b0330)) * **sexp:** table-drive ANSI-C escape dispatch ([#67](#67)) ([91a5267](91a5267)) * tighten lexer API surface and relocate WordSpan to ast ([#70](#70)) ([5171d01](5171d01)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: repository-butler[bot] <166800726+repository-butler[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #35
Root cause
Two independent bugs in the lexer / parser both triggered by
]]and[...]appearing outside a[[ ]]conditional.Bug A —
]]was always a reserved word.TokenType::reserved_word("]]")returnedDoubleRightBracketregardless of context, andParser::is_list_terminatortreated anyWord "]]"as a list terminator. A stray]]in the middle of a plain command (e.g.Cdeclare -n ref=t ]] arget) silently terminated the list instead of becoming an ordinary word.Bug B —
read_bracket_subscriptwas unguarded. The helper fired on any mid-word[as long as the word so far was non-empty and didn't already end with[. That absorbed whitespace and control operators inside constructs like][[,[c[,$$[a||b],case[a&&b], andfoo^[...]where bash does not.Fix
src/lexer/words.rs— gateread_bracket_subscriptso it only runs when:arr[i]=valassignments,foo[...]invocations, which bash parses as a single word and in which subscripts permissively absorb whitespace), or[[ ]](so regex character classes like[[:alpha:][:digit:]]still tokenize as one word).src/lexer/words.rs— after the existing reserved-word classification, downgradeTokenType::DoubleRightBrackettoTokenType::Wordwhenevercond_expris false.parse_cond_commandsetscond_expracross the body of a[[ ]], so the reserved-word meaning is preserved exactly where bash expects it.src/parser/mod.rs— drop theWord "]]"branch fromis_list_terminator. A bare]]is now a plain word that flows into the current command instead of ending it.src/lexer/words.rs— remove the staleread_bracket_subscriptcall sites incontinue_wordandread_array_element(they are no longer needed once the main-loop branch is guarded).tests/integration.rs— remove the lexer:]]not tokenized as separate word outside[[ ... ]]#35 entries and the bonus side-effect entries fromKNOWN_ORACLE_FAILURESso they become active regression guards.Fixed test cases from #35
rbracket_outside_cond 1—][[ "$file" == *.txt ]]rbracket_outside_cond 2—[c[ $x =~ ]+[a-z] ]]rbracket_outside_cond 3—Cdeclare -n ref=t ]] argetSide-effect fixes (tests, not auto-closing issues)
Side-effect fix for cases originally reported in #36 and #37 (case 5):
bracket_op_split 1—echo ho $$[a||b]bracket_op_split 2—Decho $ case[a&&b]bracket_op_split 3—foo^[a-echo ${foo^[a-z]}reserved_word_as_word 5—if [ $"yes" = yes9 ][ $; then echo ok; fiAll four share the same root cause as Bug B and now pass.
KNOWN_ORACLE_FAILUREShas been updated and unit tests were added as direct regression guards. #36 and #37 are intentionally not auto-closed — #36 can be closed by whoever verifies there are no further manifestations, and #37 still has cases 1-4 open (parallel work).Test plan
cargo fmtcargo clippy --all-targets -- -D warnings— no warningscargo test— 230 passed (175 unit + 1 doc + 48 integration + 6)src/lexer/tests.rsexercise each failing scenario directly at the lexer level (no reliance on the oracle corpus)arr[0 foo]) and[[ ]]regex char classes ([[:alpha:][:dig||it:]])