Skip to content

fix(lexer): stop treating ]] and unbalanced [...] as special outside conditionals#45

Merged
mpecan merged 1 commit into
mainfrom
feat/35-rbracket-outside-cond
Apr 14, 2026
Merged

fix(lexer): stop treating ]] and unbalanced [...] as special outside conditionals#45
mpecan merged 1 commit into
mainfrom
feat/35-rbracket-outside-cond

Conversation

@mpecan
Copy link
Copy Markdown
Owner

@mpecan mpecan commented Apr 14, 2026

Closes #35

Root cause

Two independent bugs in the lexer / parser both triggered by ]] and [...] appearing outside a [[ ]] conditional.

Bug A — ]] was always a reserved word. TokenType::reserved_word("]]") returned DoubleRightBracket regardless of context, and Parser::is_list_terminator treated any Word "]]" as a list terminator. A stray ]] in the middle of a plain command (e.g. Cdeclare -n ref=t ]] arget) silently terminated the list instead of becoming an ordinary word.

Bug B — read_bracket_subscript was unguarded. The helper fired on any mid-word [ as long as the word so far was non-empty and didn't already end with [. That absorbed whitespace and control operators inside constructs like ][[, [c[, $$[a||b], case[a&&b], and foo^[...] where bash does not.

Fix

  • src/lexer/words.rs — gate read_bracket_subscript so it only runs when:
    • we're at command-start and the word so far is a plain bash identifier (arr[i]=val assignments, foo[...] invocations, which bash parses as a single word and in which subscripts permissively absorb whitespace), or
    • we're inside [[ ]] (so regex character classes like [[:alpha:][:digit:]] still tokenize as one word).
  • src/lexer/words.rs — after the existing reserved-word classification, downgrade TokenType::DoubleRightBracket to TokenType::Word whenever cond_expr is false. parse_cond_command sets cond_expr across the body of a [[ ]], so the reserved-word meaning is preserved exactly where bash expects it.
  • src/parser/mod.rs — drop the Word "]]" branch from is_list_terminator. A bare ]] is now a plain word that flows into the current command instead of ending it.
  • src/lexer/words.rs — remove the stale read_bracket_subscript call sites in continue_word and read_array_element (they are no longer needed once the main-loop branch is guarded).
  • tests/integration.rs — remove the lexer: ]] not tokenized as separate word outside [[ ... ]] #35 entries and the bonus side-effect entries from KNOWN_ORACLE_FAILURES so they become active regression guards.

Fixed test cases from #35

  • rbracket_outside_cond 1][[ "$file" == *.txt ]]
  • rbracket_outside_cond 2[c[ $x =~ ]+[a-z] ]]
  • rbracket_outside_cond 3Cdeclare -n ref=t ]] arget

Side-effect fixes (tests, not auto-closing issues)

Side-effect fix for cases originally reported in #36 and #37 (case 5):

  • bracket_op_split 1echo ho $$[a||b]
  • bracket_op_split 2Decho $ case[a&&b]
  • bracket_op_split 3foo^[a-echo ${foo^[a-z]}
  • reserved_word_as_word 5if [ $"yes" = yes9 ][ $; then echo ok; fi

All four share the same root cause as Bug B and now pass. KNOWN_ORACLE_FAILURES has been updated and unit tests were added as direct regression guards. #36 and #37 are intentionally not auto-closed — #36 can be closed by whoever verifies there are no further manifestations, and #37 still has cases 1-4 open (parallel work).

Test plan

  • cargo fmt
  • cargo clippy --all-targets -- -D warnings — no warnings
  • cargo test — 230 passed (175 unit + 1 doc + 48 integration + 6)
  • 11 new unit tests in src/lexer/tests.rs exercise each failing scenario directly at the lexer level (no reliance on the oracle corpus)
  • Regression guards in place for identifier-prefixed array subscripts (arr[0 foo]) and [[ ]] regex char classes ([[:alpha:][:dig||it:]])
  • No regressions in the Parable corpus or the remaining oracle suites

mpecan added a commit that referenced this pull request Apr 14, 2026
…zzing (#46)

## Summary

Adds 26 test cases that bash-oracle parses successfully but rable
produces a different AST (or errors) for. These are **real rable bugs**,
collected from differential fuzzing. Every case is listed in
`KNOWN_ORACLE_FAILURES` in `tests/integration.rs`, so this PR does not
introduce any failing tests — it just gives the 7 tracked bugs a stable
regression corpus to land fixes against.

## What this PR contains

- **`tests/fuzz.py`** — new `--valid-only` flag on both `mutate` and
`generate` modes. Filters out mismatches where the reference parser
rejects the input, so only divergences on *valid* bash remain. This is
the change that surfaced the 26 cases.
- **`tests/oracle/bash_valid_divergences.tests`** — 26 test cases
grouped by cluster, each with the canonical bash-oracle output. Header
references the 7 tracked issues.
- **`tests/integration.rs`** — registers the new oracle file and lists
all 26 cases in `KNOWN_ORACLE_FAILURES` with inline comments grouping
them by issue.

## Clusters and tracking issues

| Cluster | Issue | Cases |
|---|---|---|
| `rbracket_outside_cond` — `]]` tokenization outside `[[ ]]` | #35 | 3
|
| `bracket_op_split` — unbalanced `[...]` absorbing `\|\|` / `&&` | #36
| 3 |
| `reserved_word_as_word` — reserved words in non-reserved positions |
#37 | 5 |
| `backtick_opaque` — backticks opaque on invalid content | #38 | 6 |
| `heredoc_in_cmdsub` — heredoc inside `$( ... )` | #39 | 2 |
| `cmdsub_reformat` — command-sub canonical reformat drift | #40 | 6 |
| `linecont_in_arith` — `\<newline>` inside `$(( ))` | #41 | 1 |

## Motivation

Landing this first de-risks the three open fix PRs (#43 / #44 / #45) by
making their diffs show only the fix, not the 260-line corpus addition
as baseline noise.

## Test plan

- [ ] `cargo test --test integration oracle_bash_valid_divergences` —
reports `0/26 passed (26 remaining)`, which is the expected baseline
- [ ] `cargo test --all-targets` — full suite stays green (nothing new
fails; the corpus only lives in `KNOWN_ORACLE_FAILURES`)
- [ ] Quality gate: `cargo fmt` + `cargo clippy --all-targets -- -D
warnings`

🤖 Generated with [Claude Code](https://claude.com/claude-code)
…prefix

Two independent bugs around `]]` and `[...]` outside a `[[ ]]` conditional.

Bug A — `]]` was unconditionally classified as `DoubleRightBracket` and
`is_list_terminator` treated any `Word "]]"` as a terminator, so a stray
`]]` in the middle of a plain command (e.g. `Cdeclare -n ref=t ]] arget`)
silently terminated the list instead of becoming an ordinary word.

Bug B — `read_bracket_subscript` fired on any `[` mid-word as long as the
word so far was non-empty and didn't already end with `[`. That absorbed
whitespace and control operators inside constructs like `][[`, `[c[`,
`$$[a||b]`, `case[a&&b]` and `foo^[...]` where bash does not.

Fix:
  * Only enter `read_bracket_subscript` when (a) we're at command-start
    and the word so far is a plain identifier (`arr[i]=val`, `foo[...]`)
    or (b) we're inside `[[ ]]` (regex char classes). Everywhere else
    `[` is a literal word character.
  * Downgrade `]]` from `DoubleRightBracket` to `Word` outside
    `cond_expr`; drop the `]]` branch from `Parser::is_list_terminator`.
    Inside `[[ ]]`, `parse_cond_command` keeps setting `cond_expr`, so
    `]]` remains the reserved closing token there.

Also removes the stale `read_bracket_subscript` call sites in
`continue_word` and `read_array_element` (they were never needed once
the main loop is guarded).

Closes #35.

Side-effect fix for cases originally reported in #36 (all three
`bracket_op_split` cases) and #37 (`reserved_word_as_word 5`) — they
share the same root cause and now pass. #36 and #37 remain open for
whoever verifies there are no further manifestations.

Unit tests in `src/lexer/tests.rs` directly exercise the fix at the
lexer level (no reliance on the oracle corpus), and the three
`rbracket_outside_cond`, three `bracket_op_split`, and one
`reserved_word_as_word 5` entries are removed from
`KNOWN_ORACLE_FAILURES`.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mpecan mpecan force-pushed the feat/35-rbracket-outside-cond branch from 62790a6 to 719e3c1 Compare April 14, 2026 20:25
@mpecan mpecan merged commit 4bf5a5c into main Apr 14, 2026
5 checks passed
@mpecan mpecan deleted the feat/35-rbracket-outside-cond branch April 14, 2026 20:28
mpecan pushed a commit that referenced this pull request Apr 19, 2026
🤖 I have created a release *beep* *boop*
---


##
[0.2.0](rable-v0.1.15...rable-v0.2.0)
(2026-04-18)


### ⚠ BREAKING CHANGES

* tighten lexer API surface and relocate WordSpan to ast
([#70](#70))

### Bug Fixes

* **format:** align cmdsub reformatter with bash canonical form
([#49](#49))
([c7a4411](c7a4411))
* **lexer:** accept sloppy heredoc terminator in cmdsub mode
([#50](#50))
([40f394f](40f394f))
* **lexer:** backticks opaque when content is invalid
([#71](#71))
([e72166f](e72166f)),
closes [#38](#38)
* **lexer:** disable reserved-word recognition after assignment words
([#44](#44))
([42e1fc0](42e1fc0))
* **lexer:** stop treating ]] and unbalanced [...] as special outside
conditionals ([#45](#45))
([4bf5a5c](4bf5a5c))
* **parser:** fall back from (( … )) arith to nested subshells
([#48](#48))
([1437f00](1437f00))


### Code Refactoring

* **format:** introduce Formatter struct
([#65](#65))
([d965a8f](d965a8f))
* **lexer:** drop Result&lt;Token&gt; wrapper from operator readers
([#62](#62))
([d52a841](d52a841))
* **lexer:** split read_word_token into classify + advance + dispatch
helpers ([#63](#63))
([3ba09f5](3ba09f5))
* **parser:** extract fill_heredoc_contents visitor helpers
([#68](#68))
([40e6165](40e6165))
* **parser:** extract helpers from three oversize parsers
([#69](#69))
([25d0762](25d0762))
* **sexp:** dispatch NodeKind Display to per-category helpers
([#66](#66))
([44b0330](44b0330))
* **sexp:** table-drive ANSI-C escape dispatch
([#67](#67))
([91a5267](91a5267))
* tighten lexer API surface and relocate WordSpan to ast
([#70](#70))
([5171d01](5171d01))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: repository-butler[bot] <166800726+repository-butler[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

lexer: ]] not tokenized as separate word outside [[ ... ]]

1 participant