feat: fallback static parsing for completely invalid notebooks#8723
feat: fallback static parsing for completely invalid notebooks#8723dmadisetti merged 1 commit intomainfrom
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
marimo/_ast/parse.py
Outdated
| """When ast.parse() fails, use scanner to recover individual cells.""" | ||
| from marimo._ast.scanner import scan_parse_fallback | ||
|
|
||
| return scan_parse_fallback(source, filepath) |
There was a problem hiding this comment.
does this do anything? should we just inline this?
There was a problem hiding this comment.
Pull request overview
Adds a tokenizer-driven fallback parser so marimo notebooks with broken Python syntax can still be statically parsed by recovering @app.cell-style boundaries, improving resilience for --watch/VS Code workflows.
Changes:
- Introduces
marimo._ast.scannerwith token-based boundary detection plus recovery logic, and ascan_parse_fallback()that parses cells individually. - Updates
Parser.node_stack()to fall back to the scanner onSyntaxError, and emits a dedicated violation for scanner-generated unparsable cells. - Adjusts linting/tests to treat syntax-broken files without cell boundaries as skipped/unrecognisable, adds an encoding-bytes regression test, and avoids duplicate diagnostics for unparsable cells.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
marimo/_ast/scanner.py |
New tokenizer-based scanner and per-cell fallback parsing to recover boundaries in syntactically invalid notebooks. |
marimo/_ast/parse.py |
Hooks scanner fallback into parsing flow and tags scanner-generated unparsable cells with a specific violation. |
marimo/_ast/load.py |
Reads notebook text with errors="replace" to avoid crashing on invalid UTF-8 bytes. |
marimo/_lint/rules/formatting/general.py |
Skips scanner-specific unparsable violations to prevent duplicate reporting alongside MB001. |
tests/_lint/test_run_check.py |
Updates expectations for syntax-broken non-notebooks, adds encoding and deduplication regression tests. |
tests/_lint/test_json_formatter.py |
Updates JSON output expectations when broken files are skipped rather than errored. |
tests/_ast/test_load.py |
Updates expected status behavior for syntax-broken inputs with/without notebook boundaries. |
| # Error recovery: scan forward from error line for boundaries | ||
| error_line_in_chunk, _exc = error_info | ||
| error_line_abs = error_line_in_chunk + offset | ||
| found_restart = False | ||
|
|
||
| for candidate_line_0 in range(error_line_abs, total_lines): | ||
| line_text = lines[candidate_line_0] |
There was a problem hiding this comment.
In scan_notebook() error recovery, error_line_in_chunk is 1-indexed (tokenize reports line numbers starting at 1), but error_line_abs is used as a 0-indexed list index into lines. This causes an off-by-one (skipping the actual error line and potentially indexing past the end), which can prevent finding the next boundary and break recovery. Convert the absolute error line to 0-index before iterating over lines (and consider clamping to [0, total_lines)).
| def _has_cell_boundaries(source: str) -> bool: | ||
| """Quick check whether source has any cell boundary markers.""" | ||
| return ( | ||
| "@app.cell" in source | ||
| or "@app.function" in source | ||
| or "@app.class_definition" in source | ||
| or "with app.setup" in source | ||
| or "app._unparsable_cell" in source | ||
| ) | ||
|
|
||
|
|
||
| def scan_parse_fallback( | ||
| source: str, filepath: str | ||
| ) -> tuple[list[ast.stmt], frozenset[int]]: | ||
| """Fallback parser: scan for cell boundaries, parse each cell individually. | ||
|
|
||
| Called when ast.parse() on the full file fails due to syntax errors. | ||
| Returns a tuple of (nodes, scanner_generated_lines) where | ||
| scanner_generated_lines contains the 1-indexed start line numbers of | ||
| unparsable cells created by the scanner (vs. pre-existing | ||
| app._unparsable_cell() calls in the source). | ||
| Returns ([], frozenset()) if no cell boundaries are found. | ||
| """ | ||
| from marimo._ast.parse import ast_parse | ||
|
|
||
| if not _has_cell_boundaries(source): | ||
| return [], frozenset() | ||
|
|
||
| scan = scan_notebook(source) |
There was a problem hiding this comment.
_has_cell_boundaries() is a raw substring check, so it can return true for @app.cell appearing in a string/comment. In that case scan_parse_fallback() proceeds, scan_notebook() may find zero real boundaries, and then the preamble parse re-raises SyntaxError (fatal) even though the file is effectively “no boundaries” and should be skipped gracefully. Consider removing this precheck or making it token/line-anchored (column-0) and, importantly, early-returning ([], frozenset()) when scan_notebook() finds no boundaries before attempting to parse scan.preamble.
| line_idx = start_line - 2 # 0-indexed, one line before | ||
| while line_idx >= 0: | ||
| line = lines[line_idx].strip() | ||
| if line.startswith("@"): | ||
| adjusted_start = line_idx + 1 # 1-indexed | ||
| line_idx -= 1 | ||
| elif not line: | ||
| # Skip blank lines between decorators | ||
| line_idx -= 1 | ||
| else: | ||
| break |
There was a problem hiding this comment.
When expanding a boundary upward to include preceding decorators, using lines[line_idx].strip().startswith("@") ignores indentation. That means an indented line like @decorator (e.g., inside a previous cell/function, or in malformed code) can be incorrectly pulled into the next cell’s start, corrupting cell slicing. To keep boundaries stable, only treat decorators at column 0 (e.g., lines[line_idx].startswith("@")) or match the boundary line’s indentation explicitly.
marimo/_ast/parse.py
Outdated
| nodes, scanner_lines = _scan_parse_fallback( | ||
| self.extractor.contents or "", self.filepath | ||
| ) | ||
| self._scanner_generated_lines = scanner_lines | ||
| return PeekStack(iter(nodes)) |
There was a problem hiding this comment.
Parser.node_stack() claims the scanner fallback will “Never re-raise”, but _scan_parse_fallback() / scan_parse_fallback() can still raise SyntaxError (e.g., if there’s a syntax error in the preamble before the first cell boundary, it is explicitly re-raised as “fatal”). This can still break --watch/IPC, which the comment (and PR description) suggests should not happen. Either catch SyntaxError around _scan_parse_fallback() here and return a best-effort empty node list (or an Unparsable marker), or adjust the comment/behavior so the contract is accurate.
| nodes, scanner_lines = _scan_parse_fallback( | |
| self.extractor.contents or "", self.filepath | |
| ) | |
| self._scanner_generated_lines = scanner_lines | |
| return PeekStack(iter(nodes)) | |
| try: | |
| nodes, scanner_lines = _scan_parse_fallback( | |
| self.extractor.contents or "", self.filepath | |
| ) | |
| self._scanner_generated_lines = scanner_lines | |
| return PeekStack(iter(nodes)) | |
| except SyntaxError: | |
| # If the scanner itself encounters a fatal syntax error (e.g., | |
| # in the preamble before any cell boundary), fall back to an | |
| # empty node list rather than propagating the exception. | |
| self._scanner_generated_lines = frozenset() | |
| return PeekStack(iter(())) |
| else: | ||
| self._reset() | ||
|
|
||
| # if __name__ == "__main__": |
There was a problem hiding this comment.
oh lol nvm. i see what is it
Use a token scanner to recover individual cells when the notebook has syntax errors, so --watch and IPC are never broken by a syntax error. Unparsable cells are flagged with a lint violation.
|
🚀 Development release published. You may be able to view the changes at https://marimo.app?v=0.21.2-dev91 |
📝 Summary
Allows notebooks with broken syntax to be parsed by extracting the "cell boundaries" (i.e. @app.cell) and leveraging indentation. Possibly a bit heavy handed, but leverages a simple state machine to work through the tokens to find the boundaries.
NB, this is ONLY a fallback mechanism, but should prevent breakage in vs-code/ watch when the source file breaks.