Skip to content

feat: fallback static parsing for completely invalid notebooks#8723

Merged
dmadisetti merged 1 commit intomainfrom
dm/parse
Mar 30, 2026
Merged

feat: fallback static parsing for completely invalid notebooks#8723
dmadisetti merged 1 commit intomainfrom
dm/parse

Conversation

@dmadisetti
Copy link
Copy Markdown
Collaborator

@dmadisetti dmadisetti commented Mar 16, 2026

📝 Summary

Allows notebooks with broken syntax to be parsed by extracting the "cell boundaries" (i.e. @app.cell) and leveraging indentation. Possibly a bit heavy handed, but leverages a simple state machine to work through the tokens to find the boundaries.

NB, this is ONLY a fallback mechanism, but should prevent breakage in vs-code/ watch when the source file breaks.

@vercel
Copy link
Copy Markdown

vercel bot commented Mar 16, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
marimo-docs Ready Ready Preview, Comment Mar 30, 2026 7:28pm

Request Review

@dmadisetti dmadisetti changed the title stash feat: fallback static parsing for completely invalid notebooks Mar 16, 2026
@dmadisetti dmadisetti added the bug Something isn't working label Mar 16, 2026
@dmadisetti dmadisetti marked this pull request as ready for review March 25, 2026 18:11
@dmadisetti dmadisetti requested review from Copilot and manzt and removed request for Copilot March 25, 2026 18:11
@mscolnick mscolnick requested a review from Copilot March 25, 2026 18:51
"""When ast.parse() fails, use scanner to recover individual cells."""
from marimo._ast.scanner import scan_parse_fallback

return scan_parse_fallback(source, filepath)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this do anything? should we just inline this?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a tokenizer-driven fallback parser so marimo notebooks with broken Python syntax can still be statically parsed by recovering @app.cell-style boundaries, improving resilience for --watch/VS Code workflows.

Changes:

  • Introduces marimo._ast.scanner with token-based boundary detection plus recovery logic, and a scan_parse_fallback() that parses cells individually.
  • Updates Parser.node_stack() to fall back to the scanner on SyntaxError, and emits a dedicated violation for scanner-generated unparsable cells.
  • Adjusts linting/tests to treat syntax-broken files without cell boundaries as skipped/unrecognisable, adds an encoding-bytes regression test, and avoids duplicate diagnostics for unparsable cells.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
marimo/_ast/scanner.py New tokenizer-based scanner and per-cell fallback parsing to recover boundaries in syntactically invalid notebooks.
marimo/_ast/parse.py Hooks scanner fallback into parsing flow and tags scanner-generated unparsable cells with a specific violation.
marimo/_ast/load.py Reads notebook text with errors="replace" to avoid crashing on invalid UTF-8 bytes.
marimo/_lint/rules/formatting/general.py Skips scanner-specific unparsable violations to prevent duplicate reporting alongside MB001.
tests/_lint/test_run_check.py Updates expectations for syntax-broken non-notebooks, adds encoding and deduplication regression tests.
tests/_lint/test_json_formatter.py Updates JSON output expectations when broken files are skipped rather than errored.
tests/_ast/test_load.py Updates expected status behavior for syntax-broken inputs with/without notebook boundaries.

Comment on lines +264 to +270
# Error recovery: scan forward from error line for boundaries
error_line_in_chunk, _exc = error_info
error_line_abs = error_line_in_chunk + offset
found_restart = False

for candidate_line_0 in range(error_line_abs, total_lines):
line_text = lines[candidate_line_0]
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In scan_notebook() error recovery, error_line_in_chunk is 1-indexed (tokenize reports line numbers starting at 1), but error_line_abs is used as a 0-indexed list index into lines. This causes an off-by-one (skipping the actual error line and potentially indexing past the end), which can prevent finding the next boundary and break recovery. Convert the absolute error line to 0-index before iterating over lines (and consider clamping to [0, total_lines)).

Copilot uses AI. Check for mistakes.
Comment on lines +584 to +612
def _has_cell_boundaries(source: str) -> bool:
"""Quick check whether source has any cell boundary markers."""
return (
"@app.cell" in source
or "@app.function" in source
or "@app.class_definition" in source
or "with app.setup" in source
or "app._unparsable_cell" in source
)


def scan_parse_fallback(
source: str, filepath: str
) -> tuple[list[ast.stmt], frozenset[int]]:
"""Fallback parser: scan for cell boundaries, parse each cell individually.

Called when ast.parse() on the full file fails due to syntax errors.
Returns a tuple of (nodes, scanner_generated_lines) where
scanner_generated_lines contains the 1-indexed start line numbers of
unparsable cells created by the scanner (vs. pre-existing
app._unparsable_cell() calls in the source).
Returns ([], frozenset()) if no cell boundaries are found.
"""
from marimo._ast.parse import ast_parse

if not _has_cell_boundaries(source):
return [], frozenset()

scan = scan_notebook(source)
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_has_cell_boundaries() is a raw substring check, so it can return true for @app.cell appearing in a string/comment. In that case scan_parse_fallback() proceeds, scan_notebook() may find zero real boundaries, and then the preamble parse re-raises SyntaxError (fatal) even though the file is effectively “no boundaries” and should be skipped gracefully. Consider removing this precheck or making it token/line-anchored (column-0) and, importantly, early-returning ([], frozenset()) when scan_notebook() finds no boundaries before attempting to parse scan.preamble.

Copilot uses AI. Check for mistakes.
Comment on lines +305 to +315
line_idx = start_line - 2 # 0-indexed, one line before
while line_idx >= 0:
line = lines[line_idx].strip()
if line.startswith("@"):
adjusted_start = line_idx + 1 # 1-indexed
line_idx -= 1
elif not line:
# Skip blank lines between decorators
line_idx -= 1
else:
break
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When expanding a boundary upward to include preceding decorators, using lines[line_idx].strip().startswith("@") ignores indentation. That means an indented line like @decorator (e.g., inside a previous cell/function, or in malformed code) can be incorrectly pulled into the next cell’s start, corrupting cell slicing. To keep boundaries stable, only treat decorators at column 0 (e.g., lines[line_idx].startswith("@")) or match the boundary line’s indentation explicitly.

Copilot uses AI. Check for mistakes.
Comment on lines +528 to +532
nodes, scanner_lines = _scan_parse_fallback(
self.extractor.contents or "", self.filepath
)
self._scanner_generated_lines = scanner_lines
return PeekStack(iter(nodes))
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parser.node_stack() claims the scanner fallback will “Never re-raise”, but _scan_parse_fallback() / scan_parse_fallback() can still raise SyntaxError (e.g., if there’s a syntax error in the preamble before the first cell boundary, it is explicitly re-raised as “fatal”). This can still break --watch/IPC, which the comment (and PR description) suggests should not happen. Either catch SyntaxError around _scan_parse_fallback() here and return a best-effort empty node list (or an Unparsable marker), or adjust the comment/behavior so the contract is accurate.

Suggested change
nodes, scanner_lines = _scan_parse_fallback(
self.extractor.contents or "", self.filepath
)
self._scanner_generated_lines = scanner_lines
return PeekStack(iter(nodes))
try:
nodes, scanner_lines = _scan_parse_fallback(
self.extractor.contents or "", self.filepath
)
self._scanner_generated_lines = scanner_lines
return PeekStack(iter(nodes))
except SyntaxError:
# If the scanner itself encounters a fatal syntax error (e.g.,
# in the preamble before any cell boundary), fall back to an
# empty node list rather than propagating the exception.
self._scanner_generated_lines = frozenset()
return PeekStack(iter(()))

Copilot uses AI. Check for mistakes.
else:
self._reset()

# if __name__ == "__main__":
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh lol nvm. i see what is it

mscolnick
mscolnick previously approved these changes Mar 25, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

mscolnick
mscolnick previously approved these changes Mar 26, 2026
mscolnick
mscolnick previously approved these changes Mar 30, 2026
Use a token scanner to recover individual cells when the notebook has
syntax errors, so --watch and IPC are never broken by a syntax error.
Unparsable cells are flagged with a lint violation.
@dmadisetti dmadisetti marked this pull request as ready for review March 30, 2026 19:28
@dmadisetti dmadisetti merged commit 9565d52 into main Mar 30, 2026
41 of 43 checks passed
@dmadisetti dmadisetti deleted the dm/parse branch March 30, 2026 19:49
@github-actions
Copy link
Copy Markdown

🚀 Development release published. You may be able to view the changes at https://marimo.app?v=0.21.2-dev91

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working merge when ready

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants