Skip to content

Fix _code_mode cell ID collisions on large notebooks#8951

Merged
mscolnick merged 2 commits intomainfrom
push-lyzmszsrltzw
Mar 31, 2026
Merged

Fix _code_mode cell ID collisions on large notebooks#8951
mscolnick merged 2 commits intomainfrom
push-lyzmszsrltzw

Conversation

@manzt
Copy link
Copy Markdown
Collaborator

@manzt manzt commented Mar 31, 2026

CellIdGenerator uses a deterministic seed (42) for reproducible cell IDs, but the code_mode context was using the same seed when generating new cells. This meant it replayed the exact same ID sequence that created the notebook's existing cells, causing 100% collision rates on notebooks with ~50+ cells and eventually hitting the 100-attempt limit.

The fix gives CellIdGenerator a configurable seed parameter (defaulting to 42 for backwards compatibility) and uses a different seed (7) in the code_mode context. The seen_ids set is now populated from document.cell_ids rather than kernel.graph.cells.keys(), since the graph may not contain all cells (e.g. cells that exist on disk but were never executed).

Copilot AI review requested due to automatic review settings March 31, 2026 18:39
@manzt manzt requested a review from dmadisetti as a code owner March 31, 2026 18:39
@vercel
Copy link
Copy Markdown

vercel bot commented Mar 31, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
marimo-docs Ready Ready Preview, Comment Mar 31, 2026 8:19pm

Request Review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses deterministic cell ID collisions in code_mode by making CellIdGenerator seed configurable and ensuring the code_mode context avoids reusing the same ID sequence as existing notebooks.

Changes:

  • Add a configurable seed parameter to CellIdGenerator (defaulting to 42 for compatibility).
  • Initialize AsyncCodeModeContext’s ID generator with a different seed and seed seen_ids from document.cell_ids to avoid collisions with document-only cells.
  • Update snapshots and add a regression test covering document-only cell IDs.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
marimo/_ast/cell_id.py Adds seed parameter to make deterministic ID generation configurable.
marimo/_code_mode/_context.py Uses a non-default seed in code_mode and seeds seen_ids from the document to avoid collisions.
tests/_code_mode/test_context.py Updates expected deterministic IDs and adds a regression test for document-only ID collisions.

Comment on lines +253 to +257
# ID generator for new cells — use a different seed than the
# default (42) so we don't replay the same ID sequence that
# created the notebook's existing cells. Also seed seen_ids
# with graph + document IDs to avoid collisions.
self._id_generator = CellIdGenerator(seed=7)
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CellIdGenerator(seed=7) is still deterministic and replays the same ID sequence every time cm.get_context() constructs a new AsyncCodeModeContext. Once a notebook accumulates >100 IDs from this seed (e.g. after many separate code_mode requests), create_cell_id() will hit the 100-attempt cap and raise again. Consider using a per-notebook/per-document varying seed (e.g. derived from document.version / document.cell_ids), persisting the generator across contexts, or switching code_mode IDs to a UUID-based prefix so the generator doesn’t have to “skip past” previously created IDs each time.

Suggested change
# ID generator for new cells — use a different seed than the
# default (42) so we don't replay the same ID sequence that
# created the notebook's existing cells. Also seed seen_ids
# with graph + document IDs to avoid collisions.
self._id_generator = CellIdGenerator(seed=7)
# ID generator for new cells — derive a seed from the current
# document so we don't replay the same ID sequence across
# notebooks or document states. Also seed seen_ids with the
# document's existing IDs to avoid collisions.
seed = hash(("code_mode", tuple(sorted(document.cell_ids)))) & 0xFFFFFFFF
if not seed:
# Extremely unlikely fallback to a non-zero constant.
seed = 7
self._id_generator = CellIdGenerator(seed=seed)

Copilot uses AI. Check for mistakes.
Comment on lines +698 to +702
"""create_cell must not generate IDs that collide with cells
that exist in the document but not in the kernel graph (B4)."""
# Build a large set of document-only cells whose IDs come from
# the same deterministic generator used by AsyncCodeModeContext.
from marimo._ast.cell_id import CellIdGenerator
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new test’s docstring/comments say the doc-only IDs come from “the same deterministic generator used by AsyncCodeModeContext”, but AsyncCodeModeContext now uses CellIdGenerator(seed=7) while this test uses the default seed (42). Please update the wording (and/or pass seed=42 explicitly) to reflect that these IDs represent the notebook’s existing (default-seeded) IDs that code_mode must avoid colliding with. Also, the inline comment about expecting a ValueError is a bit misleading since the assertion is about not reusing an existing document ID.

Copilot uses AI. Check for mistakes.
@mscolnick mscolnick added the bug Something isn't working label Mar 31, 2026
@manzt manzt force-pushed the push-lyzmszsrltzw branch from d1108a6 to 38a3900 Compare March 31, 2026 20:17
manzt and others added 2 commits March 31, 2026 16:17
CellIdGenerator uses a deterministic seed (42) for reproducible cell
IDs, but the code_mode context was using the same seed when generating
new cells. This meant it replayed the exact same ID sequence that
created the notebook's existing cells, causing 100% collision rates on
notebooks with ~50+ cells and eventually hitting the 100-attempt limit.

The fix gives CellIdGenerator a configurable `seed` parameter
(defaulting to 42 for backwards compatibility) and uses a different seed
(7) in the code_mode context. The seen_ids set is now populated from
`document.cell_ids` rather than `kernel.graph.cells.keys()`, since the
graph may not contain all cells (e.g. cells that exist on disk but were
never executed).
The previous fix (seed=7) would still replay the same ID sequence across
sessions, eventually exhausting the 100-attempt collision limit on
notebooks where many cells were created via code_mode. Switching to
`seed=None` (OS entropy) eliminates sequence replay entirely while
`seen_ids` from the document still prevents collisions with existing
cells.
Copy link
Copy Markdown
Contributor

@kirangadhave kirangadhave left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 thanks for the quick fix!

I'm curious why we are setting seeds outside the test setup in notebook.py and cell_manager.py

self.prefix = prefix
self.unparsable = False
self._cell_id_generator = CellIdGenerator(prefix)
self._cell_id_generator = CellIdGenerator(prefix, seed=42)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need the seed outside the test setup?

NotebookV1: The notebook v1.
"""
cell_id_generator = CellIdGenerator()
cell_id_generator = CellIdGenerator(seed=42)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this to be deterministic?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I was just trying to preserve the previous behavior... but you are probably right that we only need a seed for fixtures.

We could make a follow up PR with some ContextVar that we just set in tests and probably remove the seed param entirely.

@mscolnick mscolnick merged commit 010cac0 into main Mar 31, 2026
35 of 43 checks passed
@mscolnick mscolnick deleted the push-lyzmszsrltzw branch March 31, 2026 20:31
@github-actions
Copy link
Copy Markdown

🚀 Development release published. You may be able to view the changes at https://marimo.app?v=0.21.2-dev104

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants