Fix _code_mode cell ID collisions on large notebooks#8951
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Pull request overview
This PR addresses deterministic cell ID collisions in code_mode by making CellIdGenerator seed configurable and ensuring the code_mode context avoids reusing the same ID sequence as existing notebooks.
Changes:
- Add a configurable
seedparameter toCellIdGenerator(defaulting to42for compatibility). - Initialize
AsyncCodeModeContext’s ID generator with a different seed and seedseen_idsfromdocument.cell_idsto avoid collisions with document-only cells. - Update snapshots and add a regression test covering document-only cell IDs.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
marimo/_ast/cell_id.py |
Adds seed parameter to make deterministic ID generation configurable. |
marimo/_code_mode/_context.py |
Uses a non-default seed in code_mode and seeds seen_ids from the document to avoid collisions. |
tests/_code_mode/test_context.py |
Updates expected deterministic IDs and adds a regression test for document-only ID collisions. |
marimo/_code_mode/_context.py
Outdated
| # ID generator for new cells — use a different seed than the | ||
| # default (42) so we don't replay the same ID sequence that | ||
| # created the notebook's existing cells. Also seed seen_ids | ||
| # with graph + document IDs to avoid collisions. | ||
| self._id_generator = CellIdGenerator(seed=7) |
There was a problem hiding this comment.
CellIdGenerator(seed=7) is still deterministic and replays the same ID sequence every time cm.get_context() constructs a new AsyncCodeModeContext. Once a notebook accumulates >100 IDs from this seed (e.g. after many separate code_mode requests), create_cell_id() will hit the 100-attempt cap and raise again. Consider using a per-notebook/per-document varying seed (e.g. derived from document.version / document.cell_ids), persisting the generator across contexts, or switching code_mode IDs to a UUID-based prefix so the generator doesn’t have to “skip past” previously created IDs each time.
| # ID generator for new cells — use a different seed than the | |
| # default (42) so we don't replay the same ID sequence that | |
| # created the notebook's existing cells. Also seed seen_ids | |
| # with graph + document IDs to avoid collisions. | |
| self._id_generator = CellIdGenerator(seed=7) | |
| # ID generator for new cells — derive a seed from the current | |
| # document so we don't replay the same ID sequence across | |
| # notebooks or document states. Also seed seen_ids with the | |
| # document's existing IDs to avoid collisions. | |
| seed = hash(("code_mode", tuple(sorted(document.cell_ids)))) & 0xFFFFFFFF | |
| if not seed: | |
| # Extremely unlikely fallback to a non-zero constant. | |
| seed = 7 | |
| self._id_generator = CellIdGenerator(seed=seed) |
tests/_code_mode/test_context.py
Outdated
| """create_cell must not generate IDs that collide with cells | ||
| that exist in the document but not in the kernel graph (B4).""" | ||
| # Build a large set of document-only cells whose IDs come from | ||
| # the same deterministic generator used by AsyncCodeModeContext. | ||
| from marimo._ast.cell_id import CellIdGenerator |
There was a problem hiding this comment.
The new test’s docstring/comments say the doc-only IDs come from “the same deterministic generator used by AsyncCodeModeContext”, but AsyncCodeModeContext now uses CellIdGenerator(seed=7) while this test uses the default seed (42). Please update the wording (and/or pass seed=42 explicitly) to reflect that these IDs represent the notebook’s existing (default-seeded) IDs that code_mode must avoid colliding with. Also, the inline comment about expecting a ValueError is a bit misleading since the assertion is about not reusing an existing document ID.
d1108a6 to
38a3900
Compare
CellIdGenerator uses a deterministic seed (42) for reproducible cell IDs, but the code_mode context was using the same seed when generating new cells. This meant it replayed the exact same ID sequence that created the notebook's existing cells, causing 100% collision rates on notebooks with ~50+ cells and eventually hitting the 100-attempt limit. The fix gives CellIdGenerator a configurable `seed` parameter (defaulting to 42 for backwards compatibility) and uses a different seed (7) in the code_mode context. The seen_ids set is now populated from `document.cell_ids` rather than `kernel.graph.cells.keys()`, since the graph may not contain all cells (e.g. cells that exist on disk but were never executed).
The previous fix (seed=7) would still replay the same ID sequence across sessions, eventually exhausting the 100-attempt collision limit on notebooks where many cells were created via code_mode. Switching to `seed=None` (OS entropy) eliminates sequence replay entirely while `seen_ids` from the document still prevents collisions with existing cells.
38a3900 to
a11b279
Compare
kirangadhave
left a comment
There was a problem hiding this comment.
🚀 thanks for the quick fix!
I'm curious why we are setting seeds outside the test setup in notebook.py and cell_manager.py
| self.prefix = prefix | ||
| self.unparsable = False | ||
| self._cell_id_generator = CellIdGenerator(prefix) | ||
| self._cell_id_generator = CellIdGenerator(prefix, seed=42) |
There was a problem hiding this comment.
do we need the seed outside the test setup?
| NotebookV1: The notebook v1. | ||
| """ | ||
| cell_id_generator = CellIdGenerator() | ||
| cell_id_generator = CellIdGenerator(seed=42) |
There was a problem hiding this comment.
do we need this to be deterministic?
There was a problem hiding this comment.
Good question. I was just trying to preserve the previous behavior... but you are probably right that we only need a seed for fixtures.
We could make a follow up PR with some ContextVar that we just set in tests and probably remove the seed param entirely.
|
🚀 Development release published. You may be able to view the changes at https://marimo.app?v=0.21.2-dev104 |
CellIdGenerator uses a deterministic seed (42) for reproducible cell IDs, but the code_mode context was using the same seed when generating new cells. This meant it replayed the exact same ID sequence that created the notebook's existing cells, causing 100% collision rates on notebooks with ~50+ cells and eventually hitting the 100-attempt limit.
The fix gives CellIdGenerator a configurable
seedparameter (defaulting to 42 for backwards compatibility) and uses a different seed (7) in the code_mode context. The seen_ids set is now populated fromdocument.cell_idsrather thankernel.graph.cells.keys(), since the graph may not contain all cells (e.g. cells that exist on disk but were never executed).