Skip to content

fix(runtimed): resolve duplicate rooms after saving untitled notebook#987

Merged
rgbkrk merged 2 commits intomainfrom
quill/fix-942-bug
Mar 20, 2026
Merged

fix(runtimed): resolve duplicate rooms after saving untitled notebook#987
rgbkrk merged 2 commits intomainfrom
quill/fix-942-bug

Conversation

@rgbkrk
Copy link
Member

@rgbkrk rgbkrk commented Mar 20, 2026

Summary

  • When rekey_ephemeral_room found a different room already at the canonical path (created by a racing get_or_create_room call), it silently returned None — leaving the UUID-keyed room orphaned in the rooms map. This produced two rooms for the same file path, causing edits from Python clients to be invisible in the UI.
  • Now rekey_ephemeral_room evicts the interloper room (which is empty/just-created) and proceeds with the normal rekey, so only one room exists for a given path.
  • Adds an Arc::ptr_eq guard to the eviction task to prevent a stale eviction timer from accidentally removing a rekeyed room that now lives at the same key.

Closes #942

Verification

  • Create an untitled notebook, add some cells/content, then save it to a file path. Run runt notebooks and confirm only one room exists for that path.
  • Open the saved notebook from a Python MCP client — edits should be visible in the UI and vice versa.
  • Stress test: rapidly save an untitled notebook while simultaneously opening the same path from another connection. Confirm no duplicate rooms appear.

PR submitted by @rgbkrk's agent, Quill

When `rekey_ephemeral_room` found a different room already at the
canonical path (created by a racing `get_or_create_room`), it silently
returned None — leaving the UUID-keyed room orphaned. Now it evicts
the interloper and proceeds with the rekey.

Also adds an `Arc::ptr_eq` guard to the eviction task so a stale timer
from an evicted room cannot accidentally remove a rekeyed room at the
same key.

Closes #942
@github-actions github-actions bot added sync Automerge CRDT sync protocol daemon runtimed daemon, kernel management, sync server quill PR authored by Quill Agent 🦆 labels Mar 20, 2026
@rgbkrk rgbkrk requested a review from Copilot March 20, 2026 06:56
Copy link
Member Author

@rgbkrk rgbkrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes a real race condition — the Arc::ptr_eq guard on eviction is the key safety addition. Without it, a stale eviction timer could remove a rekeyed room that happens to be at the same key.

The interloper eviction logic is sound: the ephemeral room has the user's content (cells, outputs, kernel), while the interloper is an empty/just-created room from a racing get_or_create_room. Removing the interloper and proceeding with rekey is the right call.

One concern on the interloper cleanup:

The tokio::spawn for interloper shutdown runs detached. If the interloper had a kernel (unlikely but possible if a second client connected and triggered auto-launch during the race window), the shutdown runs asynchronously. The interloper's watcher_shutdown_tx and kernel shutdown are fire-and-forget — if they fail, the resources leak silently. This is acceptable for a rare race condition, but worth a warn! if the kernel shutdown fails:

if let Err(e) = kernel.shutdown().await {
    warn!("[notebook-sync] Failed to shut down interloper kernel: {}", e);
}

Also worth noting: the interloper's persist_tx and changed_tx channels will be dropped when the Arc is dropped, which is correct — the debounced persistence task will exit when the watch channel closes.

The Arc::ptr_eq guard on eviction is the more important fix here and is clean. Good PR.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a race in runtimed’s notebook room re-keying (UUID → canonical file path) that could leave two rooms mapped to the same notebook path, causing clients (notably Python) to sync against a different room than the UI.

Changes:

  • Adds an Arc::ptr_eq guard in the idle-eviction task to avoid removing a room that has since been replaced under the same key.
  • Updates rekey_ephemeral_room to evict an “interloper” room already present at the canonical path and proceed with the re-key, cleaning up interloper resources asynchronously.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 2423 to +2434
warn!(
"[notebook-sync] Re-key skipped: canonical path {} already has a different room",
"[notebook-sync] Re-key collision: evicting interloper room at {}",
canonical
);
return None;
// Remove the interloper — the ephemeral room has the real content.
let interloper = rooms_guard.remove(&canonical);
// Clean up interloper resources in background
if let Some(interloper) = interloper {
tokio::spawn(async move {
if let Some(mut kernel) = interloper.kernel.lock().await.take() {
let _ = kernel.shutdown().await;
}
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In rekey_ephemeral_room, the collision path unconditionally removes the room currently keyed by canonical and then shuts down its kernel/watcher. This is unsafe if that “interloper” already has (or is about to have) an in-flight/connected peer that obtained an Arc from get_or_create_room (active_peers may still be 0 at this moment). In that case, removing it from the map and shutting down resources can orphan/disrupt that other connection. Consider only evicting when you can prove the interloper is truly unused (e.g., interloper.active_peers == 0 AND Arc::strong_count(&interloper) == 1, and ideally no running kernel), otherwise keep the prior behavior (skip rekey) or implement a safer merge/redirect strategy.

Copilot uses AI. Check for mistakes.
Comment on lines 2418 to 2422
if rooms_guard.contains_key(&canonical) {
let existing = rooms_guard.get(&canonical);
let is_same_room = existing
.is_some_and(|r| Arc::ptr_eq(r, rooms_guard.get(old_notebook_id).unwrap_or(r)));
if !is_same_room {
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The is_same_room calculation is hard to reason about and can produce surprising results because it falls back to comparing Arc::ptr_eq(r, r) when old_notebook_id is missing from the map (unwrap_or(r)). That masks the “old room not found” condition and makes the collision logic fragile. A clearer approach is to explicitly look up the old room Arc first (or compare existing.as_deref() with the room: &NotebookRoom parameter via std::ptr::eq) and only treat the rooms as the same when that real comparison succeeds.

Copilot uses AI. Check for mistakes.
Comment on lines 2423 to +2428
warn!(
"[notebook-sync] Re-key skipped: canonical path {} already has a different room",
"[notebook-sync] Re-key collision: evicting interloper room at {}",
canonical
);
return None;
// Remove the interloper — the ephemeral room has the real content.
let interloper = rooms_guard.remove(&canonical);
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new “collision → evict interloper” behavior changes an important concurrency edge case and the file already has a sizeable test module, but there are no tests covering rekey_ephemeral_room. Please add a test that sets up (1) a UUID-keyed room with content and (2) a concurrently-created canonical-path room, then asserts the post-rekey map contains exactly one room for the canonical path and that the correct room is retained (and that eviction is skipped when the canonical room is not safe to evict).

Copilot uses AI. Check for mistakes.
@rgbkrk rgbkrk merged commit f98d384 into main Mar 20, 2026
19 checks passed
@rgbkrk rgbkrk deleted the quill/fix-942-bug branch March 20, 2026 11:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

daemon runtimed daemon, kernel management, sync server quill PR authored by Quill Agent 🦆 sync Automerge CRDT sync protocol

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(daemon): duplicate rooms for same notebook path after save

2 participants