Skip to content

Recover fast after a node freeze instead of waiting on the announce cadence#19

Merged
mrjeeves merged 1 commit into
mainfrom
claude/determined-einstein-gEcXX
May 29, 2026
Merged

Recover fast after a node freeze instead of waiting on the announce cadence#19
mrjeeves merged 1 commit into
mainfrom
claude/determined-einstein-gEcXX

Conversation

@mrjeeves
Copy link
Copy Markdown
Owner

Problem

When a node's process is paused long enough — OS suspend, container freeze, or a host model-load memory-thrash that starves it (the MyOwnLLM scenario: loading a model freezes the box, which also freezes the myownmesh serve daemon) — it sends no traffic, so its peers tear it down after the HEARTBEAT_TIMEOUT_MS + WAKE_DETECTION_THRESHOLD_MS90 s silence grace and eventually prune it.

On resume the connection came back very slowly (often minutes), because:

  1. No re-advertise on wake. on_wake fired a Tier-2 probe (ping known peers, escalate silent ones to re-handshake) — but re-handshake sends Hello over the now-dead WebRTC data channel, and on_wake never emitted an announce. A torn-down connection can only be rebuilt via the signaling path (PeerAnnouncedensure_peer_session → offer/answer), so dropped neighbors didn't rediscover the node until its next steady-state announce — up to ANNOUNCE_STEADY_MS (5 min) away.
  2. Stale-session wedge on the woken node. The woken node still held its peers as Active with dead sessions; ensure_peer_session short-circuits on an existing entry and the re-offer path only fires at Sighted, so an inbound fresh offer was applied onto the stale PeerConnection — the "set_remote_description on an already-stable PC wedges WebRTC" hazard. The documented remedy (STALE_INBOUND_MS inbound-recency zombie clearing, see CONNECTION-ENGINE.md "Edge cases handled") existed only as a constant in the signaling crate; it was never wired into the engine.

Fix

  • Re-announce on wake (engine/wake.rs): on_wake now emits a fresh SignalingOutbound::Announce, so peers that dropped the node during the pause rediscover it within ~1 s via reactive reflection. One send per wake event (already coalesced by WAKE_COALESCE_MS; reflected announces are rate-limited by REACTIVE_ANNOUNCE_MIN_INTERVAL_MS), so it can't storm the relay.
  • Wire the documented STALE_INBOUND_MS zombie clearing (engine/mod.rs): in the PeerAnnounced and Offer arms of handle_signaling_inbound, if a peer we still hold has been silent past the threshold, drop the stale session before ensure_peer_session so it rebuilds a fresh PC instead of wedging. STALE_INBOUND_MS is re-exported from myownmesh-signaling so engine and signaling share one value.

Why it's safe

The staleness gate makes the change self-protecting on the benign laptop-sleep path: in-place ICE recovery (Tier 2.5 / ice_poll) restores ping/pong and resets last_recv_at within seconds of wake, so a genuinely-recovering peer has a small gap when any announce arrives and is not torn down. Recently-active and never-received (last_recv_at == None, e.g. mid-handshake / stuck-Sighted) peers are left untouched — the latter is handled by the existing re-offer path. Rostered peers auto-re-approve on rebuild, so trusted peers reconnect without a prompt.

Tests

  • engine::wake::tests::on_wake_emits_announce_for_rediscoveryon_wake emits an Announce.
  • engine::tests::zombie_session_cleared_on_stale_inbound / recently_active_peer_not_cleared / peer_without_inbound_not_cleared — the staleness predicate drops only true zombies.

cargo fmt --all --check, cargo clippy --workspace --all-targets -- -D warnings, and cargo test --workspace all pass (109 core lib tests incl. the new ones, plus the webrtc integration tests).

Follow-up (not in this PR)

No protocol/API change → PATCH bump 0.1.2 → 0.1.3. For MyOwnLLM to pick it up, bump the tag in its src-tauri/Cargo.toml and .myownmesh-rev to v0.1.3.

https://claude.ai/code/session_017UZ6AKBqV2ae2E6XbyoAgq


Generated by Claude Code

…he announce cadence

When a node's process is paused long enough (OS suspend, container freeze,
or a host model-load memory-thrash that starves it), its peers tear it
down after the ~90s heartbeat grace. On resume the wake detector fired a
Tier-2 probe but never re-advertised, so dropped neighbors only
rediscovered the node on its next steady-state announce — up to
ANNOUNCE_STEADY_MS (5 min) away. The woken node also kept its peers as
Active with dead sessions, so an inbound offer was applied onto a stale
PeerConnection (the WebRTC "set_remote_description on a stable PC" wedge).

- on_wake now emits a fresh SignalingOutbound::Announce so neighbors
  rediscover the node within ~1s via reactive reflection.
- Wire the documented STALE_INBOUND_MS inbound-recency zombie clearing
  into handle_signaling_inbound: an announce/offer from a peer silent
  past the threshold drops the stale session first so ensure_peer_session
  rebuilds cleanly. Recently-active and never-received peers are left
  alone, preserving in-place ICE recovery.

Adds unit tests for announce-on-wake and the staleness predicate.

https://claude.ai/code/session_017UZ6AKBqV2ae2E6XbyoAgq
@mrjeeves mrjeeves merged commit 3717c29 into main May 29, 2026
6 checks passed
@mrjeeves mrjeeves deleted the claude/determined-einstein-gEcXX branch May 29, 2026 07:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants