Recover fast after a node freeze instead of waiting on the announce cadence by mrjeeves · Pull Request #19 · mrjeeves/MyOwnMesh

mrjeeves · 2026-05-29T07:30:21Z

Problem

When a node's process is paused long enough — OS suspend, container freeze, or a host model-load memory-thrash that starves it (the MyOwnLLM scenario: loading a model freezes the box, which also freezes the myownmesh serve daemon) — it sends no traffic, so its peers tear it down after the HEARTBEAT_TIMEOUT_MS + WAKE_DETECTION_THRESHOLD_MS ≈ 90 s silence grace and eventually prune it.

On resume the connection came back very slowly (often minutes), because:

No re-advertise on wake. on_wake fired a Tier-2 probe (ping known peers, escalate silent ones to re-handshake) — but re-handshake sends Hello over the now-dead WebRTC data channel, and on_wake never emitted an announce. A torn-down connection can only be rebuilt via the signaling path (PeerAnnounced → ensure_peer_session → offer/answer), so dropped neighbors didn't rediscover the node until its next steady-state announce — up to ANNOUNCE_STEADY_MS (5 min) away.
Stale-session wedge on the woken node. The woken node still held its peers as Active with dead sessions; ensure_peer_session short-circuits on an existing entry and the re-offer path only fires at Sighted, so an inbound fresh offer was applied onto the stale PeerConnection — the "set_remote_description on an already-stable PC wedges WebRTC" hazard. The documented remedy (STALE_INBOUND_MS inbound-recency zombie clearing, see CONNECTION-ENGINE.md "Edge cases handled") existed only as a constant in the signaling crate; it was never wired into the engine.

Fix

Re-announce on wake (engine/wake.rs): on_wake now emits a fresh SignalingOutbound::Announce, so peers that dropped the node during the pause rediscover it within ~1 s via reactive reflection. One send per wake event (already coalesced by WAKE_COALESCE_MS; reflected announces are rate-limited by REACTIVE_ANNOUNCE_MIN_INTERVAL_MS), so it can't storm the relay.
Wire the documented STALE_INBOUND_MS zombie clearing (engine/mod.rs): in the PeerAnnounced and Offer arms of handle_signaling_inbound, if a peer we still hold has been silent past the threshold, drop the stale session before ensure_peer_session so it rebuilds a fresh PC instead of wedging. STALE_INBOUND_MS is re-exported from myownmesh-signaling so engine and signaling share one value.

Why it's safe

The staleness gate makes the change self-protecting on the benign laptop-sleep path: in-place ICE recovery (Tier 2.5 / ice_poll) restores ping/pong and resets last_recv_at within seconds of wake, so a genuinely-recovering peer has a small gap when any announce arrives and is not torn down. Recently-active and never-received (last_recv_at == None, e.g. mid-handshake / stuck-Sighted) peers are left untouched — the latter is handled by the existing re-offer path. Rostered peers auto-re-approve on rebuild, so trusted peers reconnect without a prompt.

Tests

engine::wake::tests::on_wake_emits_announce_for_rediscovery — on_wake emits an Announce.
engine::tests::zombie_session_cleared_on_stale_inbound / recently_active_peer_not_cleared / peer_without_inbound_not_cleared — the staleness predicate drops only true zombies.

cargo fmt --all --check, cargo clippy --workspace --all-targets -- -D warnings, and cargo test --workspace all pass (109 core lib tests incl. the new ones, plus the webrtc integration tests).

Follow-up (not in this PR)

No protocol/API change → PATCH bump 0.1.2 → 0.1.3. For MyOwnLLM to pick it up, bump the tag in its src-tauri/Cargo.toml and .myownmesh-rev to v0.1.3.

https://claude.ai/code/session_017UZ6AKBqV2ae2E6XbyoAgq

Generated by Claude Code

…he announce cadence When a node's process is paused long enough (OS suspend, container freeze, or a host model-load memory-thrash that starves it), its peers tear it down after the ~90s heartbeat grace. On resume the wake detector fired a Tier-2 probe but never re-advertised, so dropped neighbors only rediscovered the node on its next steady-state announce — up to ANNOUNCE_STEADY_MS (5 min) away. The woken node also kept its peers as Active with dead sessions, so an inbound offer was applied onto a stale PeerConnection (the WebRTC "set_remote_description on a stable PC" wedge). - on_wake now emits a fresh SignalingOutbound::Announce so neighbors rediscover the node within ~1s via reactive reflection. - Wire the documented STALE_INBOUND_MS inbound-recency zombie clearing into handle_signaling_inbound: an announce/offer from a peer silent past the threshold drops the stale session first so ensure_peer_session rebuilds cleanly. Recently-active and never-received peers are left alone, preserving in-place ICE recovery. Adds unit tests for announce-on-wake and the staleness predicate. https://claude.ai/code/session_017UZ6AKBqV2ae2E6XbyoAgq

mrjeeves merged commit 3717c29 into main May 29, 2026
6 checks passed

mrjeeves deleted the claude/determined-einstein-gEcXX branch May 29, 2026 07:41

mrjeeves mentioned this pull request May 29, 2026

Adopt MyOwnMesh v0.1.3 (fast recovery after a node freeze) mrjeeves/MyOwnLLM#211

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recover fast after a node freeze instead of waiting on the announce cadence#19

Recover fast after a node freeze instead of waiting on the announce cadence#19
mrjeeves merged 1 commit into
mainfrom
claude/determined-einstein-gEcXX

mrjeeves commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mrjeeves commented May 29, 2026

Problem

Fix

Why it's safe

Tests

Follow-up (not in this PR)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants