Recover fast after a node freeze instead of waiting on the announce cadence#19
Merged
Merged
Conversation
…he announce cadence When a node's process is paused long enough (OS suspend, container freeze, or a host model-load memory-thrash that starves it), its peers tear it down after the ~90s heartbeat grace. On resume the wake detector fired a Tier-2 probe but never re-advertised, so dropped neighbors only rediscovered the node on its next steady-state announce — up to ANNOUNCE_STEADY_MS (5 min) away. The woken node also kept its peers as Active with dead sessions, so an inbound offer was applied onto a stale PeerConnection (the WebRTC "set_remote_description on a stable PC" wedge). - on_wake now emits a fresh SignalingOutbound::Announce so neighbors rediscover the node within ~1s via reactive reflection. - Wire the documented STALE_INBOUND_MS inbound-recency zombie clearing into handle_signaling_inbound: an announce/offer from a peer silent past the threshold drops the stale session first so ensure_peer_session rebuilds cleanly. Recently-active and never-received peers are left alone, preserving in-place ICE recovery. Adds unit tests for announce-on-wake and the staleness predicate. https://claude.ai/code/session_017UZ6AKBqV2ae2E6XbyoAgq
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When a node's process is paused long enough — OS suspend, container freeze, or a host model-load memory-thrash that starves it (the MyOwnLLM scenario: loading a model freezes the box, which also freezes the
myownmesh servedaemon) — it sends no traffic, so its peers tear it down after theHEARTBEAT_TIMEOUT_MS + WAKE_DETECTION_THRESHOLD_MS≈ 90 s silence grace and eventually prune it.On resume the connection came back very slowly (often minutes), because:
on_wakefired a Tier-2 probe (ping known peers, escalate silent ones to re-handshake) — but re-handshake sendsHelloover the now-dead WebRTC data channel, andon_wakenever emitted an announce. A torn-down connection can only be rebuilt via the signaling path (PeerAnnounced→ensure_peer_session→ offer/answer), so dropped neighbors didn't rediscover the node until its next steady-state announce — up toANNOUNCE_STEADY_MS(5 min) away.Activewith dead sessions;ensure_peer_sessionshort-circuits on an existing entry and the re-offer path only fires atSighted, so an inbound fresh offer was applied onto the stalePeerConnection— the "set_remote_description on an already-stable PC wedges WebRTC" hazard. The documented remedy (STALE_INBOUND_MSinbound-recency zombie clearing, seeCONNECTION-ENGINE.md"Edge cases handled") existed only as a constant in the signaling crate; it was never wired into the engine.Fix
engine/wake.rs):on_wakenow emits a freshSignalingOutbound::Announce, so peers that dropped the node during the pause rediscover it within ~1 s via reactive reflection. One send per wake event (already coalesced byWAKE_COALESCE_MS; reflected announces are rate-limited byREACTIVE_ANNOUNCE_MIN_INTERVAL_MS), so it can't storm the relay.STALE_INBOUND_MSzombie clearing (engine/mod.rs): in thePeerAnnouncedandOfferarms ofhandle_signaling_inbound, if a peer we still hold has been silent past the threshold, drop the stale session beforeensure_peer_sessionso it rebuilds a fresh PC instead of wedging.STALE_INBOUND_MSis re-exported frommyownmesh-signalingso engine and signaling share one value.Why it's safe
The staleness gate makes the change self-protecting on the benign laptop-sleep path: in-place ICE recovery (Tier 2.5 /
ice_poll) restores ping/pong and resetslast_recv_atwithin seconds of wake, so a genuinely-recovering peer has a small gap when any announce arrives and is not torn down. Recently-active and never-received (last_recv_at == None, e.g. mid-handshake / stuck-Sighted) peers are left untouched — the latter is handled by the existing re-offer path. Rostered peers auto-re-approve on rebuild, so trusted peers reconnect without a prompt.Tests
engine::wake::tests::on_wake_emits_announce_for_rediscovery—on_wakeemits anAnnounce.engine::tests::zombie_session_cleared_on_stale_inbound/recently_active_peer_not_cleared/peer_without_inbound_not_cleared— the staleness predicate drops only true zombies.cargo fmt --all --check,cargo clippy --workspace --all-targets -- -D warnings, andcargo test --workspaceall pass (109 core lib tests incl. the new ones, plus the webrtc integration tests).Follow-up (not in this PR)
No protocol/API change → PATCH bump
0.1.2 → 0.1.3. For MyOwnLLM to pick it up, bump thetagin itssrc-tauri/Cargo.tomland.myownmesh-revtov0.1.3.https://claude.ai/code/session_017UZ6AKBqV2ae2E6XbyoAgq
Generated by Claude Code