[Bug]: worker last_seen frozen for ~12 active workers — heartbeat-ingest appears to stop while workers stay functional
Component: Host (orchestrator)
Describe the bug
On the live kv.run:8000/flowmesh host (cluster lumid0, namespace lumid), every worker except two long-stale rows reports an last_seen clustered tightly between 2026-05-27T18:18:31 and 2026-05-27T18:19:01 UTC — a ~30-second window 12 workers fell into and none have left. As of 2026-05-28T10:52Z that's a 16.6h freeze across 12 disparate workers on 8 different nodes (incl. CPU + RTX 5080 + RTX 6000 Ada + GB10 + RTX PRO 4000 boxes). The workers themselves are clearly still functional — three of them (wkr-24, wkr-21, wkr-12) report status: BUSY against the same stale last_seen, so they're actively executing tasks; only the host's heartbeat-ingest side seems to have stopped writing through that point.
Expected: last_seen should advance per worker on each heartbeat regardless of execution state.
Actual: all workers stalled at near-identical timestamps in a ~30s window, despite continuing to pick up jobs.
Reproduction (read-only probe):
curl -sS -H "Authorization: Bearer $PAT" \
https://kv.run:8000/flowmesh/api/v1/workers \
| jq '.[] | {id, node_id, status, last_seen}' | head -40
Output (representative subset, taken at 2026-05-28T10:52Z):
{"id":"wkr-15","node_id":"nde-10","status":"IDLE","last_seen":"2026-05-27T18:18:54.…Z"}
{"id":"wkr-10","node_id":"nde-13","status":"IDLE","last_seen":"2026-05-27T18:18:34.…Z"}
{"id":"wkr-16","node_id":"nde-14","status":"IDLE","last_seen":"2026-05-27T18:18:35.…Z"}
{"id":"wkr-24","node_id":"nde-20","status":"BUSY","last_seen":"2026-05-27T18:18:34.…Z"}
{"id":"wkr-21","node_id":"nde-6", "status":"BUSY","last_seen":"2026-05-27T18:18:37.…Z"}
{"id":"wkr-12","node_id":"nde-7", "status":"BUSY","last_seen":"2026-05-27T18:18:58.…Z"}
... (12 rows total, all between 18:18:31 and 18:19:01)
{"id":"wkr-23","node_id":"nde-18","status":"IDLE","last_seen":"2026-05-26T00:21:19.…Z"} // 58h, predates the freeze
{"id":"wkr-2", "node_id":"nde-2", "status":"IDLE","last_seen":"2026-05-19T08:21:11.…Z"} // genuine zombie (nde-2 offline)
{"id":"wkr-3", "node_id":"nde-2", "status":"IDLE","last_seen":"2026-05-19T08:21:11.…Z"} // genuine zombie
That tight clustering across 8 nodes + the BUSY-with-stale-heartbeat combination point to a host-side ingest issue (Redis stream consumer / worker-state writer stalled, channel desync, etc.) rather than 12 simultaneous worker failures.
Impact:
- Registry can no longer distinguish live workers from dead ones (every monitoring threshold built on
last_seen flags healthy workers as stale).
- Pre-existing zombies (wkr-2, wkr-3 on offline node
nde-2/luyao1) become indistinguishable from the rest of the fleet.
- Downstream dashboards / alerting (e.g. lum.id cluster registry) lose ground truth.
Suggested investigation:
- Grep the host logs around
2026-05-27T18:18:31Z ± 60s — look for an exception, a Redis reconnect, or a deploy/restart.
- Check whether the worker-heartbeat consumer group is stuck on a pending entry (Redis
XPENDING).
- Confirm whether
last_seen is updated only on heartbeat ingest, or also on result-submission / task-claim paths (status: BUSY updates without last_seen updates suggests the two paths diverged).
Environment
- FlowMesh version: deployed image as of
2026-05-13T18:06:47Z (worker started_at); pre-v0.1.2 redeploy.
- Cluster:
lumid0, namespace lumid, multi-node (8 nodes, 15 workers — 12 GPU + 3 CPU).
- Observation captured at:
2026-05-28T10:52Z.
Additional context
This was surfaced while validating the upcoming lumid-flowmesh-plugin v0.2.0 rollout. The plugin work is unrelated; the heartbeat-ingest behavior reproduces against the current v0.1.1 stack. A GET /api/v1/admin/workers/prune (or equivalent) endpoint that drops workers with last_seen older than a configurable threshold would also help — currently the host exposes only GET on /api/v1/workers/{id} and /api/v1/nodes/{id} (confirmed via OPTIONS), so stale rows can't be cleaned out via the API.
[Bug]: worker
last_seenfrozen for ~12 active workers — heartbeat-ingest appears to stop while workers stay functionalComponent: Host (orchestrator)
Describe the bug
On the live
kv.run:8000/flowmeshhost (clusterlumid0, namespacelumid), every worker except two long-stale rows reports anlast_seenclustered tightly between2026-05-27T18:18:31and2026-05-27T18:19:01UTC — a ~30-second window 12 workers fell into and none have left. As of2026-05-28T10:52Zthat's a 16.6h freeze across 12 disparate workers on 8 different nodes (incl. CPU + RTX 5080 + RTX 6000 Ada + GB10 + RTX PRO 4000 boxes). The workers themselves are clearly still functional — three of them (wkr-24, wkr-21, wkr-12) reportstatus: BUSYagainst the same stalelast_seen, so they're actively executing tasks; only the host's heartbeat-ingest side seems to have stopped writing through that point.Expected:
last_seenshould advance per worker on each heartbeat regardless of execution state.Actual: all workers stalled at near-identical timestamps in a ~30s window, despite continuing to pick up jobs.
Reproduction (read-only probe):
Output (representative subset, taken at
2026-05-28T10:52Z):That tight clustering across 8 nodes + the
BUSY-with-stale-heartbeat combination point to a host-side ingest issue (Redis stream consumer / worker-state writer stalled, channel desync, etc.) rather than 12 simultaneous worker failures.Impact:
last_seenflags healthy workers as stale).nde-2/luyao1) become indistinguishable from the rest of the fleet.Suggested investigation:
2026-05-27T18:18:31Z± 60s — look for an exception, a Redis reconnect, or a deploy/restart.XPENDING).last_seenis updated only on heartbeat ingest, or also on result-submission / task-claim paths (status: BUSYupdates withoutlast_seenupdates suggests the two paths diverged).Environment
2026-05-13T18:06:47Z(workerstarted_at); pre-v0.1.2 redeploy.lumid0, namespacelumid, multi-node (8 nodes, 15 workers — 12 GPU + 3 CPU).2026-05-28T10:52Z.Additional context
This was surfaced while validating the upcoming
lumid-flowmesh-plugin v0.2.0rollout. The plugin work is unrelated; the heartbeat-ingest behavior reproduces against the current v0.1.1 stack. AGET /api/v1/admin/workers/prune(or equivalent) endpoint that drops workers withlast_seenolder than a configurable threshold would also help — currently the host exposes onlyGETon/api/v1/workers/{id}and/api/v1/nodes/{id}(confirmed viaOPTIONS), so stale rows can't be cleaned out via the API.