Skip to content

fix(distributed): scope Upgrade All to nodes that have the backend installed#9678

Merged
mudler merged 1 commit into
masterfrom
fix/distributed-upgrade-smart-fanout
May 5, 2026
Merged

fix(distributed): scope Upgrade All to nodes that have the backend installed#9678
mudler merged 1 commit into
masterfrom
fix/distributed-upgrade-smart-fanout

Conversation

@mudler
Copy link
Copy Markdown
Owner

@mudler mudler commented May 5, 2026

In distributed mode the React UI's "Upgrade All" button fanned every detected outdated backend out to every healthy backend node, including nodes that never had that backend installed. On heterogeneous clusters this surfaced as platform errors (e.g. mac-mini-m4 asked to upgrade cpu-insightface-development, which has no darwin/arm64 variant) and left forever-retrying pending_backend_ops rows.

DistributedBackendManager.UpgradeBackend now queries ListBackends() first, builds the target node-ID set from SystemBackend.Nodes, and only fans out to those nodes — every per-node primitive (adapter.InstallBackend, the pending-ops queue, BackendOpResult) is unchanged. enqueueAndDrainBackendOp gains an optional targetNodeIDs allowlist; Install/Delete keep their fan-to-everyone semantics by passing nil. If no node reports the backend installed, UpgradeBackend now returns a clear "not installed on any node" error instead of producing a stuck queue.

Adds Ginkgo coverage for the smart fan-out: backend on a subset of nodes goes only to those nodes; backend on no node returns the new error and never sends a NATS install request.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]

Description

This PR fixes #

Notes for Reviewers

Signed commits

  • Yes, I signed my commits.

…stalled

In distributed mode the React UI's "Upgrade All" button fanned every
detected outdated backend out to every healthy backend node, including
nodes that never had that backend installed. On heterogeneous clusters
this surfaced as platform errors (e.g. mac-mini-m4 asked to upgrade
cpu-insightface-development, which has no darwin/arm64 variant) and left
forever-retrying pending_backend_ops rows.

DistributedBackendManager.UpgradeBackend now queries ListBackends()
first, builds the target node-ID set from SystemBackend.Nodes, and only
fans out to those nodes — every per-node primitive
(adapter.InstallBackend, the pending-ops queue, BackendOpResult) is
unchanged. enqueueAndDrainBackendOp gains an optional targetNodeIDs
allowlist; Install/Delete keep their fan-to-everyone semantics by
passing nil. If no node reports the backend installed, UpgradeBackend
now returns a clear "not installed on any node" error instead of
producing a stuck queue.

Adds Ginkgo coverage for the smart fan-out: backend on a subset of
nodes goes only to those nodes; backend on no node returns the new
error and never sends a NATS install request.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
@mudler mudler merged commit 75fba9e into master May 5, 2026
50 checks passed
@mudler mudler deleted the fix/distributed-upgrade-smart-fanout branch May 5, 2026 22:28
@localai-bot localai-bot added the bug Something isn't working label May 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants