fix(distributed): make backend upgrade actually re-install on workers by localai-bot · Pull Request #9708 · mudler/LocalAI

localai-bot · 2026-05-07T13:36:27Z

Summary

Fixes the persistent "N backends have updates available" loop where clicking Upgrade All appears to succeed (toast says "Upgrade started", logs say "Backend upgraded successfully") but the upgrade list never drops and backends stay on their old versions indefinitely.

Two coupled bugs, both in the worker's installBackend fast path:

1. `backend.install` is a no-op when the backend is already running

UpgradeBackend dispatches a vanilla backend.install NATS event to every node hosting the backend. The worker's installBackend (core/cli/worker.go) short-circuits on "already running for this (model, replica) slot" and returns the existing address — so the gallery install path is skipped, no artifact is re-downloaded, no metadata is written. The frontend's drift detection then re-flags the same backends every cycle (installedDigest stays empty → mismatch → "Backend upgrade available (new build)") while "Backend upgraded successfully" lands in the logs at the same time. The user clicks Upgrade All, sees nothing change, repeats forever.

Fix: Add Force bool to messaging.BackendInstallRequest and thread it through NodeCommandSender → RemoteUnloaderAdapter. UpgradeBackend (and the reconciler's pending-op drain when retrying an upgrade op) sets force=true; routine load events and admin install endpoints keep force=false. On the worker, force=true:

stops every live process that uses this backend (resolveProcessKeys for peer replicas, plus the exact request processKey for model-prefixed keys)
skips the findBackend early return so we don't restart the same stale binary
passes force=true into gallery.InstallBackendFromGallery so the on-disk artifact is overwritten
starts a fresh process at the same processKey on a new port

2. Stale `(key, addr)` survives a dead process and traps the reconciler in a retry loop

installBackend's "already running" branch read getAddr without verifying the process was alive. A gRPC backend that died without the supervisor noticing left a stale entry. The reconciler then dialed that address, got ECONNREFUSED, marked the replica failed, retried install — and the supervisor said "Backend already running for model replica … addr=127.0.0.1:30230" again. Loop forever, exactly what we observed on a Jetson Thor node whose llama-cpp process had died but whose supervisor record persisted (logs in tracking issue).

Fix: Verify s.isRunning(processKey) before trusting getAddr; if the entry is stale, stopBackendExact cleans up and we fall through to a real install.

Backwards compatibility

The new Force field is omitempty — older workers ignore it, and absence is interpreted as false which preserves their current "already running → return addr" behavior.
The signature change on NodeCommandSender.InstallBackend is internal-only; one test fake (router_test.go) updated alongside.

Test plan

go test ./core/services/messaging — passes
go test ./core/services/nodes — passes (108s suite, includes router/reconciler)
Manual: in a distributed cluster with a backend pre-installed on workers, edit the gallery to a newer version (or wait for upstream to publish one); click "Upgrade All" in the UI; verify
- workers log "Force install: stopping running backend before reinstall" then "Installing backend from gallery" with force=true
- on-disk artifact mtime updates on workers
- upgrade-availability list converges to 0 within one check cycle
Manual: kill a worker's gRPC process out-of-band (kill -9 of the per-model local-ai child) without restarting the worker; trigger any model load that routes to that node; verify the worker logs "Stale process entry for backend (dead process); cleaning up before reinstall" and the load completes against a freshly started process instead of looping on ECONNREFUSED.
Regression: routine model load on an existing healthy backend still returns immediately via the fast path (no spurious gallery install).

Build note

The pre-existing core/backend/transcript.go:189 build break (proto regen pending for the word-level-timestamps feature added in af83518) blocks core/cli and core/http/endpoints/localai package tests on master HEAD. This PR doesn't touch that path; tests in unaffected packages pass.

🤖 Generated with Claude Code

UpgradeBackend dispatched a vanilla backend.install NATS event to every node hosting the backend. The worker's installBackend short-circuits on "already running for this (model, replica) slot" and returns the existing address — so the gallery install path was skipped, no artifact was re-downloaded, no metadata was written. The frontend's drift detection then re-flagged the same backends every cycle (installedDigest stays empty → mismatch → "Backend upgrade available (new build)") while "Backend upgraded successfully" landed in the logs at the same time. The user-visible symptom: clicking "Upgrade All" silently does nothing and the same N backends sit on the upgrade list forever. Two coupled fixes, one PR: 1. Force flag on backend.install. Add `Force bool` to BackendInstallRequest and thread it through NodeCommandSender -> RemoteUnloaderAdapter. UpgradeBackend (and the reconciler's pending-op drain when retrying an upgrade) sets force=true; routine load events and admin install endpoints keep force=false. On the worker, force=true stops every live process that uses this backend (resolveProcessKeys for peer replicas, plus the exact request processKey), skips the findBackend short-circuit, and passes force=true into gallery.InstallBackendFromGallery so the on-disk artifact is overwritten. After the gallery install completes, startBackend brings up a fresh process at the same processKey on a new port. 2. Liveness check on the fast path. installBackend's "already running" branch read getAddr without verifying the process was alive, so a gRPC backend that died without the supervisor noticing left a stale (key, addr) entry. The reconciler then dialed that address, got ECONNREFUSED, marked the replica failed, retried install — and the supervisor said "already running addr=…" again. Loop forever, exactly what we observed on a node whose llama-cpp process had died but whose supervisor record persisted. Verify s.isRunning(processKey) before trusting getAddr; if the entry is stale, stopBackendExact cleans up and we fall through to a real install. Backwards-compatible: the new Force field is omitempty, older workers ignore it (their default behavior matches force=false). The signature change on NodeCommandSender.InstallBackend is internal-only. Verified: unit tests in core/services/nodes pass (108s suite). The pre-existing core/backend build break (proto regen pending for word-level timestamps) blocks core/cli and core/http/endpoints/localai package tests but is unrelated to this change. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code]

NodeCommandSender.InstallBackend gained a final force bool in the upgrade-force commit; the e2e distributed lifecycle tests still called the old 8-arg signature and broke compilation. These tests exercise the routine install path (single replica, default behavior), so force=false preserves their existing semantics. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code]

mudler added 2 commits May 7, 2026 13:35

mudler merged commit 447c186 into master May 7, 2026
49 of 51 checks passed

mudler deleted the fix/distributed-upgrade-force-reinstall branch May 7, 2026 15:28

localai-bot added the bug Something isn't working label May 9, 2026

BrewTestBot mentioned this pull request May 11, 2026

localai 4.2.0 Homebrew/homebrew-core#282016

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(distributed): make backend upgrade actually re-install on workers#9708

fix(distributed): make backend upgrade actually re-install on workers#9708
mudler merged 2 commits into
masterfrom
fix/distributed-upgrade-force-reinstall

localai-bot commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

localai-bot commented May 7, 2026

Summary

1. backend.install is a no-op when the backend is already running

2. Stale (key, addr) survives a dead process and traps the reconciler in a retry loop

Backwards compatibility

Test plan

Build note

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. `backend.install` is a no-op when the backend is already running

2. Stale `(key, addr)` survives a dead process and traps the reconciler in a retry loop