Skip to content

fix(distributed): make backend upgrade actually re-install on workers#9708

Merged
mudler merged 2 commits into
masterfrom
fix/distributed-upgrade-force-reinstall
May 7, 2026
Merged

fix(distributed): make backend upgrade actually re-install on workers#9708
mudler merged 2 commits into
masterfrom
fix/distributed-upgrade-force-reinstall

Conversation

@localai-bot
Copy link
Copy Markdown
Collaborator

Summary

Fixes the persistent "N backends have updates available" loop where clicking Upgrade All appears to succeed (toast says "Upgrade started", logs say "Backend upgraded successfully") but the upgrade list never drops and backends stay on their old versions indefinitely.

Two coupled bugs, both in the worker's installBackend fast path:

1. backend.install is a no-op when the backend is already running

UpgradeBackend dispatches a vanilla backend.install NATS event to every node hosting the backend. The worker's installBackend (core/cli/worker.go) short-circuits on "already running for this (model, replica) slot" and returns the existing address — so the gallery install path is skipped, no artifact is re-downloaded, no metadata is written. The frontend's drift detection then re-flags the same backends every cycle (installedDigest stays empty → mismatch → "Backend upgrade available (new build)") while "Backend upgraded successfully" lands in the logs at the same time. The user clicks Upgrade All, sees nothing change, repeats forever.

Fix: Add Force bool to messaging.BackendInstallRequest and thread it through NodeCommandSender → RemoteUnloaderAdapter. UpgradeBackend (and the reconciler's pending-op drain when retrying an upgrade op) sets force=true; routine load events and admin install endpoints keep force=false. On the worker, force=true:

  • stops every live process that uses this backend (resolveProcessKeys for peer replicas, plus the exact request processKey for model-prefixed keys)
  • skips the findBackend early return so we don't restart the same stale binary
  • passes force=true into gallery.InstallBackendFromGallery so the on-disk artifact is overwritten
  • starts a fresh process at the same processKey on a new port

2. Stale (key, addr) survives a dead process and traps the reconciler in a retry loop

installBackend's "already running" branch read getAddr without verifying the process was alive. A gRPC backend that died without the supervisor noticing left a stale entry. The reconciler then dialed that address, got ECONNREFUSED, marked the replica failed, retried install — and the supervisor said "Backend already running for model replica … addr=127.0.0.1:30230" again. Loop forever, exactly what we observed on a Jetson Thor node whose llama-cpp process had died but whose supervisor record persisted (logs in tracking issue).

Fix: Verify s.isRunning(processKey) before trusting getAddr; if the entry is stale, stopBackendExact cleans up and we fall through to a real install.

Backwards compatibility

  • The new Force field is omitempty — older workers ignore it, and absence is interpreted as false which preserves their current "already running → return addr" behavior.
  • The signature change on NodeCommandSender.InstallBackend is internal-only; one test fake (router_test.go) updated alongside.

Test plan

  • go test ./core/services/messaging — passes
  • go test ./core/services/nodes — passes (108s suite, includes router/reconciler)
  • Manual: in a distributed cluster with a backend pre-installed on workers, edit the gallery to a newer version (or wait for upstream to publish one); click "Upgrade All" in the UI; verify
    • workers log "Force install: stopping running backend before reinstall" then "Installing backend from gallery" with force=true
    • on-disk artifact mtime updates on workers
    • upgrade-availability list converges to 0 within one check cycle
  • Manual: kill a worker's gRPC process out-of-band (kill -9 of the per-model local-ai child) without restarting the worker; trigger any model load that routes to that node; verify the worker logs "Stale process entry for backend (dead process); cleaning up before reinstall" and the load completes against a freshly started process instead of looping on ECONNREFUSED.
  • Regression: routine model load on an existing healthy backend still returns immediately via the fast path (no spurious gallery install).

Build note

The pre-existing core/backend/transcript.go:189 build break (proto regen pending for the word-level-timestamps feature added in af83518) blocks core/cli and core/http/endpoints/localai package tests on master HEAD. This PR doesn't touch that path; tests in unaffected packages pass.

🤖 Generated with Claude Code

mudler added 2 commits May 7, 2026 13:35
UpgradeBackend dispatched a vanilla backend.install NATS event to every
node hosting the backend. The worker's installBackend short-circuits on
"already running for this (model, replica) slot" and returns the
existing address — so the gallery install path was skipped, no artifact
was re-downloaded, no metadata was written. The frontend's drift
detection then re-flagged the same backends every cycle (installedDigest
stays empty → mismatch → "Backend upgrade available (new build)") while
"Backend upgraded successfully" landed in the logs at the same time.
The user-visible symptom: clicking "Upgrade All" silently does nothing
and the same N backends sit on the upgrade list forever.

Two coupled fixes, one PR:

1. Force flag on backend.install. Add `Force bool` to
   BackendInstallRequest and thread it through NodeCommandSender ->
   RemoteUnloaderAdapter. UpgradeBackend (and the reconciler's pending-op
   drain when retrying an upgrade) sets force=true; routine load events
   and admin install endpoints keep force=false. On the worker, force=true
   stops every live process that uses this backend (resolveProcessKeys
   for peer replicas, plus the exact request processKey), skips the
   findBackend short-circuit, and passes force=true into
   gallery.InstallBackendFromGallery so the on-disk artifact is
   overwritten. After the gallery install completes, startBackend brings
   up a fresh process at the same processKey on a new port.

2. Liveness check on the fast path. installBackend's "already running"
   branch read getAddr without verifying the process was alive, so a
   gRPC backend that died without the supervisor noticing left a stale
   (key, addr) entry. The reconciler then dialed that address, got
   ECONNREFUSED, marked the replica failed, retried install — and the
   supervisor said "already running addr=…" again. Loop forever, exactly
   what we observed on a node whose llama-cpp process had died but whose
   supervisor record persisted. Verify s.isRunning(processKey) before
   trusting getAddr; if the entry is stale, stopBackendExact cleans up
   and we fall through to a real install.

Backwards-compatible: the new Force field is omitempty, older workers
ignore it (their default behavior matches force=false). The signature
change on NodeCommandSender.InstallBackend is internal-only.

Verified: unit tests in core/services/nodes pass (108s suite). The
pre-existing core/backend build break (proto regen pending for
word-level timestamps) blocks core/cli and core/http/endpoints/localai
package tests but is unrelated to this change.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
NodeCommandSender.InstallBackend gained a final force bool in the
upgrade-force commit; the e2e distributed lifecycle tests still called
the old 8-arg signature and broke compilation. These tests exercise the
routine install path (single replica, default behavior), so force=false
preserves their existing semantics.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
@mudler mudler merged commit 447c186 into master May 7, 2026
49 of 51 checks passed
@mudler mudler deleted the fix/distributed-upgrade-force-reinstall branch May 7, 2026 15:28
@localai-bot localai-bot added the bug Something isn't working label May 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants