fix(distributed): make backend upgrade actually re-install on workers#9708
Merged
Conversation
UpgradeBackend dispatched a vanilla backend.install NATS event to every node hosting the backend. The worker's installBackend short-circuits on "already running for this (model, replica) slot" and returns the existing address — so the gallery install path was skipped, no artifact was re-downloaded, no metadata was written. The frontend's drift detection then re-flagged the same backends every cycle (installedDigest stays empty → mismatch → "Backend upgrade available (new build)") while "Backend upgraded successfully" landed in the logs at the same time. The user-visible symptom: clicking "Upgrade All" silently does nothing and the same N backends sit on the upgrade list forever. Two coupled fixes, one PR: 1. Force flag on backend.install. Add `Force bool` to BackendInstallRequest and thread it through NodeCommandSender -> RemoteUnloaderAdapter. UpgradeBackend (and the reconciler's pending-op drain when retrying an upgrade) sets force=true; routine load events and admin install endpoints keep force=false. On the worker, force=true stops every live process that uses this backend (resolveProcessKeys for peer replicas, plus the exact request processKey), skips the findBackend short-circuit, and passes force=true into gallery.InstallBackendFromGallery so the on-disk artifact is overwritten. After the gallery install completes, startBackend brings up a fresh process at the same processKey on a new port. 2. Liveness check on the fast path. installBackend's "already running" branch read getAddr without verifying the process was alive, so a gRPC backend that died without the supervisor noticing left a stale (key, addr) entry. The reconciler then dialed that address, got ECONNREFUSED, marked the replica failed, retried install — and the supervisor said "already running addr=…" again. Loop forever, exactly what we observed on a node whose llama-cpp process had died but whose supervisor record persisted. Verify s.isRunning(processKey) before trusting getAddr; if the entry is stale, stopBackendExact cleans up and we fall through to a real install. Backwards-compatible: the new Force field is omitempty, older workers ignore it (their default behavior matches force=false). The signature change on NodeCommandSender.InstallBackend is internal-only. Verified: unit tests in core/services/nodes pass (108s suite). The pre-existing core/backend build break (proto regen pending for word-level timestamps) blocks core/cli and core/http/endpoints/localai package tests but is unrelated to this change. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code]
NodeCommandSender.InstallBackend gained a final force bool in the upgrade-force commit; the e2e distributed lifecycle tests still called the old 8-arg signature and broke compilation. These tests exercise the routine install path (single replica, default behavior), so force=false preserves their existing semantics. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code]
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the persistent "N backends have updates available" loop where clicking Upgrade All appears to succeed (toast says "Upgrade started", logs say "Backend upgraded successfully") but the upgrade list never drops and backends stay on their old versions indefinitely.
Two coupled bugs, both in the worker's
installBackendfast path:1.
backend.installis a no-op when the backend is already runningUpgradeBackenddispatches a vanillabackend.installNATS event to every node hosting the backend. The worker'sinstallBackend(core/cli/worker.go) short-circuits on "already running for this (model, replica) slot" and returns the existing address — so the gallery install path is skipped, no artifact is re-downloaded, no metadata is written. The frontend's drift detection then re-flags the same backends every cycle (installedDigeststays empty → mismatch →"Backend upgrade available (new build)") while"Backend upgraded successfully"lands in the logs at the same time. The user clicks Upgrade All, sees nothing change, repeats forever.Fix: Add
Force booltomessaging.BackendInstallRequestand thread it throughNodeCommandSender → RemoteUnloaderAdapter.UpgradeBackend(and the reconciler's pending-op drain when retrying an upgrade op) setsforce=true; routine load events and admin install endpoints keepforce=false. On the worker,force=true:resolveProcessKeysfor peer replicas, plus the exact requestprocessKeyfor model-prefixed keys)findBackendearly return so we don't restart the same stale binaryforce=trueintogallery.InstallBackendFromGalleryso the on-disk artifact is overwrittenprocessKeyon a new port2. Stale
(key, addr)survives a dead process and traps the reconciler in a retry loopinstallBackend's "already running" branch readgetAddrwithout verifying the process was alive. A gRPC backend that died without the supervisor noticing left a stale entry. The reconciler then dialed that address, gotECONNREFUSED, marked the replica failed, retried install — and the supervisor said"Backend already running for model replica … addr=127.0.0.1:30230"again. Loop forever, exactly what we observed on a Jetson Thor node whosellama-cppprocess had died but whose supervisor record persisted (logs in tracking issue).Fix: Verify
s.isRunning(processKey)before trustinggetAddr; if the entry is stale,stopBackendExactcleans up and we fall through to a real install.Backwards compatibility
Forcefield isomitempty— older workers ignore it, and absence is interpreted asfalsewhich preserves their current "already running → return addr" behavior.NodeCommandSender.InstallBackendis internal-only; one test fake (router_test.go) updated alongside.Test plan
go test ./core/services/messaging— passesgo test ./core/services/nodes— passes (108s suite, includes router/reconciler)"Force install: stopping running backend before reinstall"then"Installing backend from gallery"withforce=truekill -9of the per-modellocal-aichild) without restarting the worker; trigger any model load that routes to that node; verify the worker logs"Stale process entry for backend (dead process); cleaning up before reinstall"and the load completes against a freshly started process instead of looping onECONNREFUSED.Build note
The pre-existing
core/backend/transcript.go:189build break (proto regen pending for the word-level-timestamps feature added in af83518) blockscore/cliandcore/http/endpoints/localaipackage tests on master HEAD. This PR doesn't touch that path; tests in unaffected packages pass.🤖 Generated with Claude Code