Skip to content

fix(distributed): split NATS backend.upgrade off install + dedup loads#9717

Merged
mudler merged 11 commits into
masterfrom
fix/distributed-nats-backend-upgrade-split
May 8, 2026
Merged

fix(distributed): split NATS backend.upgrade off install + dedup loads#9717
mudler merged 11 commits into
masterfrom
fix/distributed-nats-backend-upgrade-split

Conversation

@localai-bot
Copy link
Copy Markdown
Collaborator

Summary

Live cluster was hitting 5-minute `nats: timeout` on routine `/v1/chat/completions` model loads whenever a slow gallery upgrade was in flight on the same worker — NATS-go delivers each subscription's callbacks on a single goroutine, so synchronous `backend.install` handlers head-of-line-blocked everything else.

This PR fixes the cascade in four layers:

  • Subject split — `backend.upgrade` is now its own NATS subject with its own subscription, so the slow force-reinstall path can no longer block `backend.install` on the same worker.
  • Goroutine-wrapped handlers + per-backend mutex — each install/upgrade message spawns a goroutine, so unrelated backends install in parallel; a per-backend `sync.Mutex` keeps two operations on the same backend serialized at the gallery directory.
  • Master-side singleflight — six concurrent chat completions for one not-yet-loaded model now produce one NATS round-trip via `golang.org/x/sync/singleflight` keyed by `(nodeID, backend, modelID, replica)`.
  • Self-healing rolling update — master keeps the legacy `backend.install Force=true` path as a `nats.ErrNoResponders` fallback for old workers; new workers still honor `req.Force=true` for old masters. Both directions converge.

Timeouts: `InstallBackend` 5min → 3min, new `UpgradeBackend` 15min, fallback 15min.

Test plan

  • `go test ./core/services/messaging/ ./core/services/nodes/ ./core/cli/` — 192/192 nodes specs, 41/41 cli specs, all messaging specs pass
  • `go test -race` clean across all three packages
  • `go vet` clean on touched packages
  • New specs cover: per-backend mutex serialization (same/different names), backend.upgrade subscription, `UpgradeBackend` adapter, `UpgradeBackend` rolling-update fallback (NoResponders → install Force=true), singleflight coalescing + non-coalescing
  • Smoke-test in the dev cluster: trigger an upgrade on `agx-orin-slow` while `/v1/chat/completions` lands on `dgx-spark` — confirm the chat load completes without 5-minute timeout
  • Mixed-version sanity check: deploy new master against an old worker (or vice-versa) and confirm fallback path fires (`Falling back to legacy backend.install Force=true (old worker)` log line on master)
  • CI lint gate (golangci-lint not on local PATH)

Follow-ups (not blocking)

  • Reconciler's `OpBackendUpgrade` drain path duplicates the `DistributedBackendManager` fallback logic; either add a reconciler spec or extract a shared helper.
  • Singleflight failure cascade: if leader's NATS round-trip fails, all coalesced callers see the same error. Acceptable trade-off for the load-amplification fix; worth a doc note.
  • Phase 6 cleanup: drop `BackendInstallRequest.Force`, `installWithForceFallback`, both NoResponders fallback branches, and the `req.Force` honoring in worker once every deployed worker is on this commit or newer.

mudler added 11 commits May 8, 2026 08:08
Splits the slow force-reinstall path off backend.install so it can run on
its own subscription goroutine, eliminating head-of-line blocking between
routine model loads and full gallery upgrades.

Wire-level Force flag on BackendInstallRequest is kept for one release as
the rolling-update fallback target; doc note marks it deprecated.

Assisted-by: Claude:claude-sonnet-4-6
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…rvisor

Different backend names lock independently; same backend serializes. This
is the synchronization primitive used by the upcoming concurrent install
handler — without it, wrapping the NATS callback in a goroutine would
race the gallery directory when two requests target the same backend.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
NATS subscriptions deliver messages serially on a single per-subscription
goroutine. With a synchronous install handler, a multi-minute gallery
download would head-of-line-block every other install request to the
same worker — manifesting upstream as a 5-minute "nats: timeout" on
unrelated routine model loads.

The body now runs in its own goroutine, with a per-backend mutex
(lockBackend) protecting the gallery directory from concurrent operations
on the same backend. Different backend names install in parallel.

Backward-compat: req.Force=true is still honored here, so an older master
that hasn't been updated to send on backend.upgrade keeps working.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…path

Slow force-reinstall now lives on its own NATS subscription, so a
multi-minute gallery pull cannot head-of-line-block the routine
backend.install handler on the same worker. Same per-backend mutex
guards both — concurrent install + upgrade for the same backend
serialize at the gallery directory; different backends are independent.

upgradeBackend stops every live process for the backend, force-installs
from gallery, and re-registers. It does not start a new process — the
next backend.install will spawn one with the freshly-pulled binary.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…e from InstallBackend

Master now sends to backend.upgrade for force-reinstall, with a
nats.ErrNoResponders fallback to the legacy backend.install Force=true
path so a rolling update with a new master + an old worker still
converges. The Force parameter leaves the public Go API surface
entirely — only the internal fallback sets it on the wire.

InstallBackend timeout drops 5min -> 3min (most replies are sub-second
since the worker short-circuits on already-running or already-installed).
UpgradeBackend timeout is 15min, sized for real-world Jetson-on-WiFi
gallery pulls.

Updates the admin install HTTP endpoint
(core/http/endpoints/localai/nodes.go) to the new signature too.

router_test.go's fakeUnloader does not yet implement the new interface
shape; Task 3.2 will catch it up before the next package-level test run.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
InstallBackend lost its force bool param (Force is not part of the public
Go API anymore — only the internal upgrade-fallback path sets it on the
wire). UpgradeBackend gained a method. Fake records both call slices and
provides an installHook concurrency seam for upcoming singleflight tests.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…e fallback

Task 3.1 changed the master to publish UpgradeBackend on the new
backend.upgrade subject; the existing UpgradeBackend tests scripted the
old install subject and so all 3 began failing as expected. Updates them
to script SubjectNodeBackendUpgrade with BackendUpgradeReply.

Adds two new specs for the rolling-update fallback:
  - ErrNoResponders on backend.upgrade triggers a backend.install
    Force=true retry on the same node.
  - Non-NoResponders errors propagate to the caller unchanged.

scriptedMessagingClient gains scriptNoResponders (real nats sentinel) and
scriptReplyMatching (predicate-matched canned reply, used to assert that
the fallback path actually sets Force=true on the install retry).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…ingleflight

Six simultaneous chat completions for the same not-yet-loaded model were
observed firing six independent NATS install requests, each serializing
through the worker's per-subscription goroutine and amplifying queue
depth. SmartRouter now wraps the NATS round-trip in a singleflight.Group
keyed by (nodeID, backend, modelID, replica): N concurrent identical
loads share one round-trip and one reply.

Distinct (modelID, replica) keys still fire independent calls, so
multi-replica scaling and multi-model fan-out are unaffected.

fakeUnloader gains a sync.Mutex around its recording slices to keep
concurrent test goroutines race-clean.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Two e2e test call sites still passed the trailing force bool that was
removed from RemoteUnloaderAdapter.InstallBackend in 9bde76d. Caught
by golangci-lint typecheck on the upgrade-split branch (master CI was
already green because these tests don't run in the standard test path).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…/worker

core/cli/worker.go grew to 1212 lines after the backend.upgrade split.
The CLI package was carrying backendSupervisor, NATS lifecycle handlers,
gallery install/upgrade orchestration, S3 file staging, and registration
helpers — all distributed-worker business logic that doesn't belong in
the cobra surface.

Move it to a new core/services/worker package, mirroring the existing
core/services/{nodes,messaging,galleryop} pattern. core/cli/worker.go
shrinks to ~19 lines: a kong-tagged shim that embeds worker.Config and
delegates Run.

No behavior change. All symbols stay unexported except Config and Run.
The three worker-specific tests (addr/replica/concurrency) move with
the code via git mv so history follows them.

Files split as:
  worker.go        - Run entry point
  config.go        - Config struct (kong tags retained, kong not imported)
  supervisor.go    - backendProcess, backendSupervisor, process lifecycle
  install.go       - installBackend, upgradeBackend, findBackend, lockBackend
  lifecycle.go     - subscribeLifecycleEvents (verbatim, decomposition is
                     a follow-up commit)
  file_staging.go  - subscribeFileStaging, isPathAllowed
  registration.go  - advertiseAddr, registrationBody, heartbeatBody, etc.
  reply.go         - replyJSON
  process_helpers.go - readLastLinesFromFile

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
… per-event handlers

The 226-line subscribeLifecycleEvents method packed eight NATS subscriptions
inline. Each grew context-shaped doc comments mixed with subscription
plumbing, making it hard to read any one handler without scrolling past the
others. Extract each handler into its own method on *backendSupervisor; the
subscriber becomes a thin 8-line dispatcher.

No behavior change: each method body is byte-equivalent to its corresponding
inline goroutine + handler. Doc comments that were attached to the inline
SubscribeReply calls migrate to the new method godocs.

Adding the next NATS subject is now a 2-line patch to the dispatcher plus
one new method, instead of grafting onto a monolith.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
@mudler mudler merged commit e5d7b84 into master May 8, 2026
50 of 51 checks passed
@mudler mudler deleted the fix/distributed-nats-backend-upgrade-split branch May 8, 2026 14:24
@localai-bot localai-bot added the bug Something isn't working label May 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants