fix(distributed): route per request across loaded replicas + cache probeHealth#9968
Merged
Conversation
…thModel Lifts the replica-selection policy (in_flight ASC, last_used ASC, available_vram DESC) out of the SQL ORDER BY into a pure Go function in the new replicapicker.go. The SQL clause keeps its FOR UPDATE atomicity and remains the production path used by SmartRouter; PickBestReplica is the canonical implementation that the future per-frontend rotating replica cache (TODO referenced from pkg/model) will call against an in-memory snapshot without paying a DB round-trip per inference. A new registry_test mirror spec seeds a multi-tier scenario and asserts both layers pick the same replica, so any future tweak to either side fails the test until the other side is updated. No behavior change. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Two related fixes that together restore load balancing across loaded replicas of the same model. 1. ModelLoader.Load and LoadModel bypass the local *Model cache when modelRouter is set. The cached *Model wraps an InFlightTrackingClient bound to a single (nodeID, replicaIndex) — reusing it pinned every subsequent request to whichever node won the very first pick, so FindAndLockNodeWithModel's round-robin never got a chance to run even after the reconciler scaled the model out to a second node. In distributed mode SmartRouter.Route now runs per request, and PickBestReplica picks the least-loaded replica each time. SmartRouter has its own coalescing (advisory DB lock for first-time loads + singleflight on backend.install RPC) so concurrent first requests for a not-yet-loaded model still produce a single worker side install. 2. SmartRouter.probeHealth memoizes successful gRPC HealthCheck results in a new probeCache (probe_cache.go) with a 30s TTL. With per-request routing every inference call hits probeHealth, and llama.cpp-style backends serialize HealthCheck behind active Predict — so a burst of incoming requests stalled on the probe to a node already mid-stream, tripping the 2s timeout and falling through to the install path. singleflight collapses N concurrent first-time probes for the same (node, addr) into one round-trip, failed probes invalidate the entry so the staleness-recovery path still triggers, and the TTL matches pkg/model/model.go's healthCheckTTL so the single-process and distributed paths share a staleness budget. The background HealthMonitor still reaps actually-dead backends within ~45s. The bypass introduces one short FindAndLockNodeWithModel transaction per inference. A TODO in pkg/model/loader.go documents the future per modelID rotating-replica cache that would reuse PickBestReplica against an in-memory snapshot and skip the DB round-trip for hot paths. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code]
mudler
previously approved these changes
May 24, 2026
… callback errcheck (new-from-merge-base) flagged the new distributed-mode branch in Load() for ignoring ml.ShutdownModel's return. The grandfathered local-mode branch a few lines below ignores it too, but a fresh identical pattern doesn't carry the baseline exemption. Log the error via xlog.Warn — the connection-evicting wrapper is the failure path so silent eviction errors would lose useful debugging context anyway. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code]
mudler
approved these changes
May 24, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Restores cross-node load balancing in distributed mode. Two related fixes:
*Modelcache in distributed mode.ModelLoader.Load/LoadModelwere returning a cachedModelwhose embeddedInFlightTrackingClientis bound to a single(nodeID, replicaIndex). After the first request loaded a model and cached the wrapper, every subsequent request reused the same client and pinned to whichever node won the first pick — even after the reconciler scaled the model out to a second node.SmartRouter.Route(→FindAndLockNodeWithModel) now runs per request so the existingin_flight ASC, last_used ASC, available_vram DESCround-robin actually takes effect.probeHealthper(nodeID, addr)with a 30s TTL + singleflight. With per-request routing, every inference call now lands inprobeHealth. llama.cpp-style gRPC backends serializeHealthCheckagainst activePredict, so a burst of new requests stalled on the probe to a node already mid-stream, tripping the 2s timeout and falling through to the install path. The newprobeCache(core/services/nodes/probe_cache.go) memoizes successful probes, coalesces concurrent first-time probes viasingleflight, and invalidates on failure so the staleness-recovery path still fires. TTL matchespkg/model/model.go'shealthCheckTTLso the single-process and distributed paths share a staleness budget.To keep the two layers of the policy in lock-step, the replica-selection rule is now defined in a single place (
PickBestReplicaincore/services/nodes/replicapicker.go). The SQLORDER BYinFindAndLockNodeWithModelis documented as a faithful mirror of that function, and a newregistry_testmirror spec asserts both pick the same replica on a seeded multi-tier dataset — so future tweaks to either side fail a test until the other side is updated. ATODO(distributed-cache)inpkg/model/loader.goflags the planned per-frontend rotating-replica cache that would reusePickBestReplicaagainst an in-memory snapshot to skip the per-request DB round-trip on hot paths.Reproducer the original report matched
Two healthy backend nodes (
dgx-spark1,nvidia-thor1) with the same model loaded:All requests landed on
dgx-spark1. After this PR a second concurrent request goes tonvidia-thor1.Test plan
go test ./core/services/nodes/ -count=1(210 specs, real PG via testcontainers) — greengo test ./pkg/model/ -count=1— greengofmt -lclean on touched filesPickBestReplicaunit specs (7) cover each tiebreaker tier + edge casesprobeCacheunit specs (7) cover cold/TTL/failure/singleflight/per-key independence/disabled mode/Invalidate