fix(distributed): route per request across loaded replicas + cache probeHealth by localai-bot · Pull Request #9968 · mudler/LocalAI

localai-bot · 2026-05-24T07:45:19Z

Summary

Restores cross-node load balancing in distributed mode. Two related fixes:

Bypass the local *Model cache in distributed mode. ModelLoader.Load / LoadModel were returning a cached Model whose embedded InFlightTrackingClient is bound to a single (nodeID, replicaIndex). After the first request loaded a model and cached the wrapper, every subsequent request reused the same client and pinned to whichever node won the first pick — even after the reconciler scaled the model out to a second node. SmartRouter.Route (→ FindAndLockNodeWithModel) now runs per request so the existing in_flight ASC, last_used ASC, available_vram DESC round-robin actually takes effect.
Cache probeHealth per (nodeID, addr) with a 30s TTL + singleflight. With per-request routing, every inference call now lands in probeHealth. llama.cpp-style gRPC backends serialize HealthCheck against active Predict, so a burst of new requests stalled on the probe to a node already mid-stream, tripping the 2s timeout and falling through to the install path. The new probeCache (core/services/nodes/probe_cache.go) memoizes successful probes, coalesces concurrent first-time probes via singleflight, and invalidates on failure so the staleness-recovery path still fires. TTL matches pkg/model/model.go's healthCheckTTL so the single-process and distributed paths share a staleness budget.

To keep the two layers of the policy in lock-step, the replica-selection rule is now defined in a single place (PickBestReplica in core/services/nodes/replicapicker.go). The SQL ORDER BY in FindAndLockNodeWithModel is documented as a faithful mirror of that function, and a new registry_test mirror spec asserts both pick the same replica on a seeded multi-tier dataset — so future tweaks to either side fail a test until the other side is updated. A TODO(distributed-cache) in pkg/model/loader.go flags the planned per-frontend rotating-replica cache that would reuse PickBestReplica against an in-memory snapshot to skip the per-request DB round-trip on hot paths.

Reproducer the original report matched

Two healthy backend nodes (dgx-spark1, nvidia-thor1) with the same model loaded:

dgx-spark1     loaded   in_flight=6
nvidia-thor1   loaded   in_flight=0

All requests landed on dgx-spark1. After this PR a second concurrent request goes to nvidia-thor1.

Test plan

go test ./core/services/nodes/ -count=1 (210 specs, real PG via testcontainers) — green
go test ./pkg/model/ -count=1 — green
gofmt -l clean on touched files
New PickBestReplica unit specs (7) cover each tiebreaker tier + edge cases
New SQL-vs-picker mirror spec ("agrees with PickBestReplica on a seeded dataset") catches drift
New probeCache unit specs (7) cover cold/TTL/failure/singleflight/per-key independence/disabled mode/Invalidate
Cluster smoke: redeploy and confirm a second concurrent chat completion lands on the previously-idle node

…thModel Lifts the replica-selection policy (in_flight ASC, last_used ASC, available_vram DESC) out of the SQL ORDER BY into a pure Go function in the new replicapicker.go. The SQL clause keeps its FOR UPDATE atomicity and remains the production path used by SmartRouter; PickBestReplica is the canonical implementation that the future per-frontend rotating replica cache (TODO referenced from pkg/model) will call against an in-memory snapshot without paying a DB round-trip per inference. A new registry_test mirror spec seeds a multi-tier scenario and asserts both layers pick the same replica, so any future tweak to either side fails the test until the other side is updated. No behavior change. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code]

Two related fixes that together restore load balancing across loaded replicas of the same model. 1. ModelLoader.Load and LoadModel bypass the local *Model cache when modelRouter is set. The cached *Model wraps an InFlightTrackingClient bound to a single (nodeID, replicaIndex) — reusing it pinned every subsequent request to whichever node won the very first pick, so FindAndLockNodeWithModel's round-robin never got a chance to run even after the reconciler scaled the model out to a second node. In distributed mode SmartRouter.Route now runs per request, and PickBestReplica picks the least-loaded replica each time. SmartRouter has its own coalescing (advisory DB lock for first-time loads + singleflight on backend.install RPC) so concurrent first requests for a not-yet-loaded model still produce a single worker side install. 2. SmartRouter.probeHealth memoizes successful gRPC HealthCheck results in a new probeCache (probe_cache.go) with a 30s TTL. With per-request routing every inference call hits probeHealth, and llama.cpp-style backends serialize HealthCheck behind active Predict — so a burst of incoming requests stalled on the probe to a node already mid-stream, tripping the 2s timeout and falling through to the install path. singleflight collapses N concurrent first-time probes for the same (node, addr) into one round-trip, failed probes invalidate the entry so the staleness-recovery path still triggers, and the TTL matches pkg/model/model.go's healthCheckTTL so the single-process and distributed paths share a staleness budget. The background HealthMonitor still reaps actually-dead backends within ~45s. The bypass introduces one short FindAndLockNodeWithModel transaction per inference. A TODO in pkg/model/loader.go documents the future per modelID rotating-replica cache that would reuse PickBestReplica against an in-memory snapshot and skip the DB round-trip for hot paths. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code]

… callback errcheck (new-from-merge-base) flagged the new distributed-mode branch in Load() for ignoring ml.ShutdownModel's return. The grandfathered local-mode branch a few lines below ignores it too, but a fresh identical pattern doesn't carry the baseline exemption. Log the error via xlog.Warn — the connection-evicting wrapper is the failure path so silent eviction errors would lose useful debugging context anyway. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code]

mudler added 2 commits May 24, 2026 07:44

mudler enabled auto-merge (squash) May 24, 2026 07:47

mudler previously approved these changes May 24, 2026

View reviewed changes

mudler dismissed their stale review via 1fd1702 May 24, 2026 07:57

mudler approved these changes May 24, 2026

View reviewed changes

mudler merged commit 8bbe89a into master May 24, 2026
56 checks passed

mudler deleted the fix/distributed-per-request-routing branch May 24, 2026 08:15

localai-bot added the bug Something isn't working label May 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(distributed): route per request across loaded replicas + cache probeHealth#9968

fix(distributed): route per request across loaded replicas + cache probeHealth#9968
mudler merged 3 commits into
masterfrom
fix/distributed-per-request-routing

localai-bot commented May 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

localai-bot commented May 24, 2026

Summary

Reproducer the original report matched

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants