Skip to content

fix(distributed): route per request across loaded replicas + cache probeHealth#9968

Merged
mudler merged 3 commits into
masterfrom
fix/distributed-per-request-routing
May 24, 2026
Merged

fix(distributed): route per request across loaded replicas + cache probeHealth#9968
mudler merged 3 commits into
masterfrom
fix/distributed-per-request-routing

Conversation

@localai-bot
Copy link
Copy Markdown
Collaborator

Summary

Restores cross-node load balancing in distributed mode. Two related fixes:

  • Bypass the local *Model cache in distributed mode. ModelLoader.Load / LoadModel were returning a cached Model whose embedded InFlightTrackingClient is bound to a single (nodeID, replicaIndex). After the first request loaded a model and cached the wrapper, every subsequent request reused the same client and pinned to whichever node won the first pick — even after the reconciler scaled the model out to a second node. SmartRouter.Route (→ FindAndLockNodeWithModel) now runs per request so the existing in_flight ASC, last_used ASC, available_vram DESC round-robin actually takes effect.
  • Cache probeHealth per (nodeID, addr) with a 30s TTL + singleflight. With per-request routing, every inference call now lands in probeHealth. llama.cpp-style gRPC backends serialize HealthCheck against active Predict, so a burst of new requests stalled on the probe to a node already mid-stream, tripping the 2s timeout and falling through to the install path. The new probeCache (core/services/nodes/probe_cache.go) memoizes successful probes, coalesces concurrent first-time probes via singleflight, and invalidates on failure so the staleness-recovery path still fires. TTL matches pkg/model/model.go's healthCheckTTL so the single-process and distributed paths share a staleness budget.

To keep the two layers of the policy in lock-step, the replica-selection rule is now defined in a single place (PickBestReplica in core/services/nodes/replicapicker.go). The SQL ORDER BY in FindAndLockNodeWithModel is documented as a faithful mirror of that function, and a new registry_test mirror spec asserts both pick the same replica on a seeded multi-tier dataset — so future tweaks to either side fail a test until the other side is updated. A TODO(distributed-cache) in pkg/model/loader.go flags the planned per-frontend rotating-replica cache that would reuse PickBestReplica against an in-memory snapshot to skip the per-request DB round-trip on hot paths.

Reproducer the original report matched

Two healthy backend nodes (dgx-spark1, nvidia-thor1) with the same model loaded:

dgx-spark1     loaded   in_flight=6
nvidia-thor1   loaded   in_flight=0

All requests landed on dgx-spark1. After this PR a second concurrent request goes to nvidia-thor1.

Test plan

  • go test ./core/services/nodes/ -count=1 (210 specs, real PG via testcontainers) — green
  • go test ./pkg/model/ -count=1 — green
  • gofmt -l clean on touched files
  • New PickBestReplica unit specs (7) cover each tiebreaker tier + edge cases
  • New SQL-vs-picker mirror spec ("agrees with PickBestReplica on a seeded dataset") catches drift
  • New probeCache unit specs (7) cover cold/TTL/failure/singleflight/per-key independence/disabled mode/Invalidate
  • Cluster smoke: redeploy and confirm a second concurrent chat completion lands on the previously-idle node

mudler added 2 commits May 24, 2026 07:44
…thModel

Lifts the replica-selection policy (in_flight ASC, last_used ASC,
available_vram DESC) out of the SQL ORDER BY into a pure Go function in
the new replicapicker.go. The SQL clause keeps its FOR UPDATE atomicity
and remains the production path used by SmartRouter; PickBestReplica is
the canonical implementation that the future per-frontend rotating
replica cache (TODO referenced from pkg/model) will call against an
in-memory snapshot without paying a DB round-trip per inference.

A new registry_test mirror spec seeds a multi-tier scenario and asserts
both layers pick the same replica, so any future tweak to either side
fails the test until the other side is updated.

No behavior change.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Two related fixes that together restore load balancing across loaded
replicas of the same model.

1. ModelLoader.Load and LoadModel bypass the local *Model cache when
   modelRouter is set. The cached *Model wraps an InFlightTrackingClient
   bound to a single (nodeID, replicaIndex) — reusing it pinned every
   subsequent request to whichever node won the very first pick, so
   FindAndLockNodeWithModel's round-robin never got a chance to run
   even after the reconciler scaled the model out to a second node. In
   distributed mode SmartRouter.Route now runs per request, and
   PickBestReplica picks the least-loaded replica each time.

   SmartRouter has its own coalescing (advisory DB lock for first-time
   loads + singleflight on backend.install RPC) so concurrent first
   requests for a not-yet-loaded model still produce a single worker
   side install.

2. SmartRouter.probeHealth memoizes successful gRPC HealthCheck results
   in a new probeCache (probe_cache.go) with a 30s TTL. With per-request
   routing every inference call hits probeHealth, and llama.cpp-style
   backends serialize HealthCheck behind active Predict — so a burst of
   incoming requests stalled on the probe to a node already mid-stream,
   tripping the 2s timeout and falling through to the install path.
   singleflight collapses N concurrent first-time probes for the same
   (node, addr) into one round-trip, failed probes invalidate the entry
   so the staleness-recovery path still triggers, and the TTL matches
   pkg/model/model.go's healthCheckTTL so the single-process and
   distributed paths share a staleness budget. The background
   HealthMonitor still reaps actually-dead backends within ~45s.

The bypass introduces one short FindAndLockNodeWithModel transaction per
inference. A TODO in pkg/model/loader.go documents the future per modelID
rotating-replica cache that would reuse PickBestReplica against an
in-memory snapshot and skip the DB round-trip for hot paths.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
@mudler mudler enabled auto-merge (squash) May 24, 2026 07:47
mudler
mudler previously approved these changes May 24, 2026
… callback

errcheck (new-from-merge-base) flagged the new distributed-mode branch
in Load() for ignoring ml.ShutdownModel's return. The grandfathered
local-mode branch a few lines below ignores it too, but a fresh
identical pattern doesn't carry the baseline exemption. Log the error
via xlog.Warn — the connection-evicting wrapper is the failure path so
silent eviction errors would lose useful debugging context anyway.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
@mudler mudler merged commit 8bbe89a into master May 24, 2026
56 checks passed
@mudler mudler deleted the fix/distributed-per-request-routing branch May 24, 2026 08:15
@localai-bot localai-bot added the bug Something isn't working label May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants