fix(distributed): orchestrator resilience — auto-upgrade routing, worker bind-wait, RAG-init crash, log spam by localai-bot · Pull Request #9657 · mudler/LocalAI

localai-bot · 2026-05-04T17:01:51Z

Summary

Four small but impactful fixes for distributed-mode orchestration. Triggered by a real cluster (5-node K3s + a NATS-attached Jetson Orin / Spark) where the orchestrator pod was crash-looping on every chat that touched RAG.

fix(distributed): route auto-upgrade through BackendManager (a39c0b6)
UpgradeChecker.runCheck already routed the check through the active BackendManager, but the auto-upgrade branch right below it called gallery.UpgradeBackend directly using the frontend's SystemState. In distributed mode the frontend has no backends installed locally, so every reported upgrade failed with backend "<name>": backend not found. The fix routes the upgrade (and the post-upgrade re-check) through bm.UpgradeBackend(...) so it fans out to workers via NATS. Local mode still falls back to gallery.UpgradeBackend.
fix(worker): wait for backend gRPC bind before replying to backend.install (97a8c74)
The supervisor used to wait 4s (20 × 200ms) for the backend's gRPC to answer a HealthCheck and then log a warning and reply Success anyway. On Jetson Orin first-boot CUDA init that wasn't enough — the frontend's first LoadModel dial got connect: connection refused instead of a real error. The wait window is now 30s, and on deadline-exceeded the supervisor stops the half-started process, recycles the port, and returns an error with the backend's stderr tail.
fix(distributed): bump LocalAGI/LocalRecall — RAG-init no longer crashes the host (a39c0b6, paired with the dep bumps)
LocalRecall's NewPersistent{Chrome,LocalAI,Postgres}Collection used to call os.Exit(1) on init failure (acceptable for the standalone CLI, catastrophic for a long-running embedder). When a chat lazily initialized RAG and the embedding worker was unreachable, os.Exit(1) killed the orchestrator pod. LocalRecall now returns errors (mudler/LocalRecall@6138c1f), LocalAGI surfaces them as a nil collection (mudler/LocalAGI@e83bf51), and the existing RAGProviderFromState returns (nil, nil, false) — the same path agents take when no RAG is configured.
fix(nodes/health): skip stale-marking already-offline nodes (5c10fb1)
The health monitor re-emitted Node heartbeat stale + Marking stale node offline + MarkOffline every cycle for nodes that were already offline, flooding the logs whenever an operator intentionally took a node down. Skip the staleness branch when the node is already StatusOffline / StatusUnhealthy.

Test plan

go build ./core/application/... ./core/services/agentpool/... ./core/services/nodes/... ./core/cli/... ./core/http/endpoints/openai/...
go vet clean on the same set
go test ./core/services/nodes/ (all node tests pass, ~94s)
go test ./core/cli/
Roll the orchestrator pod onto an image built from this branch + bumped LocalAGI dep and confirm:
- chat that triggers RAG init survives an unreachable embedding worker (no pod restart, falls back to "no RAG available")
- first cold-load on the Jetson Orin returns a real error (or succeeds) instead of connection refused
- dgx-spark stays offline without log spam
- auto-upgrade actually upgrades workers instead of failing with "backend not found"

The health monitor re-emitted "Node heartbeat stale" + "Marking stale node offline" + MarkOffline on every cycle for nodes that were already in the offline (or unhealthy) state. For an operator-stopped node this flooded the logs with the same WARN+INFO pair every check interval. Skip the staleness branch when the node is already StatusOffline / StatusUnhealthy — the state is already what we'd write, so neither the log lines nor the DB update carry information. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…stall The backend supervisor used to wait up to 4s (20 × 200ms) for the backend's gRPC server to answer a HealthCheck, then log a warning and reply Success with the bind address anyway. On slower nodes (a Jetson Orin doing first-boot CUDA init, large CGO library load) the gRPC listener wasn't up yet, so the frontend's first LoadModel dial returned "connect: connection refused" and the operator chased a phantom network issue instead of a startup-timing one. Two changes: - Bump the readiness window to 30s. CUDA init on Orin/Thor first boot measures in seconds, not milliseconds. - On deadline-exceeded, stop the half-started process, recycle the port, and return an error with the backend's stderr tail. The frontend now gets a real failure with diagnostic context instead of a misleading ECONNREFUSED on a downstream dial. Process death during the wait window keeps its existing fast-fail path. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…calAGI/LocalRecall Two distributed-mode bugs that surfaced together in the orchestrator logs: 1. Auto-upgrade always failed with "backend not found". UpgradeChecker correctly routed CheckUpgrades through the active BackendManager (so the frontend aggregates worker state), but the auto-upgrade branch right below called gallery.UpgradeBackend directly with the frontend's SystemState. In distributed mode the frontend has no backends installed locally, so ListSystemBackends returned empty and Get(name) failed for every reported upgrade. Auto-upgrade now also goes through BackendManager.UpgradeBackend, which fans out to workers via NATS. 2. Embedding-load failure on a remote node crashed the orchestrator. When RAG init lazily called NewPersistentPostgresCollection and the remote embedding worker was unreachable, LocalRecall called os.Exit(1) inside the constructor, killing the orchestrator pod. LocalRecall now returns errors instead, LocalAGI surfaces them as a nil collection, and the existing RAGProviderFromState path returns (nil, nil, false) — the same code path the agent pool already takes when no RAG is configured. The orchestrator stays up; chat requests degrade to "no RAG available" until the embedding worker recovers. Bumps: github.com/mudler/LocalAGI → e83bf515d010 github.com/mudler/localrecall → 6138c1f535ab Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

mudler added 3 commits May 4, 2026 16:57

mudler approved these changes May 4, 2026

View reviewed changes

mudler merged commit de83b72 into master May 4, 2026
49 checks passed

mudler deleted the fix/distributed-orchestrator-resilience branch May 4, 2026 17:09

localai-bot added the bug Something isn't working label May 9, 2026

BrewTestBot mentioned this pull request May 11, 2026

localai 4.2.0 Homebrew/homebrew-core#282016

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(distributed): orchestrator resilience — auto-upgrade routing, worker bind-wait, RAG-init crash, log spam#9657

fix(distributed): orchestrator resilience — auto-upgrade routing, worker bind-wait, RAG-init crash, log spam#9657
mudler merged 3 commits into
masterfrom
fix/distributed-orchestrator-resilience

localai-bot commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

localai-bot commented May 4, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants