fix(distributed): orchestrator resilience — auto-upgrade routing, worker bind-wait, RAG-init crash, log spam#9657
Merged
Conversation
The health monitor re-emitted "Node heartbeat stale" + "Marking stale node offline" + MarkOffline on every cycle for nodes that were already in the offline (or unhealthy) state. For an operator-stopped node this flooded the logs with the same WARN+INFO pair every check interval. Skip the staleness branch when the node is already StatusOffline / StatusUnhealthy — the state is already what we'd write, so neither the log lines nor the DB update carry information. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…stall
The backend supervisor used to wait up to 4s (20 × 200ms) for the
backend's gRPC server to answer a HealthCheck, then log a warning and
reply Success with the bind address anyway. On slower nodes (a Jetson
Orin doing first-boot CUDA init, large CGO library load) the gRPC
listener wasn't up yet, so the frontend's first LoadModel dial returned
"connect: connection refused" and the operator chased a phantom network
issue instead of a startup-timing one.
Two changes:
- Bump the readiness window to 30s. CUDA init on Orin/Thor first boot
measures in seconds, not milliseconds.
- On deadline-exceeded, stop the half-started process, recycle the
port, and return an error with the backend's stderr tail. The
frontend now gets a real failure with diagnostic context instead of
a misleading ECONNREFUSED on a downstream dial.
Process death during the wait window keeps its existing fast-fail path.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…calAGI/LocalRecall Two distributed-mode bugs that surfaced together in the orchestrator logs: 1. Auto-upgrade always failed with "backend not found". UpgradeChecker correctly routed CheckUpgrades through the active BackendManager (so the frontend aggregates worker state), but the auto-upgrade branch right below called gallery.UpgradeBackend directly with the frontend's SystemState. In distributed mode the frontend has no backends installed locally, so ListSystemBackends returned empty and Get(name) failed for every reported upgrade. Auto-upgrade now also goes through BackendManager.UpgradeBackend, which fans out to workers via NATS. 2. Embedding-load failure on a remote node crashed the orchestrator. When RAG init lazily called NewPersistentPostgresCollection and the remote embedding worker was unreachable, LocalRecall called os.Exit(1) inside the constructor, killing the orchestrator pod. LocalRecall now returns errors instead, LocalAGI surfaces them as a nil collection, and the existing RAGProviderFromState path returns (nil, nil, false) — the same code path the agent pool already takes when no RAG is configured. The orchestrator stays up; chat requests degrade to "no RAG available" until the embedding worker recovers. Bumps: github.com/mudler/LocalAGI → e83bf515d010 github.com/mudler/localrecall → 6138c1f535ab Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
mudler
approved these changes
May 4, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Four small but impactful fixes for distributed-mode orchestration. Triggered by a real cluster (5-node K3s + a NATS-attached Jetson Orin / Spark) where the orchestrator pod was crash-looping on every chat that touched RAG.
fix(distributed): route auto-upgrade through BackendManager(a39c0b6)UpgradeChecker.runCheckalready routed the check through the activeBackendManager, but the auto-upgrade branch right below it calledgallery.UpgradeBackenddirectly using the frontend'sSystemState. In distributed mode the frontend has no backends installed locally, so every reported upgrade failed withbackend "<name>": backend not found. The fix routes the upgrade (and the post-upgrade re-check) throughbm.UpgradeBackend(...)so it fans out to workers via NATS. Local mode still falls back togallery.UpgradeBackend.fix(worker): wait for backend gRPC bind before replying to backend.install(97a8c74)The supervisor used to wait 4s (20 × 200ms) for the backend's gRPC to answer a HealthCheck and then log a warning and reply Success anyway. On Jetson Orin first-boot CUDA init that wasn't enough — the frontend's first
LoadModeldial gotconnect: connection refusedinstead of a real error. The wait window is now 30s, and on deadline-exceeded the supervisor stops the half-started process, recycles the port, and returns an error with the backend's stderr tail.fix(distributed): bump LocalAGI/LocalRecall — RAG-init no longer crashes the host(a39c0b6, paired with the dep bumps)LocalRecall's
NewPersistent{Chrome,LocalAI,Postgres}Collectionused to callos.Exit(1)on init failure (acceptable for the standalone CLI, catastrophic for a long-running embedder). When a chat lazily initialized RAG and the embedding worker was unreachable,os.Exit(1)killed the orchestrator pod. LocalRecall now returns errors (mudler/LocalRecall@6138c1f), LocalAGI surfaces them as a nil collection (mudler/LocalAGI@e83bf51), and the existingRAGProviderFromStatereturns(nil, nil, false)— the same path agents take when no RAG is configured.fix(nodes/health): skip stale-marking already-offline nodes(5c10fb1)The health monitor re-emitted
Node heartbeat stale+Marking stale node offline+MarkOfflineevery cycle for nodes that were already offline, flooding the logs whenever an operator intentionally took a node down. Skip the staleness branch when the node is alreadyStatusOffline/StatusUnhealthy.Test plan
go build ./core/application/... ./core/services/agentpool/... ./core/services/nodes/... ./core/cli/... ./core/http/endpoints/openai/...go vetclean on the same setgo test ./core/services/nodes/(all node tests pass, ~94s)go test ./core/cli/connection refused