Skip to content

fix(distributed): orchestrator resilience — auto-upgrade routing, worker bind-wait, RAG-init crash, log spam#9657

Merged
mudler merged 3 commits into
masterfrom
fix/distributed-orchestrator-resilience
May 4, 2026
Merged

fix(distributed): orchestrator resilience — auto-upgrade routing, worker bind-wait, RAG-init crash, log spam#9657
mudler merged 3 commits into
masterfrom
fix/distributed-orchestrator-resilience

Conversation

@localai-bot
Copy link
Copy Markdown
Collaborator

Summary

Four small but impactful fixes for distributed-mode orchestration. Triggered by a real cluster (5-node K3s + a NATS-attached Jetson Orin / Spark) where the orchestrator pod was crash-looping on every chat that touched RAG.

  • fix(distributed): route auto-upgrade through BackendManager (a39c0b6)
    UpgradeChecker.runCheck already routed the check through the active BackendManager, but the auto-upgrade branch right below it called gallery.UpgradeBackend directly using the frontend's SystemState. In distributed mode the frontend has no backends installed locally, so every reported upgrade failed with backend "<name>": backend not found. The fix routes the upgrade (and the post-upgrade re-check) through bm.UpgradeBackend(...) so it fans out to workers via NATS. Local mode still falls back to gallery.UpgradeBackend.

  • fix(worker): wait for backend gRPC bind before replying to backend.install (97a8c74)
    The supervisor used to wait 4s (20 × 200ms) for the backend's gRPC to answer a HealthCheck and then log a warning and reply Success anyway. On Jetson Orin first-boot CUDA init that wasn't enough — the frontend's first LoadModel dial got connect: connection refused instead of a real error. The wait window is now 30s, and on deadline-exceeded the supervisor stops the half-started process, recycles the port, and returns an error with the backend's stderr tail.

  • fix(distributed): bump LocalAGI/LocalRecall — RAG-init no longer crashes the host (a39c0b6, paired with the dep bumps)
    LocalRecall's NewPersistent{Chrome,LocalAI,Postgres}Collection used to call os.Exit(1) on init failure (acceptable for the standalone CLI, catastrophic for a long-running embedder). When a chat lazily initialized RAG and the embedding worker was unreachable, os.Exit(1) killed the orchestrator pod. LocalRecall now returns errors (mudler/LocalRecall@6138c1f), LocalAGI surfaces them as a nil collection (mudler/LocalAGI@e83bf51), and the existing RAGProviderFromState returns (nil, nil, false) — the same path agents take when no RAG is configured.

  • fix(nodes/health): skip stale-marking already-offline nodes (5c10fb1)
    The health monitor re-emitted Node heartbeat stale + Marking stale node offline + MarkOffline every cycle for nodes that were already offline, flooding the logs whenever an operator intentionally took a node down. Skip the staleness branch when the node is already StatusOffline / StatusUnhealthy.

Test plan

  • go build ./core/application/... ./core/services/agentpool/... ./core/services/nodes/... ./core/cli/... ./core/http/endpoints/openai/...
  • go vet clean on the same set
  • go test ./core/services/nodes/ (all node tests pass, ~94s)
  • go test ./core/cli/
  • Roll the orchestrator pod onto an image built from this branch + bumped LocalAGI dep and confirm:
    • chat that triggers RAG init survives an unreachable embedding worker (no pod restart, falls back to "no RAG available")
    • first cold-load on the Jetson Orin returns a real error (or succeeds) instead of connection refused
    • dgx-spark stays offline without log spam
    • auto-upgrade actually upgrades workers instead of failing with "backend not found"

mudler added 3 commits May 4, 2026 16:57
The health monitor re-emitted "Node heartbeat stale" + "Marking stale
node offline" + MarkOffline on every cycle for nodes that were already
in the offline (or unhealthy) state. For an operator-stopped node this
flooded the logs with the same WARN+INFO pair every check interval.

Skip the staleness branch when the node is already StatusOffline /
StatusUnhealthy — the state is already what we'd write, so neither the
log lines nor the DB update carry information.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…stall

The backend supervisor used to wait up to 4s (20 × 200ms) for the
backend's gRPC server to answer a HealthCheck, then log a warning and
reply Success with the bind address anyway. On slower nodes (a Jetson
Orin doing first-boot CUDA init, large CGO library load) the gRPC
listener wasn't up yet, so the frontend's first LoadModel dial returned
"connect: connection refused" and the operator chased a phantom network
issue instead of a startup-timing one.

Two changes:

  - Bump the readiness window to 30s. CUDA init on Orin/Thor first boot
    measures in seconds, not milliseconds.
  - On deadline-exceeded, stop the half-started process, recycle the
    port, and return an error with the backend's stderr tail. The
    frontend now gets a real failure with diagnostic context instead of
    a misleading ECONNREFUSED on a downstream dial.

Process death during the wait window keeps its existing fast-fail path.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…calAGI/LocalRecall

Two distributed-mode bugs that surfaced together in the orchestrator
logs:

1. Auto-upgrade always failed with "backend not found".

   UpgradeChecker correctly routed CheckUpgrades through the active
   BackendManager (so the frontend aggregates worker state), but the
   auto-upgrade branch right below called gallery.UpgradeBackend
   directly with the frontend's SystemState. In distributed mode the
   frontend has no backends installed locally, so ListSystemBackends
   returned empty and Get(name) failed for every reported upgrade.
   Auto-upgrade now also goes through BackendManager.UpgradeBackend,
   which fans out to workers via NATS.

2. Embedding-load failure on a remote node crashed the orchestrator.

   When RAG init lazily called NewPersistentPostgresCollection and the
   remote embedding worker was unreachable, LocalRecall called
   os.Exit(1) inside the constructor, killing the orchestrator pod.
   LocalRecall now returns errors instead, LocalAGI surfaces them as a
   nil collection, and the existing RAGProviderFromState path returns
   (nil, nil, false) — the same code path the agent pool already takes
   when no RAG is configured. The orchestrator stays up; chat requests
   degrade to "no RAG available" until the embedding worker recovers.

Bumps:
  github.com/mudler/LocalAGI    → e83bf515d010
  github.com/mudler/localrecall → 6138c1f535ab

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
@mudler mudler merged commit de83b72 into master May 4, 2026
49 checks passed
@mudler mudler deleted the fix/distributed-orchestrator-resilience branch May 4, 2026 17:09
@localai-bot localai-bot added the bug Something isn't working label May 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants