fix(distributed): persist per-model load info so reconciler survives frontend restart by localai-bot · Pull Request #9981 · mudler/LocalAI

localai-bot · 2026-05-25T10:38:19Z

Summary

After a LocalAI frontend container restart (rolling upgrade, env-var change, etc.), the Replica Reconciler logs failed to scale up replica ... no load info every 30s for each affected model and cannot maintain min_replicas until someone sends a fresh inference request to that model. Production impact: in a window where a frontend has restarted but no traffic has hit a given model, a dying worker will not be replaced.

Root cause

router.scheduleAndLoad already persisted the (backendType, pb.ModelOptions) pair, but it stamped it onto the NodeModel row of the loading replica. That row gets deleted by every healthy-row-removal path: MarkOffline reaping stale workers, RemoveAllNodeModelReplicas on stop, RemoveNodeModel on probe failure or stale health check, evictLRUAndFreeNode. When the last replica row goes away, the only copy of load info goes with it. GetModelLoadInfo's scan over NodeModel then returns gorm.ErrRecordNotFound, and the reconciler has nothing to replicate.

Fix

New dedicated ModelLoadInfo table keyed by model_name, decoupled from NodeModel. Three small hooks:

NodeRegistry.UpsertModelLoadInfo(ctx, modelName, backendType, optsBlob) writes the per-model row with ON CONFLICT (model_name) DO UPDATE (last-write-wins under concurrent multi-frontend dispatch, matching the existing per-replica blob semantics).
router.scheduleAndLoad calls both the existing per-replica SetNodeModelLoadInfo (backward compat / hot-path) and the new per-model UpsertModelLoadInfo after a successful load.
NodeRegistry.GetModelLoadInfo now reads the per-model row first and falls back to the legacy NodeModel-blob scan so a rolling upgrade does not regress already-loaded models that were stamped only the old way.

Migration: the new model_load_infos table joins the existing AutoMigrate call under the same advisory lock; no separate migration step.

Test plan

go build ./core/services/nodes/... ./pkg/model/... ./core/services/distributed/... succeeds
go test -race -count=1 ./core/services/nodes/... passes (233 specs, 7 new)
Registry-level integration specs cover: survival across RemoveAllNodeModelReplicas, ON CONFLICT last-write-wins, legacy NodeModel-blob fallback, empty-source ErrRecordNotFound, empty-model-name rejection
Reconciler-level integration spec proves recovery: with min_replicas=2, NodeModel rows wiped, ModelLoadInfo row present, one reconcile tick calls the scheduler twice (before the fix: zero calls, "no load info" warning)
Manual e2e against a multi-frontend deployment - reproduce the original symptom, apply the patch, confirm the reconciler scales replicas up on its own after a frontend restart (no fresh inference request required)

Caveats

Concurrent dispatch by two frontends of the same model with different opts converges last-write-wins on model_load_infos.model_opts_blob. That is identical to the existing per-NodeModel-row semantics; if stronger ordering is ever needed, the row carries updated_at.
The blob is the marshalled pb.ModelOptions proto from the dispatch path. If a proto field is renamed/removed in a backward-incompatible way, stale rows would replay obsolete options; standard proto compatibility rules apply.
Schema migration is gorm AutoMigrate under the existing advisory lock; no explicit roll-back is provided. Dropping the table on downgrade requires manual SQL (consistent with how the other distributed tables are managed today).

Follow-up to #9976 (distributed-mode observability + middleware refactor).

Assisted-by: Claude:claude-opus-4-7[1m]

Adds a dedicated ModelLoadInfo table keyed by model name, decoupled from the per-replica NodeModel rows. The reconciler can now recover model load metadata after every NodeModel row has been removed (worker death, eviction, MarkOffline reaping, frontend restart with stale heartbeats), which is the read side of Bug-1 from the distributed mode bug hunt. Registry exposes: - UpsertModelLoadInfo: ON CONFLICT (model_name) update; last-write-wins, matching the existing per-replica blob semantics under concurrent multi-frontend dispatch. - GetModelLoadInfo: read from the new table first; fall back to the legacy NodeModel-blob scan for rows written before any frontend in the cluster ran an UpsertModelLoadInfo (rolling-upgrade transition). SetNodeModelLoadInfo (per-replica blob) is preserved for backward compatibility and per-replica diagnostics; the dispatch-path hook in the next commit calls both. The new table joins the existing nodes AutoMigrate set under the same schema-migration advisory lock. Refs: Bug-1, docs/superpowers/specs/2026-05-24-distributed-mode-bug-hunt-findings.md Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7[1m]

scheduleAndLoad now writes the (backendType, ModelOptions blob) pair to the new ModelLoadInfo table in addition to the existing per-replica NodeModel.model_opts_blob field. The per-replica blob still works for the hot path; the per-model row outlives every NodeModel row going away, which is what unblocks the reconciler on the read side. Both writes are best-effort with warn-level logging on failure: a write miss here just means the reconciler may need a fresh inference request to repopulate, which is the pre-fix behavior. Concurrency: two frontends loading the same model at the same time both fire UpsertModelLoadInfo; ON CONFLICT (model_name) makes the row converge to whichever commits last. Matches the existing per-replica blob semantics. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7[1m]

Adds Ginkgo specs that prove the persistence layer behaves correctly and that the reconciler actually recovers from the frontend-restart scenario that was failing in production: registry_test.go: - per-model row survives RemoveAllNodeModelReplicas (the bug repro) - ON CONFLICT (model_name) updates backend type + blob, last-write-wins - legacy NodeModel-blob fallback still works (rolling-upgrade transition) - GetModelLoadInfo returns ErrRecordNotFound when both sources are empty - UpsertModelLoadInfo rejects empty model names reconciler_test.go: - Bug-1 end-to-end: with min_replicas=2, no NodeModel rows, but a ModelLoadInfo row present, one reconcile tick fires two scheduler calls. Pre-fix this returned "no load info" and the scheduler never got called until a fresh inference request arrived. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7[1m]

Adds a bullet to the Replica Reconciler section explaining that per-model load metadata is persisted across frontend restarts via the new model_load_infos PostgreSQL table, so a rolling upgrade no longer needs a fresh inference request per model before the reconciler can replace dead replicas. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7[1m]

localai-bot changed the title ~~fix(distributed): persist per-model load info so reconciler survives frontend restart (Bug-1)~~ fix(distributed): persist per-model load info so reconciler survives frontend restart May 25, 2026

mudler added 4 commits May 25, 2026 10:49

mudler force-pushed the fix/distributed-reconciler-load-info-persistence branch from c9468fd to 558977e Compare May 25, 2026 10:49

mudler merged commit a891eed into master May 25, 2026
56 checks passed

mudler deleted the fix/distributed-reconciler-load-info-persistence branch May 25, 2026 11:00

BrewTestBot mentioned this pull request May 27, 2026

localai 4.3.2 Homebrew/homebrew-core#285003

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(distributed): persist per-model load info so reconciler survives frontend restart#9981

fix(distributed): persist per-model load info so reconciler survives frontend restart#9981
mudler merged 4 commits into
masterfrom
fix/distributed-reconciler-load-info-persistence

localai-bot commented May 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

localai-bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Fix

Test plan

Caveats

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

localai-bot commented May 25, 2026 •

edited

Loading