Skip to content

fix(distributed): persist per-model load info so reconciler survives frontend restart#9981

Merged
mudler merged 4 commits into
masterfrom
fix/distributed-reconciler-load-info-persistence
May 25, 2026
Merged

fix(distributed): persist per-model load info so reconciler survives frontend restart#9981
mudler merged 4 commits into
masterfrom
fix/distributed-reconciler-load-info-persistence

Conversation

@localai-bot
Copy link
Copy Markdown
Collaborator

@localai-bot localai-bot commented May 25, 2026

Summary

After a LocalAI frontend container restart (rolling upgrade, env-var change, etc.), the Replica Reconciler logs failed to scale up replica ... no load info every 30s for each affected model and cannot maintain min_replicas until someone sends a fresh inference request to that model. Production impact: in a window where a frontend has restarted but no traffic has hit a given model, a dying worker will not be replaced.

Root cause

router.scheduleAndLoad already persisted the (backendType, pb.ModelOptions) pair, but it stamped it onto the NodeModel row of the loading replica. That row gets deleted by every healthy-row-removal path: MarkOffline reaping stale workers, RemoveAllNodeModelReplicas on stop, RemoveNodeModel on probe failure or stale health check, evictLRUAndFreeNode. When the last replica row goes away, the only copy of load info goes with it. GetModelLoadInfo's scan over NodeModel then returns gorm.ErrRecordNotFound, and the reconciler has nothing to replicate.

Fix

New dedicated ModelLoadInfo table keyed by model_name, decoupled from NodeModel. Three small hooks:

  1. NodeRegistry.UpsertModelLoadInfo(ctx, modelName, backendType, optsBlob) writes the per-model row with ON CONFLICT (model_name) DO UPDATE (last-write-wins under concurrent multi-frontend dispatch, matching the existing per-replica blob semantics).
  2. router.scheduleAndLoad calls both the existing per-replica SetNodeModelLoadInfo (backward compat / hot-path) and the new per-model UpsertModelLoadInfo after a successful load.
  3. NodeRegistry.GetModelLoadInfo now reads the per-model row first and falls back to the legacy NodeModel-blob scan so a rolling upgrade does not regress already-loaded models that were stamped only the old way.

Migration: the new model_load_infos table joins the existing AutoMigrate call under the same advisory lock; no separate migration step.

Test plan

  • go build ./core/services/nodes/... ./pkg/model/... ./core/services/distributed/... succeeds
  • go test -race -count=1 ./core/services/nodes/... passes (233 specs, 7 new)
  • Registry-level integration specs cover: survival across RemoveAllNodeModelReplicas, ON CONFLICT last-write-wins, legacy NodeModel-blob fallback, empty-source ErrRecordNotFound, empty-model-name rejection
  • Reconciler-level integration spec proves recovery: with min_replicas=2, NodeModel rows wiped, ModelLoadInfo row present, one reconcile tick calls the scheduler twice (before the fix: zero calls, "no load info" warning)
  • Manual e2e against a multi-frontend deployment - reproduce the original symptom, apply the patch, confirm the reconciler scales replicas up on its own after a frontend restart (no fresh inference request required)

Caveats

  • Concurrent dispatch by two frontends of the same model with different opts converges last-write-wins on model_load_infos.model_opts_blob. That is identical to the existing per-NodeModel-row semantics; if stronger ordering is ever needed, the row carries updated_at.
  • The blob is the marshalled pb.ModelOptions proto from the dispatch path. If a proto field is renamed/removed in a backward-incompatible way, stale rows would replay obsolete options; standard proto compatibility rules apply.
  • Schema migration is gorm AutoMigrate under the existing advisory lock; no explicit roll-back is provided. Dropping the table on downgrade requires manual SQL (consistent with how the other distributed tables are managed today).

Follow-up to #9976 (distributed-mode observability + middleware refactor).

Assisted-by: Claude:claude-opus-4-7[1m]

@localai-bot localai-bot changed the title fix(distributed): persist per-model load info so reconciler survives frontend restart (Bug-1) fix(distributed): persist per-model load info so reconciler survives frontend restart May 25, 2026
mudler added 4 commits May 25, 2026 10:49
Adds a dedicated ModelLoadInfo table keyed by model name, decoupled from
the per-replica NodeModel rows. The reconciler can now recover model load
metadata after every NodeModel row has been removed (worker death,
eviction, MarkOffline reaping, frontend restart with stale heartbeats),
which is the read side of Bug-1 from the distributed mode bug hunt.

Registry exposes:
  - UpsertModelLoadInfo: ON CONFLICT (model_name) update; last-write-wins,
    matching the existing per-replica blob semantics under concurrent
    multi-frontend dispatch.
  - GetModelLoadInfo: read from the new table first; fall back to the
    legacy NodeModel-blob scan for rows written before any frontend in
    the cluster ran an UpsertModelLoadInfo (rolling-upgrade transition).

SetNodeModelLoadInfo (per-replica blob) is preserved for backward
compatibility and per-replica diagnostics; the dispatch-path hook in the
next commit calls both.

The new table joins the existing nodes AutoMigrate set under the same
schema-migration advisory lock.

Refs: Bug-1, docs/superpowers/specs/2026-05-24-distributed-mode-bug-hunt-findings.md

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7[1m]
scheduleAndLoad now writes the (backendType, ModelOptions blob) pair to
the new ModelLoadInfo table in addition to the existing per-replica
NodeModel.model_opts_blob field. The per-replica blob still works for
the hot path; the per-model row outlives every NodeModel row going away,
which is what unblocks the reconciler on the read side.

Both writes are best-effort with warn-level logging on failure: a write
miss here just means the reconciler may need a fresh inference request
to repopulate, which is the pre-fix behavior.

Concurrency: two frontends loading the same model at the same time both
fire UpsertModelLoadInfo; ON CONFLICT (model_name) makes the row
converge to whichever commits last. Matches the existing per-replica
blob semantics.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7[1m]
Adds Ginkgo specs that prove the persistence layer behaves correctly and
that the reconciler actually recovers from the frontend-restart scenario
that was failing in production:

registry_test.go:
  - per-model row survives RemoveAllNodeModelReplicas (the bug repro)
  - ON CONFLICT (model_name) updates backend type + blob, last-write-wins
  - legacy NodeModel-blob fallback still works (rolling-upgrade transition)
  - GetModelLoadInfo returns ErrRecordNotFound when both sources are empty
  - UpsertModelLoadInfo rejects empty model names

reconciler_test.go:
  - Bug-1 end-to-end: with min_replicas=2, no NodeModel rows, but a
    ModelLoadInfo row present, one reconcile tick fires two scheduler
    calls. Pre-fix this returned "no load info" and the scheduler never
    got called until a fresh inference request arrived.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7[1m]
Adds a bullet to the Replica Reconciler section explaining that per-model
load metadata is persisted across frontend restarts via the new
model_load_infos PostgreSQL table, so a rolling upgrade no longer needs a
fresh inference request per model before the reconciler can replace dead
replicas.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7[1m]
@mudler mudler force-pushed the fix/distributed-reconciler-load-info-persistence branch from c9468fd to 558977e Compare May 25, 2026 10:49
@mudler mudler merged commit a891eed into master May 25, 2026
56 checks passed
@mudler mudler deleted the fix/distributed-reconciler-load-info-persistence branch May 25, 2026 11:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants