Skip to content

fix(distributed): sync gallery OpCache + caches across frontend replicas#9983

Merged
mudler merged 1 commit into
masterfrom
worktree-distributed-gallery-ops
May 25, 2026
Merged

fix(distributed): sync gallery OpCache + caches across frontend replicas#9983
mudler merged 1 commit into
masterfrom
worktree-distributed-gallery-ops

Conversation

@localai-bot
Copy link
Copy Markdown
Collaborator

@localai-bot localai-bot commented May 25, 2026

Summary

When scaling the LocalAI frontend past one replica, several pieces of in-memory state were per-process while writes already touched the shared PostgreSQL + NATS layer. This PR closes the read-side gap.

  • OpCache + OpStatus (UI's /api/operations 1Hz poll): operations flickered in and out as the load balancer alternated. The Models page re-fetched the whole gallery on every flicker because its useEffect([operations.length]) re-fires when the count changes.
  • ModelConfigLoader: a chat completion that landed on replica B after the install completed on replica A failed to find the new model — B's loader was still the old one.
  • UpgradeChecker: 6-hour cache stayed stale on peer replicas after a backend upgrade, so /api/backends/upgrades kept surfacing an upgrade that had already shipped.

This mirrors the jobs.Dispatcher distributed pattern for gallery ops: NATS wildcard subscriptions for live sync, PostgreSQL gallery_operations hydration on startup, plus broadcast invalidation for caches that mirror disk state.

What's in scope

  • OpCache gains SetMessagingClient / SetGalleryStore / Start(ctx). Set/SetBackend upsert cache_key + is_backend_op and broadcast OpCacheEvent on gallery.opcache.start. DeleteUUID broadcasts on gallery.opcache.end. Start hydrates from PostgreSQL via the new GalleryStore.ListActive() then subscribes to both subjects.
  • GalleryService.SubscribeBroadcasts subscribes to SubjectGalleryProgressWildcard (new lock-light mergeStatus), SubjectGalleryCancelWildcard (new applyCancel), and the new SubjectCacheInvalidateModels / SubjectCacheInvalidateBackends. Two new hooks (OnModelsChanged, existing OnBackendOpCompleted) get wired in startup.go to LoadModelConfigsFromPath and UpgradeChecker.TriggerCheck respectively. Hydrate() restores active rows on startup. CancelOperation tolerates the cancel func living on a different replica.
  • modelHandler and backendHandler publish on the cache-invalidation subjects after their successful local reload, so peers refresh too. Originating replica reloads inline and never enters the broadcast handler.
  • OpStatus.MarshalJSON/UnmarshalJSON round-trip the error as a string via an internal opStatusWire. The published payload now wraps the status in GalleryProgressEvent{JobID, Status} so wildcard subscribers don't need to parse the subject string.
  • GalleryOperationRecord gains CacheKey + IsBackendOp columns; Create uses OnConflict-DoUpdates so OpCache.Set racing ahead of the service goroutine no longer breaks the row.
  • core/http/app.go wires the OpCache distributed setters when application.Distributed() != nil. core/application/startup.go calls Hydrate + sets OnModelsChanged + SubscribeBroadcasts right after SetGalleryStore/SetNATSClient.

Test plan

  • go test ./core/services/galleryop/... — 44/44 specs pass (5 new spec groups: OpStatus JSON wire format, OpCache distributed sync, OpCache PostgreSQL hydration, GalleryService broadcast sync, cache-invalidation broadcasts, GalleryService PostgreSQL hydration)
  • go test ./core/http/routes/... — passes
  • go test ./core/services/jobs/... — passes
  • go test ./core/services/agents/... — passes
  • go build ./core/... && go vet ./core/... — clean
  • golangci-lint run on changed packages — clean
  • Manual: scale local-ai Deployment to 2 replicas, install a model — operation card should remain stable across the 1Hz poll, Models page should not continuously refetch, and a chat completion routed to the non-installing replica should find the model.

@mudler mudler force-pushed the worktree-distributed-gallery-ops branch from c4ae628 to f822277 Compare May 25, 2026 13:48
When the LocalAI frontend deployment is scaled past one replica, the UI's
/api/operations poll round-robins between pods. Each pod kept the OpCache
(galleryID->jobID), OpStatus map, and the post-install in-memory caches
(ModelConfigLoader, UpgradeChecker) purely in-process. Reads never
consulted PostgreSQL or NATS even though writes already published to PG.
Symptoms:

- A user installing a model on replica A saw the operation card flicker
  in and out as the load balancer alternated.
- The Models page re-fetched the whole gallery on every flicker because
  useEffect([operations.length]) re-fires when the count changes.
- A chat completion that landed on replica B after the install completed
  on replica A failed to find the new model — B's ModelConfigLoader was
  still the old one because nothing told it to reload from disk.
- The UpgradeChecker 6-hour cache stayed stale on peer replicas after a
  backend upgrade, so /api/backends/upgrades kept surfacing an upgrade
  that had already shipped.

Mirror the jobs Dispatcher pattern for gallery ops:

- OpCache learns SetMessagingClient/SetGalleryStore + a Start(ctx) that
  hydrates from PostgreSQL and subscribes to gallery.opcache.{start,end}.
  Set/SetBackend now upsert cache_key + is_backend_op on the gallery_
  operations row and broadcast OpCacheEvent so peers merge it in. The
  hydrate path uses a new GalleryStore.ListActive() (status in {pending,
  downloading, processing} and updated within 30 min).
- GalleryService.SubscribeBroadcasts wires a SubjectGalleryProgress-
  Wildcard subscriber that calls a new lock-light mergeStatus into the
  local statuses map, plus a SubjectGalleryCancelWildcard subscriber that
  runs the locally-registered cancel func. Hydrate() restores active rows
  from PostgreSQL on startup so a freshly-started replica is not
  observably empty mid-install. CancelOperation tolerates the cancel func
  living on a different replica and publishes anyway.
- modelHandler and backendHandler publish on the new
  SubjectCacheInvalidateModels / SubjectCacheInvalidateBackends after
  a successful install/delete/upgrade. SubscribeBroadcasts wires peers
  to refresh: OnModelsChanged (re-runs LoadModelConfigsFromPath) and
  OnBackendOpCompleted (re-triggers UpgradeChecker). The originating
  replica reloads inline so it never enters the broadcast handler.
- OpStatus.Error (an error interface) flat-marshalled to "{}" over JSON,
  so a failed install replicated to a peer arrived with a nil error and
  the UI's failure banner never appeared. Add MarshalJSON/UnmarshalJSON
  via an opStatusWire shim that round-trips Error as a string.
- UpdateStatus and CancelOperation now drop the mutex before publishing
  to NATS or persisting to PostgreSQL. The wildcard subscriber's
  mergeStatus loops back into the same service on the publishing replica
  and would deadlock otherwise; this also prevents future PG round-trips
  from stalling concurrent readers on every progress tick.

Tests cover the OpStatus error round-trip, OpCache propagation through a
shared in-memory bus, OpCache PostgreSQL hydration (active-only),
GalleryService progress + cancel broadcast, Nodes preservation across a
peer's bare progress tick, GalleryService hydration from PG, and the
two cache-invalidation broadcasts (models + backends). 44 specs total
in galleryop; routes/operations specs and jobs/agents suites still pass.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7
@mudler mudler force-pushed the worktree-distributed-gallery-ops branch from f822277 to aaa474c Compare May 25, 2026 14:57
@localai-bot localai-bot changed the title fix(distributed): sync gallery OpCache + statuses across frontend replicas fix(distributed): sync gallery OpCache + caches across frontend replicas May 25, 2026
@mudler mudler merged commit 8d6548c into master May 25, 2026
57 checks passed
@mudler mudler deleted the worktree-distributed-gallery-ops branch May 25, 2026 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants