fix(distributed): sync gallery OpCache + caches across frontend replicas#9983
Merged
Conversation
c4ae628 to
f822277
Compare
When the LocalAI frontend deployment is scaled past one replica, the UI's
/api/operations poll round-robins between pods. Each pod kept the OpCache
(galleryID->jobID), OpStatus map, and the post-install in-memory caches
(ModelConfigLoader, UpgradeChecker) purely in-process. Reads never
consulted PostgreSQL or NATS even though writes already published to PG.
Symptoms:
- A user installing a model on replica A saw the operation card flicker
in and out as the load balancer alternated.
- The Models page re-fetched the whole gallery on every flicker because
useEffect([operations.length]) re-fires when the count changes.
- A chat completion that landed on replica B after the install completed
on replica A failed to find the new model — B's ModelConfigLoader was
still the old one because nothing told it to reload from disk.
- The UpgradeChecker 6-hour cache stayed stale on peer replicas after a
backend upgrade, so /api/backends/upgrades kept surfacing an upgrade
that had already shipped.
Mirror the jobs Dispatcher pattern for gallery ops:
- OpCache learns SetMessagingClient/SetGalleryStore + a Start(ctx) that
hydrates from PostgreSQL and subscribes to gallery.opcache.{start,end}.
Set/SetBackend now upsert cache_key + is_backend_op on the gallery_
operations row and broadcast OpCacheEvent so peers merge it in. The
hydrate path uses a new GalleryStore.ListActive() (status in {pending,
downloading, processing} and updated within 30 min).
- GalleryService.SubscribeBroadcasts wires a SubjectGalleryProgress-
Wildcard subscriber that calls a new lock-light mergeStatus into the
local statuses map, plus a SubjectGalleryCancelWildcard subscriber that
runs the locally-registered cancel func. Hydrate() restores active rows
from PostgreSQL on startup so a freshly-started replica is not
observably empty mid-install. CancelOperation tolerates the cancel func
living on a different replica and publishes anyway.
- modelHandler and backendHandler publish on the new
SubjectCacheInvalidateModels / SubjectCacheInvalidateBackends after
a successful install/delete/upgrade. SubscribeBroadcasts wires peers
to refresh: OnModelsChanged (re-runs LoadModelConfigsFromPath) and
OnBackendOpCompleted (re-triggers UpgradeChecker). The originating
replica reloads inline so it never enters the broadcast handler.
- OpStatus.Error (an error interface) flat-marshalled to "{}" over JSON,
so a failed install replicated to a peer arrived with a nil error and
the UI's failure banner never appeared. Add MarshalJSON/UnmarshalJSON
via an opStatusWire shim that round-trips Error as a string.
- UpdateStatus and CancelOperation now drop the mutex before publishing
to NATS or persisting to PostgreSQL. The wildcard subscriber's
mergeStatus loops back into the same service on the publishing replica
and would deadlock otherwise; this also prevents future PG round-trips
from stalling concurrent readers on every progress tick.
Tests cover the OpStatus error round-trip, OpCache propagation through a
shared in-memory bus, OpCache PostgreSQL hydration (active-only),
GalleryService progress + cancel broadcast, Nodes preservation across a
peer's bare progress tick, GalleryService hydration from PG, and the
two cache-invalidation broadcasts (models + backends). 44 specs total
in galleryop; routes/operations specs and jobs/agents suites still pass.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7
f822277 to
aaa474c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When scaling the LocalAI frontend past one replica, several pieces of in-memory state were per-process while writes already touched the shared PostgreSQL + NATS layer. This PR closes the read-side gap.
/api/operations1Hz poll): operations flickered in and out as the load balancer alternated. The Models page re-fetched the whole gallery on every flicker because itsuseEffect([operations.length])re-fires when the count changes./api/backends/upgradeskept surfacing an upgrade that had already shipped.This mirrors the
jobs.Dispatcherdistributed pattern for gallery ops: NATS wildcard subscriptions for live sync, PostgreSQLgallery_operationshydration on startup, plus broadcast invalidation for caches that mirror disk state.What's in scope
OpCachegainsSetMessagingClient/SetGalleryStore/Start(ctx).Set/SetBackendupsertcache_key+is_backend_opand broadcastOpCacheEventongallery.opcache.start.DeleteUUIDbroadcasts ongallery.opcache.end.Starthydrates from PostgreSQL via the newGalleryStore.ListActive()then subscribes to both subjects.GalleryService.SubscribeBroadcastssubscribes toSubjectGalleryProgressWildcard(new lock-lightmergeStatus),SubjectGalleryCancelWildcard(newapplyCancel), and the newSubjectCacheInvalidateModels/SubjectCacheInvalidateBackends. Two new hooks (OnModelsChanged, existingOnBackendOpCompleted) get wired instartup.gotoLoadModelConfigsFromPathandUpgradeChecker.TriggerCheckrespectively.Hydrate()restores active rows on startup.CancelOperationtolerates the cancel func living on a different replica.modelHandlerandbackendHandlerpublish on the cache-invalidation subjects after their successful local reload, so peers refresh too. Originating replica reloads inline and never enters the broadcast handler.OpStatus.MarshalJSON/UnmarshalJSONround-trip the error as a string via an internalopStatusWire. The published payload now wraps the status inGalleryProgressEvent{JobID, Status}so wildcard subscribers don't need to parse the subject string.GalleryOperationRecordgainsCacheKey+IsBackendOpcolumns;CreateusesOnConflict-DoUpdatessoOpCache.Setracing ahead of the service goroutine no longer breaks the row.core/http/app.gowires the OpCache distributed setters whenapplication.Distributed() != nil.core/application/startup.gocallsHydrate+ setsOnModelsChanged+SubscribeBroadcastsright afterSetGalleryStore/SetNATSClient.Test plan
go test ./core/services/galleryop/...— 44/44 specs pass (5 new spec groups: OpStatus JSON wire format, OpCache distributed sync, OpCache PostgreSQL hydration, GalleryService broadcast sync, cache-invalidation broadcasts, GalleryService PostgreSQL hydration)go test ./core/http/routes/...— passesgo test ./core/services/jobs/...— passesgo test ./core/services/agents/...— passesgo build ./core/... && go vet ./core/...— cleangolangci-lint runon changed packages — cleanlocal-aiDeployment to 2 replicas, install a model — operation card should remain stable across the 1Hz poll, Models page should not continuously refetch, and a chat completion routed to the non-installing replica should find the model.