Skip to content

fix(distributed): broadcast file-staging progress across replicas#10440

Merged
mudler merged 1 commit into
masterfrom
fix/distributed-staging-progress-broadcast
Jun 22, 2026
Merged

fix(distributed): broadcast file-staging progress across replicas#10440
mudler merged 1 commit into
masterfrom
fix/distributed-staging-progress-broadcast

Conversation

@localai-bot

Copy link
Copy Markdown
Collaborator

Problem

File-staging progress lived only in the SmartRouter's in-memory StagingTracker on the replica performing the transfer. In a multi-replica deployment behind a round-robin load balancer, a /api/operations poll that lands on any other replica saw no staging row, so the progress line (processing file ... Total ... Current ...) flickered in and out as polls rotated between frontends.

This is the same cross-replica class as gallery-install progress (already solved via NATS broadcast + merge), but staging never got the equivalent treatment.

Fix

Mirror the gallery-install pattern:

  • The origin replica broadcasts staging ticks over NATS on a new staging.<model>.progress subject (SubjectStagingProgress).
  • Peers subscribe to the wildcard (SubscribeBroadcasts) and merge via ApplyRemote.
  • Byte-level ticks are leading-edge debounced (~1/s); Start/FileComplete/Complete always publish so peers never miss a transition.
  • A locally-owned op stays authoritative: the origin's own echo and any stray peer event can't clobber or delete it.
  • Mirrored remote ops expire after a TTL, so a missed Done event (NATS is fire-and-forget) can't leave a phantom row.

The UI read path (StagingTracker.GetAll, consumed by /api/operations) is unchanged.

Test

staging_progress_broadcast_test.go: a peer tracker surfaces an op it did not originate after merging broadcasts; the op is removed on completion; a locally-owned op is not clobbered by peer events; standalone mode (no publisher) does not broadcast. Full core/services/nodes suite passes; golangci-lint --new-from-merge-base=origin/master reports 0 issues.

Related

Companion to #10438 (staging context detach). Both came out of the same multi-replica deployment investigation; this one is the cosmetic flicker, #10438 is the model-load outage.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

File-staging progress lived only in the SmartRouter's in-memory
StagingTracker on the replica performing the transfer. In a multi-replica
deployment behind a round-robin load balancer, a /api/operations poll
that lands on any other replica saw no staging row, so the progress
("processing file ... Total ... Current ...") flickered in and out as
polls rotated between frontends.

Mirror the pattern already used for gallery-install progress: the origin
replica broadcasts staging ticks over NATS (SubjectStagingProgress, a
new staging.<model>.progress subject), and peers merge them via
ApplyRemote (SubscribeBroadcasts on the wildcard). Byte-level ticks are
leading-edge debounced (~1/s); Start/FileComplete/Complete always
publish. A locally-owned op stays authoritative so the origin's own echo
and stray peer events can't clobber it, and mirrored remote ops expire
after a TTL so a missed Done event can't leave a phantom row. The UI read
path (StagingTracker.GetAll) is unchanged.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
@mudler mudler merged commit 569d9bb into master Jun 22, 2026
59 checks passed
@mudler mudler deleted the fix/distributed-staging-progress-broadcast branch June 22, 2026 07:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants