Skip to content

fix(nodes): make per-node backend install async via gallery job queue#9928

Merged
mudler merged 11 commits into
masterfrom
worktree-async-node-backend-install
May 21, 2026
Merged

fix(nodes): make per-node backend install async via gallery job queue#9928
mudler merged 11 commits into
masterfrom
worktree-async-node-backend-install

Conversation

@localai-bot
Copy link
Copy Markdown
Collaborator

Summary

POST /api/nodes/:id/backends/install now returns HTTP 202 with a jobID immediately, instead of blocking the request for up to 3 minutes while the worker downloads and registers the backend. This unfreezes the React UI when installing on one or more nodes from the Backends picker.

The change is wired through the gallery service's existing async job queue, the same pattern /api/backends/install/:id already uses:

  • galleryop.ManagementOp gains a TargetNodeID field so a single ManagementOp enqueued on BackendGalleryChannel can be scoped to one worker
  • DistributedBackendManager.InstallBackend builds a one-element targetNodeIDs allowlist when TargetNodeID is set, reusing the same path UpgradeBackend already takes
  • The HTTP handler enqueues the op, stores a node-scoped opcache row (node:<nodeID>:<backend>) so concurrent installs on different nodes don't collide, and returns { jobID, statusUrl, message }
  • /api/operations now surfaces a nodeID field for node-scoped ops so the Operations panel can render attribution (and the bare backend slug shows in name instead of the prefixed key)
  • NodeInstallPicker dispatches all installs in parallel, then polls /api/backends/job/:uid per job (1.5s interval, 6 min hard cap) until each settles; the modal stays closeable mid-install

Test plan

  • In distributed mode, open the Backends picker on a meta backend, select one node, click Install: HTTP 202 with jobID (Network tab), row shows "Installing" immediately, Operations panel surfaces the job with nodeID, row eventually flips to "Installed"
  • Install the same backend on two different nodes concurrently: two distinct jobs in the Operations panel, both complete independently, neither stomps the other's opcache row
  • Install a backend whose gallery doesn't resolve on the target node: row flips to "Failed" with the worker's error in the tooltip; "Retry failed nodes" re-runs
  • Verify backwards compatibility: POST /api/backends/install/:id (global, no node target) still fans out across the cluster unchanged
  • Verify the picker modal can be closed while installs are in flight without losing visibility (Operations panel keeps tracking)

Follow-ups (not blockers)

  • Orphan opcache rows are not auto-GCed if the handler's dispatch goroutine never drains (panic, process restart). /api/operations/:jobID/dismiss lets users clear them manually
  • Concurrent duplicate dispatch of the same (nodeID, backend) will leak the first jobID in galleryService.statuses until process restart. JS picker dedupes via the selected Set so realistic UI flow is safe; no server-side dedupe added in this PR

Assisted-by: Claude:opus-4-7 [Edit] [Bash] [Agent]

mudler added 11 commits May 21, 2026 19:06
…talls

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…t empty nodeID

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
… node via TargetNodeID

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…rvice job queue

The handler previously called unloader.InstallBackend synchronously and
blocked the browser for up to 3 minutes waiting on the NATS reply. It now
enqueues a TargetNodeID-scoped ManagementOp on BackendGalleryChannel and
returns HTTP 202 + jobID immediately, matching /api/backends/install/:id.

The opcache key is built via NodeScopedKey(nodeID, backend) so concurrent
installs of the same backend across different nodes do not stomp each
other. galleryService/opcache/appConfig are threaded through
RegisterNodeAdminRoutes for this.

Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…t drain goroutine

Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Node-scoped backend installs land in opcache under "node:<nodeID>:<backend>"
keys. Without splitting that prefix back out, the operations panel renders
the full key as the display name and has no structured way to label which
worker an install is targeting. Detect the prefix, surface nodeID as its own
response field, and reduce the display name back to the bare backend slug.
Bare (non-scoped) ops are left untouched so legacy installs do not gain a
misleading empty nodeID.

Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…cancellations as errors

Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…ch codebase precedent

Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
@mudler mudler merged commit a39e025 into master May 21, 2026
57 checks passed
@mudler mudler deleted the worktree-async-node-backend-install branch May 21, 2026 20:25
@localai-bot localai-bot added the bug Something isn't working label May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants