feat: surface distributed backend management errors#9552
Merged
Conversation
DistributedBackendManager.{Install,Upgrade,Delete}Backend discarded the
per-node BackendOpResult from enqueueAndDrainBackendOp with `_, err :=`.
When workers replied Success=false (e.g. an OCI image with no arm64
variant on a Jetson host), the per-node Error string was recorded in
result.Nodes[].Error but never reached the toplevel return value, so
OpStatus.Error stayed empty and the UI reported the install as
"completed" while the backend was nowhere on the cluster.
Add BackendOpResult.Err() that aggregates per-node Status=="error"
entries into a single error. Queued nodes (waiting for reconciler retry)
are deliberately not treated as failures. Wire the three callers and
DeleteBackendDetailed to call result.Err() so reply.Success=false
finally reaches OpStatus.Error → /api/backends/job/:uid → the UI.
The Delete closures had a related bug: they discarded the reply with
`_` and only checked the NATS round-trip error, so reply.Success=false
was a silent success even with the new aggregation. Check both.
Standalone mode (LocalBackendManager) already surfaces gallery errors
correctly through the same OpStatus.Error path; no change needed there.
Tests: 9 new Ginkgo specs covering all-success / all-fail with distinct
errors / mixed / all-queued / no-nodes for Install, Upgrade, Delete.
Assisted-by: Claude:claude-opus-4-7 [Bash] [Edit] [Read] [Write]
The Nodes page exposed a per-node "reinstall" button (fa-sync-alt, tooltip "Reinstall backend") but no per-node delete, even though the Go side has had POST /api/nodes/:id/backends/delete → RemoteUnloaderAdapter.DeleteBackend → NATS-to-specific-node wired up for a while. Sync icons read as "refresh data" — the action is functionally an upgrade (re-pulls the gallery image), so the affordance was misleading. Per-node backend row now renders two icon buttons: - Upgrade: btn-secondary btn-sm + fa-arrow-up, tooltip "Upgrade backend on this node". Names both action and scope to differentiate from the cluster-wide upgrade on the Backends page. - Delete: btn-danger-ghost btn-sm + fa-trash, tooltip "Delete backend from this node". Matches the node-level destructive style at the row action column rather than the solid btn-danger of primary destructive pages, since this is a secondary action inside a busy row. Delete goes through the existing ConfirmDialog (danger=true) with copy that names the backend and the node explicitly — it's a non-recoverable op on a specific scope. Reuses nodesApi.deleteBackend(id, backend) which already existed in the API client. Tests: 4 new Playwright specs covering upgrade clarity (icon + tooltip), delete button presence, confirm dialog flow with POST body assertion, and cancel-doesn't-POST. Assisted-by: Claude:claude-opus-4-7 [Bash] [Edit] [Read] [Write]
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
DistributedBackendManager.{Install,Upgrade,Delete}Backend discarded the
per-node BackendOpResult from enqueueAndDrainBackendOp with
_, err :=.When workers replied Success=false (e.g. an OCI image with no arm64
variant on a Jetson host), the per-node Error string was recorded in
result.Nodes[].Error but never reached the toplevel return value, so
OpStatus.Error stayed empty and the UI reported the install as
"completed" while the backend was nowhere on the cluster.
Add BackendOpResult.Err() that aggregates per-node Status=="error"
entries into a single error. Queued nodes (waiting for reconciler retry)
are deliberately not treated as failures. Wire the three callers and
DeleteBackendDetailed to call result.Err() so reply.Success=false
finally reaches OpStatus.Error → /api/backends/job/:uid → the UI.
The Delete closures had a related bug: they discarded the reply with
_and only checked the NATS round-trip error, so reply.Success=falsewas a silent success even with the new aggregation. Check both.
=Notes for Reviewers
Signed commits