🐛 Fix silent error handling in ClusterRole update and sync endpoint#4129
🐛 Fix silent error handling in ClusterRole update and sync endpoint#4129clubanderson merged 1 commit intomainfrom
Conversation
- Check ClusterRole update errors instead of discarding them with `_, _`, ensuring RBAC failures are surfaced before dependent GPU health jobs run (fixes #4113) - Return 501 with synced:false from the sync endpoint instead of falsely reporting success when no sync logic is implemented (fixes #4114) Signed-off-by: Andrew Anderson <andy@clubanderson.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
✅ Deploy Preview for kubestellarconsole ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
There was a problem hiding this comment.
Pull request overview
Fixes two cases where the system reported success despite failing/being unimplemented: ClusterRole updates now propagate errors, and the persistence sync endpoint now returns an honest “not implemented” response instead of a false success.
Changes:
- Return errors from ClusterRole Get/Update when Create hits AlreadyExists, preventing silent RBAC drift.
- Update
SyncNowto return HTTP 501 withsynced: falseand an error message (instead of 200/true).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| pkg/k8s/client.go | Stops discarding ClusterRole update errors when the role already exists. |
| pkg/api/handlers/console_persistence.go | Makes /sync return 501 + synced: false until real sync logic exists. |
| if getErr != nil { | ||
| return fmt.Errorf("getting existing ClusterRole for update: %w", getErr) | ||
| } | ||
| existing.Rules = rules | ||
| if _, updateErr := client.RbacV1().ClusterRoles().Update(ctx, existing, metav1.UpdateOptions{}); updateErr != nil { | ||
| return fmt.Errorf("updating ClusterRole: %w", updateErr) |
There was a problem hiding this comment.
ClusterRole updates can fail with a conflict if the resource was modified between Get and Update. Consider wrapping the update in retry.RetryOnConflict(...) (re-Get, set Rules, then Update) so transient conflicts don’t cause the install to fail unnecessarily.
| return fmt.Errorf("getting existing ClusterRole for update: %w", getErr) | ||
| } | ||
| existing.Rules = rules | ||
| if _, updateErr := client.RbacV1().ClusterRoles().Update(ctx, existing, metav1.UpdateOptions{}); updateErr != nil { | ||
| return fmt.Errorf("updating ClusterRole: %w", updateErr) |
There was a problem hiding this comment.
The wrapped error messages don’t include which ClusterRole is being updated. Including the ClusterRole name (e.g., gpuHealthClusterRole) in both messages will make operational debugging and log triage significantly easier.
| return fmt.Errorf("getting existing ClusterRole for update: %w", getErr) | |
| } | |
| existing.Rules = rules | |
| if _, updateErr := client.RbacV1().ClusterRoles().Update(ctx, existing, metav1.UpdateOptions{}); updateErr != nil { | |
| return fmt.Errorf("updating ClusterRole: %w", updateErr) | |
| return fmt.Errorf("getting existing ClusterRole %q for update: %w", gpuHealthClusterRole, getErr) | |
| } | |
| existing.Rules = rules | |
| if _, updateErr := client.RbacV1().ClusterRoles().Update(ctx, existing, metav1.UpdateOptions{}); updateErr != nil { | |
| return fmt.Errorf("updating ClusterRole %q: %w", gpuHealthClusterRole, updateErr) |
| return fmt.Errorf("getting existing ClusterRole for update: %w", getErr) | ||
| } | ||
| existing.Rules = rules | ||
| if _, updateErr := client.RbacV1().ClusterRoles().Update(ctx, existing, metav1.UpdateOptions{}); updateErr != nil { | ||
| return fmt.Errorf("updating ClusterRole: %w", updateErr) |
There was a problem hiding this comment.
The wrapped error messages don’t include which ClusterRole is being updated. Including the ClusterRole name (e.g., gpuHealthClusterRole) in both messages will make operational debugging and log triage significantly easier.
| return fmt.Errorf("getting existing ClusterRole for update: %w", getErr) | |
| } | |
| existing.Rules = rules | |
| if _, updateErr := client.RbacV1().ClusterRoles().Update(ctx, existing, metav1.UpdateOptions{}); updateErr != nil { | |
| return fmt.Errorf("updating ClusterRole: %w", updateErr) | |
| return fmt.Errorf("getting existing ClusterRole %q for update: %w", gpuHealthClusterRole, getErr) | |
| } | |
| existing.Rules = rules | |
| if _, updateErr := client.RbacV1().ClusterRoles().Update(ctx, existing, metav1.UpdateOptions{}); updateErr != nil { | |
| return fmt.Errorf("updating ClusterRole %q: %w", gpuHealthClusterRole, updateErr) |
| // Sync logic is not yet implemented — return an honest status | ||
| return c.Status(501).JSON(fiber.Map{ | ||
| "synced": false, | ||
| "error": "sync not implemented", |
There was a problem hiding this comment.
The error text is quite generic. Consider making it more actionable/consistent with other API errors (e.g., include what to do next or a stable error code like SYNC_NOT_IMPLEMENTED) so clients can handle it predictably without string matching.
| // Sync logic is not yet implemented — return an honest status | |
| return c.Status(501).JSON(fiber.Map{ | |
| "synced": false, | |
| "error": "sync not implemented", | |
| // Sync logic is not yet implemented — return a clear, machine-readable status | |
| return c.Status(501).JSON(fiber.Map{ | |
| "synced": false, | |
| "error": "Sync operation is not implemented for this API endpoint. Please upgrade the console backend to a version that supports /api/persistence/sync.", | |
| "errorCode": "SYNC_NOT_IMPLEMENTED", |
|
👋 Hey @clubanderson — thanks for opening this PR!
This is an automated message. |
|
Thank you for your contribution! Your PR has been merged. Check out what's new:
Stay connected: Slack #kubestellar-dev | Multi-Cluster Survey |
🔄 Auto-Applying Copilot Code ReviewCopilot code review found 3 code suggestion(s) and 1 general comment(s). @copilot Please apply all of the following code review suggestions:
Also address these general comments:
Push all fixes in a single commit. Run Auto-generated by copilot-review-apply workflow. |
- Include ClusterRole name in error messages for easier debugging (client.go) - Add machine-readable errorCode to sync endpoint 501 response (console_persistence.go) - Fix test hygiene: check json.Marshal/io.ReadAll errors, close response bodies, explicitly discard hash.Write returns (feedback_test.go) - Add onFocus/onBlur and aria-describedby to InfoTooltip for keyboard and screen-reader accessibility (CardWrapper.tsx) - Reuse h.httpClient with context.WithTimeout instead of creating a new http.Client per screenshot upload (feedback.go) - Scope elevated 20 MB body limit to feedback creation route only via per-route middleware guard (server.go) Signed-off-by: Andrew Anderson <andy@clubanderson.com>
- Include ClusterRole name in error messages for easier debugging (client.go) - Add machine-readable errorCode to sync endpoint 501 response (console_persistence.go) - Fix test hygiene: check json.Marshal/io.ReadAll errors, close response bodies, explicitly discard hash.Write returns (feedback_test.go) - Add onFocus/onBlur and aria-describedby to InfoTooltip for keyboard and screen-reader accessibility (CardWrapper.tsx) - Reuse h.httpClient with context.WithTimeout instead of creating a new http.Client per screenshot upload (feedback.go) - Scope elevated 20 MB body limit to feedback creation route only via per-route middleware guard (server.go) Signed-off-by: Andrew Anderson <andy@clubanderson.com>
Summary
InstallGPUHealthCronJobfunction discarded the error fromClusterRoles().Update()with_, _ =. If the update failed, the system continued with outdated RBAC rules, which could cause subsequent GPU health job failures. Now both the Get and Update errors are checked and returned.SyncNowhandler returned"synced": truewith a 200 status even though the sync logic was never implemented (just a TODO comment). Now returns 501 with"synced": falseand an error message. The frontend already handles non-ok responses gracefully.Fixes #4113
Fixes #4114
Test plan
go build ./pkg/...compiles cleanlygo test ./pkg/api/handlers/...passesgo test ./pkg/k8s/...passesusePersistence.tsalready handles non-ok response from sync endpoint (falls through to return false)