Skip to content

🐛 Fix silent error handling in ClusterRole update and sync endpoint#4129

Merged
clubanderson merged 1 commit intomainfrom
fix/api-error-handling
Apr 1, 2026
Merged

🐛 Fix silent error handling in ClusterRole update and sync endpoint#4129
clubanderson merged 1 commit intomainfrom
fix/api-error-handling

Conversation

@clubanderson
Copy link
Copy Markdown
Collaborator

Summary

  • ClusterRole update error ignored (ClusterRole Update Error Ignored #4113): The InstallGPUHealthCronJob function discarded the error from ClusterRoles().Update() with _, _ =. If the update failed, the system continued with outdated RBAC rules, which could cause subsequent GPU health job failures. Now both the Get and Update errors are checked and returned.
  • Sync endpoint reports false success (Sync Endpoint Reports Success Without Implementation #4114): The SyncNow handler returned "synced": true with a 200 status even though the sync logic was never implemented (just a TODO comment). Now returns 501 with "synced": false and an error message. The frontend already handles non-ok responses gracefully.

Fixes #4113
Fixes #4114

Test plan

  • go build ./pkg/... compiles cleanly
  • go test ./pkg/api/handlers/... passes
  • go test ./pkg/k8s/... passes
  • Frontend usePersistence.ts already handles non-ok response from sync endpoint (falls through to return false)

- Check ClusterRole update errors instead of discarding them with `_, _`,
  ensuring RBAC failures are surfaced before dependent GPU health jobs run
  (fixes #4113)
- Return 501 with synced:false from the sync endpoint instead of falsely
  reporting success when no sync logic is implemented (fixes #4114)

Signed-off-by: Andrew Anderson <andy@clubanderson.com>
Copilot AI review requested due to automatic review settings April 1, 2026 13:01
@kubestellar-prow kubestellar-prow bot added the dco-signoff: yes Indicates the PR's author has signed the DCO. label Apr 1, 2026
@kubestellar-prow
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign clubanderson for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@netlify
Copy link
Copy Markdown

netlify bot commented Apr 1, 2026

Deploy Preview for kubestellarconsole ready!

Name Link
🔨 Latest commit 97c62d1
🔍 Latest deploy log https://app.netlify.com/projects/kubestellarconsole/deploys/69cd171a1a2f2f0008007f25
😎 Deploy Preview https://deploy-preview-4129.console-deploy-preview.kubestellar.io
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@clubanderson clubanderson merged commit aa66cbe into main Apr 1, 2026
18 of 19 checks passed
@kubestellar-prow kubestellar-prow bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Apr 1, 2026
@kubestellar-prow kubestellar-prow bot deleted the fix/api-error-handling branch April 1, 2026 13:01
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes two cases where the system reported success despite failing/being unimplemented: ClusterRole updates now propagate errors, and the persistence sync endpoint now returns an honest “not implemented” response instead of a false success.

Changes:

  • Return errors from ClusterRole Get/Update when Create hits AlreadyExists, preventing silent RBAC drift.
  • Update SyncNow to return HTTP 501 with synced: false and an error message (instead of 200/true).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
pkg/k8s/client.go Stops discarding ClusterRole update errors when the role already exists.
pkg/api/handlers/console_persistence.go Makes /sync return 501 + synced: false until real sync logic exists.

Comment on lines +2598 to +2603
if getErr != nil {
return fmt.Errorf("getting existing ClusterRole for update: %w", getErr)
}
existing.Rules = rules
if _, updateErr := client.RbacV1().ClusterRoles().Update(ctx, existing, metav1.UpdateOptions{}); updateErr != nil {
return fmt.Errorf("updating ClusterRole: %w", updateErr)
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClusterRole updates can fail with a conflict if the resource was modified between Get and Update. Consider wrapping the update in retry.RetryOnConflict(...) (re-Get, set Rules, then Update) so transient conflicts don’t cause the install to fail unnecessarily.

Copilot uses AI. Check for mistakes.
Comment on lines +2599 to +2603
return fmt.Errorf("getting existing ClusterRole for update: %w", getErr)
}
existing.Rules = rules
if _, updateErr := client.RbacV1().ClusterRoles().Update(ctx, existing, metav1.UpdateOptions{}); updateErr != nil {
return fmt.Errorf("updating ClusterRole: %w", updateErr)
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wrapped error messages don’t include which ClusterRole is being updated. Including the ClusterRole name (e.g., gpuHealthClusterRole) in both messages will make operational debugging and log triage significantly easier.

Suggested change
return fmt.Errorf("getting existing ClusterRole for update: %w", getErr)
}
existing.Rules = rules
if _, updateErr := client.RbacV1().ClusterRoles().Update(ctx, existing, metav1.UpdateOptions{}); updateErr != nil {
return fmt.Errorf("updating ClusterRole: %w", updateErr)
return fmt.Errorf("getting existing ClusterRole %q for update: %w", gpuHealthClusterRole, getErr)
}
existing.Rules = rules
if _, updateErr := client.RbacV1().ClusterRoles().Update(ctx, existing, metav1.UpdateOptions{}); updateErr != nil {
return fmt.Errorf("updating ClusterRole %q: %w", gpuHealthClusterRole, updateErr)

Copilot uses AI. Check for mistakes.
Comment on lines +2599 to +2603
return fmt.Errorf("getting existing ClusterRole for update: %w", getErr)
}
existing.Rules = rules
if _, updateErr := client.RbacV1().ClusterRoles().Update(ctx, existing, metav1.UpdateOptions{}); updateErr != nil {
return fmt.Errorf("updating ClusterRole: %w", updateErr)
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wrapped error messages don’t include which ClusterRole is being updated. Including the ClusterRole name (e.g., gpuHealthClusterRole) in both messages will make operational debugging and log triage significantly easier.

Suggested change
return fmt.Errorf("getting existing ClusterRole for update: %w", getErr)
}
existing.Rules = rules
if _, updateErr := client.RbacV1().ClusterRoles().Update(ctx, existing, metav1.UpdateOptions{}); updateErr != nil {
return fmt.Errorf("updating ClusterRole: %w", updateErr)
return fmt.Errorf("getting existing ClusterRole %q for update: %w", gpuHealthClusterRole, getErr)
}
existing.Rules = rules
if _, updateErr := client.RbacV1().ClusterRoles().Update(ctx, existing, metav1.UpdateOptions{}); updateErr != nil {
return fmt.Errorf("updating ClusterRole %q: %w", gpuHealthClusterRole, updateErr)

Copilot uses AI. Check for mistakes.
Comment on lines +788 to +791
// Sync logic is not yet implemented — return an honest status
return c.Status(501).JSON(fiber.Map{
"synced": false,
"error": "sync not implemented",
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error text is quite generic. Consider making it more actionable/consistent with other API errors (e.g., include what to do next or a stable error code like SYNC_NOT_IMPLEMENTED) so clients can handle it predictably without string matching.

Suggested change
// Sync logic is not yet implemented — return an honest status
return c.Status(501).JSON(fiber.Map{
"synced": false,
"error": "sync not implemented",
// Sync logic is not yet implemented — return a clear, machine-readable status
return c.Status(501).JSON(fiber.Map{
"synced": false,
"error": "Sync operation is not implemented for this API endpoint. Please upgrade the console backend to a version that supports /api/persistence/sync.",
"errorCode": "SYNC_NOT_IMPLEMENTED",

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

👋 Hey @clubanderson — thanks for opening this PR!

🤖 This project is developed exclusively using AI coding assistants.

Please do not attempt to code anything for this project manually.
All contributions should be authored using an AI coding tool such as:

This ensures consistency in code style, architecture patterns, test coverage,
and commit quality across the entire codebase.


This is an automated message.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

Thank you for your contribution! Your PR has been merged.

Check out what's new:

Stay connected: Slack #kubestellar-dev | Multi-Cluster Survey

@clubanderson
Copy link
Copy Markdown
Collaborator Author

🔄 Auto-Applying Copilot Code Review

Copilot code review found 3 code suggestion(s) and 1 general comment(s).

@copilot Please apply all of the following code review suggestions:

  • pkg/k8s/client.go (line 2603): return fmt.Errorf("getting existing ClusterRole %q for update: %w", gpuHealthClu...
  • pkg/k8s/client.go (line 2603): return fmt.Errorf("getting existing ClusterRole %q for update: %w", gpuHealthClu...
  • pkg/api/handlers/console_persistence.go (line 791): // Sync logic is not yet implemented — return a clear, machine-readable status ...

Also address these general comments:

  • pkg/k8s/client.go (line 2603): ClusterRole updates can fail with a conflict if the resource was modified between Get and Update. Consider wrapping

Push all fixes in a single commit. Run cd web && npm run build && npm run lint before committing.


Auto-generated by copilot-review-apply workflow.

clubanderson added a commit that referenced this pull request Apr 1, 2026
- Include ClusterRole name in error messages for easier debugging (client.go)
- Add machine-readable errorCode to sync endpoint 501 response (console_persistence.go)
- Fix test hygiene: check json.Marshal/io.ReadAll errors, close response
  bodies, explicitly discard hash.Write returns (feedback_test.go)
- Add onFocus/onBlur and aria-describedby to InfoTooltip for keyboard
  and screen-reader accessibility (CardWrapper.tsx)
- Reuse h.httpClient with context.WithTimeout instead of creating a new
  http.Client per screenshot upload (feedback.go)
- Scope elevated 20 MB body limit to feedback creation route only via
  per-route middleware guard (server.go)

Signed-off-by: Andrew Anderson <andy@clubanderson.com>
clubanderson added a commit that referenced this pull request Apr 1, 2026
- Include ClusterRole name in error messages for easier debugging (client.go)
- Add machine-readable errorCode to sync endpoint 501 response (console_persistence.go)
- Fix test hygiene: check json.Marshal/io.ReadAll errors, close response
  bodies, explicitly discard hash.Write returns (feedback_test.go)
- Add onFocus/onBlur and aria-describedby to InfoTooltip for keyboard
  and screen-reader accessibility (CardWrapper.tsx)
- Reuse h.httpClient with context.WithTimeout instead of creating a new
  http.Client per screenshot upload (feedback.go)
- Scope elevated 20 MB body limit to feedback creation route only via
  per-route middleware guard (server.go)

Signed-off-by: Andrew Anderson <andy@clubanderson.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the DCO. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sync Endpoint Reports Success Without Implementation ClusterRole Update Error Ignored

3 participants