Skip to content

🐛 Fix 5 backend robustness issues#4328

Merged
clubanderson merged 1 commit intomainfrom
fix/backend-robustness-batch
Apr 2, 2026
Merged

🐛 Fix 5 backend robustness issues#4328
clubanderson merged 1 commit intomainfrom
fix/backend-robustness-batch

Conversation

@clubanderson
Copy link
Copy Markdown
Collaborator

@clubanderson clubanderson commented Apr 2, 2026

  • Fix %s format verb for client.userIDclient.userID.String() in websocket.go
  • Rate-limit pong-drop log: only log on transition into dropping state (new pongDropLogged field), reset on successful send
  • Use msg.ID as fallback sessionID in streaming chat path before generating UUID (server.go)
  • Use msg.ID as fallback sessionID in non-streaming chat path (server.go)
  • Add rollbackBinary() helper: removes target before rename (cross-platform fix), checks/logs rollback errors
  • Apply rollbackBinary() to all 4 rollback sites in update_checker.go with distinct failure messages when rollback itself fails
  • Go build passes, Go tests pass, CodeQL clean

- Use strings.EqualFold for WebSocket Upgrade header comparison (#4281)
- Add non-blocking send on ping to prevent reader loop deadlock (#4280)
- Log errors from fallback ListClusters in prediction worker (#4265)
- Generate UUID session ID when client omits one to prevent key collision (#4263)
- Check os.Rename/os.Chmod errors during binary auto-update with rollback (#4262)

Signed-off-by: Andrew Anderson <andy@clubanderson.com>
Copilot AI review requested due to automatic review settings April 2, 2026 20:59
@kubestellar-prow kubestellar-prow bot added the dco-signoff: yes Indicates the PR's author has signed the DCO. label Apr 2, 2026
@netlify
Copy link
Copy Markdown

netlify bot commented Apr 2, 2026

Deploy Preview for kubestellarconsole ready!

Name Link
🔨 Latest commit 7fe224a
🔍 Latest deploy log https://app.netlify.com/projects/kubestellarconsole/deploys/69ced8bf5b44c20007627367
😎 Deploy Preview https://deploy-preview-4328.console-deploy-preview.kubestellar.io
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@kubestellar-prow kubestellar-prow bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 2, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

👋 Hey @clubanderson — thanks for opening this PR!

🤖 This project is developed exclusively using AI coding assistants.

Please do not attempt to code anything for this project manually.
All contributions should be authored using an AI coding tool such as:

This ensures consistency in code style, architecture patterns, test coverage,
and commit quality across the entire codebase.


This is an automated message.

@clubanderson
Copy link
Copy Markdown
Collaborator Author

/lgtm
/approve

@kubestellar-prow
Copy link
Copy Markdown
Contributor

@clubanderson: you cannot LGTM your own PR.

Details

In response to this:

/lgtm
/approve

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@kubestellar-prow
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: clubanderson

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubestellar-prow kubestellar-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 2, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses multiple backend robustness bugs across WebSocket handling, chat session management, prediction-worker diagnostics, and the agent auto-update flow.

Changes:

  • Make WebSocket upgrade detection case-insensitive in HTTP middleware.
  • Prevent WebSocket reader goroutine blockage by making pong sending non-blocking (with a drop path).
  • Improve operational robustness/diagnostics: log fallback ListClusters errors, avoid chat session ID collisions when omitted, and handle filesystem errors/rollback in binary auto-update.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
pkg/api/middleware/auth.go Uses case-insensitive Upgrade header comparison to avoid rejecting valid WS upgrades.
pkg/api/handlers/websocket.go Makes pong send non-blocking to prevent reader-loop stalls under backpressure.
pkg/agent/update_checker.go Adds error handling for backup/replace/chmod steps during binary update and attempts rollback.
pkg/agent/server.go Generates a unique session ID when omitted to avoid cancellation collisions for anonymous chats.
pkg/agent/prediction_worker.go Logs errors from fallback ListClusters call instead of silently dropping them.

select {
case client.send <- []byte(`{"type":"pong"}`):
default:
log.Printf("[WebSocket] Dropping pong for client %s: send channel full", client.userID)
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

client.userID is a uuid.UUID, but this log line uses %s, which will print %!s(uuid.UUID=...) instead of the UUID. Use client.userID.String() or switch the format verb to %v for correct output.

Suggested change
log.Printf("[WebSocket] Dropping pong for client %s: send channel full", client.userID)
log.Printf("[WebSocket] Dropping pong for client %s: send channel full", client.userID.String())

Copilot uses AI. Check for mistakes.
Comment on lines +378 to +380
default:
log.Printf("[WebSocket] Dropping pong for client %s: send channel full", client.userID)
}
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will log once per received ping while the send channel is full, which can easily turn into noisy/log-amplifying behavior for slow or misbehaving clients. Consider sampling/rate-limiting this message or logging only on transition into the "dropping" state.

Copilot uses AI. Check for mistakes.
Comment on lines +749 to +766
if err := os.Rename(stagingDir+"/console", consolePath); err != nil {
// Attempt to restore the backup before returning
os.Rename(consolePath+".backup", consolePath) //nolint:errcheck
uc.recordError(fmt.Sprintf("replace rename failed: %v", err))
uc.broadcast("update_progress", UpdateProgressPayload{
Status: "failed",
Message: "Failed to install new binary, rolled back",
Error: err.Error(),
})
return
}

// fileModeBinary is the permission bits for the installed console binary.
const fileModeBinary = 0755
if err := os.Chmod(consolePath, fileModeBinary); err != nil {
// Attempt to restore the backup before returning
os.Rename(consolePath+".backup", consolePath) //nolint:errcheck
uc.recordError(fmt.Sprintf("chmod failed: %v", err))
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rollback paths ignore errors and attempt os.Rename(backup, consolePath) even when consolePath already exists (after a successful replace). os.Rename cannot overwrite an existing file on some platforms (notably Windows), so rollback can silently fail and leave the install broken. Consider removing the target first (best-effort), checking/logging rollback errors, and reporting a distinct failure state when rollback does not succeed.

Copilot uses AI. Check for mistakes.
Comment on lines +2512 to +2515
// Generate a unique session ID when the client omits one so that
// concurrent anonymous chats do not collide in activeChatCtxs (#4263).
if req.SessionID == "" {
req.SessionID = uuid.New().String()
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generating a random session ID server-side avoids collisions, but clients that omit sessionId cannot cancel until they learn the generated ID from a later stream/result message. If early cancellation is a goal, consider deriving the fallback from a client-known value (e.g., msg.ID) or sending an immediate response that includes the generated sessionId before starting the long-running work.

Suggested change
// Generate a unique session ID when the client omits one so that
// concurrent anonymous chats do not collide in activeChatCtxs (#4263).
if req.SessionID == "" {
req.SessionID = uuid.New().String()
// Prefer a client-known identifier when the client omits SessionID so
// cancel_chat can target the session immediately. Fall back to a random
// UUID only when neither the request nor the message envelope provides
// a usable identifier, while still avoiding collisions in activeChatCtxs
// for anonymous chats (#4263).
if req.SessionID == "" {
if msg.ID != "" {
req.SessionID = msg.ID
} else {
req.SessionID = uuid.New().String()
}

Copilot uses AI. Check for mistakes.
Comment on lines +2874 to +2877
// Generate a unique session ID when the client omits one so that
// concurrent anonymous chats do not collide (#4263).
if req.SessionID == "" {
req.SessionID = uuid.New().String()
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern as the streaming path: when sessionId is omitted, generating a new UUID makes the session identifier unpredictable to the client and unstable across requests, which can break continuity (selected agent lookup, provider session tracking) unless the client persists the returned ID. Consider using a client-known fallback (like msg.ID) and/or documenting that clients must echo back the returned sessionId for multi-turn behavior.

Suggested change
// Generate a unique session ID when the client omits one so that
// concurrent anonymous chats do not collide (#4263).
if req.SessionID == "" {
req.SessionID = uuid.New().String()
// Fall back to the client-known message ID when sessionId is omitted so
// the session identifier remains predictable to the client and stable for
// continuity-sensitive lookups during this request flow.
if req.SessionID == "" {
req.SessionID = msg.ID

Copilot uses AI. Check for mistakes.
@clubanderson
Copy link
Copy Markdown
Collaborator Author

🔄 Auto-Applying Copilot Code Review

Copilot code review found 3 code suggestion(s) and 2 general comment(s).

@copilot Please apply all of the following code review suggestions:

  • pkg/api/handlers/websocket.go (line 379): log.Printf("[WebSocket] Dropping pong for client %s: send channel full", client....
  • pkg/agent/server.go (line 2515): // Prefer a client-known identifier when the client omits SessionID so // cance...
  • pkg/agent/server.go (line 2877): // Fall back to the client-known message ID when sessionId is omitted so // the...

Also address these general comments:

  • pkg/api/handlers/websocket.go (line 380): This will log once per received ping while the send channel is full, which can easily turn into noisy/log-amplifying b
  • pkg/agent/update_checker.go (line 766): Rollback paths ignore errors and attempt os.Rename(backup, consolePath) even when consolePath already exists (after

Push all fixes in a single commit. Run cd web && npm run build && npm run lint before committing.


Auto-generated by copilot-review-apply workflow.

@clubanderson clubanderson merged commit 30c7a96 into main Apr 2, 2026
27 of 29 checks passed
@kubestellar-prow kubestellar-prow bot deleted the fix/backend-robustness-batch branch April 2, 2026 21:06
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

Thank you for your contribution! Your PR has been merged.

Check out what's new:

Stay connected: Slack #kubestellar-dev | Multi-Cluster Survey

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 2, 2026

🔄 Auto-Applying Copilot Code Review

Copilot code review found 3 code suggestion(s) and 2 general comment(s)....

Applied all requested changes in commit 0f43952:

Code suggestions applied:

  • websocket.go line 379: Fixed format verb — client.userIDclient.userID.String()
  • server.go line ~2515 (streaming path): Now prefers msg.ID as the session fallback before generating a UUID, so cancel_chat can target the session immediately without waiting for a returned ID
  • server.go line ~2877 (non-streaming path): Uses msg.ID directly as the session ID fallback for stable, predictable identifiers

General comments addressed:

  • websocket.go log noise: Added pongDropLogged bool field to Client; the drop warning now logs only on transition into the dropping state and resets when a pong is successfully sent — no more per-ping log spam
  • update_checker.go rollback: Introduced rollbackBinary(backup, consolePath string) error helper that does a best-effort os.Remove(consolePath) before renaming (required on Windows where rename cannot atomically overwrite an existing file). All four rollback sites now check and log errors from rollback, and the first two report a distinct "rollback also failed" failure state when the rollback itself fails

All changes pass go build ./..., go test ./pkg/agent/... ./pkg/api/handlers/..., and CodeQL scan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has signed the DCO. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants