Skip to content

🐛 Fix WebSocket race, nil-pointer crash, shutdown panic#4306

Merged
clubanderson merged 1 commit intomainfrom
fix/backend-crash-fixes
Apr 2, 2026
Merged

🐛 Fix WebSocket race, nil-pointer crash, shutdown panic#4306
clubanderson merged 1 commit intomainfrom
fix/backend-crash-fixes

Conversation

@clubanderson
Copy link
Copy Markdown
Collaborator

Summary

Fixes three crash-path bugs in the Go backend:

Fixes #4260, Fixes #4261, Fixes #4279

Test plan

  • go build ./... passes
  • go test ./pkg/api/handlers/... ./pkg/agent/... -run "DeviceTracker|Hub|GPUUtil|WebSocket" passes
  • CI build and lint pass

…#4279)

Fix three crash-path bugs in the Go backend:

1. WebSocket concurrent-write race (#4260): The agent server used two
   separate mutexes for the same connection — wsClient.writeMu for
   broadcasts (prediction_worker) and a local writeMu for request/stream
   handlers. Now the handler loop reuses the wsClient.writeMu so all
   writes to the same connection are serialized through a single mutex.

2. Nil-pointer crash in DeviceTracker (#4261): scanDevices() called
   t.k8sClient.ListClusters() without checking for nil. When k8s client
   initialisation fails at startup, the server continues with a nil
   client but still starts DeviceTracker. Added a nil guard.

3. Shutdown panic on double-close (#4279): Hub.Close() and
   GPUUtilizationWorker.Stop() close channels unconditionally. A second
   shutdown call panics. Protected both with sync.Once.

Signed-off-by: Andrew Anderson <andy@clubanderson.com>
Copilot AI review requested due to automatic review settings April 2, 2026 19:10
@kubestellar-prow kubestellar-prow bot added the dco-signoff: yes Indicates the PR's author has signed the DCO. label Apr 2, 2026
@netlify
Copy link
Copy Markdown

netlify bot commented Apr 2, 2026

Deploy Preview for kubestellarconsole canceled.

Name Link
🔨 Latest commit ac25a8f
🔍 Latest deploy log https://app.netlify.com/projects/kubestellarconsole/deploys/69cebf1977fa2e0008f20033

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

👋 Hey @clubanderson — thanks for opening this PR!

🤖 This project is developed exclusively using AI coding assistants.

Please do not attempt to code anything for this project manually.
All contributions should be authored using an AI coding tool such as:

This ensures consistency in code style, architecture patterns, test coverage,
and commit quality across the entire codebase.


This is an automated message.

@kubestellar-prow kubestellar-prow bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 2, 2026
@clubanderson
Copy link
Copy Markdown
Collaborator Author

/lgtm
/approve

@kubestellar-prow
Copy link
Copy Markdown
Contributor

@clubanderson: you cannot LGTM your own PR.

Details

In response to this:

/lgtm
/approve

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@kubestellar-prow
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: clubanderson

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubestellar-prow kubestellar-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 2, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the Go backend against three crash-paths: a gorilla/websocket concurrent-write race in the agent server, a nil-pointer dereference in DeviceTracker when k8s client init fails, and shutdown panics caused by double-closing channels.

Changes:

  • Serialize all per-connection WebSocket writes in pkg/agent/server.go by reusing the same mutex used by broadcast code paths.
  • Make DeviceTracker.scanDevices() resilient to a nil k8sClient to prevent startup-time nil dereferences.
  • Make hub/worker shutdown idempotent by guarding channel closes with sync.Once.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
pkg/api/handlers/websocket.go Makes Hub.Close() idempotent via sync.Once to prevent double-close panics.
pkg/api/gpu_utilization_worker.go Makes GPUUtilizationWorker.Stop() idempotent via sync.Once to prevent double-close panics.
pkg/agent/server.go Ensures all writes to a WebSocket connection share one per-connection mutex, preventing concurrent-write races.
pkg/agent/device_tracker.go Adds a nil guard around k8sClient usage to avoid nil-pointer crashes when init fails.

Comment on lines +122 to +126
// Guard against nil client — k8s client initialisation may have failed
// at startup, leaving DeviceTracker running with a nil reference.
if t.k8sClient == nil {
return
}
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning silently when k8sClient is nil avoids the panic, but it also makes DeviceTracker fail silently (no alerts ever, no log signal). Consider logging a one-time warning (e.g., reuse loggedClusterError) and/or stopping the loop when the client is nil so operators can diagnose why device tracking is inactive.

Copilot uses AI. Check for mistakes.
@clubanderson
Copy link
Copy Markdown
Collaborator Author

🔄 Auto-Applying Copilot Code Review

Copilot code review found 0 code suggestion(s) and 1 general comment(s).

Also address these general comments:

  • pkg/agent/device_tracker.go (line 126): Returning silently when k8sClient is nil avoids the panic, but it also makes DeviceTracker fail silently (no alerts ever

Push all fixes in a single commit. Run cd web && npm run build && npm run lint before committing.


Auto-generated by copilot-review-apply workflow.

@clubanderson clubanderson merged commit f78ad13 into main Apr 2, 2026
27 of 28 checks passed
@kubestellar-prow kubestellar-prow bot deleted the fix/backend-crash-fixes branch April 2, 2026 19:16
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

Thank you for your contribution! Your PR has been merged.

Check out what's new:

Stay connected: Slack #kubestellar-dev | Multi-Cluster Survey

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has signed the DCO. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

3 participants