🐛 Fix WebSocket race, nil-pointer crash, shutdown panic#4306
🐛 Fix WebSocket race, nil-pointer crash, shutdown panic#4306clubanderson merged 1 commit intomainfrom
Conversation
…#4279) Fix three crash-path bugs in the Go backend: 1. WebSocket concurrent-write race (#4260): The agent server used two separate mutexes for the same connection — wsClient.writeMu for broadcasts (prediction_worker) and a local writeMu for request/stream handlers. Now the handler loop reuses the wsClient.writeMu so all writes to the same connection are serialized through a single mutex. 2. Nil-pointer crash in DeviceTracker (#4261): scanDevices() called t.k8sClient.ListClusters() without checking for nil. When k8s client initialisation fails at startup, the server continues with a nil client but still starts DeviceTracker. Added a nil guard. 3. Shutdown panic on double-close (#4279): Hub.Close() and GPUUtilizationWorker.Stop() close channels unconditionally. A second shutdown call panics. Protected both with sync.Once. Signed-off-by: Andrew Anderson <andy@clubanderson.com>
✅ Deploy Preview for kubestellarconsole canceled.
|
|
👋 Hey @clubanderson — thanks for opening this PR!
This is an automated message. |
|
/lgtm |
|
@clubanderson: you cannot LGTM your own PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: clubanderson The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
This PR hardens the Go backend against three crash-paths: a gorilla/websocket concurrent-write race in the agent server, a nil-pointer dereference in DeviceTracker when k8s client init fails, and shutdown panics caused by double-closing channels.
Changes:
- Serialize all per-connection WebSocket writes in
pkg/agent/server.goby reusing the same mutex used by broadcast code paths. - Make
DeviceTracker.scanDevices()resilient to a nilk8sClientto prevent startup-time nil dereferences. - Make hub/worker shutdown idempotent by guarding channel closes with
sync.Once.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| pkg/api/handlers/websocket.go | Makes Hub.Close() idempotent via sync.Once to prevent double-close panics. |
| pkg/api/gpu_utilization_worker.go | Makes GPUUtilizationWorker.Stop() idempotent via sync.Once to prevent double-close panics. |
| pkg/agent/server.go | Ensures all writes to a WebSocket connection share one per-connection mutex, preventing concurrent-write races. |
| pkg/agent/device_tracker.go | Adds a nil guard around k8sClient usage to avoid nil-pointer crashes when init fails. |
| // Guard against nil client — k8s client initialisation may have failed | ||
| // at startup, leaving DeviceTracker running with a nil reference. | ||
| if t.k8sClient == nil { | ||
| return | ||
| } |
There was a problem hiding this comment.
Returning silently when k8sClient is nil avoids the panic, but it also makes DeviceTracker fail silently (no alerts ever, no log signal). Consider logging a one-time warning (e.g., reuse loggedClusterError) and/or stopping the loop when the client is nil so operators can diagnose why device tracking is inactive.
🔄 Auto-Applying Copilot Code ReviewCopilot code review found 0 code suggestion(s) and 1 general comment(s). Also address these general comments:
Push all fixes in a single commit. Run Auto-generated by copilot-review-apply workflow. |
|
Thank you for your contribution! Your PR has been merged. Check out what's new:
Stay connected: Slack #kubestellar-dev | Multi-Cluster Survey |
Summary
Fixes three crash-path bugs in the Go backend:
wsClient.writeMu) and request/stream handlers (a localwriteMu), allowing concurrent writes on the same*websocket.Conn. Now the handler loop reuseswsClient.writeMuso all writes are serialized through a single per-connection mutex.scanDevices()calledt.k8sClient.ListClusters()without a nil check. When k8s client init fails at startup the server continues with a nil client but still starts DeviceTracker, causing a nil-pointer dereference. Added a nil guard.Hub.Close()andGPUUtilizationWorker.Stop()close channels unconditionally. A second call toShutdown()panics with "close of closed channel". Protected both withsync.Once.Fixes #4260, Fixes #4261, Fixes #4279
Test plan
go build ./...passesgo test ./pkg/api/handlers/... ./pkg/agent/... -run "DeviceTracker|Hub|GPUUtil|WebSocket"passes