Fix send-on-closed-channel panic in deploy explain mode#822
Conversation
The build status callback runs from per-stream goroutines that can outlive the parent RPC, so closing pw.Status() after BuildFromTar returned could race with an in-flight send and crash the CLI. Wrap the channel in a small mutex-guarded helper that serializes sends with close, and add a panic-recovery defer that flips the deployment to failed so a CLI crash no longer leaves the server-side lock stuck for its 30-minute TTL. Fixes MIR-1167.
📝 WalkthroughWalkthroughThe deploy command now includes two concurrency safety improvements. A new Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cli/commands/deploy.go`:
- Around line 490-492: The defer recovery handler calls updateDeploymentOnError
which concatenates the shared slices buildLogs and buildErrors while
createBuildStatusCallback goroutines may still be appending to them; guard by
snapshotting or mutex-protecting those slices before calling
updateDeploymentOnError. Specifically, add or reuse a mutex that protects access
to buildLogs/buildErrors (the same one used by createBuildStatusCallback) or
acquire it in the deferred func inside the recover block, copy the slices into
local variables (e.g., logsCopy := append([]string(nil), buildLogs...), errsCopy
:= append([]string(nil), buildErrors...)), release the mutex, and then call
updateDeploymentOnError with the copies so updateDeploymentOnError operates on a
consistent snapshot without racing with buildCall goroutines.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 85ddbf8b-0381-449b-8337-34c48e631d46
📒 Files selected for processing (2)
cli/commands/deploy.gocli/commands/safe_status_ch_test.go
phinze
left a comment
There was a problem hiding this comment.
Once upon a time, in the kingdom of explain-mode deploys, there lived a tortoise and a hare. The hare was an RPC stream-handler goroutine: fast, exuberant, bounding through the forest firing build-status callbacks at every opportunity, occasionally circling back to drop another status update at the finish line long after everyone assumed the race was decided. The tortoise was the main deploy goroutine, plodding along to its one job: close the channel and call it a day.
In the original telling, the hare loses by napping. In ours, both racers reach the finish line at exactly the same instant, collide, and explode. The hare is mid-stride placing a status onto the channel; the tortoise is calmly closing it. Send on closed channel. Panic. Forest catches fire. Worse: the umpire (the deployment lock) saw nobody cross cleanly, shrugged, and went home for thirty minutes. No other animal in the kingdom could race until he returned.
This PR hires a referee. safeStatusCh stands at the finish line with a clipboard and lets exactly one racer touch the tape at a time; if the hare arrives after the tortoise has already closed the race, the referee politely waves him off instead of letting him bowl the tortoise over. The panic-recovery defer clips a pager to the umpire's belt, so if the racers somehow still crash into each other, at least somebody calls him back from lunch immediately instead of waiting half an hour. Two small interventions, both right where the trouble was.
We sat down and traced this end to end: through callInline, the explain-mode happy path, the error returns, the test alongside the race-detector semantics. The shape holds up. CodeRabbit's note about the slice race in the recovery defer is a fair sighting (the hare keeps appending to buildLogs/buildErrors while the umpire is being paged) but the worst case is a slightly fuzzy failure message, not a missed lock release. The 30-minute stuck lock was the bear; a garbled error string is not.
Two small woodland observations inline, neither blocking.
- Defer safeStatus.Close() right after construction so error returns no longer leak the progresswriter drainer goroutine. - Guard the shared buildErrors/buildLogs/deployWarnings slices with a mutex; readers go through a snapshotBuildState helper so callback goroutines and updateDeploymentOnError can't race on the slice header. - Track a deploymentFinalized flag set after UpdateDeploymentStatus "active" so a panic in post-success code (warnings, displayAccessInfo) doesn't flip a live deployment to failed. - Rework safeStatusCh to use a stop channel + sync.WaitGroup. Close no longer holds a mutex across the blocking channel send, so it can't deadlock if the channel's consumer stops draining. - Deterministic Close-mid-stream test via a midflight WaitGroup, plus a regression test for Close unblocking a parked Send.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
cli/commands/safe_status_ch_test.go (1)
100-102: ⚡ Quick winAdd a timeout when waiting for
doneafterClose().
Close()has a timeout guard, but<-donecan still block forever if a regression letsClose()return before the parkedSend()exits. A bounded wait keeps this test fail-fast instead of hanging.Proposed change
- if err := <-done; err != nil { - t.Fatalf("Send returned error: %v", err) - } + select { + case err := <-done: + if err != nil { + t.Fatalf("Send returned error: %v", err) + } + case <-time.After(2 * time.Second): + t.Fatal("Send() did not return after Close()") + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cli/commands/safe_status_ch_test.go` around lines 100 - 102, The test waiting on the done channel after calling Close() can hang; change the blocking receive on done to a bounded wait using a select with a timeout (e.g., time.After). Specifically, after invoking Close() in the test (and around the existing Receive from done), replace the plain "<-done" check with a select that either receives err := <-done and fails on err, or times out after a short duration and calls t.Fatalf("timed out waiting for Send/Close to finish") so the test fails fast if Send() stays parked; reference the done channel, Close(), and Send() in the test when making this change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@cli/commands/safe_status_ch_test.go`:
- Around line 100-102: The test waiting on the done channel after calling
Close() can hang; change the blocking receive on done to a bounded wait using a
select with a timeout (e.g., time.After). Specifically, after invoking Close()
in the test (and around the existing Receive from done), replace the plain
"<-done" check with a select that either receives err := <-done and fails on
err, or times out after a short duration and calls t.Fatalf("timed out waiting
for Send/Close to finish") so the test fails fast if Send() stays parked;
reference the done channel, Close(), and Send() in the test when making this
change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 233c0269-593f-4f2c-a8c5-50b7c0d08f85
📒 Files selected for processing (2)
cli/commands/deploy.gocli/commands/safe_status_ch_test.go
Summary
BuildFromTarreturns) and the main goroutine runningclose(pw.Status()). Two such panics were reported in MIR-1167, both leaving the server-side deployment lock stuck for its 30-minute TTL.pw.Status()in a smallsafeStatusChhelper that serializesSendandClosebehind a mutex so close cannot race with an in-flight send. The drainer goroutine inprogresswriterstill consumes normally, so this introduces no new blocking risk.defernear the top ofDeploythat callsupdateDeploymentOnErrorand re-panics. A crash now releases the server lock immediately instead of waiting 30 minutes, while still surfacing the original stack trace.Test plan
go test -race ./cli/commands/ -run SafeStatusCh -count=3(new tests: concurrent send/close, send-after-close no-op, ctx-cancel respected)go test -race ./cli/commands/...(189 tests pass)make lint(0 issues)make test(4332 passed, 20 skipped, 0 failures)Fixes MIR-1167.