Skip to content

Fix send-on-closed-channel panic in deploy explain mode#822

Merged
evanphx merged 3 commits into
mainfrom
mir-1167-panic-s-while-claude-was-deploying
May 27, 2026
Merged

Fix send-on-closed-channel panic in deploy explain mode#822
evanphx merged 3 commits into
mainfrom
mir-1167-panic-s-while-claude-was-deploying

Conversation

@evanphx
Copy link
Copy Markdown
Contributor

@evanphx evanphx commented May 26, 2026

Summary

  • Deploy's explain-mode progress handler races between RPC stream-handler goroutines (which call back into the handler after BuildFromTar returns) and the main goroutine running close(pw.Status()). Two such panics were reported in MIR-1167, both leaving the server-side deployment lock stuck for its 30-minute TTL.
  • Wrap pw.Status() in a small safeStatusCh helper that serializes Send and Close behind a mutex so close cannot race with an in-flight send. The drainer goroutine in progresswriter still consumes normally, so this introduces no new blocking risk.
  • Add a panic-recovery defer near the top of Deploy that calls updateDeploymentOnError and re-panics. A crash now releases the server lock immediately instead of waiting 30 minutes, while still surfacing the original stack trace.

Test plan

  • go test -race ./cli/commands/ -run SafeStatusCh -count=3 (new tests: concurrent send/close, send-after-close no-op, ctx-cancel respected)
  • go test -race ./cli/commands/... (189 tests pass)
  • make lint (0 issues)
  • make test (4332 passed, 20 skipped, 0 failures)
  • Smoke deploy: run an explain-mode deploy against a live cluster and confirm clean exit (not exercised in this PR; covered by unit-level race test)

Fixes MIR-1167.

The build status callback runs from per-stream goroutines that can
outlive the parent RPC, so closing pw.Status() after BuildFromTar
returned could race with an in-flight send and crash the CLI. Wrap the
channel in a small mutex-guarded helper that serializes sends with
close, and add a panic-recovery defer that flips the deployment to
failed so a CLI crash no longer leaves the server-side lock stuck for
its 30-minute TTL.

Fixes MIR-1167.
@evanphx evanphx requested a review from a team as a code owner May 26, 2026 19:08
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 26, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

The deploy command now includes two concurrency safety improvements. A new safeStatusCh wrapper protects the buildkit status channel from race conditions when buildkit goroutines continue sending status updates while the deploy command attempts to close the channel during shutdown. The wrapper uses a mutex to guard Close operations and context-aware Send to gracefully handle sends-after-close scenarios. Additionally, the deploy flow now wraps the build/deploy execution with a deferred panic recovery handler that records panics as deployment failures before re-panicing, ensuring server-side locking is released when unexpected errors occur.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cli/commands/deploy.go`:
- Around line 490-492: The defer recovery handler calls updateDeploymentOnError
which concatenates the shared slices buildLogs and buildErrors while
createBuildStatusCallback goroutines may still be appending to them; guard by
snapshotting or mutex-protecting those slices before calling
updateDeploymentOnError. Specifically, add or reuse a mutex that protects access
to buildLogs/buildErrors (the same one used by createBuildStatusCallback) or
acquire it in the deferred func inside the recover block, copy the slices into
local variables (e.g., logsCopy := append([]string(nil), buildLogs...), errsCopy
:= append([]string(nil), buildErrors...)), release the mutex, and then call
updateDeploymentOnError with the copies so updateDeploymentOnError operates on a
consistent snapshot without racing with buildCall goroutines.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 85ddbf8b-0381-449b-8337-34c48e631d46

📥 Commits

Reviewing files that changed from the base of the PR and between 926db8a and a272d02.

📒 Files selected for processing (2)
  • cli/commands/deploy.go
  • cli/commands/safe_status_ch_test.go

Comment thread cli/commands/deploy.go Outdated
Copy link
Copy Markdown
Contributor

@phinze phinze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once upon a time, in the kingdom of explain-mode deploys, there lived a tortoise and a hare. The hare was an RPC stream-handler goroutine: fast, exuberant, bounding through the forest firing build-status callbacks at every opportunity, occasionally circling back to drop another status update at the finish line long after everyone assumed the race was decided. The tortoise was the main deploy goroutine, plodding along to its one job: close the channel and call it a day.

In the original telling, the hare loses by napping. In ours, both racers reach the finish line at exactly the same instant, collide, and explode. The hare is mid-stride placing a status onto the channel; the tortoise is calmly closing it. Send on closed channel. Panic. Forest catches fire. Worse: the umpire (the deployment lock) saw nobody cross cleanly, shrugged, and went home for thirty minutes. No other animal in the kingdom could race until he returned.

This PR hires a referee. safeStatusCh stands at the finish line with a clipboard and lets exactly one racer touch the tape at a time; if the hare arrives after the tortoise has already closed the race, the referee politely waves him off instead of letting him bowl the tortoise over. The panic-recovery defer clips a pager to the umpire's belt, so if the racers somehow still crash into each other, at least somebody calls him back from lunch immediately instead of waiting half an hour. Two small interventions, both right where the trouble was.

We sat down and traced this end to end: through callInline, the explain-mode happy path, the error returns, the test alongside the race-detector semantics. The shape holds up. CodeRabbit's note about the slice race in the recovery defer is a fair sighting (the hare keeps appending to buildLogs/buildErrors while the umpire is being paged) but the worst case is a slightly fuzzy failure message, not a missed lock release. The 30-minute stuck lock was the bear; a garbled error string is not.

Two small woodland observations inline, neither blocking.

Comment thread cli/commands/deploy.go
Comment thread cli/commands/safe_status_ch_test.go Outdated
- Defer safeStatus.Close() right after construction so error returns no
  longer leak the progresswriter drainer goroutine.
- Guard the shared buildErrors/buildLogs/deployWarnings slices with a
  mutex; readers go through a snapshotBuildState helper so callback
  goroutines and updateDeploymentOnError can't race on the slice header.
- Track a deploymentFinalized flag set after UpdateDeploymentStatus
  "active" so a panic in post-success code (warnings, displayAccessInfo)
  doesn't flip a live deployment to failed.
- Rework safeStatusCh to use a stop channel + sync.WaitGroup. Close no
  longer holds a mutex across the blocking channel send, so it can't
  deadlock if the channel's consumer stops draining.
- Deterministic Close-mid-stream test via a midflight WaitGroup, plus a
  regression test for Close unblocking a parked Send.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
cli/commands/safe_status_ch_test.go (1)

100-102: ⚡ Quick win

Add a timeout when waiting for done after Close().

Close() has a timeout guard, but <-done can still block forever if a regression lets Close() return before the parked Send() exits. A bounded wait keeps this test fail-fast instead of hanging.

Proposed change
-	if err := <-done; err != nil {
-		t.Fatalf("Send returned error: %v", err)
-	}
+	select {
+	case err := <-done:
+		if err != nil {
+			t.Fatalf("Send returned error: %v", err)
+		}
+	case <-time.After(2 * time.Second):
+		t.Fatal("Send() did not return after Close()")
+	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cli/commands/safe_status_ch_test.go` around lines 100 - 102, The test waiting
on the done channel after calling Close() can hang; change the blocking receive
on done to a bounded wait using a select with a timeout (e.g., time.After).
Specifically, after invoking Close() in the test (and around the existing
Receive from done), replace the plain "<-done" check with a select that either
receives err := <-done and fails on err, or times out after a short duration and
calls t.Fatalf("timed out waiting for Send/Close to finish") so the test fails
fast if Send() stays parked; reference the done channel, Close(), and Send() in
the test when making this change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@cli/commands/safe_status_ch_test.go`:
- Around line 100-102: The test waiting on the done channel after calling
Close() can hang; change the blocking receive on done to a bounded wait using a
select with a timeout (e.g., time.After). Specifically, after invoking Close()
in the test (and around the existing Receive from done), replace the plain
"<-done" check with a select that either receives err := <-done and fails on
err, or times out after a short duration and calls t.Fatalf("timed out waiting
for Send/Close to finish") so the test fails fast if Send() stays parked;
reference the done channel, Close(), and Send() in the test when making this
change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 233c0269-593f-4f2c-a8c5-50b7c0d08f85

📥 Commits

Reviewing files that changed from the base of the PR and between a272d02 and 11f06b4.

📒 Files selected for processing (2)
  • cli/commands/deploy.go
  • cli/commands/safe_status_ch_test.go

@evanphx evanphx merged commit e07178e into main May 27, 2026
19 checks passed
@evanphx evanphx deleted the mir-1167-panic-s-while-claude-was-deploying branch May 27, 2026 23:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants