Fix send-on-closed-channel panic in deploy explain mode by evanphx · Pull Request #822 · mirendev/runtime

evanphx · 2026-05-26T19:08:18Z

Summary

Deploy's explain-mode progress handler races between RPC stream-handler goroutines (which call back into the handler after BuildFromTar returns) and the main goroutine running close(pw.Status()). Two such panics were reported in MIR-1167, both leaving the server-side deployment lock stuck for its 30-minute TTL.
Wrap pw.Status() in a small safeStatusCh helper that serializes Send and Close behind a mutex so close cannot race with an in-flight send. The drainer goroutine in progresswriter still consumes normally, so this introduces no new blocking risk.
Add a panic-recovery defer near the top of Deploy that calls updateDeploymentOnError and re-panics. A crash now releases the server lock immediately instead of waiting 30 minutes, while still surfacing the original stack trace.

Test plan

go test -race ./cli/commands/ -run SafeStatusCh -count=3 (new tests: concurrent send/close, send-after-close no-op, ctx-cancel respected)
go test -race ./cli/commands/... (189 tests pass)
make lint (0 issues)
make test (4332 passed, 20 skipped, 0 failures)
Smoke deploy: run an explain-mode deploy against a live cluster and confirm clean exit (not exercised in this PR; covered by unit-level race test)

The build status callback runs from per-stream goroutines that can outlive the parent RPC, so closing pw.Status() after BuildFromTar returned could race with an in-flight send and crash the CLI. Wrap the channel in a small mutex-guarded helper that serializes sends with close, and add a panic-recovery defer that flips the deployment to failed so a CLI crash no longer leaves the server-side lock stuck for its 30-minute TTL. Fixes MIR-1167.

coderabbitai · 2026-05-26T19:08:33Z

📝 Walkthrough

Walkthrough

The deploy command now includes two concurrency safety improvements. A new safeStatusCh wrapper protects the buildkit status channel from race conditions when buildkit goroutines continue sending status updates while the deploy command attempts to close the channel during shutdown. The wrapper uses a mutex to guard Close operations and context-aware Send to gracefully handle sends-after-close scenarios. Additionally, the deploy flow now wraps the build/deploy execution with a deferred panic recovery handler that records panics as deployment failures before re-panicing, ensuring server-side locking is released when unexpected errors occur.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cli/commands/deploy.go`:
- Around line 490-492: The defer recovery handler calls updateDeploymentOnError
which concatenates the shared slices buildLogs and buildErrors while
createBuildStatusCallback goroutines may still be appending to them; guard by
snapshotting or mutex-protecting those slices before calling
updateDeploymentOnError. Specifically, add or reuse a mutex that protects access
to buildLogs/buildErrors (the same one used by createBuildStatusCallback) or
acquire it in the deferred func inside the recover block, copy the slices into
local variables (e.g., logsCopy := append([]string(nil), buildLogs...), errsCopy
:= append([]string(nil), buildErrors...)), release the mutex, and then call
updateDeploymentOnError with the copies so updateDeploymentOnError operates on a
consistent snapshot without racing with buildCall goroutines.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 85ddbf8b-0381-449b-8337-34c48e631d46

📥 Commits

Reviewing files that changed from the base of the PR and between 926db8a and a272d02.

📒 Files selected for processing (2)

cli/commands/deploy.go
cli/commands/safe_status_ch_test.go

phinze

Once upon a time, in the kingdom of explain-mode deploys, there lived a tortoise and a hare. The hare was an RPC stream-handler goroutine: fast, exuberant, bounding through the forest firing build-status callbacks at every opportunity, occasionally circling back to drop another status update at the finish line long after everyone assumed the race was decided. The tortoise was the main deploy goroutine, plodding along to its one job: close the channel and call it a day.

In the original telling, the hare loses by napping. In ours, both racers reach the finish line at exactly the same instant, collide, and explode. The hare is mid-stride placing a status onto the channel; the tortoise is calmly closing it. Send on closed channel. Panic. Forest catches fire. Worse: the umpire (the deployment lock) saw nobody cross cleanly, shrugged, and went home for thirty minutes. No other animal in the kingdom could race until he returned.

This PR hires a referee. safeStatusCh stands at the finish line with a clipboard and lets exactly one racer touch the tape at a time; if the hare arrives after the tortoise has already closed the race, the referee politely waves him off instead of letting him bowl the tortoise over. The panic-recovery defer clips a pager to the umpire's belt, so if the racers somehow still crash into each other, at least somebody calls him back from lunch immediately instead of waiting half an hour. Two small interventions, both right where the trouble was.

We sat down and traced this end to end: through callInline, the explain-mode happy path, the error returns, the test alongside the race-detector semantics. The shape holds up. CodeRabbit's note about the slice race in the recovery defer is a fair sighting (the hare keeps appending to buildLogs/buildErrors while the umpire is being paged) but the worst case is a slightly fuzzy failure message, not a missed lock release. The 30-minute stuck lock was the bear; a garbled error string is not.

Two small woodland observations inline, neither blocking.

- Defer safeStatus.Close() right after construction so error returns no longer leak the progresswriter drainer goroutine. - Guard the shared buildErrors/buildLogs/deployWarnings slices with a mutex; readers go through a snapshotBuildState helper so callback goroutines and updateDeploymentOnError can't race on the slice header. - Track a deploymentFinalized flag set after UpdateDeploymentStatus "active" so a panic in post-success code (warnings, displayAccessInfo) doesn't flip a live deployment to failed. - Rework safeStatusCh to use a stop channel + sync.WaitGroup. Close no longer holds a mutex across the blocking channel send, so it can't deadlock if the channel's consumer stops draining. - Deterministic Close-mid-stream test via a midflight WaitGroup, plus a regression test for Close unblocking a parked Send.

coderabbitai

🧹 Nitpick comments (1)

cli/commands/safe_status_ch_test.go (1)

100-102: ⚡ Quick win

Add a timeout when waiting for done after Close().

Close() has a timeout guard, but <-done can still block forever if a regression lets Close() return before the parked Send() exits. A bounded wait keeps this test fail-fast instead of hanging.

Proposed change

-	if err := <-done; err != nil {
-		t.Fatalf("Send returned error: %v", err)
-	}
+	select {
+	case err := <-done:
+		if err != nil {
+			t.Fatalf("Send returned error: %v", err)
+		}
+	case <-time.After(2 * time.Second):
+		t.Fatal("Send() did not return after Close()")
+	}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cli/commands/safe_status_ch_test.go` around lines 100 - 102, The test waiting
on the done channel after calling Close() can hang; change the blocking receive
on done to a bounded wait using a select with a timeout (e.g., time.After).
Specifically, after invoking Close() in the test (and around the existing
Receive from done), replace the plain "<-done" check with a select that either
receives err := <-done and fails on err, or times out after a short duration and
calls t.Fatalf("timed out waiting for Send/Close to finish") so the test fails
fast if Send() stays parked; reference the done channel, Close(), and Send() in
the test when making this change.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@cli/commands/safe_status_ch_test.go`:
- Around line 100-102: The test waiting on the done channel after calling
Close() can hang; change the blocking receive on done to a bounded wait using a
select with a timeout (e.g., time.After). Specifically, after invoking Close()
in the test (and around the existing Receive from done), replace the plain
"<-done" check with a select that either receives err := <-done and fails on
err, or times out after a short duration and calls t.Fatalf("timed out waiting
for Send/Close to finish") so the test fails fast if Send() stays parked;
reference the done channel, Close(), and Send() in the test when making this
change.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 233c0269-593f-4f2c-a8c5-50b7c0d08f85

📥 Commits

Reviewing files that changed from the base of the PR and between a272d02 and 11f06b4.

📒 Files selected for processing (2)

cli/commands/deploy.go
cli/commands/safe_status_ch_test.go

evanphx requested a review from a team as a code owner May 26, 2026 19:08

coderabbitai Bot reviewed May 26, 2026

View reviewed changes

Comment thread cli/commands/deploy.go Outdated

phinze approved these changes May 26, 2026

View reviewed changes

Comment thread cli/commands/deploy.go

Comment thread cli/commands/safe_status_ch_test.go Outdated

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

Merge branch 'main' into mir-1167-panic-s-while-claude-was-deploying

563a8e1

evanphx merged commit e07178e into main May 27, 2026
19 checks passed

evanphx deleted the mir-1167-panic-s-while-claude-was-deploying branch May 27, 2026 23:08

phinze mentioned this pull request May 28, 2026

Changelog catchup: PRs since v0.8.0 #827

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix send-on-closed-channel panic in deploy explain mode#822

Fix send-on-closed-channel panic in deploy explain mode#822
evanphx merged 3 commits into
mainfrom
mir-1167-panic-s-while-claude-was-deploying

evanphx commented May 26, 2026

Uh oh!

coderabbitai Bot commented May 26, 2026 •

edited

Loading

Walkthrough

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

phinze left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

evanphx commented May 26, 2026

Summary

Test plan

Uh oh!

coderabbitai Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

phinze left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented May 26, 2026 •

edited

Loading