test(run): add race detector and edge case tests by dpup · Pull Request #179 · majorcontext/moat

dpup · 2026-02-17T18:34:44Z

Summary

Closes #82 — adds race detector to CI/Makefile and comprehensive edge case tests.

Race detector: make test-unit and make coverage now run with -race flag. CI already had it; this aligns local dev workflow.
28 new tests covering: firewall setup failures (iptables errors), cleanup error paths (idempotent stop, container removal errors, keep-container), concurrent access (state reads/writes, proxy shutdown, manager operations), privileged mode edge cases (dind/host/nil), log capture (idempotent, concurrent, error handling), and monitor-exit behavior.
Race fix: Pre-existing data race in Stop() — was writing r.State under m.mu (manager mutex) while monitorContainerExit reads via r.stateMu (run mutex). Fixed to use r.SetStateWithTime() consistently.
Test helper: New flexibleRuntime mock with configurable function fields for targeted error-path testing.

Test plan

All 28 new tests pass with go test -race ./internal/run/
Full suite go test -race ./... passes
go vet ./... clean
golangci-lint run clean
Verified make test-unit uses -race
Verified make coverage uses -race -covermode=atomic
No race conditions detected by race detector

🤖 Generated with Claude Code

claude · 2026-02-17T18:35:04Z

📦 Previous Review (superseded by newer review)

Claude finished @dpup's task in 2m 12s —— View job

Code Review — PR #179

Gather context
Read changed files
Review code
Post review feedback

Overall this is good work — the race detector integration, mock infrastructure, and most of the tests are solid. One issue stands out: the race fix in Stop() is incomplete, and a stale comment in the new test file incorrectly says the fix is still pending.

Issues

1. Incomplete race fix — `Start()` and `StartAttached()` still have the same bug

The PR fixes Stop() to use r.SetStateWithTime() (which holds r.stateMu), but Start() and StartAttached() still write r.State directly under m.mu — the same class of race that was fixed:

// manager.go:2052
r.State = StateStarting   // ← direct write under m.mu, not r.stateMu
m.mu.Unlock()

// manager.go:2067–2068
m.mu.Lock()
r.State = StateFailed     // ← same issue
r.Error = err.Error()
m.mu.Unlock()

// manager.go:2121–2122
m.mu.Lock()
r.State = StateRunning    // ← same issue
r.StartedAt = time.Now()
m.mu.Unlock()

And StartAttached() at line 2158:

r.State = StateStarting   // ← direct write under m.mu

Since monitorContainerExit() (running concurrently in a goroutine launched at line 2142) reads state via r.GetState() which acquires r.stateMu, these are live data races even after this PR. Fix this →

2. Stale comment in the new test contradicts the fix included in this PR

edge_cases_test.go lines 1081–1085:

// NOTE: There is a known data race in Stop() where r.State and r.StoppedAt
// are written under m.mu instead of r.stateMu, while monitorContainerExit
// reads them via r.GetState() which uses r.stateMu. This test exercises the
// captureLogs and sync.Once paths which ARE safe. A separate fix is needed
// for the state mutation race in Stop().

This comment says "A separate fix is needed for the state mutation race in Stop()" — but this PR includes that fix (manager.go line 2412 now uses r.SetStateWithTime()). The comment should either be removed or updated to reflect the actual situation (which, per finding #1 above, is that Start()/StartAttached() still need the same treatment). Fix this →

3. Double lock acquisition and state window in `monitorContainerExit`

Lines 2687–2688:

r.SetStateWithError(StateFailed, errMsg)  // acquires/releases r.stateMu
r.SetStateWithTime(StateFailed, time.Now())  // acquires/releases r.stateMu again

Two separate lock acquisitions for what should be a single atomic update. Between the two calls, another goroutine could observe StateFailed with no StoppedAt set. This is pre-existing, but worth noting since you're touching the surrounding code.

Minor observations

TestConcurrentStopAndMonitorExit_CaptureLogsIdempotent at line 1086 doesn't actually test concurrent Stop() + monitorContainerExit() — it only tests captureLogs/stopProxyServer directly. The test name implies more than it covers. This is fine as-is (the comment explains the limitation), but a follow-up that exercises the real concurrent path would be more valuable.
os.MkdirTemp vs t.TempDir() — the tests using os.MkdirTemp have a good comment explaining why (background goroutines outlive the test). This is correct.
The flexibleRuntime mock is well-structured and should be reusable across future test cases. Consider moving it to a _test.go helper file shared across the package if it grows further.

Verdict

The Makefile changes, flexibleRuntime mock, and most of the 28 tests are clean and correct. The core race fix in Stop() is right, but the same pattern in Start() and StartAttached() was missed — those are still data races that the race detector will catch in concurrent test scenarios. The stale comment compounds the confusion. Recommend addressing items 1 and 2 before merging.

claude · 2026-02-23T21:22:56Z

📦 Previous Review (superseded by newer review)

Claude finished @dpup's task in 4m 44s —— View job

Code Review — PR #179

Gather context
Read changed files
Review code
Post review feedback

The follow-up commit (eaa3011) correctly addresses all three issues from the previous review. Here's the full picture.

What was fixed

Start() race conditions — all three direct r.State writes now use the proper setters:

r.State = StateStarting under m.mu → r.SetState(StateStarting) after m.mu.Unlock()
r.State = StateFailed; r.Error = err.Error() under m.mu → r.SetStateWithError(StateFailed, err.Error())
r.State = StateRunning; r.StartedAt = time.Now() under m.mu → r.SetStateWithTime(StateRunning, time.Now())

StartAttached() race conditions — fixed consistently:

r.State = StateStarting while holding m.mu → r.SetState(StateStarting) after unlock
The conditional if r.State == StateStarting check now correctly uses r.GetState() / r.SetStateWithTime()

Stale comment — updated to accurately reflect the completed fix and its scope.

The lock ordering is also correct: by moving SetState(StateStarting) to after m.mu.Unlock(), the code avoids holding two mutexes simultaneously (m.mu + r.stateMu).

Remaining pre-existing issues (not introduced by this PR)

1. Double lock in `monitorContainerExit` for failed exits

manager.go:2678–2679:

r.SetStateWithError(StateFailed, errMsg)  // acquires/releases r.stateMu
r.SetStateWithTime(StateFailed, time.Now())  // acquires/releases r.stateMu again

Between these two calls there's a window where r.State == StateFailed but r.StoppedAt is still zero. Any concurrent reader (e.g., moat list) could observe this partial state. A combined SetStateWithErrorAndTime method would make this atomic. Fix this →

2. `StartAttached()` firewall failure leaves state as `StateRunning`

manager.go:2259–2268: when SetupFirewall fails in StartAttached(), the code stops the container and returns an error — but the run state is already StateRunning (set at line 2223) and is never updated to StateFailed. The run is permanently stuck in StateRunning with no container. Start() has the same gap (firewall failure returns without updating state from StateStarting to StateFailed), but the impact is smaller there since the state isn't yet StateRunning. Fix this →

Minor observations

TestConcurrentStopAndMonitorExit_CaptureLogsIdempotent — the test name still implies concurrent Stop() + monitorContainerExit(), but the comment correctly explains what's actually being tested (idempotency of captureLogs/stopProxyServer). Fine as-is.
The flexibleRuntime mock is well-structured. If it grows further, moving it to a shared _test.go helper in the package would help future test authors discover it.
os.MkdirTemp vs t.TempDir() usage is correctly explained by the comments about background goroutines outliving tests.

Verdict

All three issues from the previous review are fixed correctly. The race conditions in Start() and StartAttached() are now consistent with the Stop() fix, and the stale comment is updated. The two remaining issues above are pre-existing bugs unrelated to this PR's stated goal. This is good to merge.

claude · 2026-02-23T23:42:32Z

📦 Previous Review (superseded by newer review)

Claude finished @dpup's task in 4m 23s —— View job

Code Review — PR #179

Gather context
Read changed files
Review code
Post review feedback

All three issues from the previous reviews are resolved. The latest commit (6ad6b36) is clean and completes the work. Here's the full picture.

What was fixed in `6ad6b36`

SetStateFailedAt (run.go:235–241) — new method atomically sets State, Error, and StoppedAt under a single r.stateMu acquisition. Clean solution to the double-lock problem flagged in the previous review.

monitorContainerExit (manager.go:2680) — the two-call sequence

r.SetStateWithError(StateFailed, errMsg)  // was: acquires/releases separately
r.SetStateWithTime(StateFailed, time.Now())

is now a single atomic r.SetStateFailedAt(errMsg, time.Now()). No more observation window.

Firewall failure paths (manager.go:2077, 2264) — both Start() and StartAttached() now call r.SetStateFailedAt(...) before stopping the container, so the run is never stuck in StateStarting/StateRunning with no container underneath it.

Minor observation

TestStartFirewallFailureStopsContainer (edge_cases_test.go:150) verifies the error message and containerStopped, but doesn't assert r.GetState() == StateFailed. Since SetStateFailedAt is the core addition in 6ad6b36, adding that assertion would directly validate it:

if r.GetState() != StateFailed {
    t.Errorf("expected StateFailed after firewall error, got %s", r.GetState())
}

Same gap in TestStartFirewallFailureStopContainerAlsoFails.

Pre-existing issues (not introduced by this PR)

`Destroy()` reads `r.State` without `r.stateMu`

manager.go:2769 reads r.State directly while holding m.mu but not r.stateMu:

if r.State == StateRunning {  // ← direct read, no r.stateMu

This is the same class of race that was fixed everywhere else in this PR. Any concurrent SetState* call (which holds only r.stateMu) can race with this read. Fix this →

`SaveMetadata` reads protected fields without `r.stateMu`

run.go:155–157 reads r.State, r.StartedAt, r.StoppedAt, r.Error without holding r.stateMu. Since Stop() and monitorContainerExit() can both call SaveMetadata() concurrently (one after SetStateWithTime, the other after SetStateFailedAt), the race detector can observe a write to these fields from one goroutine while another goroutine reads them in SaveMetadata. Extracting a lockAndSaveMetadata() that reads under stateMu would close this.

Verdict

All previous review items are fully resolved. SetStateFailedAt is the right abstraction — small, precise, and eliminates the only remaining atomicity gap. The Destroy() read is the only bug with the same class as the fixed ones that was missed; the SaveMetadata issue is lower severity. Both are pre-existing. This is ready to merge.

- Enable -race flag in Makefile test-unit and coverage targets - Add 28 edge case tests: firewall failures, cleanup error paths, concurrent access, privileged mode, log capture, monitor exit - Fix pre-existing data race in Stop() by using SetStateWithTime() instead of direct field writes under wrong mutex - Add flexibleRuntime test helper for configurable mock testing - Update CLAUDE.md and CONTRIBUTING.md to reference make targets

Add SetStateFailedAt to atomically set state, error, and timestamp in a single lock acquisition. Use it in monitorContainerExit (fixes double lock) and both Start/StartAttached firewall failure paths (fixes stale StateRunning/StateStarting on failure).

Address remaining review feedback from PR #179: - SaveMetadata: snapshot stateMu-guarded fields under lock before disk write, fixing race between Stop() and monitorContainerExit() - Destroy: use r.GetState() instead of direct r.State read - StartContainer failure: use SetStateFailedAt for consistency with firewall failure path - CLI layer: migrate all r.State reads to r.GetState() across attach, clean, destroy, exec, list, resolve, status, stop, and wt commands - Tests: add StateFailed assertions to firewall failure tests, fix flexibleRuntime field naming consistency (containerLogsAllFn)

dpup · 2026-02-25T01:17:24Z

Picked up this PR, rebased onto main, and addressed all review feedback in commit 4a87b72:

Review items addressed:

SaveMetadata race (flagged as pre-existing) — SaveMetadata() now snapshots State, StartedAt, StoppedAt, and Error under stateMu before building the Metadata struct. This closes the race between Stop() and monitorContainerExit() calling SaveMetadata concurrently.
Destroy() unguarded r.State read (flagged as pre-existing) — Replaced r.State == StateRunning with r.GetState() == StateRunning, releasing m.mu first to maintain consistent lock ordering.
StartContainer failure path inconsistency — Changed from SetStateWithError to SetStateFailedAt so the run state includes StoppedAt, consistent with the firewall failure path.
Missing StateFailed assertions in firewall tests — Added r.GetState() == StateFailed checks to both TestStartFirewallFailureStopsContainer and TestStartFirewallFailureStopContainerAlsoFails.
CLI r.State direct reads — Migrated all r.State reads in cmd/moat/cli/ (attach, clean, destroy, exec, list, resolve, status, stop, wt) to use r.GetState(). These were real races with monitorContainerExit.
Minor cleanup — Fixed flexibleRuntime doc comment accuracy, renamed containerLogsAll → containerLogsAllFn for field naming consistency.

All tests pass with -race, go vet clean, golangci-lint clean.

claude · 2026-02-25T01:17:29Z

Claude encountered an error —— View job

I'll analyze this and get back to you.

dpup added 4 commits February 25, 2026 01:13

fix(run): complete race condition fixes for state mutations

371fa5b

dpup force-pushed the fix/82-race-detector-tests branch from 6ad6b36 to 4a87b72 Compare February 25, 2026 01:17

dpup merged commit 01b7f8f into main Feb 25, 2026
3 of 4 checks passed

dpup deleted the fix/82-race-detector-tests branch February 25, 2026 01:45

Conversation

dpup commented Feb 17, 2026

Summary

Test plan

Uh oh!

claude Bot commented Feb 17, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review — PR #179

Issues

1. Incomplete race fix — Start() and StartAttached() still have the same bug

2. Stale comment in the new test contradicts the fix included in this PR

3. Double lock acquisition and state window in monitorContainerExit

Minor observations

Verdict

Uh oh!

claude Bot commented Feb 23, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review — PR #179

What was fixed

Remaining pre-existing issues (not introduced by this PR)

1. Double lock in monitorContainerExit for failed exits

2. StartAttached() firewall failure leaves state as StateRunning

Minor observations

Verdict

Uh oh!

claude Bot commented Feb 23, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review — PR #179

What was fixed in 6ad6b36

Minor observation

Pre-existing issues (not introduced by this PR)

Destroy() reads r.State without r.stateMu

SaveMetadata reads protected fields without r.stateMu

Verdict

Uh oh!

dpup commented Feb 25, 2026

Uh oh!

claude Bot commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Feb 17, 2026 •

edited by github-actions Bot

Loading

1. Incomplete race fix — `Start()` and `StartAttached()` still have the same bug

3. Double lock acquisition and state window in `monitorContainerExit`

claude Bot commented Feb 23, 2026 •

edited by github-actions Bot

Loading

1. Double lock in `monitorContainerExit` for failed exits

2. `StartAttached()` firewall failure leaves state as `StateRunning`

claude Bot commented Feb 23, 2026 •

edited by github-actions Bot

Loading

What was fixed in `6ad6b36`

`Destroy()` reads `r.State` without `r.stateMu`

`SaveMetadata` reads protected fields without `r.stateMu`

claude Bot commented Feb 25, 2026 •

edited

Loading