Skip to content

test(run): add race detector and edge case tests#179

Merged
dpup merged 4 commits into
mainfrom
fix/82-race-detector-tests
Feb 25, 2026
Merged

test(run): add race detector and edge case tests#179
dpup merged 4 commits into
mainfrom
fix/82-race-detector-tests

Conversation

@dpup
Copy link
Copy Markdown
Collaborator

@dpup dpup commented Feb 17, 2026

Summary

Closes #82 — adds race detector to CI/Makefile and comprehensive edge case tests.

  • Race detector: make test-unit and make coverage now run with -race flag. CI already had it; this aligns local dev workflow.
  • 28 new tests covering: firewall setup failures (iptables errors), cleanup error paths (idempotent stop, container removal errors, keep-container), concurrent access (state reads/writes, proxy shutdown, manager operations), privileged mode edge cases (dind/host/nil), log capture (idempotent, concurrent, error handling), and monitor-exit behavior.
  • Race fix: Pre-existing data race in Stop() — was writing r.State under m.mu (manager mutex) while monitorContainerExit reads via r.stateMu (run mutex). Fixed to use r.SetStateWithTime() consistently.
  • Test helper: New flexibleRuntime mock with configurable function fields for targeted error-path testing.

Test plan

  • All 28 new tests pass with go test -race ./internal/run/
  • Full suite go test -race ./... passes
  • go vet ./... clean
  • golangci-lint run clean
  • Verified make test-unit uses -race
  • Verified make coverage uses -race -covermode=atomic
  • No race conditions detected by race detector

🤖 Generated with Claude Code

@claude
Copy link
Copy Markdown

claude Bot commented Feb 17, 2026

📦 Previous Review (superseded by newer review)

Claude finished @dpup's task in 2m 12s —— View job


Code Review — PR #179

  • Gather context
  • Read changed files
  • Review code
  • Post review feedback

Overall this is good work — the race detector integration, mock infrastructure, and most of the tests are solid. One issue stands out: the race fix in Stop() is incomplete, and a stale comment in the new test file incorrectly says the fix is still pending.


Issues

1. Incomplete race fix — Start() and StartAttached() still have the same bug

The PR fixes Stop() to use r.SetStateWithTime() (which holds r.stateMu), but Start() and StartAttached() still write r.State directly under m.mu — the same class of race that was fixed:

// manager.go:2052
r.State = StateStarting   // ← direct write under m.mu, not r.stateMu
m.mu.Unlock()

// manager.go:2067–2068
m.mu.Lock()
r.State = StateFailed     // ← same issue
r.Error = err.Error()
m.mu.Unlock()

// manager.go:2121–2122
m.mu.Lock()
r.State = StateRunning    // ← same issue
r.StartedAt = time.Now()
m.mu.Unlock()

And StartAttached() at line 2158:

r.State = StateStarting   // ← direct write under m.mu

Since monitorContainerExit() (running concurrently in a goroutine launched at line 2142) reads state via r.GetState() which acquires r.stateMu, these are live data races even after this PR. Fix this →

2. Stale comment in the new test contradicts the fix included in this PR

edge_cases_test.go lines 1081–1085:

// NOTE: There is a known data race in Stop() where r.State and r.StoppedAt
// are written under m.mu instead of r.stateMu, while monitorContainerExit
// reads them via r.GetState() which uses r.stateMu. This test exercises the
// captureLogs and sync.Once paths which ARE safe. A separate fix is needed
// for the state mutation race in Stop().

This comment says "A separate fix is needed for the state mutation race in Stop()" — but this PR includes that fix (manager.go line 2412 now uses r.SetStateWithTime()). The comment should either be removed or updated to reflect the actual situation (which, per finding #1 above, is that Start()/StartAttached() still need the same treatment). Fix this →

3. Double lock acquisition and state window in monitorContainerExit

Lines 2687–2688:

r.SetStateWithError(StateFailed, errMsg)  // acquires/releases r.stateMu
r.SetStateWithTime(StateFailed, time.Now())  // acquires/releases r.stateMu again

Two separate lock acquisitions for what should be a single atomic update. Between the two calls, another goroutine could observe StateFailed with no StoppedAt set. This is pre-existing, but worth noting since you're touching the surrounding code.


Minor observations

  • TestConcurrentStopAndMonitorExit_CaptureLogsIdempotent at line 1086 doesn't actually test concurrent Stop() + monitorContainerExit() — it only tests captureLogs/stopProxyServer directly. The test name implies more than it covers. This is fine as-is (the comment explains the limitation), but a follow-up that exercises the real concurrent path would be more valuable.

  • os.MkdirTemp vs t.TempDir() — the tests using os.MkdirTemp have a good comment explaining why (background goroutines outlive the test). This is correct.

  • The flexibleRuntime mock is well-structured and should be reusable across future test cases. Consider moving it to a _test.go helper file shared across the package if it grows further.


Verdict

The Makefile changes, flexibleRuntime mock, and most of the 28 tests are clean and correct. The core race fix in Stop() is right, but the same pattern in Start() and StartAttached() was missed — those are still data races that the race detector will catch in concurrent test scenarios. The stale comment compounds the confusion. Recommend addressing items 1 and 2 before merging.

@claude
Copy link
Copy Markdown

claude Bot commented Feb 23, 2026

📦 Previous Review (superseded by newer review)

Claude finished @dpup's task in 4m 44s —— View job


Code Review — PR #179

  • Gather context
  • Read changed files
  • Review code
  • Post review feedback

The follow-up commit (eaa3011) correctly addresses all three issues from the previous review. Here's the full picture.

What was fixed

Start() race conditions — all three direct r.State writes now use the proper setters:

  • r.State = StateStarting under m.mur.SetState(StateStarting) after m.mu.Unlock()
  • r.State = StateFailed; r.Error = err.Error() under m.mur.SetStateWithError(StateFailed, err.Error())
  • r.State = StateRunning; r.StartedAt = time.Now() under m.mur.SetStateWithTime(StateRunning, time.Now())

StartAttached() race conditions — fixed consistently:

  • r.State = StateStarting while holding m.mur.SetState(StateStarting) after unlock
  • The conditional if r.State == StateStarting check now correctly uses r.GetState() / r.SetStateWithTime()

Stale comment — updated to accurately reflect the completed fix and its scope.

The lock ordering is also correct: by moving SetState(StateStarting) to after m.mu.Unlock(), the code avoids holding two mutexes simultaneously (m.mu + r.stateMu).

Remaining pre-existing issues (not introduced by this PR)

1. Double lock in monitorContainerExit for failed exits

manager.go:2678–2679:

r.SetStateWithError(StateFailed, errMsg)  // acquires/releases r.stateMu
r.SetStateWithTime(StateFailed, time.Now())  // acquires/releases r.stateMu again

Between these two calls there's a window where r.State == StateFailed but r.StoppedAt is still zero. Any concurrent reader (e.g., moat list) could observe this partial state. A combined SetStateWithErrorAndTime method would make this atomic. Fix this →

2. StartAttached() firewall failure leaves state as StateRunning

manager.go:2259–2268: when SetupFirewall fails in StartAttached(), the code stops the container and returns an error — but the run state is already StateRunning (set at line 2223) and is never updated to StateFailed. The run is permanently stuck in StateRunning with no container. Start() has the same gap (firewall failure returns without updating state from StateStarting to StateFailed), but the impact is smaller there since the state isn't yet StateRunning. Fix this →

Minor observations

  • TestConcurrentStopAndMonitorExit_CaptureLogsIdempotent — the test name still implies concurrent Stop() + monitorContainerExit(), but the comment correctly explains what's actually being tested (idempotency of captureLogs/stopProxyServer). Fine as-is.
  • The flexibleRuntime mock is well-structured. If it grows further, moving it to a shared _test.go helper in the package would help future test authors discover it.
  • os.MkdirTemp vs t.TempDir() usage is correctly explained by the comments about background goroutines outliving tests.

Verdict

All three issues from the previous review are fixed correctly. The race conditions in Start() and StartAttached() are now consistent with the Stop() fix, and the stale comment is updated. The two remaining issues above are pre-existing bugs unrelated to this PR's stated goal. This is good to merge.

@claude
Copy link
Copy Markdown

claude Bot commented Feb 23, 2026

📦 Previous Review (superseded by newer review)

Claude finished @dpup's task in 4m 23s —— View job


Code Review — PR #179

  • Gather context
  • Read changed files
  • Review code
  • Post review feedback

All three issues from the previous reviews are resolved. The latest commit (6ad6b36) is clean and completes the work. Here's the full picture.

What was fixed in 6ad6b36

SetStateFailedAt (run.go:235–241) — new method atomically sets State, Error, and StoppedAt under a single r.stateMu acquisition. Clean solution to the double-lock problem flagged in the previous review.

monitorContainerExit (manager.go:2680) — the two-call sequence

r.SetStateWithError(StateFailed, errMsg)  // was: acquires/releases separately
r.SetStateWithTime(StateFailed, time.Now())

is now a single atomic r.SetStateFailedAt(errMsg, time.Now()). No more observation window.

Firewall failure paths (manager.go:2077, 2264) — both Start() and StartAttached() now call r.SetStateFailedAt(...) before stopping the container, so the run is never stuck in StateStarting/StateRunning with no container underneath it.

Minor observation

TestStartFirewallFailureStopsContainer (edge_cases_test.go:150) verifies the error message and containerStopped, but doesn't assert r.GetState() == StateFailed. Since SetStateFailedAt is the core addition in 6ad6b36, adding that assertion would directly validate it:

if r.GetState() != StateFailed {
    t.Errorf("expected StateFailed after firewall error, got %s", r.GetState())
}

Same gap in TestStartFirewallFailureStopContainerAlsoFails.

Pre-existing issues (not introduced by this PR)

Destroy() reads r.State without r.stateMu

manager.go:2769 reads r.State directly while holding m.mu but not r.stateMu:

if r.State == StateRunning {  // ← direct read, no r.stateMu

This is the same class of race that was fixed everywhere else in this PR. Any concurrent SetState* call (which holds only r.stateMu) can race with this read. Fix this →

SaveMetadata reads protected fields without r.stateMu

run.go:155–157 reads r.State, r.StartedAt, r.StoppedAt, r.Error without holding r.stateMu. Since Stop() and monitorContainerExit() can both call SaveMetadata() concurrently (one after SetStateWithTime, the other after SetStateFailedAt), the race detector can observe a write to these fields from one goroutine while another goroutine reads them in SaveMetadata. Extracting a lockAndSaveMetadata() that reads under stateMu would close this.

Verdict

All previous review items are fully resolved. SetStateFailedAt is the right abstraction — small, precise, and eliminates the only remaining atomicity gap. The Destroy() read is the only bug with the same class as the fixed ones that was missed; the SaveMetadata issue is lower severity. Both are pre-existing. This is ready to merge.

dpup added 4 commits February 25, 2026 01:13
- Enable -race flag in Makefile test-unit and coverage targets
- Add 28 edge case tests: firewall failures, cleanup error paths,
  concurrent access, privileged mode, log capture, monitor exit
- Fix pre-existing data race in Stop() by using SetStateWithTime()
  instead of direct field writes under wrong mutex
- Add flexibleRuntime test helper for configurable mock testing
- Update CLAUDE.md and CONTRIBUTING.md to reference make targets
Add SetStateFailedAt to atomically set state, error, and timestamp in a
single lock acquisition. Use it in monitorContainerExit (fixes double
lock) and both Start/StartAttached firewall failure paths (fixes stale
StateRunning/StateStarting on failure).
Address remaining review feedback from PR #179:

- SaveMetadata: snapshot stateMu-guarded fields under lock before disk
  write, fixing race between Stop() and monitorContainerExit()
- Destroy: use r.GetState() instead of direct r.State read
- StartContainer failure: use SetStateFailedAt for consistency with
  firewall failure path
- CLI layer: migrate all r.State reads to r.GetState() across attach,
  clean, destroy, exec, list, resolve, status, stop, and wt commands
- Tests: add StateFailed assertions to firewall failure tests, fix
  flexibleRuntime field naming consistency (containerLogsAllFn)
@dpup dpup force-pushed the fix/82-race-detector-tests branch from 6ad6b36 to 4a87b72 Compare February 25, 2026 01:17
@dpup
Copy link
Copy Markdown
Collaborator Author

dpup commented Feb 25, 2026

Picked up this PR, rebased onto main, and addressed all review feedback in commit 4a87b72:

Review items addressed:

  1. SaveMetadata race (flagged as pre-existing) — SaveMetadata() now snapshots State, StartedAt, StoppedAt, and Error under stateMu before building the Metadata struct. This closes the race between Stop() and monitorContainerExit() calling SaveMetadata concurrently.

  2. Destroy() unguarded r.State read (flagged as pre-existing) — Replaced r.State == StateRunning with r.GetState() == StateRunning, releasing m.mu first to maintain consistent lock ordering.

  3. StartContainer failure path inconsistency — Changed from SetStateWithError to SetStateFailedAt so the run state includes StoppedAt, consistent with the firewall failure path.

  4. Missing StateFailed assertions in firewall tests — Added r.GetState() == StateFailed checks to both TestStartFirewallFailureStopsContainer and TestStartFirewallFailureStopContainerAlsoFails.

  5. CLI r.State direct reads — Migrated all r.State reads in cmd/moat/cli/ (attach, clean, destroy, exec, list, resolve, status, stop, wt) to use r.GetState(). These were real races with monitorContainerExit.

  6. Minor cleanup — Fixed flexibleRuntime doc comment accuracy, renamed containerLogsAllcontainerLogsAllFn for field naming consistency.

All tests pass with -race, go vet clean, golangci-lint clean.

@claude
Copy link
Copy Markdown

claude Bot commented Feb 25, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

@dpup dpup merged commit 01b7f8f into main Feb 25, 2026
3 of 4 checks passed
@dpup dpup deleted the fix/82-race-detector-tests branch February 25, 2026 01:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Missing race detector and edge case tests

1 participant