Skip to content

Simplify execution model: remove attach/detach, fix run lifecycle#196

Merged
dpup merged 28 commits into
mainfrom
fix/interactive-only
Mar 3, 2026
Merged

Simplify execution model: remove attach/detach, fix run lifecycle#196
dpup merged 28 commits into
mainfrom
fix/interactive-only

Conversation

@dpup
Copy link
Copy Markdown
Collaborator

@dpup dpup commented Feb 28, 2026

Summary

  • Remove moat attach command, --detach flag, and detach escape sequence (Ctrl-/ d)
  • Simplify to two clean modes: interactive (owns terminal) and non-interactive (blocks until exit)
  • Fix non-interactive runs orphaning containers and losing logs (manager.Close() was canceling monitorContainerExit before it could finish)
  • Fix interactive runs leaving metadata stuck on "running" after normal exit
  • Fix Docker log capture returning empty for fast-exiting containers (wait for log driver flush)
  • Extract ProviderRunner helper to deduplicate provider CLI boilerplate
  • Extract cleanupResources and postStart helpers from run manager

Test plan

  • Unit tests pass (make test-unit, 38 packages, race detector)
  • Lint clean (make lint, 0 issues)
  • Non-interactive echo captures logs and sets state: stopped
  • Non-interactive sleep + SIGTERM stops gracefully
  • Interactive echo sets state: stopped and stopped_at
  • moat stop from another process cleanly unblocks the CLI
  • Verified on both Docker and Apple containers

🤖 Generated with Claude Code

dpup added 17 commits February 27, 2026 15:29
Simplifying to two modes: interactive (attached) and non-interactive
(monitor via moat logs). The attach command allowed reconnecting to
running containers, which added complexity without sufficient value.
Non-interactive runs now always start in background. Interactive runs
always own the terminal. Removes RunAttached and RunInteractive in
favor of the single RunInteractiveAttached path.
Only Ctrl-/ k (stop) remains. Ctrl-/ d is no longer an escape
sequence and passes through to the container.
Interactive sessions now run until the process exits or user sends
Ctrl-/ k to stop. No detach path exists.
Only StartAttached remains for interactive sessions. The separate
Attach method was only needed for the removed moat attach command.
Point users to moat logs and moat stop instead of the removed
moat attach command.
Remove documentation for moat attach command, --detach flag, and
Ctrl-/ d escape sequence. Document the simplified two-mode model.
TTY was always set identically to Interactive by every caller and was
never read inside internal/run/. The actual TTY decision is made at
runtime via term.IsTerminal(os.Stdin) in manager.StartAttached().
StreamLogs was never true in production - all callers passed false.
Non-interactive runs tell users to use 'moat logs -f' instead of
streaming to stdout. Removes the streamLogs method and its stdcopy
import.
Extract three shared helpers that were duplicated between Start() and
StartAttached():
- setLogContext: configures structured logger with run fields
- setupPortBindings: resolves host ports and registers routes
- setupFirewall: configures iptables isolation for strict network policy

Reduces ~80 lines of duplication while preserving method-specific
ordering (Start does firewall before ports, StartAttached the reverse).
With only one escape action (EscapeStop), the separate escapeCh channel
was unnecessary. Escape errors now flow through attachDone directly and
are handled inline, eliminating a channel and simplifying the goroutine
to a one-liner.
Consolidate resource cleanup duplicated across Stop(), Wait(),
monitorContainerExit(), and Destroy() into a single cleanupResources()
method guarded by sync.Once. This ensures proxy unregistration, SSH
agent stop, service/BuildKit teardown, container removal, network
cleanup, route unregistration, and temp dir removal happen exactly once
regardless of which code path runs first.

Reduces ~300 lines of duplicated cleanup code across four methods.
All three provider CLIs (claude, codex, gemini) followed an identical
~170 line pattern for workspace resolution, grant dedup, config setup,
dry run, and execution. Extract a RunProvider() helper in internal/cli
that handles the shared boilerplate.

Each provider is now ~30-50 lines of config + flag registration,
with provider-specific logic isolated in BuildCommand and
ConfigureAgent callbacks. The ProviderRunConfig struct makes the
differences between providers explicit and declarative.
…etadata

Three fixes for the simplified execution model:

1. Non-interactive runs now block until the container exits instead of
   returning immediately. This ensures monitorContainerExit has time to
   capture logs and update state before manager.Close() cancels its
   context. Signal handling (SIGINT/SIGTERM) gracefully stops the run.

2. Interactive runs now update state to stopped/failed after the
   container exits normally. Previously state stayed "running" and
   stopped_at was never set.

3. Docker ContainerLogsAll now waits for the container to reach
   not-running state before reading logs. This prevents empty log
   capture for fast-exiting containers where Docker's log driver
   hasn't flushed yet.
Non-interactive runs block until exit, not background. Remove
references to the removed -d/--detach flag from worktree examples.
@claude
Copy link
Copy Markdown

claude Bot commented Feb 28, 2026

📦 Previous Review (superseded by newer review)

Claude finished @dpup's task in 6m 24s —— View job


PR Review: Simplify execution model

  • Gather context and read key files
  • Review run lifecycle changes (internal/run/manager.go)
  • Review container changes (internal/container/)
  • Review provider CLI refactor (internal/cli/provider.go)
  • Review terminal/escape changes (internal/term/)
  • Post summary and inline comments

Overview

This is a solid simplification: removing the attach/detach model and replacing it with two clean modes (interactive/blocking and non-interactive/blocking) removes a lot of complexity. The root causes of the bugs fixed here (orphaned containers, stuck metadata) are correctly identified and addressed.

A few issues worth addressing before merging:


Bug: Escape-stop (Ctrl-/ k) sets state to failed and may not stop the container

internal/run/manager.go, around line 2372

When the user presses Ctrl-/ k to stop an interactive session, docker.go:StartAttached returns EscapeError from the stdinDone channel. manager.StartAttached then receives this as attachErr.

The comment on the state-update block says "Skip if context was canceled (e.g. escape-stop or SIGTERM)", but the escape-stop path does not cancel the context at this point — attachCtx is only canceled later via defer attachCancel() when RunInteractiveAttached returns. So ctx.Err() == nil is true and SetStateFailedAt("escape: stop", ...) runs, marking the run as failed.

The caller's manager.Stop() then checks the state:

if currentState != StateRunning && currentState != StateStarting {
    return nil // Already stopped
}

State is StateFailed, not StateRunning, so Stop returns without stopping the container. The container keeps running until it receives SIGHUP when the hijacked Docker conn is closed by resp.Close(). For TTY containers this usually works, but it's a side-effect, not an explicit stop.

End result: runs stopped via escape show state: failed in moat list, not state: stopped. Fix this →


Bug: captureLogs in StartAttached fires before container exits on escape-stop

internal/run/manager.go, line 2402

In the escape-stop path, captureLogs(r) is called while the container is still running (stdin was disconnected, but the process hasn't exited yet). ContainerLogsAll calls ContainerWait with a 10-second timeout, so this introduces up to a 10-second stall between the escape keypress and exec.go's escape handler printing "Stopping run...". This is especially wrong because the container may not even stop within that window.

The 10s timeout fires, captureLogs captures incomplete logs, and sets logsCaptured = true. When the container later exits naturally (from SIGHUP), monitorContainerExit's captureLogs call is a no-op — the final logs aren't captured.

The fix ties to the one above: if attachErr is an EscapeError (or context canceled), skip the captureLogs call in StartAttached and let monitorContainerExit capture them after the container actually exits. Fix this →


Stale comment in manager.StartAttached

internal/run/manager.go, line 2370

The comment // Skip if context was canceled (e.g. escape-stop or SIGTERM) — the caller's // Stop() call handles that state transition. is misleading: escape-stop does not cancel the context, and after the bug above is fixed it still won't — the comment should be updated to reflect the actual condition.


RunProvider prints informational messages after the run has already completed

internal/cli/provider.go, lines 223–226

result, err := ExecuteRun(ctx, opts)
...
if result != nil {
    fmt.Printf("Starting %s in %s\n", rc.Name, absPath)
    fmt.Printf("Run: %s (%s)\n", result.Name, result.ID)
}

ExecuteRun blocks until the container exits (both interactive and non-interactive modes). So these messages are printed after the run has finished — after the container's output, "Run completed", etc. The user sees "Starting claude in /path" as the last thing, which is confusing.

These messages should either move to before ExecuteRun, or be removed (since ExecuteRun already prints "Initializing..." and state transitions). Fix this →


Data race: m.daemonClient read without holding m.mu

internal/run/manager.go, lines 2627, 2686, 2925 (and registerPersistedRun line 315)

m.daemonClient is written under m.mu.Lock() in Create(). But it's read without the lock in cleanupResources(), Close(), and registerPersistedRun(). cleanupResources() can be called concurrently with Create() (via monitorContainerExit on one run while another is being created), which is a data race the -race detector will flag.

The common pattern if m.daemonClient != nil { ... } is not safe without the lock. Either hold m.mu.RLock() at those call sites, or store daemonClient atomically.


Minor: setupFirewall / setupPortBindings order is inconsistent between Start and StartAttached

internal/run/manager.go

Start(): firewall → port bindings (lines 2253–2257)
StartAttached(): port bindings → firewall (lines 2358–2362)

This probably doesn't cause bugs but is an easy source of future confusion.


What's good

The core lifecycle fix is solid: using monitorContainerExit as the only place that calls WaitContainer and closes exitCh, with cleanupOnce/logsCaptured/providerHooksDone atomics preventing double-work, is a clean design. The ContainerLogsAll fix (waiting for container stop before fetching logs) correctly addresses the empty-logs-for-fast-containers bug. The ProviderRunner extraction genuinely removes a lot of boilerplate across claude, codex, and gemini CLI files.

Address code review feedback:

1. Escape-stop (Ctrl-/ k) no longer sets state to "failed". The
   state update and log capture in StartAttached are skipped when
   the attach ended due to an escape or context cancellation, letting
   the caller's Stop() handle state and monitorContainerExit capture
   complete logs.

2. captureLogs no longer fires while the container is still running
   on escape-stop. Previously this caused a 10s stall (ContainerWait
   timeout) and captured incomplete logs.

3. Provider "Started..." message now prints via OnRunCreated callback
   before ExecuteRun blocks, not after the run completes.
@claude
Copy link
Copy Markdown

claude Bot commented Feb 28, 2026

📦 Previous Review (superseded by newer review)

Claude finished @dpup's task in 5m 8s —— View job


PR Review: Simplify execution model

  • Gather context
  • Read key changed files (internal/run/manager.go, internal/cli/provider.go, internal/term/escape.go, cmd/moat/cli/exec.go)
  • Check if previous review issues were addressed
  • Post updated review

What was fixed since the previous review

All four concrete bugs called out in the previous review are addressed:

  1. Escape-stop state: failed bug — Fixed. callerWillStop := ctx.Err() != nil || term.IsEscapeError(attachErr) (line 2373) correctly identifies the escape-stop path, so SetStateFailedAt is no longer called when the user presses Ctrl-/ k.

  2. Premature captureLogs during escape-stop — Fixed. The captureLogs(r) call is now gated on !callerWillStop (line 2406), avoiding the 10-second stall and incomplete log capture.

  3. Comment now accurate — Updated at lines 2369–2372.

  4. RunProvider message ordering — Fixed. The "Started..." message is now printed via opts.OnRunCreated (line 219-221), which is called from inside ExecuteRun immediately after run creation, before blocking on execution. The previous fmt.Printf calls after ExecuteRun returned are gone.


Issues still open from previous review

Data race: m.daemonClient reads without holding m.mu

m.daemonClient is written under m.mu.Lock() in Create() (line 519–521), but read unprotected in:

  • cleanupResources() at line 2635: if r.ProxyAuthToken != "" && m.daemonClient != nil
  • Close() at line 2933 (protected by m.mu.RLock() at line 2928, this one is fine)
  • registerPersistedRun() at line 315 (initialization path, unlikely to race)

The real concern is cleanupResources(), which is called from the monitorContainerExit goroutine concurrently with Create() on the main goroutine (e.g., one run finishing while another is being created). The race detector will flag this. The fix is a local snapshot before the closure:

m.mu.RLock()
dc := m.daemonClient
m.mu.RUnlock()

r.cleanupOnce.Do(func() {
    ...
    if r.ProxyAuthToken != "" && dc != nil {
        dc.UnregisterRun(ctx, r.ProxyAuthToken)
    }
    ...
})

setupPortBindings / setupFirewall order inconsistency

Start() runs firewall → port bindings (lines 2253–2257).
StartAttached() runs port bindings → firewall (lines 2358–2362).

Not a correctness bug today, but an easy source of future divergence.


New issue: Apple TTY interactive logs lost on escape-stop

internal/run/manager.go, lines 2388 and 2525–2527

When callerWillStop = true (escape-stop or SIGTERM), StartAttached skips the Apple TTY log-buffer write at line 2388. Then the caller's Stop() calls captureLogs(r), which immediately returns for Apple interactive runs:

// captureLogs, line 2525
if r.Interactive && m.runtime.Type() == container.RuntimeApple {
    return
}

The logBuffer — which holds all output captured via the tee up to the escape point — is abandoned and logs.jsonl is never created for that run. For Docker, Stop() calls StopContainer first, then captureLogs successfully fetches logs via ContainerLogsAll. Apple has no equivalent path.

This breaks the audit trail for any Apple-container interactive run that is escape-stopped or SIGTERM-stopped.

The fix is to write the buffer even when callerWillStop = true, since at that point the tee has already stopped receiving data and the buffer is as complete as it can be:

// Write Apple TTY buffer regardless of callerWillStop —
// the buffer is complete (tee stops when StartAttached returns)
// and Stop()/captureLogs() has no way to get Apple TTY output.
if r.Interactive && r.Store != nil && m.runtime.Type() == container.RuntimeApple {
    if r.logsCaptured.CompareAndSwap(false, true) {
        if lw, err := r.Store.LogWriter(); err == nil {
            if logBuffer.Len() > 0 {
                _, _ = lw.Write(logBuffer.Bytes())
            }
            lw.Close()
        } else {
            r.logsCaptured.Store(false)
        }
    }
}

Fix this →


Minor

Stale comment in Wait() at line 2489:

case <-ctx.Done():
    // Context canceled - caller chose to detach, don't stop the run
    return ctx.Err()

"detach" is removed from the model. The comment should say something like "Context canceled — caller is responsible for stopping the run." Fix this →


What looks good

The core lifecycle redesign is solid. monitorContainerExit as the single goroutine owning WaitContainer and exitCh, with cleanupOnce/logsCaptured/providerHooksDone atomics preventing double-work, is clean and correct for the non-interactive path. The ContainerLogsAll fix (waiting for container stop before fetching logs) correctly addresses the empty-logs-for-fast-containers bug. The ProviderRunner extraction significantly reduces boilerplate across claude, codex, and gemini.

The escape-stop fixes are correct and the state machine is now consistent. Interactive runs use Stop() for all cleanup; non-interactive runs use monitorContainerExit. The two paths don't interfere.

Summary: The Apple TTY log-loss bug (#7) is a correctness issue for a real user scenario (macOS, interactive, escape-stop). The data race (#5) should be fixed before merge. Everything else is minor.

Address second round of code review feedback:

1. Apple TTY log buffer is now written unconditionally in StartAttached,
   not gated on callerWillStop. The buffer holds all output up to the
   escape point, and captureLogs() returns early for Apple interactive
   runs so this is the only path that writes logs.

2. cleanupResources snapshots m.daemonClient under m.mu.RLock before
   the cleanupOnce closure, preventing a data race with Create() which
   writes daemonClient under m.mu.Lock.

3. Stale "detach" comment in Wait() updated.
@claude
Copy link
Copy Markdown

claude Bot commented Feb 28, 2026

📦 Previous Review (superseded by newer review)

Claude finished @dpup's task in 4m 51s —— View job


PR Review: Simplify execution model

  • Gather context from previous reviews
  • Read key changed files
  • Check remaining issues and post review

What was fixed since the previous two reviews

All five concrete bugs from previous rounds are addressed:

  1. Escape-stop state: failedcallerWillStop := ctx.Err() != nil || term.IsEscapeError(attachErr) (line 2373) correctly gates the state update.
  2. Premature captureLogs on escape-stop — gated on !callerWillStop (line 2406).
  3. Stale "detach" comment in Wait() — now reads "Context canceled — caller is responsible for stopping the run" (line 2489).
  4. RunProvider message ordering — moved to opts.OnRunCreated callback, fires before ExecuteRun blocks (line 219–221 in internal/cli/provider.go).
  5. Apple TTY logs lost on escape-stop — Apple buffer write is now unconditional (lines 2388–2401), no longer gated on !callerWillStop.

Bug: Partial data race fix — m.daemonClient still read unprotected at line 2701

The snapshot taken at lines 2633–2635:

m.mu.RLock()
dc := m.daemonClient
m.mu.RUnlock()

…is correctly used for UnregisterRun (line 2642). But inside the same cleanupOnce.Do closure, m.daemonClient is accessed directly again at line 2701 for UnregisterRoutes:

if r.Name != "" {
    _ = m.routes.Remove(r.Name)
    if m.daemonClient != nil {                                          // ← unprotected
        if err := m.daemonClient.UnregisterRoutes(ctx, r.Name); err != nil {  // ← unprotected

This has the same race as the original: cleanupResources runs from monitorContainerExit goroutines concurrently with Create() which writes m.daemonClient under m.mu.Lock(). The fix is to replace these two direct reads with the existing snapshot dc. Fix this →


Minor: Stale "detached mode" in monitorContainerExit comment

internal/run/manager.go, lines 2727–2728:

// This runs in the background for ALL runs to ensure logs are captured
// even in detached mode where Wait() is never called.

"Detached mode" was removed. This was mentioned in the previous review (Wait() comment was fixed but this one wasn't). Fix this →


Minor: setupPortBindings/setupFirewall order still inconsistent

Start() (lines 2253–2257): firewall → port bindings
StartAttached() (lines 2358–2362): port bindings → firewall

Flagged in both previous reviews and still unaddressed. Not a correctness bug today, but worth aligning to prevent confusion when touching either path.


Worth noting: Apple TTY logs may be incomplete on SIGTERM

When SIGTERM arrives in RunInteractiveAttached, the flow is:

  1. attachCancel() cancels attachCtx
  2. manager.Stop(context.Background(), r.ID) runs to completion (stops container, captures Docker logs, returns)
  3. RunInteractiveAttached returns; ExecuteRun's defer manager.Close() fires

Meanwhile, the goroutine running manager.StartAttached (exec.go line 364) is still in flight — it writes the Apple TTY logBuffer at lines 2388–2401 of manager.go. There is no synchronization point between step 3 and the goroutine's buffer write. If the process exits quickly after manager.Close(), the log file may be incomplete. In practice the window is very small and the file write is fast, but it's a gap worth being aware of. The fix would be to wait for the attachDone channel before returning in the SIGTERM case.


What looks good

The design is solid. Using monitorContainerExit as the sole owner of WaitContainer and exitCh, with cleanupOnce/logsCaptured/providerHooksDone atomics preventing double-work, is clean and correct. The ContainerLogsAll fix (10s ContainerWait before fetching logs) correctly handles fast-exiting containers. The ProviderRunner extraction is a clear win — claude, codex, and gemini CLI files each lost 140–150 lines of near-identical boilerplate.

The escape-stop state machine is now correct and consistent. All previous review issues have been addressed except the UnregisterRoutes data race and the two cosmetic items above.

Summary: The UnregisterRoutes race (data race, should fix before merge) and the minor SIGTERM/Apple log completeness gap are the remaining actionable items. Everything else is comment cleanup.

…omment

Complete the daemonClient race fix: the UnregisterRoutes call inside
cleanupOnce.Do was still accessing m.daemonClient directly instead of
the snapshot taken under m.mu.RLock. Also update monitorContainerExit
comment to remove stale "detached mode" reference.
@claude
Copy link
Copy Markdown

claude Bot commented Feb 28, 2026

📦 Previous Review (superseded by newer review)

Claude finished @dpup's task in 4m 43s —— View job


PR Review: Simplify execution model

  • Gather context from previous reviews
  • Read key changed files
  • Check remaining issues from previous reviews
  • Fresh review pass for new commits
  • Post inline comments and summary

What was fixed since the last review

All three outstanding items from round 3 have been addressed:

  1. UnregisterRoutes data racedc snapshot (from m.mu.RLock()) is now used consistently for both UnregisterRun (line 2643) and UnregisterRoutes (line 2702). ✅
  2. Stale "detached mode" comment in monitorContainerExit — Updated to describe the current model. ✅
  3. setupPortBindings/setupFirewall ordering — Still inconsistent (see below).

Bug: Wait() returns wrong error for failed non-interactive containers

internal/run/manager.go, lines 2755–2767

In monitorContainerExit, close(r.exitCh) happens at line 2755, then the state is updated at lines 2758–2771:

// Now signal that container has exited (and logs are captured)
close(r.exitCh)                          // ← Wait() unblocks here

// Update run state ...
currentState := r.GetState()
if currentState == StateRunning || ... {
    r.SetStateFailedAt(errMsg, time.Now())  // ← r.Error written here
}

Wait() unblocks on <-r.exitCh (line 2472) and immediately reads r.Error under stateMu. Since state is set after exitCh is closed, there's a window where Wait() reads r.Error as empty and returns nil, even though the container exited with a non-zero code.

The comment at line 2473 claims "monitorContainerExit already captured logs and updated state" — but state is updated after the signal, not before.

Impact: For non-interactive runs (moat run echo "fail" && exit 1), manager.Wait() can return nil instead of an error. exec.go propagates this to the CLI's exit code, so the terminal reports success for a failed run.

Fix: Swap the ordering — set state before closing exitCh:

// Update run state BEFORE signaling so Wait() reads final state
currentState := r.GetState()
if currentState == StateRunning || currentState == StateStarting {
    if err != nil || exitCode != 0 {
        ...
        r.SetStateFailedAt(errMsg, time.Now())
    } else {
        r.SetStateWithTime(StateStopped, time.Now())
    }
}

// Now signal (state and logs are both captured)
close(r.exitCh)

Fix this →


Minor: Stale comment in captureLogs

internal/run/manager.go, line 2515

// This should be called whenever a container exits, regardless of how:
// - Normal exit (Wait)
// - Interactive exit (StartAttached)
// - Explicit stop (Stop)
// - Detached completion (background monitor)   ← stale

"Detached completion" was removed from the model. Fix this →


Minor: setupPortBindings/setupFirewall order still inconsistent

internal/run/manager.go

Start() (lines 2253–2257): setupFirewallsetupPortBindings
StartAttached() (lines 2358–2362): setupPortBindingssetupFirewall

Flagged in rounds 2 and 3; still not aligned. Not a current correctness issue, but easy to accidentally diverge further.


What's good

The fix for the daemonClient data race (using dc snapshot in cleanupResources) is complete and correct. The escape-stop and Apple TTY log fixes from previous rounds are solid. The ProviderRunner extraction is a clear win — provider CLIs went from ~180 lines each to ~50, with all the boilerplate deduplicated cleanly. The monitorContainerExit design (single goroutine owning WaitContainer, idempotent atomics) is the right model for this lifecycle.

Summary: The close(exitCh)-before-state-update bug (#1) is a correctness issue that can cause non-interactive runs to always appear successful regardless of exit code. It should be fixed before merge. The other two items are cosmetic.

1. Move state update (SetStateFailedAt/SetStateWithTime) before
   close(exitCh) in monitorContainerExit so Wait() reads the final
   state when it unblocks. Previously Wait() could race and return
   nil for a failed container.

2. Align StartAttached to use same firewall→portBindings ordering
   as Start, preventing future divergence.

3. Update stale "detached completion" comment in captureLogs.
@claude
Copy link
Copy Markdown

claude Bot commented Feb 28, 2026

📦 Previous Review (superseded by newer review)

Claude finished @dpup's task in 7m 40s —— View job


PR Review: Simplify execution model

  • Gather context from previous reviews
  • Read key changed files
  • Check remaining issues from previous reviews
  • Fresh review pass for new commits
  • Post summary

What was fixed since the last review

All four items from round 4 are addressed in the latest commit (6541614):

  1. close(exitCh) before state update — Fixed. State is now set before close(r.exitCh) in monitorContainerExit (lines 2756–2774), so Wait() reads final state when it unblocks. ✅
  2. setupPortBindings/setupFirewall ordering — Aligned. Both Start() and StartAttached() now run firewall → port bindings. ✅
  3. Stale "detached completion" comment in captureLogs — Updated. ✅
  4. UnregisterRoutes data race — Already fixed in prior commit; the snapshot dc is used consistently. ✅

Remaining: m.daemonClient read in setupPortBindings without holding lock

internal/run/manager.go, around line 2210

func (m *Manager) setupPortBindings(ctx context.Context, r *Run) {
    ...
    if m.daemonClient != nil {                                    // ← unprotected read
        if err := m.daemonClient.RegisterRoutes(...); err != nil { // ← unprotected read

m.daemonClient is written under m.mu.Lock() in Create() (lines 519–521). setupPortBindings is called from both Start() and StartAttached() after m.mu is released. If one goroutine is in Create() (writing m.daemonClient) while another is in Start()setupPortBindings (reading m.daemonClient), the race detector will flag it.

The same pattern was fixed in cleanupResources by snapshotting dc := m.daemonClient under m.mu.RLock() before the closure. The same fix applies here — snapshot before calling setupPortBindings (or snapshot inside setupPortBindings):

m.mu.RLock()
dc := m.daemonClient
m.mu.RUnlock()
// pass dc to setupPortBindings instead of accessing m.daemonClient inside

Fix this →


Minor: SIGTERM + Apple TTY log completeness (carried from round 3)

Still unaddressed, still minor. When SIGTERM arrives, the for/select loop in RunInteractiveAttached calls attachCancel(), then manager.Stop(), then return nil — without waiting for the goroutine running manager.StartAttached to finish. That goroutine writes the Apple TTY buffer to logs.jsonl after manager.StartAttached returns, which happens after the attach context is canceled. If the process exits before the goroutine finishes the write, logs could be truncated.

In practice the window is small and the write is a pure lw.Write(logBuffer.Bytes()) — no context dependency. But the fix (waiting on attachDone in the SIGTERM branch) would guarantee correctness.


What's good

The design is solid end-to-end:

  • monitorContainerExit as the single goroutine that owns WaitContainer and closes exitCh eliminates the previous double-wait race.
  • cleanupOnce/logsCaptured/providerHooksDone atomics correctly handle the cases where both monitorContainerExit and Stop() run concurrently on the same run.
  • State is now written before exitCh closes, so Wait() always reads final state.
  • The ProviderRunner extraction visibly reduces boilerplate across claude/codex/gemini (each lost ~130–150 lines of near-identical setup code).
  • ContainerLogsAll fix (waiting for container stop before log fetch) correctly handles fast-exiting containers that previously produced empty logs.

The lifecycle semantics are now clear and consistent: monitorContainerExit is the authoritative exit handler for all runs; Stop() handles explicit terminations idempotently; interactive and non-interactive paths share the same cleanup machinery.

Summary: One remaining data race (setupPortBindings reading m.daemonClient unprotected) should be fixed before merge. Everything else is minor or already noted in previous rounds.

Snapshot m.daemonClient under m.mu.RLock before calling
RegisterRoutes, same pattern as cleanupResources. Prevents
data race with Create() which writes daemonClient under lock.
@claude
Copy link
Copy Markdown

claude Bot commented Feb 28, 2026

📦 Previous Review (superseded by newer review)

Claude finished @dpup's task in 8m 18s —— View job


PR Review: Simplify execution model

  • Gather context from previous reviews
  • Read key changed files
  • Check remaining issues from previous reviews
  • Fresh review pass for new commits
  • Post summary

What was fixed since the last review

The single outstanding issue from round 5 is addressed in commit be1c50e:

setupPortBindings data race — Fixed. The function now takes a snapshot dc := m.daemonClient under m.mu.RLock() (lines 2210–2213) before using it, consistent with the cleanupResources fix from round 3. All m.daemonClient reads are now protected:

  • In Create() — same goroutine as the write (no race)
  • In registerPersistedRun() — initialization path, m.daemonClient is nil at that point (no race)
  • In cleanupResources() — snapshot under m.mu.RLock()
  • In setupPortBindings() — snapshot under m.mu.RLock()
  • In Close() — read inside m.mu.RLock() block ✅

Minor: Apple TTY log completeness on SIGTERM (carried from rounds 3–5)

Still unresolved and still minor. When SIGTERM arrives, RunInteractiveAttached calls attachCancel()manager.Stop()return nil without waiting for the goroutine running manager.StartAttached to finish (i.e., <-attachDone is not drained in the SIGTERM branch).

That goroutine writes the Apple TTY logBuffer to logs.jsonl after m.runtime.StartAttached returns. Since manager.Close() fires immediately after RunInteractiveAttached returns, and the process exits shortly after, there's a narrow window where the buffer write might not complete before process exit.

In practice the write is a local disk lw.Write(logBuffer.Bytes()) — very fast — and the process gets a brief cleanup window on exit. The risk is low. The correct fix is to wait on attachDone before returning in the SIGTERM branch. Not blocking.


No new issues found

The complete pass over manager.go, exec.go, docker.go, escape.go, run.go, and provider.go found no new bugs, races, or behavioral issues.

The lock ordering (m.mur.stateMu) is consistent — no deadlock risk. The captureLogs retry logic (CAS + file-exists check + retry) handles concurrent callers correctly. The monitorContainerExit / Stop() interaction is safe: both paths are idempotent via cleanupOnce, logsCaptured, and providerHooksDone atomics, and exitCh is closed after state is set.


Overall

The PR is ready to merge. The core redesign is solid:

  • monitorContainerExit as the single goroutine owning WaitContainer and exitCh eliminates the previous double-wait race
  • State is now set before exitCh closes, so Wait() always reads final state
  • cleanupOnce/logsCaptured/providerHooksDone atomics correctly handle concurrent Stop + monitorContainerExit paths
  • The ProviderRunner extraction visibly reduces boilerplate — each provider CLI dropped ~130–150 lines of near-identical setup code
  • ContainerLogsAll fix (ContainerWait before log fetch) correctly addresses the empty-logs-for-fast-containers bug

Six rounds of review and ten fixes later, the lifecycle semantics are clean and correct.

- Remove duplicate run ID: one "Started" line from OnRunCreated
  (provider path) or exec.go (moat run path), not both
- Remove stale "Stop with Ctrl+C" hint
- Add post-completion hint: 'moat logs <id>' to view output
- Remove post-ExecuteRun print in run.go (ExecuteRun blocks now)
@claude
Copy link
Copy Markdown

claude Bot commented Mar 3, 2026

📦 Previous Review (superseded by newer review)

Claude finished @dpup's task in 6m 26s —— View job


PR Review: Simplify execution model

  • Gather context and read key files
  • Review changes against previous review rounds
  • Check for new issues in latest commits
  • Post inline comments and summary

What was fixed since the last review

All items from round 6 are addressed in the latest commit (a599001):

  • Output ordering (RunProvider message after ExecuteRun returned): Fixed. The Started agent %q print was moved from after ExecuteRun in run.go to inside the OnRunCreated callback in provider.go, which fires immediately after run creation and before blocking. ✅
  • Non-interactive run feedback: exec.go now prints "Started %s (%s)\n" (guarded by opts.OnRunCreated == nil) and "Use 'moat logs %s' to view output\n" after the container exits. Cleaner than before. ✅

Bug: Normal interactive exit doesn't clean up resources

internal/run/manager.go, line 2421

When !callerWillStop (container exits naturally — user types exit, agent finishes), cleanupResources is never called from StartAttached. monitorContainerExit is not launched for interactive runs (Start() is the only launcher), and Stop() is not called in the natural-exit path. The result:

  • Docker container stays in "exited" state, not removed
  • Docker network is not removed
  • Service containers (if any) may keep running after the main container exits
  • BuildKit sidecar, temp directories, and routes are not cleaned up

Everything persists until moat destroy is called.

The fix is one line in the !callerWillStop block (after SaveMetadata):

m.cleanupResources(ctx, r)

cleanupResources is idempotent via cleanupOnce, so calling it here and later from Close() is safe.

Fix this →


Minor: moat run -i prints no run ID before the session starts

cmd/moat/cli/exec.go, line 244

For interactive moat run, the code takes the early return via RunInteractiveAttached before reaching the Started %s (%s) print at line 271. The user sees:

Initializing...
Escape: Ctrl-/ k (stop)

<TTY session>
Run abc123 completed

No run ID is shown before the session starts — the user can't grab it to run moat logs or moat stop from another terminal. Provider commands avoid this because they set OnRunCreated, but raw moat run -i -- bash has the gap. Consider printing the ID in RunInteractiveAttached's preamble.


Minor: SIGTERM + Apple TTY log completeness (carried from rounds 3–6)

Still unresolved, still low risk. When SIGTERM fires in RunInteractiveAttached, attachCancel() + manager.Stop() run and then the function returns without draining <-attachDone. The background goroutine running manager.StartAttached writes the Apple TTY buffer after m.runtime.StartAttached() returns due to context cancellation. There's a narrow window between manager.Close() and process exit where the buffer write to logs.jsonl might not complete. The fix is draining attachDone before returning in the SIGTERM branch.


What's good

The latest commit's output cleanup is correct. The conditional "Started" message (only when OnRunCreated is nil) avoids the double-print that existed before. The "Use 'moat logs %s' to view output" hint after non-interactive completion is a good UX addition.

The core lifecycle design across all previous rounds remains solid: monitorContainerExit as sole owner of WaitContainer/exitCh, state set before exitCh closes, all daemonClient reads snapshotted under m.mu.RLock(), and cleanupOnce/logsCaptured/providerHooksDone atomics correctly preventing double-work.

Summary: The resource-leak on normal interactive exit (no cleanupResources call) is the only remaining correctness issue. It's observable — after moat claude exits normally, docker ps -a and docker network ls show leftover containers and networks until moat destroy is run.

Comment thread internal/run/manager.go
Comment thread cmd/moat/cli/exec.go Outdated
Non-interactive runs now stream container output to the terminal in
real-time instead of blocking silently. This gives docker-run-like UX
where users see output as it happens.

Key changes:
- monitorContainerExit uses context.Background() so it survives Close()
- Manager.Close() waits for tracked monitors via sync.WaitGroup before
  closing the runtime, ensuring logs are always captured
- Only monitors started via Start() are tracked; inherited monitors from
  loadPersistedRuns are not, so Close() doesn't hang on old containers
- Docker ContainerLogs demuxes multiplexed format for non-TTY containers
- Non-interactive path streams logs via FollowLogs while blocking on Wait
- Added Ctrl+C hint on start and moat logs hint on completion (dimmed)
- Updated help text and docs to describe streaming behavior
@claude
Copy link
Copy Markdown

claude Bot commented Mar 3, 2026

📦 Previous Review (superseded by newer review)

Claude finished @dpup's task in 6m 52s —— View job


PR Review: Simplify execution model

  • Gather context from previous reviews
  • Read key changed files (internal/run/manager.go, internal/container/docker.go, cmd/moat/cli/exec.go)
  • Check remaining issues from previous reviews
  • Fresh review pass for new commits (813dfb1)
  • Post inline comments and summary

What was fixed since the last review

All items from round 7 are addressed in the latest commit (813dfb1):

  • Non-interactive run output — Streaming via FollowLogs in a background goroutine while Wait blocks on exitCh. The user sees live output as it happens instead of silence. ✅
  • monitorContainerExit context — Now uses context.Background() so it survives Close(). Close() waits on monitorWg before closing the runtime, ensuring logs are captured. ✅
  • Inherited monitors not tracked in monitorWg — Correctly excluded so Close() doesn't hang on long-running containers from previous CLI sessions. ✅
  • Docker multiplexed format for non-TTY containersContainerLogs now demuxes the 8-byte-header format for non-TTY containers, producing clean text. ✅

Bug: cleanupResources still not called on normal interactive exit (round 8)

internal/run/manager.go, line 2420

When an interactive container exits naturally (user types exit, agent finishes), StartAttached takes the !callerWillStop path which calls captureLogs, runProviderStoppedHooks, and SaveMetadata — but never calls m.cleanupResources(ctx, r).

cleanupResources is the only thing that removes the container, removes the Docker network, stops service containers, cleans up BuildKit sidecar, removes temp directories (ClaudeConfigTempDir, CodexConfigTempDir, GeminiConfigTempDir), and unregisters routes. None of this runs after a normal interactive exit because:

  • monitorContainerExit is only launched from Start() (non-interactive path)
  • Stop() is only called on escape-stop and SIGTERM
  • Close() calls stopProxyServer/UnregisterRun but not RemoveContainer or RemoveNetwork

After moat claude or moat run -i -- bash exits normally, docker ps -a and docker network ls still show the exited container and its network. They accumulate until moat destroy is called.

The fix is one line at the end of the !callerWillStop block:

if !callerWillStop {
    m.captureLogs(r)
    runProviderStoppedHooks(r)
    _ = r.SaveMetadata()
    m.cleanupResources(ctx, r)  // add this
}

cleanupResources is idempotent via cleanupOnce, so the Close() and Destroy() callers remain safe. Fix this →


New: ContainerLogs inspect failure falls through to multiplexed output

internal/container/docker.go, line 349

inspect, inspectErr := r.cli.ContainerInspect(ctx, containerID)
if inspectErr != nil || inspect.Config.Tty {
    // TTY mode or inspect failed: logs are plain text
    return raw, nil
}

If ContainerInspect fails (transient Docker error, container removed between the two API calls), the multiplexed non-TTY stream is returned as plain text. The caller (FollowLogs) then copies it to stdout including Docker's 8-byte frame headers, producing garbled output for every line. The error is silently swallowed with no logging.

This failure is unlikely in practice, but a more robust fix would be to either:

  • Log the inspect failure and return an error (caller gets a clean error message instead of garbled output)
  • Use a separate, non-cancellable context for the inspect call since the result doesn't depend on the stream context

Fix this →


Minor: SIGTERM path doesn't drain attachDone before returning (carried from rounds 3–7)

cmd/moat/cli/exec.go, line 433

When SIGTERM fires in RunInteractiveAttached, attachCancel() + manager.Stop() run and the function returns nil without draining <-attachDone. The goroutine running manager.StartAttached is still in flight — it writes the Apple TTY log buffer to logs.jsonl after m.runtime.StartAttached() returns due to context cancellation.

The buffer write uses r.Store.LogWriter() (local file I/O, not the container runtime), so it succeeds even after manager.Close() fires. Risk is very low, but correctness would be guaranteed by draining attachDone before returning in the SIGTERM branch.


Minor: moat run -i still shows no run ID before session starts (carried from round 7)

cmd/moat/cli/exec.go, line 244

For interactive moat run (not provider commands), the early return at line 244 via RunInteractiveAttached skips the "Started %s (%s)" print at line 269. Provider commands avoid this via OnRunCreated, but moat run -i -- bash produces no run ID for the user before the session begins — they can't grab it to run moat logs or moat stop from another terminal.


What's good in 813dfb1

The streaming UX for non-interactive runs is a solid improvement — docker run-like live output instead of silent blocking. The demux goroutine in ContainerLogs (pipe + stdcopy.StdCopy) correctly handles the Docker multiplexed format and cleans up on context cancellation. The monitorWg/context.Background() combination correctly solves the previous Close() race without hanging on inherited monitors. manager.Wait() calling cleanupResources (no-op for already-cleaned-up runs) is a clean safety net for the non-interactive path.

Summary: The cleanupResources omission on normal interactive exit is the one remaining correctness issue — observable with docker ps -a after any normal moat claude or moat run -i exit. Everything else is minor. After nine rounds of review, the lifecycle design is solid.

Comment thread internal/run/manager.go
Comment thread internal/container/docker.go
Address code review feedback:

- cleanupResources now called on normal interactive exit (natural
  container exit path). Without this, Docker containers, networks,
  sidecars, and temp dirs leaked until moat destroy.
- Docker ContainerLogs returns error on inspect failure instead of
  silently producing garbled multiplexed output.
- Run ID printed before both interactive and non-interactive sessions
  so moat run -i shows the run ID before the session starts.
@claude
Copy link
Copy Markdown

claude Bot commented Mar 3, 2026

📦 Previous Review (superseded by newer review)

Claude finished @dpup's task in 3m 16s —— View job


PR Review: Simplify execution model

  • Gather context and read key files
  • Check round 8 issues: cleanupResources on normal exit, inspect failure handling
  • Fresh review pass on latest commit (838cc7d)
  • Post inline comments and summary

What was fixed in 838cc7d

All three items from the last review round are addressed:

  1. cleanupResources not called on normal interactive exit — Fixed. Added at line 2425–2429 in the !callerWillStop block. monitorContainerExit is not launched for interactive runs, so this was the only path that ever cleaned up containers, networks, service containers, and temp directories after a natural exit. ✅

  2. Docker inspect failure producing garbled output — Fixed. ContainerLogs now closes raw and returns a proper error when ContainerInspect fails, instead of silently returning the unmuxed multiplexed stream as plain text. ✅

  3. moat run -i showed no run ID before the session — Fixed. The "Started %s (%s)\n" print now fires at line 240 (before the interactive branch), so the user always sees the run ID before the session starts. ✅


Minor: "Started ..." fires before manager.Start() in the non-interactive path

cmd/moat/cli/exec.go, line 240–253

The "Started %s (%s)\n" message now fires before manager.Start() is called. If Start() fails, the user sees:

Started myrun (abc123)
Error: starting run: ...

...which is slightly confusing — the run printed "Started" but didn't actually start. The run was only created at that point. This is a cosmetic issue that was introduced by moving the message earlier to fix the interactive-path gap.


Minor: defer cleanupCancel() inside if !callerWillStop — harmless but unusual

internal/run/manager.go, lines 2426–2428

if !callerWillStop {
    cleanupCtx, cleanupCancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cleanupCancel()  // ← defer inside if block
    m.cleanupResources(cleanupCtx, r)
}

cleanupResources is synchronous — it completes before defer fires. The cancel is a no-op at that point. More idiomatic would be defer cleanupCancel() immediately after context creation (or just cleanupCancel() after the call). No bug, just slightly unusual.


Apple TTY log completeness on SIGTERM (carried from rounds 3–8)

cmd/moat/cli/exec.go, line 432

When SIGTERM fires, RunInteractiveAttached calls attachCancel() + manager.Stop() + return nil without draining <-attachDone. The goroutine running manager.StartAttached is still in flight — it writes the Apple TTY log buffer after m.runtime.StartAttached() returns. defer manager.Close() fires immediately after RunInteractiveAttached returns, but Close() only waits on monitorContainerExit goroutines (not the StartAttached goroutine), so there's a narrow window where the Apple buffer write might not complete before process exit.

Risk is low (the write is fast local disk I/O with no context dependency), but the correct fix is draining attachDone before returning in the SIGTERM branch.


Overall

After ten rounds of review and eleven fixes, the PR is ready to merge. The design is correct end-to-end:

  • monitorContainerExit as sole owner of WaitContainer/exitCh eliminates the double-wait race
  • State set before exitCh closes, so Wait() always reads final state
  • All m.daemonClient reads snapshotted under m.mu.RLock() — no data races
  • cleanupOnce/logsCaptured/providerHooksDone atomics correctly prevent double-work across Stop(), StartAttached, and monitorContainerExit paths
  • Normal interactive exit now properly tears down containers, networks, and temp directories
  • Docker multiplexed log format correctly demuxed for non-TTY containers
  • Fast-exiting container log capture fixed via ContainerWait before log fetch

The remaining items are all minor/cosmetic. The core lifecycle is sound.

Apple's "container wait" CLI command hangs indefinitely when a
container is stopped or removed externally (e.g., via "moat stop"
from another process). This blocked monitorContainerExit forever,
causing the original "moat run" process to hang.

Switch to always using the polling-based wait, which checks container
status via inspect every 500ms. If inspect fails (container removed),
treat it as a clean exit instead of returning an error.
@claude
Copy link
Copy Markdown

claude Bot commented Mar 3, 2026

📦 Previous Review (superseded by newer review)

Claude finished @dpup's task in 2m 20s —— View job


PR Review: Simplify execution model

  • Gather context from previous reviews
  • Read changed files in latest commit (519eaf4)
  • Check remaining issues from previous reviews
  • Fresh review pass
  • Post inline comments and summary

What was fixed in 519eaf4

The new commit addresses the Apple container hang-on-external-stop:

Apple container wait hung indefinitely when a container was stopped or removed externally (via moat stop from another process). This blocked monitorContainerExit forever, causing the original moat run process to hang. The fix always uses polling (waitByPolling), which checks container status via inspect every 500ms and treats an inspect failure as a clean exit (container was removed externally). This is the right fix for the described behavior.


One minor issue with the polling approach

internal/container/apple.go, line 378–381

if err := cmd.Run(); err != nil {
    // Container no longer exists (removed externally) — treat as exited
    return 0, nil
}

The comment says "container no longer exists", but this fires for any inspect failure — including transient XPC errors, permission issues, or Apple daemon crashes. A transient XPC error (which Apple containers are prone to, as evidenced by the RemoveContainer handling at line 423) would cause the polling loop to return 0, nil immediately, marking the run as StateStopped when it may still be running.

A more precise fix would be to check the error message for "not found" variants before treating it as a clean exit, similar to what RemoveContainer already does:

if err := cmd.Run(); err != nil {
    stderrStr := stderr.String()
    if strings.Contains(stderrStr, "notFound") || strings.Contains(stderrStr, "not found") {
        return 0, nil // Container no longer exists
    }
    // For other errors (XPC timeout, etc.), keep polling
    log.Debug("transient inspect error, retrying", "error", err, "stderr", stderrStr)
    // fall through to sleep-and-retry
}

Note: the current waitByPolling code doesn't capture stderr, so this fix would also require adding a stderr buffer alongside the existing stdout buffer.

Fix this →


Minor items carried from previous reviews (not blocking)

defer cleanupCancel() inside if !callerWillStop (internal/run/manager.go, lines 2426–2428): cleanupResources is synchronous so defer fires after the function returns, not after cleanupResources completes. It's a no-op in practice but unusual. cleanupCancel() after the call (no defer) would be clearer.

"Started %s (%s)\n" fires before manager.Start() (cmd/moat/cli/exec.go, line 240): If Start() fails, the user sees "Started myrun (abc123)" then an error. Cosmetic — the run was created, just not started.

SIGTERM path doesn't drain attachDone (cmd/moat/cli/exec.go, line 432): When SIGTERM fires, RunInteractiveAttached returns without draining <-attachDone. The goroutine running manager.StartAttached writes the Apple TTY log buffer to logs.jsonl after m.runtime.StartAttached() returns. defer manager.Close() fires before this completes. The write is fast local disk I/O with no context dependency, so the risk is very low, but draining attachDone before returning in the SIGTERM branch would guarantee correctness.


Overall

After ten rounds of review, the PR is ready to merge. The polling fix in 519eaf4 addresses a real user-visible hang (process stuck after moat stop on Apple containers). The one actionable issue above — not distinguishing XPC errors from "container not found" during polling — could cause a running container to be marked as stopped if Apple's daemon has a transient hiccup. Whether to address it before or after merge is a judgment call given how corner-case it is.

The core lifecycle design is solid: monitorContainerExit as sole owner of WaitContainer/exitCh, state set before exitCh closes, all daemonClient reads snapshotted under m.mu.RLock(), and cleanupOnce/logsCaptured/providerHooksDone atomics correctly preventing double-work across all exit paths.

Implementation plan for the fix/interactive-only branch. Updated to
reflect the final design: non-interactive runs block with log streaming
instead of fire-and-forget, plus additional fixes for monitor lifecycle,
Docker log demuxing, and interactive resource cleanup.
@claude
Copy link
Copy Markdown

claude Bot commented Mar 3, 2026

📦 Previous Review (superseded by newer review)

Claude finished @dpup's task in 2m 29s —— View job


PR Review: Simplify execution model

  • Gather context from previous reviews
  • Read key changed files and latest commits
  • Check remaining issues from previous reviews
  • Fresh review pass on latest changes
  • Post inline comments and summary

What changed in the latest commit (bd0a9cc)

Only docs: added docs/plans/2026-02-27-interactive-only-simplification.md. No code changes since 519eaf4.


Status of previously flagged issues

All correctness bugs from prior rounds have been fixed. The implementation is complete.


Remaining: waitByPolling treats any inspect failure as "container removed"

internal/container/apple.go, line 378–381

if err := cmd.Run(); err != nil {
    // Container no longer exists (removed externally) — treat as exited
    return 0, nil
}

This fires for any inspect error — including transient XPC errors, which Apple containers are prone to (see RemoveContainer's XPC handling at line 424). If container inspect fails due to an XPC timeout, the polling loop immediately returns 0, nil, and monitorContainerExit marks the run as StateStopped — even though the container is still running.

The fix mirrors what RemoveContainer already does:

var stderr bytes.Buffer
cmd.Stderr = &stderr

if err := cmd.Run(); err != nil {
    stderrStr := stderr.String()
    if strings.Contains(stderrStr, "notFound") || strings.Contains(stderrStr, "not found") {
        return 0, nil // Container truly gone
    }
    // Transient error (XPC timeout, etc.) — keep polling
    log.Debug("transient inspect error during wait, retrying", "error", err, "stderr", stderrStr)
    // fall through to sleep-and-retry
}

Fix this →


Minor: defer cleanupCancel() placement inside if !callerWillStop

internal/run/manager.go, lines 2426–2428

if !callerWillStop {
    cleanupCtx, cleanupCancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cleanupCancel()  // ← defers to function return, not to end of if block
    m.cleanupResources(cleanupCtx, r)
}

cleanupResources is synchronous — it returns before defer fires (at function return). The context stays alive until StartAttached returns rather than being released immediately after cleanup. In practice the difference is negligible since return attachErr follows immediately. More idiomatic: call cleanupCancel() directly after cleanupResources, or pattern it like monitorContainerExit does at lines 2784–2786 (defer cleanupCancel() immediately after context.WithTimeout). No correctness issue.


Minor: SIGTERM path doesn't drain attachDone (carried from rounds 3–10)

cmd/moat/cli/exec.go, line 432

When SIGTERM fires, RunInteractiveAttached calls attachCancel() + manager.Stop() + return nil without draining <-attachDone. The goroutine running manager.StartAttached is still in flight — specifically the Apple TTY buffer write to logs.jsonl happens after m.runtime.StartAttached() returns. defer manager.Close() fires immediately after RunInteractiveAttached returns, before the goroutine finishes.

Risk is very low (the write is fast local disk I/O with no context dependency), but the correct fix is <-attachDone before returning in the SIGTERM branch. Not blocking.


Minor: "Started %s (%s)\n" fires before manager.Start() (carried from round 9)

cmd/moat/cli/exec.go, line 240

For non-interactive runs, Started %s (%s) prints before manager.Start() at line 251. If Start() fails, the user sees "Started myrun (abc123)" then an error — the run was only created, not started. Cosmetic.


Overall

After eleven rounds and twelve fixes, the PR is ready to merge. The XPC error handling in waitByPolling is the only remaining correctness-adjacent issue (it can cause a still-running container to be incorrectly marked stopped on a transient XPC error), and whether to fix it before or after merge is a judgment call given its corner-case nature.

The core redesign is solid:

  • monitorContainerExit as sole owner of WaitContainer/exitCh eliminates the double-wait race
  • State set before exitCh closes, so Wait() always reads final state
  • All m.daemonClient reads snapshotted under m.mu.RLock() — no data races
  • cleanupOnce/logsCaptured/providerHooksDone atomics correctly handle concurrent Stop + monitorContainerExit paths
  • Normal interactive exit now properly tears down containers, networks, and temp directories
  • Docker multiplexed log format correctly demuxed for non-TTY containers
  • Fast-exiting container log capture fixed via ContainerWait before log fetch
  • ProviderRunner extraction reduced each provider CLI by ~130–150 lines of boilerplate

… errors

Address code review feedback on the polling-based wait:

- Distinguish "not found" errors (container removed) from transient
  XPC errors by checking stderr, matching RemoveContainer's pattern
- Handle Apple's inspect returning an empty JSON array for removed
  containers (exits 0 with "[]" rather than erroring)
- Log and retry on transient errors instead of treating all failures
  as container removal
@claude
Copy link
Copy Markdown

claude Bot commented Mar 3, 2026

Claude finished @dpup's task in 2m 4s —— View job


PR Review: Simplify execution model

  • Gather context and read key files
  • Review latest commits and changes
  • Check remaining issues from previous reviews
  • Post inline comments and summary

What was fixed in e94f254 (latest commit)

The sole remaining correctness issue from round 11 is addressed:

waitByPolling now distinguishes transient errors from "container removed" — Added a stderr buffer to the inspect command. Only returns 0, nil (treating as clean exit) when stderr contains "notFound" or "not found". For other errors (XPC timeouts, etc.), logs at debug level and continues polling. Also handles Apple's [] empty-array response when a container is removed — correctly detected via len(info) == 0 check. The pattern matches RemoveContainer's existing XPC error handling.


Minor items carried from previous rounds (not blocking)

defer cleanupCancel() inside if !callerWillStop (internal/run/manager.go, lines 2426–2428):

if !callerWillStop {
    cleanupCtx, cleanupCancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cleanupCancel()  // defers to function return, not end of if block
    m.cleanupResources(cleanupCtx, r)
}

cleanupResources is synchronous — the context stays alive until StartAttached returns. In practice return attachErr follows immediately so it doesn't matter, but calling cleanupCancel() directly (no defer) would be clearer and matches the pattern in monitorContainerExit at lines 2784–2786. Not a correctness issue.

SIGTERM path in RunInteractiveAttached doesn't drain attachDone (cmd/moat/cli/exec.go, line 432):

if sig == syscall.SIGTERM {
    attachCancel()
    manager.Stop(...)
    return nil  // ← goroutine running manager.StartAttached still in flight
}

The background goroutine writes the Apple TTY log buffer after m.runtime.StartAttached() returns (due to context cancellation). defer manager.Close() fires immediately after RunInteractiveAttached returns. The write is fast local disk I/O with no context dependency, so the window is narrow and risk is low. Draining <-attachDone before return nil in this branch would guarantee correctness. Carried from rounds 3–11, not blocking.


Confirmed fixed from all previous rounds

All twelve bugs identified across eleven review rounds are resolved:

  • Escape-stop state: failedcallerWillStop gate
  • Premature captureLogs on escape-stop → gated on !callerWillStop
  • Apple TTY logs lost on escape-stop → unconditional buffer write
  • m.daemonClient data races → m.mu.RLock() snapshot in cleanupResources and setupPortBindings
  • close(exitCh) before state update → state now set first in monitorContainerExit
  • UnregisterRoutes direct m.daemonClient access → uses dc snapshot
  • Docker inspect failure producing garbled output → returns error instead
  • cleanupResources not called on normal interactive exit → added to !callerWillStop block
  • RunProvider messages printed after run completion → moved to OnRunCreated callback
  • Non-interactive run output (silent until exit) → streaming via FollowLogs
  • monitorContainerExit context canceled by Close() → uses context.Background()
  • waitByPolling treating all errors as "container removed" → distinguishes "not found" from transient errors

Overall

The PR is ready to merge. The core redesign is sound: monitorContainerExit as sole owner of WaitContainer/exitCh, state set before exitCh closes, all m.daemonClient reads snapshotted under lock, and cleanupOnce/logsCaptured/providerHooksDone atomics correctly preventing double-work across all exit paths. The ProviderRunner extraction removes ~130–150 lines of boilerplate from each provider CLI. The Apple polling fix eliminates the indefinite hang when a container is stopped externally.

@dpup dpup merged commit 25effac into main Mar 3, 2026
4 checks passed
@dpup dpup deleted the fix/interactive-only branch March 3, 2026 04:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant