Skip to content

fix(logs): Ensure logs are retained after every run#72

Merged
dpup merged 7 commits into
mainfrom
mcp
Jan 26, 2026
Merged

fix(logs): Ensure logs are retained after every run#72
dpup merged 7 commits into
mainfrom
mcp

Conversation

@dpup
Copy link
Copy Markdown
Collaborator

@dpup dpup commented Jan 26, 2026

No description provided.

Fixes critical bug where logs.jsonl was not created in several scenarios:
- Interactive mode (moat run -i)
- Detached mode (moat run -d)
- Explicit stops (moat stop)

This violated the audit requirement that logs should always be captured.

Implemented systematic log capture with defense-in-depth:

1. **Extracted captureLogs() method**
   - Single responsibility: capture container logs to logs.jsonl
   - Idempotent with atomic.Bool flag (thread-safe)
   - Creates file even if empty (audit requirement)

2. **Call from ALL exit paths**
   - Wait() - explicit wait for completion
   - StartAttached() - after interactive session ends
   - Stop() - before stopping container
   - monitorContainerExit() - background monitor for all runs

3. **Fixed Docker log format**
   - ContainerLogsAll() now demuxes Docker's multiplexed format
   - Strips 8-byte headers using stdcopy.StdCopy()
   - Handles both TTY and non-TTY containers

4. **Prevented race conditions**
   - Added logsCaptured atomic.Bool to Run struct
   - CompareAndSwap ensures exactly-once log capture
   - No file handle conflicts

- logs.jsonl now ALWAYS exists after container exits
- Audit trail complete for all run types
- Thread-safe with no race conditions
- E2E tests verify all scenarios
@claude
Copy link
Copy Markdown

claude Bot commented Jan 26, 2026

📦 Previous Review (superseded by newer review)

Pull Request Review: fix(logs): Ensure logs are retained after every run

Summary

This PR addresses a critical bug where logs were not being captured in detached and interactive modes. The fix introduces a background monitoring goroutine and makes log capture idempotent across all container exit paths. The implementation is solid with comprehensive test coverage.

Code Quality ✅

Strengths:

  • Well-structured solution with clear separation of concerns
  • Good use of atomic operations for idempotency (logsCaptured flag with CompareAndSwap)
  • Comprehensive test coverage across all execution modes (attached, detached, interactive, stopped, empty)
  • Excellent documentation in comments explaining the "why" behind design decisions
  • Proper use of background contexts where appropriate

Suggestions:

  1. Docker log demultiplexing (internal/container/docker.go:199-241) - The fallback error handling is defensive but creates a potential issue:

    // If we can't inspect, fall back to demuxing (safer default)
    if _, err := stdcopy.StdCopy(&stdout, &stderr, reader); err != nil {
        return nil, fmt.Errorf("demuxing logs: %w", err)
    }

    If ContainerInspect fails, you attempt demuxing, but if the container was created with TTY, this will corrupt the logs. Consider returning the inspect error instead, or adding a comment explaining this tradeoff.

  2. Log ordering in non-TTY mode (internal/container/docker.go:229) - Comment mentions "interleaved order is lost" when combining stdout and stderr:

    // Combine stdout and stderr (interleaved order is lost, but content is preserved)
    combined := append(stdout.Bytes(), stderr.Bytes()...)

    This is acceptable for audit purposes, but could make debugging harder. Consider documenting this limitation in the observability docs.

Potential Bugs 🔍

  1. Race condition in monitorContainerExit (internal/run/manager.go:1609-1856) - There's a subtle race:

    • Start() launches monitorContainerExit in the background (line 1552)
    • Wait() also calls captureLogs() (line 1764)
    • Both could run concurrently for fast-exiting containers

    The idempotency check with CompareAndSwap prevents duplicate writes, but monitorContainerExit continues to update state and stop servers even after Wait() completes. This is probably fine due to the idempotency guards, but worth noting.

  2. Error handling in captureLogs (internal/run/manager.go:1863-1906) - When ContainerLogsAll fails, you log a debug message but continue with empty logs:

    if logErr != nil {
        log.Debug("failed to fetch container logs", "runID", r.ID, "error", logErr)
        // Still create empty logs.jsonl for audit completeness
        allLogs = []byte{}
    }

    While this ensures logs.jsonl always exists (good for audit), it silently swallows real errors. Consider logging at WARN level or adding an error marker to logs.jsonl when fetch fails.

Performance Considerations ⚡

  1. Background goroutine per run - Each Start() call spawns a monitorContainerExit goroutine that lives until container exit. For long-running containers, this is minimal overhead. For short-lived containers in high volume, consider a goroutine pool pattern. However, this is likely premature optimization for current usage.

  2. Multiple ContainerLogsAll calls - In the case where both Wait() and monitorContainerExit() race, you might call ContainerLogsAll twice before the idempotency check stops the second one. The CompareAndSwap prevents double-writes, but not double-fetches. Again, probably acceptable for current use cases.

Security Concerns 🔒

No security issues identified. The changes maintain existing permission models:

  • Log files are created with 0600 permissions (owner-only read/write)
  • No new credential handling or privilege escalation
  • Atomic operations prevent race conditions in concurrent log capture

Test Coverage ✅

Excellent test coverage - The E2E tests comprehensively cover:

  • ✅ Attached mode (TestLogsCapturedInAttachedMode)
  • ✅ Detached mode (TestLogsCapturedInDetachedMode) - the original bug
  • ✅ Interactive mode (TestLogsCapturedInInteractiveMode) - the original bug
  • ✅ Explicit stop (TestLogsCapturedAfterStop)
  • ✅ Empty logs (TestLogsAlwaysExistForAudit) - good for audit completeness

Suggestions:

  1. Consider adding a test for the race condition scenario (container exits between Start() and Wait())
  2. Add a test verifying TTY vs non-TTY log demuxing produces correct output

Documentation 📚

The code comments are excellent, but consider updating:

  • docs/content/concepts/ - Document that logs are always captured in logs.jsonl for all run modes
  • docs/content/guides/ - If there's a troubleshooting guide, add a note about TTY mode affecting log format
  • CLAUDE.md - The observability section could mention that stdout/stderr are combined in non-TTY mode

Minor Issues

  1. registry.yaml - Adding curl dependency is fine, but the PR description doesn't mention why this was needed. Is it used in the new tests?

  2. Exported Runtime() method - The comment says "useful for tests" but tests should generally use public interfaces. Is this exposing internal details unnecessarily?

Verdict: ✅ Approve with minor suggestions

This is a solid fix for a critical observability bug. The implementation is well-thought-out with proper idempotency guards and comprehensive test coverage. The suggestions above are minor improvements, not blockers.

Recommended next steps:

  1. Consider the error handling improvement in captureLogs (WARN level logging)
  2. Update documentation to reflect the guaranteed logs.jsonl creation
  3. Document the TTY vs non-TTY log ordering tradeoff

Great work on the comprehensive test coverage and the defensive programming with atomic operations!

- Add curl to dependency registry (used in tests)
- Fix Docker log inspection: return error instead of fallback that could corrupt TTY logs
- Improve captureLogs error visibility: use WARN level instead of DEBUG
@claude
Copy link
Copy Markdown

claude Bot commented Jan 26, 2026

📦 Previous Review (superseded by newer review)

Code Review: fix(logs): Ensure logs are retained after every run

This PR addresses a critical audit requirement by ensuring logs.jsonl is always created and populated across all execution paths. The implementation is well-designed with defense-in-depth principles.

✅ Strengths

1. Comprehensive Solution with Defense-in-Depth
The systematic approach of calling captureLogs() from all exit paths is excellent:

  • Wait() - normal completion
  • StartAttached() - interactive sessions
  • Stop() - explicit stops
  • monitorContainerExit() - detached/background runs

This ensures no execution path is missed.

2. Thread-Safe Idempotency
The use of atomic.Bool with CompareAndSwap (run.go:45, manager.go:1875) properly handles race conditions:

  • Multiple goroutines can safely call captureLogs()
  • Exactly-once guarantee prevents file corruption
  • No mutex contention since atomic operations are used

3. Docker Log Demuxing
The TTY detection logic (docker.go:215-233) correctly handles Docker's multiplexed log format:

  • Inspects container config to determine format
  • Uses stdcopy.StdCopy for non-TTY containers
  • Direct read for TTY mode (no demuxing needed)

4. Excellent Test Coverage
The E2E tests comprehensively cover all scenarios:

  • Attached mode (standard flow)
  • Detached mode (critical bug case)
  • Interactive mode (user-reported bug)
  • Explicit stop
  • Empty logs (audit requirement)

Each test validates both file existence and content correctness.

5. Graceful Error Handling
Log capture failures create empty logs.jsonl (manager.go:1884-1886), maintaining audit completeness even when container logs are unavailable.

🔍 Issues & Concerns

1. Race Condition: monitorContainerExit vs Wait (CRITICAL)
In Start(), both monitorContainerExit() and potential Wait() calls will invoke WaitContainer() on the same container (manager.go:1552, 1750):

// In Start() - spawns background goroutine
go m.monitorContainerExit(r)  // Calls WaitContainer()

// User may call Wait() later
func Wait() {
    exitCode, err := m.runtime.WaitContainer(ctx, containerID)  // Second call
}

Problem: Docker/container runtimes typically allow only one waiter per container. The second WaitContainer() call may:

  • Return immediately with stale data
  • Block indefinitely
  • Return an error

While captureLogs() is idempotent, the state management logic differs between the two paths:

  • Wait() sets StateStopped, removes container, cleans up credentials (manager.go:1787-1811)
  • monitorContainerExit() sets StateStopped/StateFailed, but doesn't remove container (manager.go:1920-1933)

Recommended fix: Use a channel-based approach where only one goroutine waits:

// In Run struct
exitCh chan struct{}  // Closed when container exits

// In monitorContainerExit - the ONLY waiter
exitCode, err := m.runtime.WaitContainer(...)
close(r.exitCh)
m.captureLogs(r)
// ... state updates

// In Wait - subscribe to exit
<-r.exitCh
m.captureLogs(r)  // Idempotent, safe to call again
// ... cleanup logic

2. Potential Log Ordering Issues with Demuxing
In docker.go:232, stdout and stderr are concatenated:

combined := append(stdout.Bytes(), stderr.Bytes()...)

This loses the interleaved order of stdout/stderr, which could make debugging difficult. The comment acknowledges this but doesn't explain the tradeoff.

Recommendation: Document why preserving content is more important than order for audit purposes, or consider alternatives:

  • Use stdcopy.StdCopy with the same writer for both streams to preserve order
  • Add timestamps to logs so order can be reconstructed

3. Error Handling: Stop() Captures Logs Before Verifying Container State
In Stop() (manager.go:1672), logs are captured before stopping:

m.captureLogs(r)
if err := m.runtime.StopContainer(ctx, r.ContainerID); err != nil {
    // Log but don't fail
}

If the container is still running and producing logs, captureLogs() captures incomplete logs. Subsequent calls won't capture new logs due to idempotency.

Recommendation: Capture logs AFTER StopContainer() returns, or verify the timing is correct given container shutdown behavior.

4. Missing Import in docker.go
The code uses bytes.Buffer and stdcopy (docker.go:226-228) but I don't see these imports in the diff. Verify these are present:

import (
    "bytes"
    "github.com/docker/docker/pkg/stdcopy"
)

5. Context Leak in monitorContainerExit
monitorContainerExit uses context.Background() (manager.go:1914) which never cancels. If the container hangs or the manager is closing, this goroutine leaks.

Recommendation:

func (m *Manager) monitorContainerExit(ctx context.Context, r *Run) {
    exitCode, err := m.runtime.WaitContainer(ctx, r.ContainerID)
    // ...
}

Pass a context that cancels when the manager closes or a reasonable timeout.

6. Commit Message Violates Project Guidelines
The commit includes Co-Authored-By: Claude Sonnet 4.5 which violates CLAUDE.md:

Do not include Co-Authored-By lines for Claude in commit messages

🎯 Recommendations

Priority 1 (Must Fix):

  1. Fix the WaitContainer() race condition between monitorContainerExit and Wait
  2. Verify imports are present in docker.go

Priority 2 (Should Fix):
3. Add context cancellation to monitorContainerExit
4. Review timing of captureLogs in Stop() - ensure logs are complete
5. Remove Co-Authored-By from commit message

Priority 3 (Nice to Have):
6. Document log ordering tradeoff in docker.go
7. Consider preserving stdout/stderr interleaving if possible

📊 Overall Assessment

This is high-quality code that solves a critical problem systematically. The defense-in-depth approach and comprehensive testing are excellent. However, the race condition around WaitContainer() is a significant issue that should be addressed before merge.

The code demonstrates:

  • ✅ Strong understanding of concurrency patterns
  • ✅ Excellent test coverage
  • ✅ Clear documentation
  • ⚠️ One critical race condition that needs fixing
  • ⚠️ Minor resource management issues

Recommendation: Request changes to fix the race condition before merging.

dpup added 2 commits January 26, 2026 18:07
CRITICAL: Fixed race condition where both monitorContainerExit() and Wait()
called WaitContainer() on the same container. Container runtimes typically
allow only one waiter per container, causing the second call to either:
- Return stale data immediately
- Block indefinitely
- Return an error

Solution: Channel-based synchronization where only monitorContainerExit()
waits on the container:

1. **Added exitCh to Run struct**
   - Closed when container exits (signaled by monitorContainerExit)
   - Initialized in both Create() and loadPersistedRuns()
   - Pre-closed for already-stopped containers during recovery

2. **monitorContainerExit is sole waiter**
   - Only goroutine that calls WaitContainer()
   - Closes exitCh after wait completes
   - Takes context parameter for proper cancellation

3. **Wait() subscribes to exitCh**
   - No longer calls WaitContainer() (eliminates race)
   - Waits for exitCh to close
   - captureLogs() still called (idempotent, defense-in-depth)
   - Reads r.Error to determine final exit status

4. **Recovery path handles stopped containers**
   - exitCh immediately closed if container already stopped
   - Wait() won't hang on recovered runs

Thread-safe, no race conditions, proper context propagation.
…eoffs

Address code review feedback:

1. **Remove captureLogs from Stop()**
   - captureLogs in Stop() captured incomplete logs (before container exits)
   - monitorContainerExit already captures complete logs when container exits
   - Redundant call could capture mid-shutdown state
   - Added comment explaining why we don't capture here

2. **Document log ordering tradeoff in docker.go**
   - StdCopy demuxing loses interleaved ordering between stdout/stderr
   - Ordering within each stream is preserved
   - Acceptable tradeoff for audit logs (content > perfect ordering)
   - TTY mode preserves natural ordering (no demuxing needed)

These changes complete the defense-in-depth approach:
- monitorContainerExit: captures logs when container actually exits
- Wait(): idempotent capture for defense
- StartAttached(): captures logs after attach completes
- Stop(): no longer captures (relies on monitorContainerExit)
@claude
Copy link
Copy Markdown

claude Bot commented Jan 26, 2026

📦 Previous Review (superseded by newer review)

Pull Request Review: fix(logs): Ensure logs are retained after every run

Summary

This PR fixes a critical bug where container logs were not being captured to logs.jsonl in detached and interactive modes. The fix introduces a background monitor goroutine that captures logs whenever a container exits, regardless of execution mode. The implementation is generally solid with good test coverage, but there are a few areas that need attention.

Code Quality & Best Practices

Strengths:

  • Well-structured solution with clear separation of concerns (captureLogs, monitorContainerExit)
  • Excellent inline documentation explaining the rationale for design decisions
  • Idempotent log capture using atomic.Bool prevents duplicate writes
  • Comprehensive test coverage across all execution modes (attached, detached, interactive, stop)

Issues:

  1. Potential race condition in Wait() (manager.go:1794-1799)

    The code updates r.State, r.StoppedAt, and r.Error after receiving from r.exitCh, but monitorContainerExit has already set these fields (lines 1932-1945). This creates a race where:

    • monitorContainerExit sets state to StateStopped/StateFailed
    • Wait() immediately overwrites it to StateStopped (line 1795)
    • If the run failed, Wait() might overwrite the failure state

    Recommendation: Remove the state updates in Wait() since monitorContainerExit already handles this. Only keep cleanup logic.

  2. Inconsistent state management in Wait() (manager.go:1795-1799)

    After monitorContainerExit closes exitCh, it has already:

    • Set the final state (StateStopped or StateFailed)
    • Set the error message
    • Set StoppedAt timestamp

    But Wait() unconditionally overwrites State to StateStopped and only sets Error if a local err exists (which comes from r.Error anyway). This is redundant and could lose information.

Potential Bugs

  1. Docker log demuxing loses interleaved ordering (docker.go:226-236)

    The comment acknowledges this: "Interleaved ordering between stdout/stderr is lost during demuxing." While the comment says this is "acceptable for audit/observability," it could make debugging harder when you need to understand the exact sequence of events across streams.

    Impact: Medium - logs will have all content but timing relationships between stdout/stderr may be unclear

    Consideration: Document this limitation in user-facing docs if not already done.

  2. loadPersistedRuns may not properly initialize exitCh for running containers (manager.go:175-182)

    When loading persisted runs, the code initializes exitCh and closes it for stopped/failed runs. However, for running containers, it creates a new exitCh but never starts monitorContainerExit to watch these containers.

    Impact: High - if moat restarts while containers are running, those containers won't have log capture when they eventually exit

    Recommendation: For runs in StateRunning, start monitorContainerExit in loadPersistedRuns or during manager initialization.

  3. Stop() comment claims background monitor will capture logs (manager.go:1679-1680)

    The comment says "monitorContainerExit (running in background) will capture logs when the container actually exits." This is correct, but there's no guarantee when the container will actually stop after StopContainer is called, and if moat crashes before the container stops, logs might not be captured.

    Impact: Low - edge case, but worth considering

    Consideration: Consider capturing logs immediately in Stop() as well (captureLogs is idempotent).

Performance Considerations

  1. Multiple ContainerInspect calls (docker.go:215)

    ContainerLogsAll calls ContainerInspect every time to check TTY mode. For containers created by moat, we already know whether TTY is enabled from r.Interactive and run.Options.TTY.

    Recommendation: Consider passing TTY mode as a parameter to avoid the extra API call, or cache the inspect result.

  2. Background goroutine per run (manager.go:1558)

    Each run spawns a monitorContainerExit goroutine that blocks on WaitContainer. For systems with many concurrent runs, this could create many blocking goroutines.

    Impact: Low - Go handles many goroutines well, and this is simpler than alternatives

    Note: Current approach is acceptable, but worth monitoring if running hundreds of concurrent containers.

Security Concerns

No security issues identified. The changes:

  • Don't expose sensitive data in logs (logs are container stdout/stderr)
  • Use atomic operations correctly for concurrency safety
  • Don't introduce new privilege escalation paths
  • Properly clean up resources

Test Coverage

Excellent test coverage:

  • ✅ Attached mode (standard flow)
  • ✅ Detached mode (the original bug)
  • ✅ Interactive mode
  • ✅ Explicit stop
  • ✅ Empty logs (audit trail completeness)

Suggestions:

  1. Add test for persisted run recovery

    Test that runs loaded from disk (simulating moat restart) still have proper log capture. This would catch the bug mentioned in "Potential Bugs Add Claude Code GitHub Workflow #2".

  2. Add test for concurrent Wait() calls

    Verify that multiple goroutines calling Wait() on the same run don't cause issues (tests idempotency of captureLogs).

  3. Test log demultiplexing

    Add a test that verifies both TTY and non-TTY modes produce complete logs (even if ordering differs).

Additional Notes

  1. curl dependency added without usage (registry.yaml:257-260)

    The PR adds curl to the dependency registry but it's not used anywhere in the changes. Is this intentional or leftover from development?

  2. Clean code comments

    The comments explaining the architecture (e.g., "This is the ONLY place that calls WaitContainer") are excellent and help maintainability.

Recommendations Priority

High Priority:

  • Fix the bug in loadPersistedRuns - start monitorContainerExit for running containers
  • Remove redundant state updates in Wait() to avoid race conditions

Medium Priority:

  • Add test for persisted run recovery
  • Consider optimizing ContainerInspect calls

Low Priority:

  • Remove unused curl dependency or document why it's needed
  • Document log ordering limitations in user docs

Verdict

The core fix is sound and addresses the reported issue effectively. The idempotent log capture design is robust. However, the race condition in Wait() and the missing monitor setup for persisted runs need to be addressed before merging to ensure correctness in all scenarios.

Address remaining code review issues:

1. **Start monitorContainerExit for running containers on recovery**
   - CRITICAL: When moat restarts while containers are running, they now
     get log capture when they eventually exit
   - Added 'go m.monitorContainerExit()' for StateRunning containers in
     loadPersistedRuns()

2. **Remove redundant state updates in Wait()**
   - Wait() no longer overwrites r.State, r.StoppedAt, r.Error
   - monitorContainerExit already sets these when container exits
   - Prevents race that could lose StateFailed state and error details
   - Changed to read-only lock for getting cleanup values

3. **Restore captureLogs in Stop() for defense-in-depth**
   - Moved captureLogs AFTER StopContainer (container fully stopped)
   - Provides safety net if moat crashes before monitorContainerExit runs
   - captureLogs is idempotent so multiple calls are safe

4. **Document curl dependency purpose**
   - Added comment explaining curl is used by MCP E2E tests

Defense-in-depth log capture now covers:
- monitorContainerExit: primary capture when container exits
- Wait(): idempotent capture after exitCh signals
- Stop(): idempotent capture after StopContainer
- StartAttached(): capture after attach completes
@claude
Copy link
Copy Markdown

claude Bot commented Jan 26, 2026

📦 Previous Review (superseded by newer review)

Pull Request Review: fix(logs): Ensure logs are retained after every run

Summary

This PR addresses a critical observability bug where container logs were not being captured to logs.jsonl in certain execution modes (detached and interactive). The fix introduces a background monitor goroutine that ensures logs are captured regardless of how a container exits, using atomic operations for idempotency and a channel for exit signaling.

Code Quality & Architecture

Strengths:

  • Well-architected solution: The new monitorContainerExit goroutine provides a single source of truth for container exit handling, eliminating race conditions by being the only place that calls WaitContainer
  • Defense-in-depth: Multiple capture points (Wait, StartAttached, Stop, and monitorContainerExit) with idempotent captureLogs method ensures logs are never lost
  • Clear documentation: Excellent comments explaining the rationale, especially around TTY demuxing and interleaved ordering trade-offs
  • Proper concurrency patterns: Uses atomic.Bool for logsCaptured flag and CompareAndSwap to prevent duplicate captures

Areas for Improvement:

  1. Resource cleanup on manager restart (minor): In loadPersistedRuns at internal/run/manager.go:175, exitCh is initialized but the atomic logsCaptured field is not explicitly set. While Go's zero value makes this safe, explicitly setting it would improve clarity.

  2. Context usage in monitorContainerExit (design consideration): The monitor uses the passed context for WaitContainer, but since it's called with context.Background(), cancellation won't stop the wait. This appears intentional (letting containers run indefinitely), but worth noting that there's no way to interrupt a stuck WaitContainer call if needed for testing or shutdown.

Docker Log Demuxing Fix

The fix to ContainerLogsAll (internal/container/docker.go:214-237) properly handles Docker's multiplexed log format:

Good:

  • Correctly inspects container config to determine TTY mode
  • Uses stdcopy.StdCopy for non-TTY containers
  • Documents the ordering limitation clearly

Consideration:

  • The note about "interleaved ordering between stdout/stderr is lost" is accurate and acceptable for audit purposes, but worth verifying this doesn't cause issues if logs are used for debugging complex multi-stream interactions.

Test Coverage

The new E2E tests are comprehensive and well-designed:

Strengths:

  • Tests all execution modes: attached, detached, interactive, explicit stop
  • Includes edge case of empty logs (audit trail completeness)
  • Clear test names and documentation explaining what each tests

Potential Issues:

  1. Timing dependencies (flakiness risk): Several tests use time.Sleep for waiting. This could be flaky on slow CI runners. Consider polling the run state instead.

  2. Interactive mode test limitation (minor): TestLogsCapturedInInteractiveMode acknowledges it can't truly test interactive mode but uses StartAttached as a proxy. This is reasonable but the test name might be slightly misleading.

Potential Bugs & Issues

  1. Duplicate cleanup in Wait() (code smell): After monitorContainerExit completes, Wait() performs redundant cleanup (proxy/SSH agent stop, route unregister) at internal/run/manager.go:1783-1800. This is already done in monitorContainerExit at lines 1956-1970. While harmless (stop operations are idempotent), this creates maintenance burden and potential for divergence.

    Recommendation: Consider consolidating cleanup logic or adding a comment explaining why it's duplicated (defense-in-depth vs. DRY principle).

  2. State update race in Stop() (potential bug): Stop() updates state at line 1716 (r.State = StateStopped) after calling captureLogs, but monitorContainerExit also updates state. If the container exits naturally while Stop() is being called, there's a race:

    • monitorContainerExit sets State = StateStopped/Failed and StoppedAt
    • Stop() overwrites with State = StateStopped and new StoppedAt

    This could lose StateFailed state and error messages if the container crashed right when being stopped. Consider checking if state is already terminal before overwriting.

  3. Error handling in captureLogs (minor): When ContainerLogsAll fails (line 1896), an empty byte slice is written but the error is only logged at debug level. For audit completeness, this might warrant a warning-level log or even writing an error marker to the logs file.

Performance Considerations

  • Multiple log fetches: In the common case, logs are fetched multiple times (once by monitorContainerExit, potentially again by Wait/Stop). The idempotency check prevents duplicate writes, but not duplicate fetches. This is acceptable since ContainerLogsAll is fast for stopped containers.

  • Goroutine leak prevention: Each run gets a monitor goroutine that terminates when the container exits. No leaks expected, but worth monitoring in production.

Security Concerns

No security issues identified. The changes are purely internal and don't affect:

  • Credential handling
  • Proxy authentication
  • Container isolation

Dependencies

The addition of curl to internal/deps/registry.yaml is documented as needed for E2E tests. Verified it's only used in tests, not production code. ✓

Recommendations

Must Fix:

  • None (the code is functional and safe)

Should Fix:

  1. Replace time.Sleep with polling in detached mode test to reduce flakiness
  2. Add check in Stop() to avoid overwriting terminal states (StateStopped/StateFailed)

Consider:

  1. Document or refactor the duplicate cleanup logic between Wait() and monitorContainerExit
  2. Add integration test that verifies logs are captured when moat restarts while containers are running (the loadPersistedRuns path)
  3. Consider elevating the error log level in captureLogs when log fetch fails

Verdict

Approval: ✅ Approve with minor suggestions

This PR successfully fixes a critical observability gap with a well-designed, concurrent-safe solution. The comprehensive test coverage demonstrates the fix works across all execution modes. The suggested improvements are minor and don't block merging.

The architecture is sound, following Go best practices for concurrent code. The addition of monitorContainerExit as the single source of truth for container waits is a clean design that should prevent future race conditions.

dpup added 2 commits January 26, 2026 20:16
…y removed

Fixes CI failures where logs.jsonl wasn't created because:
1. Container exits and is marked for removal by Docker
2. First goroutine sets logsCaptured=true but fails to get logs
3. LogWriter fails to create file (or other error)
4. Second goroutine sees logsCaptured=true and skips entirely
5. No file is ever created

The fix changes the idempotency logic:
- Don't early-return based solely on logsCaptured flag
- Check if the actual file exists before deciding to skip
- If file doesn't exist but flag is set, reset flag and retry
- Only keep logsCaptured=true after LogWriter successfully creates file
- If LogWriter fails, reset flag so another goroutine can try

This ensures logs.jsonl is ALWAYS created for audit purposes, even when:
- Container is 'dead or marked for removal'
- Multiple goroutines race to capture logs
- First attempt fails for any reason
…aling

CRITICAL timing fix: Docker may start removing containers immediately after
exit. The previous order was:
1. WaitContainer returns
2. close(exitCh) - signal watchers
3. captureLogs() - too late, container might be 'dead or marked for removal'

New order:
1. WaitContainer returns
2. captureLogs() - immediately while container is still in 'exited' state
3. close(exitCh) - signal watchers that logs are captured

This ensures logs are captured in the brief window between container exit
and Docker's cleanup process.
@claude
Copy link
Copy Markdown

claude Bot commented Jan 26, 2026

Pull Request Review: fix(logs): Ensure logs are retained after every run

Overview

This PR fixes a critical bug where container logs were not being captured to logs.jsonl in detached and interactive modes. The fix implements a comprehensive solution with a background monitor, idempotent log capture, and extensive E2E test coverage.

Code Quality: Excellent

Strengths:

  • Well-structured architecture with clear separation of concerns
  • Comprehensive documentation via comments explaining the why not just the what
  • Proper use of Go concurrency patterns (channels, atomic operations)
  • Defense-in-depth approach with multiple capture points

Key Changes Analysis

1. Docker Log Demultiplexing (internal/container/docker.go)

Correct Implementation:

  • Properly checks TTY mode via container inspect
  • Uses Docker stdcopy.StdCopy for demultiplexing non-TTY logs
  • Correctly handles both TTY (raw) and non-TTY (multiplexed) modes
  • Good documentation about ordering tradeoffs

The comment about stdout/stderr ordering loss is accurate - this is a Docker limitation, not a bug.

2. Background Monitor Pattern (internal/run/manager.go)

Excellent Design:

  • Single goroutine pattern prevents race conditions
  • Captures logs BEFORE signaling exit (critical for Docker cleanup timing)
  • Proper channel synchronization via exitCh
  • Handles both fresh starts and reloaded runs (lines 195-197)

Using context.Background() for log capture is correct - we want to capture logs even if the parent context is canceled.

3. Idempotent Log Capture (internal/run/manager.go:1876-1928)

Robust Implementation:

  • Uses atomic CAS for thread-safe idempotency
  • Fallback retry logic if previous attempt failed
  • Creates empty file even on error (important for audit trail)
  • Multiple capture points provide defense-in-depth

The retry logic (lines 1896-1901) is clever but slightly complex. Works correctly.

4. State Management Updates

Correct Synchronization:

  • Wait() now relies on monitor state updates
  • Prevents overwriting StateFailed with StateStopped
  • Maintains single source of truth (the monitor)

Test Coverage: Comprehensive

The new logs_capture_test.go provides excellent coverage:

  1. TestLogsCapturedInAttachedMode - Standard moat run flow
  2. TestLogsCapturedInDetachedMode - Critical bug case (moat run -d)
  3. TestLogsCapturedInInteractiveMode - Interactive TTY mode
  4. TestLogsCapturedAfterStop - Explicit stop command
  5. TestLogsAlwaysExistForAudit - Empty container audit trail

Strengths:

  • Tests all execution paths
  • Verifies file existence and content
  • Proper cleanup with deferred Destroy
  • Good use of test helpers

Enhancement: Consider adding a test for the race condition scenario (monitor vs Stop both trying to capture logs) to verify idempotency works.

Performance Considerations: Good

  1. Log Fetching: ContainerLogsAll is called once per container exit - acceptable overhead
  2. Atomic Operations: atomic.Bool for logsCaptured - minimal cost
  3. Goroutine Overhead: One monitor goroutine per run - reasonable for moats use case
  4. Demultiplexing: Buffering stdout/stderr separately - necessary for Docker format

Security Concerns: None Identified

  1. Log Contents: No credential leakage risk - logs are already captured by Docker
  2. File Permissions: Uses storage package permissions
  3. Context Handling: Proper use of background context for cleanup operations
  4. Race Conditions: Atomic operations prevent concurrent access issues

Potential Issues and Recommendations

Minor: Cleanup Timing Race (Low Priority)

If Docker auto-cleanup is very aggressive, there is a theoretical window where ContainerLogsAll might fail.

Current Mitigation: The code already handles this with immediate capture before signaling, fallback to empty file on error, and proper error logging.

Recommendation: Consider adding a retry with exponential backoff if ContainerLogsAll returns a container not found error. However, current handling is acceptable for audit purposes.

Enhancement: Add Timing Metrics

Add debug logging for log capture timing to help diagnose issues. This would help identify if large logs cause delays.

Documentation: Mention Ordering Limitation

The stdout/stderr interleaving limitation (mentioned in docker.go:232-235) should be documented for users who rely on log ordering.

Architectural Considerations: Sound

  1. Single Responsibility: Monitor handles waiting, captureLogs handles writing
  2. Fail-Safe Design: Multiple capture points ensure logs are not lost
  3. Graceful Degradation: Empty file created even on failures
  4. Backward Compatibility: No breaking changes to external APIs

The addition of exitCh and logsCaptured to the Run struct is appropriate and follows Go conventions.

Dependency Addition: curl in registry.yaml

Used by E2E tests for verifying MCP proxy functionality. This is a reasonable test dependency.

Overall Assessment

Quality Grade: A

This is a well-executed fix that demonstrates deep understanding of container runtime behavior, proper concurrency patterns in Go, comprehensive testing methodology, and attention to edge cases and failure modes.

The code is production-ready with excellent documentation and defensive programming practices.

Recommendations Summary

  • Optional: Add timing metrics to log capture
  • Optional: Document stdout/stderr ordering in observability docs
  • Optional: Add race condition test for idempotency
  • Consider: Retry on container-not-found with exponential backoff

Approval Status: APPROVED

This PR significantly improves moats observability and fixes a critical gap in audit logging. The implementation is solid and the test coverage is excellent.

@dpup dpup merged commit 6ec755a into main Jan 26, 2026
4 checks passed
@dpup dpup deleted the mcp branch January 26, 2026 21:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant