Conversation
Fixes critical bug where logs.jsonl was not created in several scenarios: - Interactive mode (moat run -i) - Detached mode (moat run -d) - Explicit stops (moat stop) This violated the audit requirement that logs should always be captured. Implemented systematic log capture with defense-in-depth: 1. **Extracted captureLogs() method** - Single responsibility: capture container logs to logs.jsonl - Idempotent with atomic.Bool flag (thread-safe) - Creates file even if empty (audit requirement) 2. **Call from ALL exit paths** - Wait() - explicit wait for completion - StartAttached() - after interactive session ends - Stop() - before stopping container - monitorContainerExit() - background monitor for all runs 3. **Fixed Docker log format** - ContainerLogsAll() now demuxes Docker's multiplexed format - Strips 8-byte headers using stdcopy.StdCopy() - Handles both TTY and non-TTY containers 4. **Prevented race conditions** - Added logsCaptured atomic.Bool to Run struct - CompareAndSwap ensures exactly-once log capture - No file handle conflicts - logs.jsonl now ALWAYS exists after container exits - Audit trail complete for all run types - Thread-safe with no race conditions - E2E tests verify all scenarios
📦 Previous Review (superseded by newer review)Pull Request Review: fix(logs): Ensure logs are retained after every runSummaryThis PR addresses a critical bug where logs were not being captured in detached and interactive modes. The fix introduces a background monitoring goroutine and makes log capture idempotent across all container exit paths. The implementation is solid with comprehensive test coverage. Code Quality ✅Strengths:
Suggestions:
Potential Bugs 🔍
Performance Considerations ⚡
Security Concerns 🔒No security issues identified. The changes maintain existing permission models:
Test Coverage ✅Excellent test coverage - The E2E tests comprehensively cover:
Suggestions:
Documentation 📚The code comments are excellent, but consider updating:
Minor Issues
Verdict: ✅ Approve with minor suggestionsThis is a solid fix for a critical observability bug. The implementation is well-thought-out with proper idempotency guards and comprehensive test coverage. The suggestions above are minor improvements, not blockers. Recommended next steps:
Great work on the comprehensive test coverage and the defensive programming with atomic operations! |
- Add curl to dependency registry (used in tests) - Fix Docker log inspection: return error instead of fallback that could corrupt TTY logs - Improve captureLogs error visibility: use WARN level instead of DEBUG
📦 Previous Review (superseded by newer review)Code Review: fix(logs): Ensure logs are retained after every runThis PR addresses a critical audit requirement by ensuring logs.jsonl is always created and populated across all execution paths. The implementation is well-designed with defense-in-depth principles. ✅ Strengths1. Comprehensive Solution with Defense-in-Depth
This ensures no execution path is missed. 2. Thread-Safe Idempotency
3. Docker Log Demuxing
4. Excellent Test Coverage
Each test validates both file existence and content correctness. 5. Graceful Error Handling 🔍 Issues & Concerns1. Race Condition: monitorContainerExit vs Wait (CRITICAL) // In Start() - spawns background goroutine
go m.monitorContainerExit(r) // Calls WaitContainer()
// User may call Wait() later
func Wait() {
exitCode, err := m.runtime.WaitContainer(ctx, containerID) // Second call
}Problem: Docker/container runtimes typically allow only one waiter per container. The second
While
Recommended fix: Use a channel-based approach where only one goroutine waits: // In Run struct
exitCh chan struct{} // Closed when container exits
// In monitorContainerExit - the ONLY waiter
exitCode, err := m.runtime.WaitContainer(...)
close(r.exitCh)
m.captureLogs(r)
// ... state updates
// In Wait - subscribe to exit
<-r.exitCh
m.captureLogs(r) // Idempotent, safe to call again
// ... cleanup logic2. Potential Log Ordering Issues with Demuxing combined := append(stdout.Bytes(), stderr.Bytes()...)This loses the interleaved order of stdout/stderr, which could make debugging difficult. The comment acknowledges this but doesn't explain the tradeoff. Recommendation: Document why preserving content is more important than order for audit purposes, or consider alternatives:
3. Error Handling: Stop() Captures Logs Before Verifying Container State m.captureLogs(r)
if err := m.runtime.StopContainer(ctx, r.ContainerID); err != nil {
// Log but don't fail
}If the container is still running and producing logs, Recommendation: Capture logs AFTER 4. Missing Import in docker.go import (
"bytes"
"github.com/docker/docker/pkg/stdcopy"
)5. Context Leak in monitorContainerExit Recommendation: func (m *Manager) monitorContainerExit(ctx context.Context, r *Run) {
exitCode, err := m.runtime.WaitContainer(ctx, r.ContainerID)
// ...
}Pass a context that cancels when the manager closes or a reasonable timeout. 6. Commit Message Violates Project Guidelines
🎯 RecommendationsPriority 1 (Must Fix):
Priority 2 (Should Fix): Priority 3 (Nice to Have): 📊 Overall AssessmentThis is high-quality code that solves a critical problem systematically. The defense-in-depth approach and comprehensive testing are excellent. However, the race condition around The code demonstrates:
Recommendation: Request changes to fix the race condition before merging. |
CRITICAL: Fixed race condition where both monitorContainerExit() and Wait() called WaitContainer() on the same container. Container runtimes typically allow only one waiter per container, causing the second call to either: - Return stale data immediately - Block indefinitely - Return an error Solution: Channel-based synchronization where only monitorContainerExit() waits on the container: 1. **Added exitCh to Run struct** - Closed when container exits (signaled by monitorContainerExit) - Initialized in both Create() and loadPersistedRuns() - Pre-closed for already-stopped containers during recovery 2. **monitorContainerExit is sole waiter** - Only goroutine that calls WaitContainer() - Closes exitCh after wait completes - Takes context parameter for proper cancellation 3. **Wait() subscribes to exitCh** - No longer calls WaitContainer() (eliminates race) - Waits for exitCh to close - captureLogs() still called (idempotent, defense-in-depth) - Reads r.Error to determine final exit status 4. **Recovery path handles stopped containers** - exitCh immediately closed if container already stopped - Wait() won't hang on recovered runs Thread-safe, no race conditions, proper context propagation.
…eoffs Address code review feedback: 1. **Remove captureLogs from Stop()** - captureLogs in Stop() captured incomplete logs (before container exits) - monitorContainerExit already captures complete logs when container exits - Redundant call could capture mid-shutdown state - Added comment explaining why we don't capture here 2. **Document log ordering tradeoff in docker.go** - StdCopy demuxing loses interleaved ordering between stdout/stderr - Ordering within each stream is preserved - Acceptable tradeoff for audit logs (content > perfect ordering) - TTY mode preserves natural ordering (no demuxing needed) These changes complete the defense-in-depth approach: - monitorContainerExit: captures logs when container actually exits - Wait(): idempotent capture for defense - StartAttached(): captures logs after attach completes - Stop(): no longer captures (relies on monitorContainerExit)
📦 Previous Review (superseded by newer review)Pull Request Review: fix(logs): Ensure logs are retained after every runSummaryThis PR fixes a critical bug where container logs were not being captured to Code Quality & Best PracticesStrengths:
Issues:
Potential Bugs
Performance Considerations
Security ConcernsNo security issues identified. The changes:
Test CoverageExcellent test coverage:
Suggestions:
Additional Notes
Recommendations PriorityHigh Priority:
Medium Priority:
Low Priority:
VerdictThe core fix is sound and addresses the reported issue effectively. The idempotent log capture design is robust. However, the race condition in |
Address remaining code review issues:
1. **Start monitorContainerExit for running containers on recovery**
- CRITICAL: When moat restarts while containers are running, they now
get log capture when they eventually exit
- Added 'go m.monitorContainerExit()' for StateRunning containers in
loadPersistedRuns()
2. **Remove redundant state updates in Wait()**
- Wait() no longer overwrites r.State, r.StoppedAt, r.Error
- monitorContainerExit already sets these when container exits
- Prevents race that could lose StateFailed state and error details
- Changed to read-only lock for getting cleanup values
3. **Restore captureLogs in Stop() for defense-in-depth**
- Moved captureLogs AFTER StopContainer (container fully stopped)
- Provides safety net if moat crashes before monitorContainerExit runs
- captureLogs is idempotent so multiple calls are safe
4. **Document curl dependency purpose**
- Added comment explaining curl is used by MCP E2E tests
Defense-in-depth log capture now covers:
- monitorContainerExit: primary capture when container exits
- Wait(): idempotent capture after exitCh signals
- Stop(): idempotent capture after StopContainer
- StartAttached(): capture after attach completes
📦 Previous Review (superseded by newer review)Pull Request Review: fix(logs): Ensure logs are retained after every runSummaryThis PR addresses a critical observability bug where container logs were not being captured to Code Quality & ArchitectureStrengths:
Areas for Improvement:
Docker Log Demuxing FixThe fix to Good:
Consideration:
Test CoverageThe new E2E tests are comprehensive and well-designed: Strengths:
Potential Issues:
Potential Bugs & Issues
Performance Considerations
Security ConcernsNo security issues identified. The changes are purely internal and don't affect:
DependenciesThe addition of RecommendationsMust Fix:
Should Fix:
Consider:
VerdictApproval: ✅ Approve with minor suggestions This PR successfully fixes a critical observability gap with a well-designed, concurrent-safe solution. The comprehensive test coverage demonstrates the fix works across all execution modes. The suggested improvements are minor and don't block merging. The architecture is sound, following Go best practices for concurrent code. The addition of |
…y removed Fixes CI failures where logs.jsonl wasn't created because: 1. Container exits and is marked for removal by Docker 2. First goroutine sets logsCaptured=true but fails to get logs 3. LogWriter fails to create file (or other error) 4. Second goroutine sees logsCaptured=true and skips entirely 5. No file is ever created The fix changes the idempotency logic: - Don't early-return based solely on logsCaptured flag - Check if the actual file exists before deciding to skip - If file doesn't exist but flag is set, reset flag and retry - Only keep logsCaptured=true after LogWriter successfully creates file - If LogWriter fails, reset flag so another goroutine can try This ensures logs.jsonl is ALWAYS created for audit purposes, even when: - Container is 'dead or marked for removal' - Multiple goroutines race to capture logs - First attempt fails for any reason
…aling CRITICAL timing fix: Docker may start removing containers immediately after exit. The previous order was: 1. WaitContainer returns 2. close(exitCh) - signal watchers 3. captureLogs() - too late, container might be 'dead or marked for removal' New order: 1. WaitContainer returns 2. captureLogs() - immediately while container is still in 'exited' state 3. close(exitCh) - signal watchers that logs are captured This ensures logs are captured in the brief window between container exit and Docker's cleanup process.
Pull Request Review: fix(logs): Ensure logs are retained after every runOverviewThis PR fixes a critical bug where container logs were not being captured to logs.jsonl in detached and interactive modes. The fix implements a comprehensive solution with a background monitor, idempotent log capture, and extensive E2E test coverage. Code Quality: ExcellentStrengths:
Key Changes Analysis1. Docker Log Demultiplexing (internal/container/docker.go)Correct Implementation:
The comment about stdout/stderr ordering loss is accurate - this is a Docker limitation, not a bug. 2. Background Monitor Pattern (internal/run/manager.go)Excellent Design:
Using context.Background() for log capture is correct - we want to capture logs even if the parent context is canceled. 3. Idempotent Log Capture (internal/run/manager.go:1876-1928)Robust Implementation:
The retry logic (lines 1896-1901) is clever but slightly complex. Works correctly. 4. State Management UpdatesCorrect Synchronization:
Test Coverage: ComprehensiveThe new logs_capture_test.go provides excellent coverage:
Strengths:
Enhancement: Consider adding a test for the race condition scenario (monitor vs Stop both trying to capture logs) to verify idempotency works. Performance Considerations: Good
Security Concerns: None Identified
Potential Issues and RecommendationsMinor: Cleanup Timing Race (Low Priority)If Docker auto-cleanup is very aggressive, there is a theoretical window where ContainerLogsAll might fail. Current Mitigation: The code already handles this with immediate capture before signaling, fallback to empty file on error, and proper error logging. Recommendation: Consider adding a retry with exponential backoff if ContainerLogsAll returns a container not found error. However, current handling is acceptable for audit purposes. Enhancement: Add Timing MetricsAdd debug logging for log capture timing to help diagnose issues. This would help identify if large logs cause delays. Documentation: Mention Ordering LimitationThe stdout/stderr interleaving limitation (mentioned in docker.go:232-235) should be documented for users who rely on log ordering. Architectural Considerations: Sound
The addition of exitCh and logsCaptured to the Run struct is appropriate and follows Go conventions. Dependency Addition: curl in registry.yamlUsed by E2E tests for verifying MCP proxy functionality. This is a reasonable test dependency. Overall AssessmentQuality Grade: A This is a well-executed fix that demonstrates deep understanding of container runtime behavior, proper concurrency patterns in Go, comprehensive testing methodology, and attention to edge cases and failure modes. The code is production-ready with excellent documentation and defensive programming practices. Recommendations Summary
Approval Status: APPROVED This PR significantly improves moats observability and fixes a critical gap in audit logging. The implementation is solid and the test coverage is excellent. |
No description provided.