Skip to content

feat(snapshot): add workspace snapshots and process tracing#36

Merged
andybons merged 39 commits into
mainfrom
feat/snapshots
Jan 21, 2026
Merged

feat(snapshot): add workspace snapshots and process tracing#36
andybons merged 39 commits into
mainfrom
feat/snapshots

Conversation

@andybons
Copy link
Copy Markdown
Collaborator

@andybons andybons commented Jan 20, 2026

Summary

  • Workspace Snapshots: Automatic and manual snapshotting of agent workspaces with APFS (macOS) and archive backends
  • Process Tracing: Platform-specific execution tracing via Linux proc connector and macOS sysctl polling
  • Rollback Support: moat rollback command to restore workspace to previous snapshot

Key Components

Snapshots (internal/snapshot/)

  • APFS backend for instant copy-on-write snapshots on macOS
  • Archive backend with gitignore-aware tar.gz fallback
  • Snapshot engine with automatic pre/post-run capture
  • Security fix: prevent symlink path traversal attacks in archive restore

Process Tracing (internal/trace/)

  • Linux: Netlink proc connector for kernel-level exec notifications
  • macOS: sysctl polling with start-time-based exec detection
  • Build-tagged implementations with stub fallback
  • Race condition fixes and periodic PID cleanup

CLI Commands

  • moat snapshot - Manual workspace snapshot
  • moat snapshots - List/manage snapshots
  • moat rollback - Restore to previous snapshot

Configuration

  • New snapshot and tracing fields in agent.yaml

Test Plan

  • Unit tests for all new packages
  • Integration tests for platform-specific tracers
  • Regression tests for race conditions and edge cases
  • go test -race ./... passes

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings January 20, 2026 20:01
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds comprehensive workspace snapshot capabilities and process execution tracing to the moat agent system. The implementation provides automatic pre-run snapshots, event-based snapshot triggers (git commits, builds, idle detection), and platform-specific execution tracing via Linux proc connector and macOS sysctl polling.

Changes:

  • Workspace snapshot system with APFS (macOS) and archive (tar.gz) backends
  • Process execution tracing with platform-specific implementations (Linux/macOS/stub)
  • Configuration support for snapshot triggers, exclusions, and retention policies
  • Storage integration for execution events
  • CLI commands for snapshot management (moat snapshots, moat snapshot, moat rollback)

Reviewed changes

Copilot reviewed 37 out of 38 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
internal/snapshot/snapshot.go Core snapshot types, metadata, and backend interface
internal/snapshot/archive.go Archive backend with tar.gz and gitignore support
internal/snapshot/apfs_darwin.go APFS backend using macOS tmutil commands
internal/snapshot/engine.go Snapshot engine with backend detection and management
internal/trace/event.go Execution event types with git/build detection helpers
internal/trace/tracer.go Tracer interface definition
internal/trace/tracer_stub.go No-op tracer for testing/unsupported platforms
internal/trace/tracer_linux.go Linux proc connector tracer via netlink
internal/trace/tracer_darwin.go macOS sysctl polling tracer
internal/config/config.go Snapshot and tracing configuration schema
internal/storage/storage.go Execution event storage (exec.jsonl)
internal/run/run.go Added SnapEngine field to Run struct
internal/run/manager.go Integrated snapshot engine into run lifecycle
go.mod/go.sum Added go-git dependency for gitignore parsing

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/snapshot/archive.go Outdated
Comment on lines +264 to +267
f.Close()
return fmt.Errorf("write file %s: %w", header.Name, err)
}
f.Close()
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File handle may be writable as a result of data flow from a call to OpenFile and closing it may result in data loss upon failure, which is not handled explicitly.

Suggested change
f.Close()
return fmt.Errorf("write file %s: %w", header.Name, err)
}
f.Close()
// Best-effort close; preserve the original write error
_ = f.Close()
return fmt.Errorf("write file %s: %w", header.Name, err)
}
if err := f.Close(); err != nil {
return fmt.Errorf("close file %s: %w", header.Name, err)
}

Copilot uses AI. Check for mistakes.
Comment thread internal/snapshot/archive.go Outdated
Comment on lines +264 to +267
f.Close()
return fmt.Errorf("write file %s: %w", header.Name, err)
}
f.Close()
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File handle may be writable as a result of data flow from a call to OpenFile and closing it may result in data loss upon failure, which is not handled explicitly.

Suggested change
f.Close()
return fmt.Errorf("write file %s: %w", header.Name, err)
}
f.Close()
if cerr := f.Close(); cerr != nil {
return fmt.Errorf("write file %s: %v (also failed to close: %w)", header.Name, err, cerr)
}
return fmt.Errorf("write file %s: %w", header.Name, err)
}
if err := f.Close(); err != nil {
return fmt.Errorf("close file %s: %w", header.Name, err)
}

Copilot uses AI. Check for mistakes.
Comment thread internal/storage/storage.go Outdated
Comment on lines +395 to +405
defer f.Close()

data, err := json.Marshal(event)
if err != nil {
return fmt.Errorf("marshaling exec event: %w", err)
}
if _, writeErr := f.Write(data); writeErr != nil {
return fmt.Errorf("writing exec event: %w", writeErr)
}
_, err = f.Write([]byte("\n"))
return err
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File handle may be writable as a result of data flow from a call to OpenFile and closing it may result in data loss upon failure, which is not handled explicitly.

Suggested change
defer f.Close()
data, err := json.Marshal(event)
if err != nil {
return fmt.Errorf("marshaling exec event: %w", err)
}
if _, writeErr := f.Write(data); writeErr != nil {
return fmt.Errorf("writing exec event: %w", writeErr)
}
_, err = f.Write([]byte("\n"))
return err
data, err := json.Marshal(event)
if err != nil {
_ = f.Close()
return fmt.Errorf("marshaling exec event: %w", err)
}
if _, writeErr := f.Write(data); writeErr != nil {
_ = f.Close()
return fmt.Errorf("writing exec event: %w", writeErr)
}
if _, writeErr := f.Write([]byte("\n")); writeErr != nil {
_ = f.Close()
return fmt.Errorf("writing exec event newline: %w", writeErr)
}
if closeErr := f.Close(); closeErr != nil {
return fmt.Errorf("closing exec file: %w", closeErr)
}
return nil

Copilot uses AI. Check for mistakes.
@claude
Copy link
Copy Markdown

claude Bot commented Jan 20, 2026

Code Review - PR #36: Workspace Snapshots & Process Tracing

Excellent work on this substantial feature! The architecture is solid with good separation of concerns, strong security practices, and comprehensive test coverage. However, I've identified several issues that should be addressed before merging.

Critical Issues (Must Fix)

1. Race Condition in .git Preservation ⚠️

Location: internal/snapshot/archive.go:156-210

Issue: The .git backup/restore creates a window where the workspace is in an inconsistent state:

// Current code:
os.Rename(gitDir, gitBackup)        // .git renamed away
// ... multiple fs operations happen here ...
os.Rename(gitBackup, gitDir)        // .git restored

If the process crashes between these operations, the workspace loses all git history (it's in .git.backup).

Recommendations:

  • Copy instead of rename for the backup
  • Use a temporary directory in snapshot storage instead of the workspace
  • At minimum: document this risk clearly

2. Unhandled .git Restoration Errors ⚠️

Locations: archive.go:174, 188, 198

// Errors are silently ignored:
os.Rename(gitBackup, gitDir)

If restoration fails, the workspace is corrupted with no feedback to the user.

Fix: Check and wrap these errors:

if err := os.Rename(gitBackup, gitDir); err != nil {
    return fmt.Errorf("restoring .git from backup: %w", err)
}

High Priority Issues

3. APFS Stub Silently Succeeds on Non-Darwin

Location: internal/snapshot/apfs_stub.go:21

Returns empty string (success) when APFS snapshots aren't available. Users won't know snapshots aren't being created.

Fix:

func (a *APFSBackend) Create(name string) (string, error) {
    return "", fmt.Errorf("APFS snapshots not available on this platform")
}

4. Metadata Corruption Not Recoverable

Location: internal/snapshot/engine.go:232

If snapshots.json is corrupted, users get a generic unmarshal error with no recovery path.

Fix: Provide actionable guidance:

if err := json.Unmarshal(data, &e.snapshots); err != nil {
    return fmt.Errorf("corrupted snapshot metadata at %s: %w\n"+
        "To reset: rm %s", metaPath, err, metaPath)
}

Medium Priority Issues

5. Event Emission Callback Race

Locations: tracer_linux.go:376-398, tracer_darwin.go:416-439

Callbacks are invoked outside the lock after event channel send. If Stop() is called concurrently, tracer state can become inconsistent.

Current:

t.mu.Lock()
select {
case t.events <- event:
default:
    t.droppedEvents++
}
t.mu.Unlock()

// Callback outside lock - can race with Stop()
for _, cb := range cbs {
    cb(event)
}

Fix: Either hold lock during callbacks or use defer/channels for cleanup coordination.

6. Darwin Tracer Offset Assumptions Not Validated

Location: tracer_darwin.go:29-40

Hardcoded struct offsets for kinfo_proc are assumed correct but not validated at runtime.

Fix: Add sanity checks after parsing:

pid := binary.LittleEndian.Uint32(b[pidOffset : pidOffset+4])
if pid > 99999 {  // Reasonable max PID
    log.Printf("WARNING: parsed suspicious PID %d, offset may be wrong", pid)
}

Code Quality Issues

7. Error Ignored in Archive Cleanup

Location: archive.go:149

os.Remove(archivePath)  // Error ignored
return nil, fmt.Errorf("creating tar writer: %w", err)

Should log or wrap the cleanup error.

8. .git Symlink Not Handled

The .git preservation logic doesn't handle the case where .git is a symlink. After restore, the symlink would point to the old (backup) location.

Fix: Use filepath.EvalSymlinks() before backup operations.


Test Coverage Gaps

Excellent test coverage overall, but consider adding:

  1. Concurrent snapshot creation - Two engine.Create() calls simultaneously
  2. .git corruption recovery - What happens if .git backup rename fails
  3. Corrupted snapshots.json - Unmarshal error handling
  4. Permission errors - Read-only workspace during archive creation
  5. Linux tracer edge cases - Malformed netlink packets, out-of-order events

The existing path traversal tests are excellent! 👍


Documentation Suggestions

  • Add docstring to Backend interface explaining method semantics
  • Document snapshot ID format (snap_<8-hex-chars>)
  • Document APFS backend limitations (no .git preservation vs archive backend)
  • Add performance characteristics note for large workspaces
  • Document tracer limitations (can't restart, event dropping)

Positive Highlights

Excellent security practices - Path traversal prevention is defense-in-depth with comprehensive tests
Clean architecture - Interface-based design, proper platform abstraction with build tags
Resource management - Proper file handle cleanup (nice comment on archive.go:138)
Structured logging - Consistent use of slog throughout
Error wrapping - Mostly consistent use of %w for error chains
Concurrency design - Good use of RWMutex for PID tracking in Linux tracer


Overall Assessment

This is a well-designed feature with strong fundamentals. The architecture is sound and the code demonstrates good Go practices. The main concerns are around error handling completeness and edge case robustness, particularly around the .git preservation logic which poses a data integrity risk.

Recommendation: Address the critical issues (#1, #2) and high priority issues (#3, #4) before merging. The medium priority issues can be addressed post-merge if time is constrained.

@claude
Copy link
Copy Markdown

claude Bot commented Jan 20, 2026

Code Review: PR #36 - Workspace Snapshots and Process Tracing

Overview

This PR adds significant new functionality with 9,098 additions across 37 files, introducing workspace snapshots with APFS/archive backends and platform-specific process tracing. The implementation is generally solid with good security practices and comprehensive testing.

Summary Assessment

Overall Quality: ✅ Strong - Well-architected with good security practices
Test Coverage: ✅ Good - 2,247 lines of tests, though some concurrency edge cases need coverage
Security: ✅ Solid - Excellent symlink path traversal protection
Documentation: ✅ Excellent - Clear error messages and examples following CLAUDE.md


Critical Findings

🟢 No Critical Issues Found

The code is production-ready with no blocking security vulnerabilities or critical bugs.


High-Priority Recommendations

1. Archive Restore Atomicity (Medium Priority)

File: internal/snapshot/archive.go:194-200

Issue: If extraction fails midway, the workspace may contain partial files in an inconsistent state. While .git is restored, other files may be corrupted.

Current Code:

if err := b.RestoreTo(nativeRef, workspacePath); err != nil {
    // Restore .git on error (best effort)
    if gitBackup != "" {
        _ = os.Rename(gitBackup, gitDir)
    }
    return fmt.Errorf("extract archive: %w", err)
}

Recommendation: Extract to a temporary directory first, then swap atomically:

tmpDir := workspacePath + ".restore-tmp"
if err := b.RestoreTo(nativeRef, tmpDir); err != nil {
    os.RemoveAll(tmpDir)
    return fmt.Errorf("extract archive: %w", err)
}
// Atomic swap with proper cleanup

2. Insufficient Concurrency Test Coverage (Medium Priority)

Files: internal/trace/*_test.go

Missing Tests:

  • Concurrent Start()/Stop() calls
  • Channel overflow scenarios (dropped events)
  • Callback panic handling
  • Process tree tracking edge cases (grandchildren)
  • Linux consecutive error recovery
  • Goroutine cleanup verification

Recommendation: Add these test cases:

func TestTracerConcurrentStartStop(t *testing.T)
func TestTracerChannelOverflow(t *testing.T) 
func TestTracerCallbackPanic(t *testing.T)
func TestTracerGrandchildTracking(t *testing.T)
func TestLinuxTracerConsecutiveErrors(t *testing.T)
func TestTracerGoroutineCleanup(t *testing.T)

Security Review

✅ Strengths

  1. Symlink Path Traversal Protection (archive.go:272-301)

    • Correctly rejects absolute symlink targets
    • Validates relative symlinks don't escape destination
    • Comprehensive test coverage in archive_test.go:413-539
  2. Decompression Bomb Protection (archive.go:266-271)

    • 1GB limit per file prevents resource exhaustion
    • Consider making configurable for legitimate large files
  3. Race Condition Prevention (tracer_*.go)

    • Proper mutex usage throughout
    • Defensive channel send patterns prevent deadlocks
    • Lock held during channel send to prevent Stop() race (tracer_darwin.go:416-438)

⚠️ Minor Security Considerations

  1. APFS Backend Privilege Requirements (apfs_darwin.go:60-85)

    • tmutil restore may require elevated privileges
    • Restore operates on entire volume, not just workspace subdirectory
    • Consider documenting these limitations or defaulting to archive backend
  2. Error Cleanup Failures Silently Ignored (archive.go:147-151)

    • Partial archives may remain if cleanup fails
    • Recommendation: Log cleanup failures

Code Quality Observations

✅ Excellent Patterns

  1. Error Messages - Follow CLAUDE.md guidelines perfectly:

    • rollback.go:127: Shows undo command after operations
    • snapshots.go:200-201: Explains state clearly
    • Actionable guidance throughout
  2. Resource Management:

    • Proper defer chains for file handles (archive.go:57-63)
    • Clean shutdown sequence in tracers (tracer_linux.go:119-141)
    • Immediate file handle close to avoid accumulation (archive.go:133-142)
  3. Error Handling:

    • Metadata consistency with rollback (engine.go:124-129)
    • Linux tracer consecutive error limiting (tracer_linux.go:195-235)
    • Non-blocking snapshot creation (manager.go:956-961)

📝 Improvement Opportunities

  1. Enhance CLI Error Messages (rollback.go:68, 83)

    • Suggest next actions when snapshots not found
    • Example: "List available snapshots: moat snapshots "
  2. Add Destination Validation (rollback.go:100-108)

    • Check if --to directory exists before extraction
    • Provide clear guidance on how to proceed
  3. Darwin Tracer Error Rate Limiting (tracer_darwin.go:155-157)

    • Missing consecutive error tracking like Linux tracer has
    • Silent failures in getAllProcesses() could hide persistent issues
  4. Configuration Default Duplication (config.go:159-165, 171-186)

    • Defaults set in both Load() and DefaultConfig()
    • Consider consolidating or documenting why both are needed

Performance Considerations

✅ Good Choices

  1. APFS Copy-on-Write: Instant snapshots on macOS with APFS
  2. Gitignore-Aware Archives: Reduces snapshot size significantly
  3. Separate Mutexes: tracer_linux.go uses RWMutex for PID map (lines 52-53)
  4. Non-Blocking Snapshots: Optional feature doesn't impact run performance

💡 Potential Optimizations

  1. Prune Operation (snapshots.go:176-197)

    • No transaction semantics - snapshots created during prune aren't considered
    • Acceptable for non-critical operation, but worth noting
  2. Archive Compression: Using gzip - consider offering compression level options


Testing Assessment

Test Lines: 2,247 lines across snapshot and trace packages

✅ Well-Tested Areas

  • Archive backend with security tests (730 lines)
  • Engine operations and metadata persistence (604 lines)
  • Symlink path traversal attacks
  • Gitignore handling
  • File permissions preservation

⚠️ Needs More Coverage

  • Concurrent tracer operations
  • Process tree tracking edge cases
  • Error recovery scenarios
  • Resource leak detection

Recommendation: Run go test -race ./... to verify concurrency safety (PR mentions this passes ✓)


Documentation Quality

✅ Excellent

  • All CLI commands have detailed Long descriptions with examples
  • Comments explain "why" not just "what"
  • Design docs included in docs/plans/ (3,839 lines total)
  • Inline security considerations documented

📝 Suggestions

  1. Add godoc examples for exported functions in snapshot/trace packages
  2. Document APFS backend limitations (privilege requirements, volume-level operations)
  3. Clarify when archive vs APFS backend is selected

Specific File Reviews

internal/snapshot/archive.go

  • Security: Excellent symlink protection
  • Resource management: Good defer patterns
  • Minor: Log cleanup failures, consider atomic restore

internal/snapshot/engine.go

  • Clean abstraction over backends
  • Good metadata consistency
  • Proper cleanup on errors

internal/trace/tracer_darwin.go

  • Excellent concurrency patterns
  • Good process tree tracking
  • Add error rate limiting like Linux version

internal/trace/tracer_linux.go

  • Netlink integration looks solid
  • Good consecutive error handling
  • Proper mutex usage

cmd/moat/cli/*.go

  • Excellent UX and error messages
  • Good command structure
  • Enhance with suggested next actions

internal/config/config.go

  • Clean schema design
  • Sensible defaults
  • Consolidate default logic

Dependencies

New dependencies added (go.mod):

  • No new direct dependencies (only transitive updates)
  • Clean implementation using stdlib where possible ✅

Recommendations Summary

Must Address Before Merge

  • None - code is production-ready

Should Address Soon

  1. Add atomic restore for archive backend
  2. Expand tracer concurrency test coverage
  3. Add error rate limiting to Darwin tracer

Nice to Have

  1. Make decompression limit configurable
  2. Enhance CLI error messages with next actions
  3. Add godoc examples
  4. Consider fuzzing for archive extraction

Conclusion

This is high-quality work that follows project conventions and security best practices. The implementation is thoughtful with excellent documentation and good test coverage. The symlink path traversal protection is particularly well-done with comprehensive tests.

Recommendation: ✅ Approve with minor suggestions

The suggestions above are for improvement but don't block merging. The code is secure, well-tested, and ready for production use.


Review generated using Claude Code following CLAUDE.md guidelines

@claude
Copy link
Copy Markdown

claude Bot commented Jan 20, 2026

Comprehensive Code Review: PR #36 - Workspace Snapshots and Process Tracing

Overview

This PR adds significant new functionality (~9,130 additions across 38 files) for workspace snapshots and process tracing. Overall, this is a solid implementation with excellent security practices, though there are areas for improvement.


✅ Strengths

Security Implementation (Excellent)

The path traversal protection in internal/snapshot/archive.go is textbook-quality:

File Path Validation (lines 241-246):

relToDestPath, err := filepath.Rel(destPath, targetPath)
if err != nil || strings.HasPrefix(relToDestPath, "..") {
    return fmt.Errorf("invalid path in archive: %s", header.Name)
}

Symlink Validation (lines 280-296):

  • Rejects absolute symlink targets
  • Validates relative symlinks resolve within destination
  • Correctly uses filepath.Rel() to detect escape attempts

Additional Protections:

  • Decompression bomb protection: 1GB limit per file (line 267)
  • Mode masking to prevent setuid/setgid bits (lines 251, 261)

This is security done right! 🎯

Concurrency Safety

Excellent use of atomic operations in manager.go:301-332:

var storeRef atomic.Value // holds *storage.RunStore

The atomic reference pattern cleverly avoids a race condition where the proxy logger might try to write before the store is initialized.

Proper mutex usage throughout:

  • snapshot/engine.go: Consistent lock/unlock for snapshot map operations
  • trace/tracer_linux.go: Good separation with sync.RWMutex for read-heavy PID tracking

Resource Management

Smart file handle management in archive.go:138:

f.Close() // Close immediately, not deferred, to avoid accumulating file handles

Excellent comment explaining why immediate close is used in loops instead of defer!

Architecture

  • Clean Backend interface abstraction for snapshot strategies
  • Proper platform-specific code using build tags (//go:build darwin, //go:build linux)
  • Well-integrated into run lifecycle with sensible defaults
  • Good CLI design with safety features (rollback creates safety snapshot before restore)

⚠️ Issues & Concerns

Critical: APFS Backend May Not Work as Designed

Location: internal/snapshot/apfs_darwin.go:809-835

The APFS backend uses tmutil localsnapshot which is designed for Time Machine, not programmatic snapshots. The restore logic assumes tmutil restore works like a rollback, but it's actually for restoring files from Time Machine backups to a different location.

Recommendation: Use diskutil apfs createSnapshot and diskutil apfs revertToSnapshot instead, or clearly document that APFS support is experimental/unsupported.

High Priority: Missing Security Tests

Despite excellent security implementation, there are no tests validating the path traversal fixes:

  • No tests for ../../../etc/passwd in tar entries
  • No tests for the complex symlink validation logic (lines 280-296)
  • No tests for decompression bomb protection
  • No tests for concurrent snapshot operations

Recommendation: Add dedicated security tests to prevent regressions:

func TestArchive_PathTraversalProtection(t *testing.T) {
    // Test cases for .., absolute paths, symlinks escaping destination
}

Medium Priority: Resource Management Edge Cases

  1. Git Directory Backup Races (archive.go:158-210):

    • Concurrent restores could overwrite .git.backup
    • Crashes leave .git.backup orphaned
    • Consider using unique temp directories: .git.backup.<timestamp>
  2. No Disk Space Checks: Archive creation doesn't verify available space before starting. Large workspaces could fill the disk.

  3. Synchronous Pre-Run Snapshots (manager.go:957-961): Block container startup. For large workspaces, consider making async or adding a timeout.

Low Priority: Code Quality

  1. Long Functions: manager.Create() is 870+ lines and handles too many responsibilities (container, proxy, SSH, snapshots, audit). Consider extracting helpers.

  2. gitignore Performance (archive.go:507-547): buildMatcher() walks entire workspace to find all .gitignore files. For large workspaces, consider caching or only checking root.

  3. Missing Constants Documentation: Some magic numbers lack explanation (4096 buffer size, 1GB file limit).

  4. Minor Race Condition: tracer_linux.go:147-150 - OnExec() appends to callbacks without lock, but callbacks are invoked without holding lock. Likely not an issue since OnExec() is called during init, but worth noting.


📝 Additional Observations

Error Handling

Good:

  • Consistent error wrapping with context (fmt.Errorf("...: %w", err))
  • Cleanup on errors with explicit _ = for best-effort cleanup
  • Helpful error messages with recovery instructions

Could Improve:

  • Silent snapshot failures during run creation (logged but don't fail the run) - users might not notice lost protection
  • APFS backend silently falls back on parsing errors

Test Coverage

Present:

  • Basic archive create/restore/delete operations
  • Engine lifecycle tests
  • Event parsing tests

Missing:

  • Security-focused tests (as noted above)
  • Integration/E2E tests for snapshot creation during runs
  • APFS backend functional tests (only detection is tested)

Documentation

Design docs are comprehensive, but:

  • Implementation diverges slightly (idle triggers and build triggers not fully wired up)
  • APFS implementation may not work as described
  • Missing operational docs (disk space requirements, troubleshooting)

📊 Summary & Recommendation

Verdict: Approve with Minor Concerns

The core functionality is sound and the security implementation is exemplary. The architecture is clean and extensible. However:

Before Merge:

  1. ✅ Mark APFS backend as experimental in code/docs, or fix the implementation
  2. ✅ Add at least basic security tests for path traversal protection

Follow-up Work:

  1. Improve test coverage (especially security and concurrency tests)
  2. Add disk space checks before snapshot creation
  3. Consider async pre-run snapshots for large workspaces
  4. Refactor manager.Create() to reduce complexity
  5. Add confirmation prompt to snapshots prune command
  6. Wire up tracer integration completely (appears incomplete)

🎯 Final Notes

This is well-crafted code that demonstrates strong security awareness and good Go practices. The path traversal fix is thorough and production-ready. With the recommended test additions and APFS clarifications, this will be a solid feature addition to moat.

Great work on the security implementation! The symlink validation logic is particularly well thought out.

@claude
Copy link
Copy Markdown

claude Bot commented Jan 21, 2026

Pull Request Review: Workspace Snapshots and Process Tracing

Overall Assessment

This is a well-architected and thoroughly implemented PR that adds two major features: workspace snapshots and process tracing. The code demonstrates strong engineering practices with comprehensive test coverage (39 snapshot tests, 19 trace tests), proper security considerations, and platform-specific optimizations.

Recommendation:Approve with minor suggestions


Code Quality & Best Practices

Strengths

  1. Excellent Separation of Concerns

    • Clean abstraction with Backend interface for snapshot implementations
    • Platform-specific build tags (//go:build linux, //go:build darwin) properly isolate code
    • Factory pattern for tracer selection based on platform
  2. Robust Error Handling

    • Comprehensive error wrapping with context (e.g., fmt.Errorf("backend create: %w", err))
    • Recovery paths with best-effort cleanup (e.g., gitignore backup restoration on errors in archive.go:187-198)
    • Clear, actionable error messages (follows CLAUDE.md guidance)
  3. Thread Safety

    • Proper mutex usage in Engine and all tracers
    • Lock-then-copy pattern for callbacks to prevent deadlock (e.g., tracer_linux.go:383-386)
    • Stopped flags to prevent double-close panics
  4. Test Coverage

    • 58 total tests across snapshot and trace packages
    • Regression tests for race conditions and edge cases
    • Security-focused tests for symlink path traversal attacks
    • Integration tests for platform-specific tracers

Security Review

✅ Security Strengths

  1. Path Traversal Protection (archive.go:238-296)

    • Validates archive paths don't escape destination using filepath.Rel
    • Rejects absolute symlink targets (archive.go:282)
    • Validates symlink targets stay within destination (archive.go:293-296)
    • Comprehensive test coverage for attack vectors (archive_test.go:413-484)
  2. Decompression Bomb Protection (archive.go:267)

    • 1GB per-file limit using io.LimitReader
    • Prevents zip bomb attacks
  3. Safe Integer Conversions

    • Appropriate //nolint:gosec G115 with validation comments where needed
    • PIDs and modes properly range-validated before conversion

⚠️ Security Considerations

  1. APFS Backend Disabled (Good Decision)

    • The APFS backend using tmutil creates volume-level snapshots, not directory-level
    • Correctly disabled by default in engine.go:87-98
    • Warning comment explains the issue
    • Suggestion: Consider documenting the proper approach (using cp -c for COW cloning) in a GitHub issue for future implementation
  2. Privileged Operations

    • Linux tracer requires CAP_NET_ADMIN or root (documented in error message, tracer_linux.go:86)
    • APFS restore may require elevated privileges (noted in comment, apfs_darwin.go:61-62)
    • This is acceptable for the use case but worth noting in documentation

Performance Considerations

✅ Optimizations

  1. Platform-Specific Efficiency

    • Linux: Kernel-level netlink connector (zero polling overhead)
    • macOS: Sysctl polling with 100ms interval (acceptable tradeoff vs ESF complexity)
    • Start-time-based exec detection on macOS avoids false positives
  2. Resource Management

    • Non-blocking event channels with dropped event counters
    • Periodic PID cleanup to handle missed EXIT events (tracer_linux.go:206-208)
    • File handle discipline (immediate close after read, not deferred in loops)
  3. Gitignore-Aware Snapshots

    • Uses go-git library for proper gitignore pattern parsing
    • Avoids snapshotting node_modules, build artifacts, etc.

💡 Performance Suggestions

  1. Archive Backend Buffer Size

    • archive.go:194: Consider a larger read buffer (currently 4KB default from WalkDir)
    • For large workspaces, buffered I/O could improve snapshot creation speed
  2. Darwin Tracer Poll Interval

    • 100ms polling may be aggressive for long-running agents
    • Consider making pollInterval configurable via Config

Potential Issues

🐛 Bugs

  1. APFS IsAPFS() Function Unreliable (apfs_darwin.go:156-184)

    • String matching on diskutil output is fragile
    • Falls back to mount point check, then checks for substring "apfs"
    • Impact: Low (APFS backend disabled by default)
    • Suggestion: If re-enabling APFS backend in future, use a more robust check (e.g., statfs syscall with f_fstypename)
  2. Potential Race in Metadata Corruption Recovery (engine.go:237)

    • Error message suggests deleting snapshots.json, but doesn't handle orphaned .tar.gz files
    • Impact: Low (user recoverable, data not lost)
    • Suggestion: Add a recovery command or auto-rebuild metadata from existing archives

⚠️ Minor Issues

  1. Magic Numbers (tracer_linux.go:238)

    • if n >= 52 uses hardcoded minimum message size
    • Suggestion: Define as constant minProcEventSize = 52 // nlhdr(16) + cnhdr(20) + proc_event(16)
  2. Error Shadowing (archive.go:137)

    • Variable named copyErr to avoid shadowing err, but could use inline assignment
    • Minor style issue, not a bug
  3. Inconsistent Error Handling for Close() (archive.go:57, 138)

    • Line 57: defer file.Close() ignores error
    • Line 138: f.Close() called immediately, error not checked
    • Line 268: Close error properly checked
    • Suggestion: Consistently check close errors in functions that return errors

Test Coverage Analysis

✅ Well-Tested Areas

  • Snapshot creation, restoration, deletion (archive backend)
  • Gitignore pattern matching
  • Symlink handling and path traversal attacks
  • Concurrent tracer operations (race condition tests)
  • PID tracking and cleanup
  • Build/git commit detection

🧪 Suggested Additional Tests

  1. Large File Handling

    • Test snapshot creation with files approaching 1GB limit
    • Verify decompression bomb protection triggers correctly
  2. Metadata Corruption Recovery

    • Test loadMetadata() behavior with corrupted JSON
    • Test recovery when .tar.gz exists but not in metadata
  3. CLI Command Integration Tests

    • End-to-end test of moat snapshotmoat rollback flow
    • Test error messages match expected format

Architecture & Design

✅ Strong Design Decisions

  1. Backend Abstraction

    • Backend interface allows swapping implementations
    • Engine auto-detects best backend (currently always archive, ready for future APFS cp -c implementation)
  2. Metadata Persistence

    • snapshots.json survives process restarts
    • Clean separation between metadata and native refs
  3. Type Safety

    • Type enum for snapshot types (Auto, Manual, Safety, Rollback)
    • Prevents string-based type confusion

💭 Design Questions

  1. Snapshot Retention Policy

    • No automatic cleanup of old snapshots
    • For long-running agents, disk usage could grow unbounded
    • Suggestion: Consider adding retention config (e.g., keep last N auto snapshots)
  2. Snapshot Compression Level

    • Uses default gzip compression (level 6)
    • Suggestion: Consider exposing compression level in ArchiveOptions for speed vs size tradeoff

Documentation

✅ Good Documentation

  • Package-level docs for trace package (doc.go)
  • Inline comments explain security validations
  • CLI help text clear and includes examples
  • Commit messages follow Conventional Commits format

📝 Documentation Suggestions

  1. agent.yaml Schema

    • Add example agent.yaml showing snapshot and tracing configuration
    • Document what UseGitignore and Additional do
  2. APFS Limitation

    • Create a GitHub issue documenting why APFS tmutil approach is incorrect
    • Link issue from engine.go:92 comment (currently has placeholder URL)
  3. Tracer Limitations

    • Document that Linux tracer requires CAP_NET_ADMIN
    • Document macOS tracer polling frequency and accuracy tradeoffs

Specific File Comments

internal/snapshot/archive.go

Lines 88-94: Git directory exclusion is correctly handled, but the logic is duplicated.

if relPath == ".git" || strings.HasPrefix(relPath, ".git/") || 
   strings.HasPrefix(relPath, ".git"+string(filepath.Separator)) {

Suggestion: Simplify to strings.HasPrefix(relPath, ".git") || relPath == ".git"

internal/trace/event.go

Lines 59-69: Build command detection is clever with the word boundary matching!

// Pattern followed by space (e.g., "build ./..." or "run build --production")
if strings.HasPrefix(argsStr, pattern+" ") {

This correctly avoids false positives like npm run build-docker. Nice work!

internal/trace/tracer_darwin.go

Lines 18-42: Architecture-specific struct offsets are well-documented

  • Clean separation for arm64 vs amd64
  • Suggestion: Consider adding a compile-time check that panics on unknown architectures instead of silently using amd64 offsets

cmd/moat/cli/rollback.go

Lines 111-117: Safety snapshot feature is excellent UX

fmt.Print("Creating safety snapshot of current state... ")
safetySnap, err := engine.Create(snapshot.TypeSafety, "pre-rollback")
  • Prevents destructive accidents
  • Clear undo message printed at end

Summary of Recommendations

High Priority

  1. ✅ Approve and merge - code is production-ready
  2. 📝 Create GitHub issue for APFS cp -c implementation (replace placeholder URL in engine.go:92)
  3. 🧹 Add snapshot retention policy or document disk usage growth

Medium Priority

  1. 🐛 Make Darwin tracer poll interval configurable
  2. 📚 Add agent.yaml examples to docs
  3. 🧪 Add metadata corruption recovery tests

Low Priority (Polish)

  1. 🎨 Extract magic number 52 to named constant in tracer_linux.go
  2. 🔧 Consistent error handling for Close() operations
  3. 📖 Document tracer privilege requirements in main README

Conclusion

This is excellent work that adds significant value to the moat project. The implementation is secure, performant, and well-tested. The conscious decision to disable the APFS backend until a proper implementation shows good engineering judgment. The comprehensive test suite and attention to security details (especially symlink path traversal protection) demonstrate production-quality code.

The code follows the CLAUDE.md guidelines well - good error messages, conventional commits, and practical examples in CLI help text.

Ship it! 🚀

@claude
Copy link
Copy Markdown

claude Bot commented Jan 21, 2026

Code Review: Workspace Snapshots and Process Tracing

Overview

This PR adds significant new functionality: workspace snapshots with APFS and archive backends, process tracing for Linux and macOS, and rollback capabilities. The implementation is generally well-designed with good test coverage, but there are several issues that need attention.


Critical Issues

1. Security: Command Injection Risk in APFS Backend

File: internal/snapshot/apfs_darwin.go (lines 52, 121, 157, 209, 228)

While the current implementation uses separate arguments (preventing shell injection), paths come from user-controlled input. If an attacker can control the workspace path or snapshot ID, they could potentially pass malicious arguments starting with -.

Recommendation: Add path validation:

// Before executing commands, validate paths
if strings.HasPrefix(filepath.Base(workspacePath), "-") {
    return "", fmt.Errorf("invalid workspace path: cannot start with -")
}

2. Performance: Potential Resource Leak in Archive Creation

File: internal/snapshot/archive.go (lines 59-63)

Deferred closes happen in reverse order, but errors from Close() are ignored. This could lead to incomplete archives.

Recommendation: Implement proper cleanup with error checking:

var writeErr error
defer func() {
    if err := tw.Close(); err != nil && writeErr == nil {
        writeErr = err
    }
    if err := gw.Close(); err != nil && writeErr == nil {
        writeErr = err
    }
    if err := file.Close(); err != nil && writeErr == nil {
        writeErr = err
    }
}()
// Return writeErr at the end of function

3. Security: Add Total Archive Size Limit

File: internal/snapshot/archive.go (line 267)

The 1GB per-file limit is excellent, but a malicious archive could still contain thousands of 1GB files.

Recommendation: Add a total archive size limit or file count limit in the restore loop.


Major Issues

4. Race Condition: Engine Metadata Concurrency

File: internal/snapshot/engine.go (lines 104-133)

The mutex is held during potentially long-running I/O operations (creating snapshots), which could block other goroutines unnecessarily.

Recommendation: Narrow the critical section:

id := NewID()
nativeRef, err := e.backend.Create(e.workspace, id)  // Outside lock
if err != nil {
    return Metadata{}, fmt.Errorf("backend create: %w", err)
}

e.mu.Lock()
defer e.mu.Unlock()
// Only metadata operations under lock

5. Process Tracing: Missing Privilege Requirements Documentation

File: internal/trace/tracer_linux.go (line 86)

There's no package-level documentation about privilege requirements.

Recommendation: Add godoc comments:

// NewProcConnectorTracer creates a new proc connector tracer.
// Requires CAP_NET_ADMIN capability or root privileges on Linux.
// Returns an error if the netlink socket cannot be created.

6. Darwin Tracer: Hardcoded Structure Offsets

File: internal/trace/tracer_darwin.go (lines 28-41)

These offsets are fragile and could break across macOS versions.

Recommendation:

  1. Add version detection and skip gracefully on unsupported versions
  2. Document the supported macOS versions in documentation
  3. Consider adding a warning when running on untested macOS versions

Minor Issues

7. Error Handling: Quote Path in Recovery Message

File: internal/snapshot/engine.go (lines 232-235)

The error message includes an rm command with an unquoted path.

Recommendation:

return fmt.Errorf("corrupted snapshot metadata at %s: %w\nTo reset, delete the file: rm %q", path, err, path)

8. Configuration: Add Validation

File: internal/config/config.go (lines 160-165)

There's no validation that MaxCount is positive or that IdleThresholdSeconds is reasonable.

Recommendation: Add validation:

if cfg.Snapshots.Retention.MaxCount < 0 {
    return nil, fmt.Errorf("snapshots.retention.max_count must be non-negative")
}

9. Test Coverage: Missing Edge Cases

File: internal/snapshot/archive_test.go

Excellent coverage for symlink attacks and basic operations, but missing tests for:

  • Concurrent snapshot creation/deletion
  • Disk full scenarios during restore
  • Corrupted tar.gz archives
  • Partial restore failures (cleanup verification)

10. Performance: Darwin Tracer Polling Not Configurable

File: internal/trace/tracer_darwin.go (line 51)

100ms polling might be too coarse for short-lived processes, but there's no way to configure it.

Recommendation: Make polling interval configurable via Config struct.


Positive Highlights

Excellent test coverage with regression tests for race conditions
Security-conscious design with symlink validation and decompression bomb protection
Great error messages following CLAUDE.md guidelines with actionable recovery steps
Good separation of concerns with Backend interface abstraction
Platform-specific implementations properly isolated with build tags
Comprehensive design documentation


Summary

Recommendation: Approve with changes

The code is well-designed and follows good practices. The symlink path traversal protection is excellent. Main concerns:

Must fix before merge:

  1. Add path validation for command arguments in APFS backend
  2. Add total archive size limit in restore
  3. Fix error handling for writer close operations

Should fix before merge:
4. Narrow critical sections in Engine.Create()
5. Add configuration validation for snapshot settings
6. Document privilege requirements for Linux tracer

Nice to have:
7. Add more edge case tests
8. Make Darwin tracer polling interval configurable
9. Add version detection for Darwin tracer offsets

The PR adds valuable functionality with good attention to security and testing. The identified issues are addressable and don't represent fundamental design flaws.

@claude
Copy link
Copy Markdown

claude Bot commented Jan 21, 2026

Security and Code Quality Review - PR #36

I've completed a thorough security and code quality review of the workspace snapshots and process tracing implementation.

Executive Summary

Overall Assessment: EXCELLENT ✅

This PR demonstrates strong security practices and solid engineering. The implementation shows careful attention to security-critical details like symlink path traversal prevention, race condition mitigation, and proper error handling.

Key Strengths:

  • Symlink path traversal protection is thoroughly implemented with comprehensive test coverage
  • Race condition fixes are well-designed with proper synchronization
  • Excellent test coverage with edge case testing (11 archive tests, 8 regression tests)
  • Resource cleanup is handled carefully (file descriptor management)
  • Error messages are clear and actionable per CLAUDE.md guidelines

1. Security Analysis

✅ Symlink Path Traversal Prevention (EXCELLENT)

Location: internal/snapshot/archive.go:274-296

The symlink handling demonstrates excellent security practice:

// Reject absolute symlink targets - they could point anywhere on the filesystem
if filepath.IsAbs(header.Linkname) {
    return fmt.Errorf("invalid symlink in archive: absolute path not allowed: %s -> %s",
        header.Name, header.Linkname)
}

// Resolve the symlink target relative to its location within destPath
symlinkDir := filepath.Dir(targetPath)
resolvedTarget := filepath.Join(symlinkDir, header.Linkname)
resolvedTarget = filepath.Clean(resolvedTarget)

// Verify the resolved target stays within destPath
relToDestPath, err := filepath.Rel(destPath, resolvedTarget)
if err != nil || strings.HasPrefix(relToDestPath, "..") {
    return fmt.Errorf("invalid symlink in archive: target escapes destination: %s -> %s",
        header.Name, header.Linkname)
}

Validated against:

  • ✅ Absolute symlinks (/etc/passwd)
  • ✅ Relative path traversal (../../../etc/passwd)
  • ✅ Subdirectory escapes
  • ✅ Test coverage: TestArchiveBackendSymlinkPathTraversal with 5 test cases

✅ Regular File Path Traversal Prevention (EXCELLENT)

Location: internal/snapshot/archive.go:238-246

Every extracted file path is validated before filesystem operations:

targetPath := filepath.Join(destPath, header.Name) //nolint:gosec // G305: validated below

relToDestPath, err := filepath.Rel(destPath, targetPath)
if err != nil || strings.HasPrefix(relToDestPath, "..") {
    return fmt.Errorf("invalid path in archive: %s", header.Name)
}

✅ Decompression Bomb Protection (GOOD)

Location: internal/snapshot/archive.go:266-267

// Limit copy size to prevent decompression bombs (1GB max per file)
if _, err := io.Copy(f, io.LimitReader(tr, 1<<30)); err != nil {

The 1GB per-file limit is reasonable for workspace snapshots.

✅ Command Execution Safety (GOOD)

Location: internal/snapshot/apfs_darwin.go

All command executions use exec.Command with separate args (no shell invocation):

cmd := exec.Command("cp", "-c", "-R", "-p", workspacePath, clonePath)
cmd := exec.Command("diskutil", "info", mountPoint)

No injection risk as arguments aren't shell-interpreted.


2. Race Conditions & Concurrency

✅ ProcConnectorTracer Synchronization (EXCELLENT)

Location: internal/trace/tracer_linux.go

The synchronization design is careful and correct:

Mutex Strategy:

  • mu sync.Mutex: Guards callbacks, started/stopped state
  • pidMu sync.RWMutex: Guards shared PID map (allows concurrent reads)

Key Pattern - Event Emission:

func (t *ProcConnectorTracer) emitEvent(event ExecEvent) {
    t.mu.Lock()
    if t.stopped {
        t.mu.Unlock()
        return
    }
    // Copy callbacks under lock
    cbs := make([]func(ExecEvent), len(t.callbacks))
    copy(cbs, t.callbacks)

    // Non-blocking channel send
    select {
    case t.events <- event:
    default:
        t.droppedEvents++
    }
    t.mu.Unlock()

    // Invoke callbacks OUTSIDE lock to prevent deadlock
    for _, cb := range cbs {
        cb(event)
    }
}

Strengths:

  • ✅ Minimal critical section duration
  • ✅ Callback invocation outside lock (prevents deadlocks)
  • ✅ Non-blocking channel send with drop counter
  • ✅ RWMutex for PID tracking (concurrent reads)

Test Coverage: tracer_regression_test.go includes 8 regression tests:

  • Concurrent Stop() calls
  • Double-Stop() safety
  • Concurrent OnExec() and Emit() (100 iterations)
  • Event dropping when channel full

3. Error Handling & Edge Cases

✅ File Descriptor Management (EXCELLENT)

Location: internal/snapshot/archive.go:138

for info.Mode().IsRegular() {
    f, err := os.Open(path)
    if err != nil {
        return fmt.Errorf("open file %s: %w", relPath, err)
    }
    _, copyErr := io.Copy(tw, f)
    f.Close() // Close immediately, not deferred, to avoid accumulating file handles
    if copyErr != nil {
        return fmt.Errorf("copy file content %s: %w", relPath, copyErr)
    }
}

Closes immediately instead of deferred to avoid fd leaks in loops - subtle but important.

✅ Archive Restore Error Recovery (EXCELLENT)

Location: internal/snapshot/archive.go:156-211

The restore operation preserves .git across destructive restore with best-effort recovery on failures:

// Backup .git before cleaning workspace
gitBackup := gitDir + ".backup"
if err := os.Rename(gitDir, gitBackup); err != nil {
    return fmt.Errorf("backup .git directory: %w", err)
}

// Extract archive
if err := b.RestoreTo(nativeRef, workspacePath); err != nil {
    if gitBackup != "" {
        _ = os.Rename(gitBackup, gitDir) // Best-effort recovery
    }
    return fmt.Errorf("extract archive: %w", err)
}

Test: TestArchiveBackendPreservesGitDirOnRestore validates this behavior.

✅ Engine Metadata Consistency (EXCELLENT)

Location: internal/snapshot/engine.go:103-133

Snapshot creation is atomic - if metadata save fails, the snapshot is cleaned up:

nativeRef, err := e.backend.Create(e.workspace, id)
if err != nil {
    return Metadata{}, fmt.Errorf("backend create: %w", err)
}

meta := Metadata{...}
e.snapshots[id] = meta

if err := e.saveMetadata(); err != nil {
    // Clean up the snapshot if we can't save metadata
    _ = e.backend.Delete(nativeRef)
    delete(e.snapshots, id)
    return Metadata{}, fmt.Errorf("save metadata: %w", err)
}

Prevents orphaned snapshots.


4. Platform-Specific Code Correctness

✅ Build Tags (EXCELLENT)

Proper platform separation:

File Build Tag Purpose
apfs_darwin.go //go:build darwin APFS implementation
apfs_stub.go //go:build !darwin Stub for non-macOS
tracer_linux.go //go:build linux Netlink implementation
tracer_darwin.go //go:build darwin sysctl polling
tracer_stub.go (no tag) Universal fallback

✅ macOS Architecture Handling (EXCELLENT)

Location: internal/trace/tracer_darwin.go:27-42

Correctly handles ARM64 and x86_64 with different kinfo_proc offsets:

func init() {
    switch runtime.GOARCH {
    case "arm64":
        procSize = 648
        pidOffset = 72
        ppidOffset = 76
        commOffset = 243
        startOffset = 128
    default: // amd64/x86_64
        procSize = 492
        pidOffset = 68
        ppidOffset = 72
        commOffset = 163
        startOffset = 120
    }
}

5. Test Coverage Quality

✅ Archive Backend Tests (EXCELLENT)

File: internal/snapshot/archive_test.go

11 test functions covering:

  • Basic create/restore/delete
  • .gitignore exclusion (including nested .gitignore files)
  • Custom exclude patterns
  • .git preservation across restore
  • Security: Symlink path traversal (5 attack scenarios)
  • File permissions preservation
  • Empty workspace edge case

Helper function createMaliciousArchive for security testing.

✅ Tracer Regression Tests (EXCELLENT)

File: internal/trace/tracer_regression_test.go

8 tests specifically for concurrency issues:

  • Multiple concurrent Stop() calls
  • Sequential double-stop
  • Race between emit and stop (100 iterations)
  • Concurrent callback registration
  • Channel overflow handling
  • Double-start prevention
  • Stop before start
  • Start after stop

6. Code Quality & Best Practices

✅ Error Messages (EXCELLENT - per CLAUDE.md)

return fmt.Errorf("create netlink socket: %w (requires CAP_NET_ADMIN or root)", err)
return fmt.Errorf("invalid symlink in archive: target escapes destination: %s -> %s",
                  header.Name, header.Linkname)
return fmt.Errorf("corrupted snapshot metadata at %s: %w\nTo reset, delete the file: rm %s",
                 path, err, path)
  • ✅ Specific and actionable
  • ✅ Include context (e.g., CAP_NET_ADMIN requirement)
  • ✅ Chain errors with %w
  • ✅ Provide recovery instructions

✅ Code Style (EXCELLENT)

  • Follows Go conventions
  • Consistent naming
  • Comments explain "why" not "what"
  • Interface compliance checks: var _ Backend = (*ArchiveBackend)(nil)

✅ Documentation (EXCELLENT)

  • Package-level doc.go files
  • CLI commands have Long descriptions with examples
  • Inline comments for non-obvious logic

Summary

Ready for Merge

This PR demonstrates strong security engineering with no critical vulnerabilities found.

Category Status Notes
Symlink Path Traversal ✅ Excellent Thoroughly validated, comprehensive test coverage
Command Injection ✅ Good Uses exec.Command safely
Race Conditions ✅ Excellent Careful mutex usage, regression tests
Decompression Bombs ✅ Good 1GB per-file limit
File Descriptor Leaks ✅ Excellent Immediate close in loops
Error Handling ✅ Excellent Comprehensive recovery, clear messages
Platform Code ✅ Excellent Proper build tags, architecture handling
Test Coverage ✅ Excellent 11 archive tests, 8 regression tests
Code Style ✅ Excellent Follows CLAUDE.md guidelines

Optional Improvements (non-blocking):

  1. Consider documenting the path traversal protections in a security doc
  2. Consider making the decompression bomb limit configurable
  3. Consider adding benchmarks for large workspace snapshots

Great work on this implementation! The attention to security details, comprehensive testing, and careful concurrency handling make this a high-quality addition to the codebase.

@claude
Copy link
Copy Markdown

claude Bot commented Jan 21, 2026

Code Review: PR #36 - Workspace Snapshots and Process Tracing

I've completed a comprehensive review of this PR, examining security-critical components, race condition fixes, error handling, and test coverage. Overall, this is high-quality, security-conscious code that delivers well on its promises.


🟢 STRENGTHS

Security (Excellent)

  • Archive path traversal protections are robust and well-implemented (internal/snapshot/archive.go:252-307)
    • Correct use of filepath.Rel() to validate paths before extraction
    • Defense-in-depth symlink validation: rejects absolute symlinks AND validates relative targets
    • Excellent test coverage in TestArchiveBackendSymlinkPathTraversal
  • Zip/decompression bomb mitigations: 100k file limit + 1GB per-file limit
  • APFS command injection prevention: rejects paths starting with -

Race Condition Handling (Excellent)

The tracer event emission code is textbook-perfect (internal/trace/tracer_linux.go:385-402, tracer_darwin.go:434-457):

  1. Checks stopped flag while holding lock (prevents send-on-closed-channel)
  2. Non-blocking send with select/default (prevents deadlock)
  3. Copies callbacks under lock, invokes outside lock (prevents callback deadlock)
  4. Tracks dropped events for observability

Comprehensive regression tests in tracer_regression_test.go with 100-iteration concurrent tests are excellent.

Error Messages (Excellent)

Perfect adherence to CLAUDE.md guidance: "Good error messages are documentation"

Example from engine.go:234:

return nil, fmt.Errorf("corrupted snapshot metadata at %s: %w\nTo reset, delete the file: rm %q", path, err, path)

Tells users exactly what's wrong AND how to fix it.

Test Coverage (Excellent)

  • 729 lines of archive tests covering 16 scenarios
  • Security-focused tests for path traversal attacks
  • Race condition regression tests with concurrent execution
  • Integration tests with real process execution
  • go test -race ./... passes ✅

Go Best Practices (Excellent)

  • Interface compliance checks: var _ Backend = (*ArchiveBackend)(nil)
  • Consistent mutex patterns with defer mu.Unlock()
  • Proper build tags for platform-specific code
  • Clean stub implementations for unsupported platforms

🟡 ISSUES FOUND

🔴 Medium Severity: Rollback Safety Snapshot Accumulation

File: cmd/moat/cli/rollback.go:111-117

If creating the safety snapshot fails (e.g., disk full), rollback aborts. If the restore itself fails on a second attempt, another safety snapshot is created, potentially filling disk space with accumulating safety snapshots.

Recommendation: Check for existing safety snapshots and reuse them, or add a --skip-safety flag.


🟡 Low Severity: CLI Prune Returns Success on Partial Failure

File: cmd/moat/cli/snapshots.go:229-248

The prune command continues after deletion failures (reasonable for bulk operations) but returns nil even if some deletions failed.

Fix:

if failed > 0 {
    return fmt.Errorf("failed to delete %d of %d snapshots", failed, len(toDelete))
}
return nil

🟡 Low Severity: Darwin kinfo_proc Structure Offsets

File: internal/trace/tracer_darwin.go:19-42

Hard-coded structure offsets for kinfo_proc are architecture-specific AND macOS version-specific. No runtime validation that offsets are correct.

Concerns:

  • Silent data corruption if offsets are wrong
  • Binary parsing with binary.LittleEndian.Uint32 (lines 297, 300) could read garbage
  • Integration test acknowledges potential for incorrect results (line 53)

Recommendations:

  1. Add runtime validation in Start() to verify offsets (e.g., check PID 1 exists with expected properties)
  2. Document which macOS versions these offsets are valid for
  3. Consider falling back to stub tracer if validation fails

🟢 Minor: APFS Path Validation Only Checks Base Name

File: internal/snapshot/apfs_darwin.go:35-40

if strings.HasPrefix(filepath.Base(workspacePath), "-") {
    return "", fmt.Errorf("invalid workspace path: name cannot start with -")
}

Only validates the final path component. A path like /valid/path/-malicious would pass validation. This is likely acceptable since paths are application-constructed, but full path validation would be more robust.


🔵 ENHANCEMENT SUGGESTIONS (Non-Blocking)

1. Atomic Metadata Writes

File: internal/snapshot/engine.go:262

Use write-to-temp-then-rename pattern to prevent metadata corruption if process is killed mid-write:

tmpPath := path + ".tmp"
if err := os.WriteFile(tmpPath, data, 0644); err != nil {
    return err
}
return os.Rename(tmpPath, path)

2. Total Snapshot Size Limit

Archive backend has file count (100k) and per-file size (1GB) limits, but no total snapshot size limit. Consider adding:

const maxSnapshotSize = 10 << 30 // 10GB

3. Tracer Metrics Exposure

The tracer tracks droppedEvents but only logs on Stop(). Consider exposing via Metrics() method for runtime observability.

4. Configurable Darwin Polling Interval

100ms polling = 10 syscalls/sec. Consider making configurable or using longer interval (200-500ms) for low-priority tracing.


📊 PERFORMANCE

APFS Copy-on-Write: Using cp -c is correct - instant clones with no space overhead
Archive Gitignore: Efficient single-pass tree walk with pattern building
Linux Proc Connector: Event-driven netlink is far superior to /proc polling
⚠️ Darwin Polling: 100ms interval is reasonable but consider making configurable


📝 DOCUMENTATION

Excellent inline documentation for privilege requirements, security model, and platform differences
Good CLI help text with examples and explanations
Comprehensive design docs in docs/plans/


✅ RECOMMENDATION: APPROVE WITH MINOR FIXES

This PR maintains the high code quality standards of the moat codebase. The snapshot implementation is well-architected, security-conscious, and thoroughly tested.

Required before merge:

  1. Fix CLI prune command to return error on partial failure
  2. Add validation for Darwin kinfo_proc offsets (or document known-working macOS versions)

Suggested enhancements (can be follow-up PRs):

  1. Atomic metadata file writes
  2. Total snapshot size limits
  3. Tracer metrics exposure
  4. Configurable Darwin polling interval

No critical bugs or security vulnerabilities found. The symlink path traversal fixes are solid, race conditions are properly handled, and test coverage is excellent. Great work! 🎉

@claude
Copy link
Copy Markdown

claude Bot commented Jan 21, 2026

PR Review: Workspace Snapshots and Process Tracing

This is a substantial and well-implemented feature addition. Overall, the code quality is excellent with strong attention to security, error handling, and testing.

Summary

This PR adds two major features:

  1. Workspace Snapshots: APFS-backed (macOS) and archive-backed snapshot system
  2. Process Tracing: Platform-specific execution tracing via Linux proc connector and macOS sysctl polling

Strengths

Code Quality

  • Excellent security hardening: Path traversal prevention, symlink validation, command injection protection, zip bomb limits
  • Strong error handling: Detailed error messages with actionable guidance (per CLAUDE.md)
  • Platform-specific implementations: Clean use of build tags
  • Good observability: Dropped event tracking, structured logging
  • Comprehensive testing: Unit, integration, and regression tests

Architecture

  • Clean abstractions via Backend interface
  • Thread-safe operations with proper mutex usage
  • Graceful degradation with stub implementations

Issues & Recommendations

1. Security: Archive File Count Limit (Low)

Location: internal/snapshot/archive.go:215

maxArchiveFiles is 100,000 which seems high. Most projects have fewer files. Consider 25-50k for better zip bomb protection.

2. Prune Command Error Handling (Medium)

Location: cmd/moat/cli/snapshots.go:244

Returns error after partial success. Document this behavior in help text or make it a warning instead.

3. APFS Detection Robustness (Low)

Location: internal/snapshot/apfs_darwin.go:219-242

IsAPFS uses diskutil which could fail in restricted environments. Current fallback is safe but could have better error handling.

4. Documentation: APFS Evolution (Low)

Add comment explaining why cp -c vs tmutil (volume-level vs directory-level snapshots).

5. Cleanup Interval Configuration (Low)

Location: internal/trace/tracer_linux.go:69

60s cleanup interval is hardcoded. Consider making configurable for short-lived process tracking.

Performance

  • APFS cp -c: instant (COW)
  • Archive: streaming tar.gz (good)
  • Archive walks full tree: could be slow (consider progress feedback)
  • Linux tracer: event-driven (efficient)
  • Darwin tracer: 100ms polling (acceptable)

Test Coverage

Excellent coverage with unit, integration, regression, security, and edge case tests.

Security Review

  1. Path Traversal: Validated with filepath.Rel
  2. Command Injection: Checks paths dont start with -
  3. Zip Bomb: 100k files, 1GB per file limits
  4. Symlinks: Rejects absolute, validates relative
  5. Privileges: Documents CAP_NET_ADMIN requirement

Final Verdict

Approve with minor suggestions

High-quality, production-ready code with strong security awareness, excellent error handling, clean architecture, and comprehensive testing. Issues are low-medium priority and dont block merging.

Optional follow-ups:

  • Add TESTING.md
  • Progress feedback for large archives
  • Document APFS backend rationale
  • Review maxArchiveFiles limit
  • Document prune partial failure behavior

Great work! The attention to security and edge cases is commendable.

@claude
Copy link
Copy Markdown

claude Bot commented Jan 21, 2026

Code Review: PR #36 - Workspace Snapshots and Process Tracing

Overview

This PR adds comprehensive workspace snapshot functionality with APFS and archive backends, plus platform-specific process tracing. The implementation is substantial (9,364 lines added) with good test coverage. Overall, this is high-quality code with strong security practices, but there are several critical issues that should be addressed before merge.


🔴 Critical Issues

1. Command Injection Vulnerability in APFS Backend

File: internal/snapshot/apfs_darwin.go:35-40, 83-87, 159-163

The argument injection protection only checks filepath.Base(), which validates only the final path component. A path like /tmp/normal/../../-evil would pass validation but still inject arguments.

Recommendation: Use the -- argument separator with cp:

cmd := exec.Command("cp", "-c", "-R", "-p", "--", workspacePath, clonePath)

Or validate the full resolved path:

absPath, err := filepath.Abs(workspacePath)
if err != nil {
    return "", fmt.Errorf("invalid workspace path: %w", err)
}
if strings.Contains(absPath, "/-") {
    return "", fmt.Errorf("invalid workspace path: path components cannot start with -")
}

2. Potential DoS via Compression Bomb

File: internal/snapshot/archive.go:215, 278

While there's a 1GB per-file limit and 100,000 file count limit, there's no check for overall decompressed size. An attacker could create a tar.gz with 100,000 files × 1GB each = 100TB extraction.

Recommendation: Add cumulative size tracking:

const maxTotalSize = 10 << 30 // 10GB total
var totalWritten int64

// In extraction loop:
if totalWritten > maxTotalSize {
    return fmt.Errorf("archive exceeds maximum size limit (%d bytes)", maxTotalSize)
}
written, err := io.Copy(f, io.LimitReader(tr, 1<<30))
totalWritten += written

3. Integer Overflow Risk in Darwin Tracer

File: internal/trace/tracer_darwin.go:308-311, 326

The type conversions have //nolint:gosec comments but no validation. If kernel structures change, this could produce invalid values.

Recommendation: Add bounds checking:

pidRaw := binary.LittleEndian.Uint32(buf[pidOffset:])
if pidRaw > math.MaxInt32 {
    return processInfo{} // Invalid PID
}
info.pid = int(int32(pidRaw))

🟡 High Priority Issues

4. Snapshot ID Collision Risk

File: internal/snapshot/snapshot.go:39-45

4 bytes (32 bits) provides only ~4B unique IDs, giving 50% collision probability after ~65,000 snapshots (birthday paradox). The timestamp fallback is also non-deterministic.

Recommendation: Use 8 bytes for IDs:

func NewID() string {
    b := make([]byte, 8) // 64 bits = better collision resistance
    if _, err := rand.Read(b); err != nil {
        return fmt.Sprintf("snap_%016x", time.Now().UnixNano())
    }
    return "snap_" + hex.EncodeToString(b)
}

5. Missing Metadata Corruption Recovery

File: internal/snapshot/engine.go:232-235

If metadata is corrupted, the entire snapshot system fails, even though actual snapshot data still exists on disk.

Recommendation: Add automatic recovery that rescans the snapshot directory and rebuilds metadata from discovered snapshots.


6. Linux Tracer Memory Leak

File: internal/trace/tracer_linux.go:317-327

In cleanupStalePIDs(), if os.Stat() fails with an error other than IsNotExist (e.g., permission denied), the PID remains tracked indefinitely.

Fix:

if _, err := os.Stat(procPath); err != nil {
    // Remove PID on any error - if we can't check, assume it's gone
    delete(t.trackedPIDs, pid)
}

7. Config Defaults Applied When Snapshots Disabled

File: internal/config/config.go:159-165

Snapshot defaults are applied even when cfg.Snapshots.Disabled == true.

Fix:

if !cfg.Snapshots.Disabled {
    if cfg.Snapshots.Triggers.IdleThresholdSeconds == 0 {
        cfg.Snapshots.Triggers.IdleThresholdSeconds = 30
    }
    // ... other defaults
}

🔵 Code Quality Improvements

8. Error Messages Leak Path Information

File: internal/snapshot/archive.go:234

Per CLAUDE.md, error messages should be actionable without exposing unnecessary internals:

// Instead of: "corrupted snapshot metadata at /full/path: rm /full/path"
// Use: "corrupted snapshot metadata: moat snapshots prune <run-id>"

9. Inconsistent CLI Error Handling

Files: cmd/moat/cli/rollback.go:49-51, others

CLI commands duplicate existence checks that the storage layer already performs. Delegate to storage:

store, err := storage.NewRunStore(baseDir, runID)
if err != nil {
    return fmt.Errorf("run %s not found or inaccessible: %w", runID, err)
}

10. Hardcoded Architecture Offsets

File: internal/trace/tracer_darwin.go:31-45

Document which macOS versions the offsets were tested against:

// kinfo_proc offsets for macOS 13+ (Ventura)
// Validated on: macOS 13.0, 14.0 (Sonoma), 15.0 (Sequoia)

✅ Positive Observations

  1. Excellent security: Comprehensive symlink path traversal prevention with tests
  2. Proper concurrency: All tracers correctly handle locks to prevent deadlocks
  3. Good error messages: Generally follows CLAUDE.md guidance
  4. Clean platform abstraction: Build tags properly separate Darwin/Linux/stub implementations
  5. Strong regression tests: tracer_regression_test.go provides excellent race condition coverage
  6. Defensive programming: PID 1 validation catches offset mismatches early

📊 Test Coverage Gaps

Missing edge case tests for:

  1. Snapshot pruning with pre-run snapshots (preserving them)
  2. Archive backend with deeply nested directories (max path depth)
  3. APFS backend cleanup on cp -c failure
  4. Concurrent snapshot creation (race conditions in Engine.Create())

⚡ Performance Considerations

  1. Darwin tracer: 100ms poll interval may be aggressive for battery devices - consider making configurable
  2. Archive creation: No parallelization for large workspaces - consider progress indication for >1GB
  3. Metadata persistence: Saves entire list on every create/delete - could cause I/O pressure with many snapshots

📝 Conventions (CLAUDE.md)

✅ No Co-Authored-By in commit message
✅ Standard Go formatting
✅ Generic placeholder names in examples
⚠️ Some error messages leak path info (see issue #8)


Summary

Overall Assessment: High-quality implementation with strong security practices and comprehensive testing. The critical issues (#1-3) represent security/reliability risks that should be addressed before merge. Medium-priority issues are mostly robustness and UX improvements.

Recommendation: Request changes for critical issues, then approve with optional follow-up items.

Estimated effort:

  • Critical fixes: 2-3 hours
  • High priority: 4-5 hours
  • Code quality improvements: 3-4 hours (optional)

Add ExecEvent struct for capturing command execution events including:
- Process info (PID, PPID, timestamp, working directory)
- Command execution details (command, args, exit code, duration)
- IsGitCommit() method to detect git commit operations
- IsBuildCommand() method to detect common build commands

Build detection supports npm, yarn, go, make, cargo, mvn, gradle with
proper word boundary matching to avoid false positives (e.g., npm run
build-docker is correctly not detected as a build command).
Implements ArchiveBackend that creates tar.gz snapshots with:
- Gitignore pattern support via go-git library
- Additional exclude patterns configuration
- .git directory preservation during restore
- Symlink and file permission preservation
- Path traversal protection on extract
Add Engine type that provides high-level snapshot management:
- Auto-detects APFS on macOS, falls back to archive backend
- Persists metadata to snapshots.json for durability across restarts
- Thread-safe Create/Restore/RestoreTo/Delete/List/Get operations
- Supports ForceBackend option for testing specific backends
- Passes through UseGitignore and Additional exclude patterns
- Reject absolute symlink targets (could point anywhere on filesystem)
- Validate relative symlinks don't escape destination directory
- Add comprehensive test cases for path traversal attack vectors
- Use filepath.Rel to properly detect escape attempts via ../
…handling

- Add stopped flag to all tracers to prevent double-close panics
- Fix emitEvent race condition by holding lock during channel send
- Add droppedEvents counter and logging for observability
- Fix Darwin grandchild tracking to properly track all descendants
- Improve Linux readLoop error handling with max consecutive errors limit
- Add comprehensive regression tests for concurrent operations
- Add cleanupStalePIDs() to Linux tracer to handle missed EXIT events
- Add started flag to StubTracer for consistency with other tracers
- Fix misleading test comment that referenced old behavior
- Add TestStubTracerDoubleStart to verify Start() returns error when
  called twice
- Add Linux-specific tests for cleanupStalePIDs function
- Fix errcheck: explicitly ignore os.Rename errors in recovery paths
- Fix gosec G115: add nolint for safe integer conversions
- Fix gosec G305/G110: add nolint with validation comments, limit copy size
- Fix govet shadow: rename shadowed err variables
- Fix prealloc: pre-allocate slices with known capacity
- Fix staticcheck: remove unused metas slice in test
- Fix file close error handling in archive.go (check close errors)
- Fix file close error handling in storage.go WriteExecEvent
- APFS stub now returns ErrAPFSNotAvailable instead of silently succeeding
- Add recovery guidance for corrupted snapshot metadata
- Add consecutive error tracking to Darwin tracer (like Linux tracer)
…fault

The APFS backend incorrectly uses tmutil (Time Machine) which creates
volume-level snapshots, not directory-level snapshots. This means:
- Snapshots are of the entire disk, not the workspace
- Restore requires root privileges
- Restore affects the entire volume, not just the workspace

The correct approach for APFS workspace snapshots would be to use
cp -c for copy-on-write directory cloning. Until that's implemented,
default to the archive backend which correctly handles directory-level
snapshots.

The APFS backend is still available via ForceBackend option but with
a warning comment.
Replace tmutil-based Time Machine snapshots with cp -c copy-on-write
directory cloning. This is the correct approach for workspace snapshots:

- tmutil creates volume-level snapshots (entire disk)
- cp -c creates directory-level clones (exactly what we need)

Benefits of cp -c on APFS:
- Instant cloning (metadata-only operation)
- Space-efficient (shared blocks until modified)
- No Time Machine entitlements required
- Works at directory granularity

Changes:
- APFSBackend.Create() now uses cp -c -R -p for COW cloning
- APFSBackend.Restore() copies files back using cp -c
- Engine auto-detects APFS and uses APFS backend when available
- Updated tests for new cp -c based implementation
- NewAPFSBackend() now takes snapshotDir parameter
Security improvements:
- Add path validation in APFS backend to prevent argument injection
  (paths starting with "-" could be misinterpreted as flags)
- Add file count limit (100k) in archive restore to prevent zip bombs

Documentation:
- Add privilege requirements documentation for Linux proc connector tracer
- Quote path in metadata corruption error message for safe copy/paste

Addresses feedback from Claude code review.
CLI:
- Return error from prune command on partial deletion failure
  instead of silently succeeding

Darwin tracer:
- Add runtime validation of kinfo_proc structure offsets
- Validates by checking PID 1 (launchd) has expected properties
- Logs warning if validation fails (offsets may be wrong for macOS version)
- Rename ALL_CAPS constants to camelCase (revive var-naming)
- Add nolint comments for syscall.Getpid() uint32 conversion (gosec G115)

Also installed pre-commit hook to catch these issues locally.
- APFS backend: add "--" separator to all cp commands to prevent
  command injection via filenames starting with "-"
- Archive backend: add total extracted size limit (10GB) to prevent
  compression bomb attacks that could exhaust disk space
Add tests verifying:
- File count limit (100k files) triggers extraction failure
- Archive limit constants exist with reasonable values

These tests ensure the compression bomb protections are in place
and catch any accidental removal of security limits.
@claude
Copy link
Copy Markdown

claude Bot commented Jan 21, 2026

Code Review: PR #36 - Workspace Snapshots and Process Tracing

Executive Summary

I've completed a comprehensive review of this PR. Overall, this is high-quality code with excellent security considerations, thorough testing, and solid engineering practices.

Recommendation: ✅ APPROVE with minor suggestions

The code is production-ready. All critical security issues have been addressed, comprehensive tests are in place, and the implementation follows Go best practices.


Key Strengths 🎉

1. Security - EXCELLENT ⭐

  • Path traversal prevention: Comprehensive validation for both regular files AND symlinks (archive.go:256-261, 303-319)
  • Zip bomb protection: File count, per-file size, and total size limits with test coverage (archive.go:214-296)
  • Command injection prevention: Validates paths don't start with - and uses proper separators (apfs_darwin.go:34-40)
  • Security test coverage: 5 test cases for symlink attacks, comprehensive path traversal tests

2. Concurrency - EXCELLENT ⭐

The event emission pattern across all tracers is perfect - non-blocking channel sends, callbacks invoked outside locks, double-stop protection, and comprehensive regression tests.

3. Platform-Specific Code - EXCELLENT ⭐

  • Darwin: Brilliant PID 1 sanity check for offset validation (tracer_darwin.go:332-372)
  • Linux: Clear CAP_NET_ADMIN documentation and proper netlink usage
  • Build tags: Clean separation with stub fallback for unsupported platforms

4. User Experience - EXCELLENT ⭐

  • Safety snapshots: Automatic pre-rollback snapshots prevent data loss (rollback.go:111-117)
  • Error messages: Actionable with recovery instructions (engine.go:234)
  • CLI help: Clear examples and usage patterns

Minor Suggestions 💡

1. Timing Issue in Total Size Check (archive.go:282-285)

Check happens AFTER potentially exceeding limit. Suggest checking totalWritten+header.Size > maxArchiveTotalSize before writing.

2. Benign Race in Cleanup Timing (tracer_linux.go:211-214)

t.lastCleanup read without lock (updated under lock in cleanupStalePIDs). Benign race - worst case is slightly delayed cleanup. Consider adding comment explaining this is intentionally lockless.

3. Type Conversion Clarity (tracer_darwin.go:308-311)

Add comment explaining why int32 intermediate conversion is necessary.


Security Analysis

All Critical Attack Vectors Mitigated ✅

  1. ✅ Path traversal (files and symlinks)
  2. ✅ Zip bombs (decompression attacks)
  3. ✅ Command injection
  4. ✅ TOCTOU races (proper locking)

Security Score: 9.5/10


What Makes This PR Stand Out 🌟

  1. Security-first mindset: Path traversal, zip bombs, command injection all properly addressed with tests
  2. Concurrency expertise: Non-blocking channel sends, proper callback handling, comprehensive race tests
  3. Regression testing: Tests for subtle issues like double-stop and concurrent operations
  4. Platform awareness: Offset validation for Darwin, clear privilege docs for Linux
  5. User safety: Safety snapshots before rollback with helpful undo instructions

Final Recommendation

APPROVE

This PR is production-ready as-is. The minor suggestions are truly optional - the code already meets high standards for security, correctness, and maintainability.

Suggested Actions Before Merge:

Required: NONE

Optional:

  1. Fix archive.go:282 total size check timing
  2. Add comment for tracer_linux.go:212 lockless read
  3. Consider optimizing double-walk in buildMatcher (low priority)

Excellent work on this PR! The attention to security details, thorough testing, and careful handling of platform-specific code demonstrates strong engineering expertise. 🎉

@andybons andybons merged commit 555965e into main Jan 21, 2026
4 checks passed
@andybons andybons deleted the feat/snapshots branch January 21, 2026 02:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants