feat(snapshot): add workspace snapshots and process tracing#36
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds comprehensive workspace snapshot capabilities and process execution tracing to the moat agent system. The implementation provides automatic pre-run snapshots, event-based snapshot triggers (git commits, builds, idle detection), and platform-specific execution tracing via Linux proc connector and macOS sysctl polling.
Changes:
- Workspace snapshot system with APFS (macOS) and archive (tar.gz) backends
- Process execution tracing with platform-specific implementations (Linux/macOS/stub)
- Configuration support for snapshot triggers, exclusions, and retention policies
- Storage integration for execution events
- CLI commands for snapshot management (
moat snapshots,moat snapshot,moat rollback)
Reviewed changes
Copilot reviewed 37 out of 38 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/snapshot/snapshot.go | Core snapshot types, metadata, and backend interface |
| internal/snapshot/archive.go | Archive backend with tar.gz and gitignore support |
| internal/snapshot/apfs_darwin.go | APFS backend using macOS tmutil commands |
| internal/snapshot/engine.go | Snapshot engine with backend detection and management |
| internal/trace/event.go | Execution event types with git/build detection helpers |
| internal/trace/tracer.go | Tracer interface definition |
| internal/trace/tracer_stub.go | No-op tracer for testing/unsupported platforms |
| internal/trace/tracer_linux.go | Linux proc connector tracer via netlink |
| internal/trace/tracer_darwin.go | macOS sysctl polling tracer |
| internal/config/config.go | Snapshot and tracing configuration schema |
| internal/storage/storage.go | Execution event storage (exec.jsonl) |
| internal/run/run.go | Added SnapEngine field to Run struct |
| internal/run/manager.go | Integrated snapshot engine into run lifecycle |
| go.mod/go.sum | Added go-git dependency for gitignore parsing |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| f.Close() | ||
| return fmt.Errorf("write file %s: %w", header.Name, err) | ||
| } | ||
| f.Close() |
There was a problem hiding this comment.
File handle may be writable as a result of data flow from a call to OpenFile and closing it may result in data loss upon failure, which is not handled explicitly.
| f.Close() | |
| return fmt.Errorf("write file %s: %w", header.Name, err) | |
| } | |
| f.Close() | |
| // Best-effort close; preserve the original write error | |
| _ = f.Close() | |
| return fmt.Errorf("write file %s: %w", header.Name, err) | |
| } | |
| if err := f.Close(); err != nil { | |
| return fmt.Errorf("close file %s: %w", header.Name, err) | |
| } |
| f.Close() | ||
| return fmt.Errorf("write file %s: %w", header.Name, err) | ||
| } | ||
| f.Close() |
There was a problem hiding this comment.
File handle may be writable as a result of data flow from a call to OpenFile and closing it may result in data loss upon failure, which is not handled explicitly.
| f.Close() | |
| return fmt.Errorf("write file %s: %w", header.Name, err) | |
| } | |
| f.Close() | |
| if cerr := f.Close(); cerr != nil { | |
| return fmt.Errorf("write file %s: %v (also failed to close: %w)", header.Name, err, cerr) | |
| } | |
| return fmt.Errorf("write file %s: %w", header.Name, err) | |
| } | |
| if err := f.Close(); err != nil { | |
| return fmt.Errorf("close file %s: %w", header.Name, err) | |
| } |
| defer f.Close() | ||
|
|
||
| data, err := json.Marshal(event) | ||
| if err != nil { | ||
| return fmt.Errorf("marshaling exec event: %w", err) | ||
| } | ||
| if _, writeErr := f.Write(data); writeErr != nil { | ||
| return fmt.Errorf("writing exec event: %w", writeErr) | ||
| } | ||
| _, err = f.Write([]byte("\n")) | ||
| return err |
There was a problem hiding this comment.
File handle may be writable as a result of data flow from a call to OpenFile and closing it may result in data loss upon failure, which is not handled explicitly.
| defer f.Close() | |
| data, err := json.Marshal(event) | |
| if err != nil { | |
| return fmt.Errorf("marshaling exec event: %w", err) | |
| } | |
| if _, writeErr := f.Write(data); writeErr != nil { | |
| return fmt.Errorf("writing exec event: %w", writeErr) | |
| } | |
| _, err = f.Write([]byte("\n")) | |
| return err | |
| data, err := json.Marshal(event) | |
| if err != nil { | |
| _ = f.Close() | |
| return fmt.Errorf("marshaling exec event: %w", err) | |
| } | |
| if _, writeErr := f.Write(data); writeErr != nil { | |
| _ = f.Close() | |
| return fmt.Errorf("writing exec event: %w", writeErr) | |
| } | |
| if _, writeErr := f.Write([]byte("\n")); writeErr != nil { | |
| _ = f.Close() | |
| return fmt.Errorf("writing exec event newline: %w", writeErr) | |
| } | |
| if closeErr := f.Close(); closeErr != nil { | |
| return fmt.Errorf("closing exec file: %w", closeErr) | |
| } | |
| return nil |
Code Review - PR #36: Workspace Snapshots & Process TracingExcellent work on this substantial feature! The architecture is solid with good separation of concerns, strong security practices, and comprehensive test coverage. However, I've identified several issues that should be addressed before merging. Critical Issues (Must Fix)1. Race Condition in .git Preservation
|
Code Review: PR #36 - Workspace Snapshots and Process TracingOverviewThis PR adds significant new functionality with 9,098 additions across 37 files, introducing workspace snapshots with APFS/archive backends and platform-specific process tracing. The implementation is generally solid with good security practices and comprehensive testing. Summary AssessmentOverall Quality: ✅ Strong - Well-architected with good security practices Critical Findings🟢 No Critical Issues FoundThe code is production-ready with no blocking security vulnerabilities or critical bugs. High-Priority Recommendations1. Archive Restore Atomicity (Medium Priority)File: Issue: If extraction fails midway, the workspace may contain partial files in an inconsistent state. While Current Code: if err := b.RestoreTo(nativeRef, workspacePath); err != nil {
// Restore .git on error (best effort)
if gitBackup != "" {
_ = os.Rename(gitBackup, gitDir)
}
return fmt.Errorf("extract archive: %w", err)
}Recommendation: Extract to a temporary directory first, then swap atomically: tmpDir := workspacePath + ".restore-tmp"
if err := b.RestoreTo(nativeRef, tmpDir); err != nil {
os.RemoveAll(tmpDir)
return fmt.Errorf("extract archive: %w", err)
}
// Atomic swap with proper cleanup2. Insufficient Concurrency Test Coverage (Medium Priority)Files: Missing Tests:
Recommendation: Add these test cases: func TestTracerConcurrentStartStop(t *testing.T)
func TestTracerChannelOverflow(t *testing.T)
func TestTracerCallbackPanic(t *testing.T)
func TestTracerGrandchildTracking(t *testing.T)
func TestLinuxTracerConsecutiveErrors(t *testing.T)
func TestTracerGoroutineCleanup(t *testing.T)Security Review✅ Strengths
|
Comprehensive Code Review: PR #36 - Workspace Snapshots and Process TracingOverviewThis PR adds significant new functionality (~9,130 additions across 38 files) for workspace snapshots and process tracing. Overall, this is a solid implementation with excellent security practices, though there are areas for improvement. ✅ StrengthsSecurity Implementation (Excellent)The path traversal protection in File Path Validation (lines 241-246): relToDestPath, err := filepath.Rel(destPath, targetPath)
if err != nil || strings.HasPrefix(relToDestPath, "..") {
return fmt.Errorf("invalid path in archive: %s", header.Name)
}Symlink Validation (lines 280-296):
Additional Protections:
This is security done right! 🎯 Concurrency SafetyExcellent use of atomic operations in var storeRef atomic.Value // holds *storage.RunStoreThe atomic reference pattern cleverly avoids a race condition where the proxy logger might try to write before the store is initialized. Proper mutex usage throughout:
Resource ManagementSmart file handle management in f.Close() // Close immediately, not deferred, to avoid accumulating file handlesExcellent comment explaining why immediate close is used in loops instead of defer! Architecture
|
Pull Request Review: Workspace Snapshots and Process TracingOverall AssessmentThis is a well-architected and thoroughly implemented PR that adds two major features: workspace snapshots and process tracing. The code demonstrates strong engineering practices with comprehensive test coverage (39 snapshot tests, 19 trace tests), proper security considerations, and platform-specific optimizations. Recommendation: ✅ Approve with minor suggestions Code Quality & Best PracticesStrengths
Security Review✅ Security Strengths
|
Code Review: Workspace Snapshots and Process TracingOverviewThis PR adds significant new functionality: workspace snapshots with APFS and archive backends, process tracing for Linux and macOS, and rollback capabilities. The implementation is generally well-designed with good test coverage, but there are several issues that need attention. Critical Issues1. Security: Command Injection Risk in APFS BackendFile: While the current implementation uses separate arguments (preventing shell injection), paths come from user-controlled input. If an attacker can control the workspace path or snapshot ID, they could potentially pass malicious arguments starting with Recommendation: Add path validation: // Before executing commands, validate paths
if strings.HasPrefix(filepath.Base(workspacePath), "-") {
return "", fmt.Errorf("invalid workspace path: cannot start with -")
}2. Performance: Potential Resource Leak in Archive CreationFile: Deferred closes happen in reverse order, but errors from Recommendation: Implement proper cleanup with error checking: var writeErr error
defer func() {
if err := tw.Close(); err != nil && writeErr == nil {
writeErr = err
}
if err := gw.Close(); err != nil && writeErr == nil {
writeErr = err
}
if err := file.Close(); err != nil && writeErr == nil {
writeErr = err
}
}()
// Return writeErr at the end of function3. Security: Add Total Archive Size LimitFile: The 1GB per-file limit is excellent, but a malicious archive could still contain thousands of 1GB files. Recommendation: Add a total archive size limit or file count limit in the restore loop. Major Issues4. Race Condition: Engine Metadata ConcurrencyFile: The mutex is held during potentially long-running I/O operations (creating snapshots), which could block other goroutines unnecessarily. Recommendation: Narrow the critical section: id := NewID()
nativeRef, err := e.backend.Create(e.workspace, id) // Outside lock
if err != nil {
return Metadata{}, fmt.Errorf("backend create: %w", err)
}
e.mu.Lock()
defer e.mu.Unlock()
// Only metadata operations under lock5. Process Tracing: Missing Privilege Requirements DocumentationFile: There's no package-level documentation about privilege requirements. Recommendation: Add godoc comments: // NewProcConnectorTracer creates a new proc connector tracer.
// Requires CAP_NET_ADMIN capability or root privileges on Linux.
// Returns an error if the netlink socket cannot be created.6. Darwin Tracer: Hardcoded Structure OffsetsFile: These offsets are fragile and could break across macOS versions. Recommendation:
Minor Issues7. Error Handling: Quote Path in Recovery MessageFile: The error message includes an Recommendation: return fmt.Errorf("corrupted snapshot metadata at %s: %w\nTo reset, delete the file: rm %q", path, err, path)8. Configuration: Add ValidationFile: There's no validation that MaxCount is positive or that IdleThresholdSeconds is reasonable. Recommendation: Add validation: if cfg.Snapshots.Retention.MaxCount < 0 {
return nil, fmt.Errorf("snapshots.retention.max_count must be non-negative")
}9. Test Coverage: Missing Edge CasesFile: Excellent coverage for symlink attacks and basic operations, but missing tests for:
10. Performance: Darwin Tracer Polling Not ConfigurableFile: 100ms polling might be too coarse for short-lived processes, but there's no way to configure it. Recommendation: Make polling interval configurable via Positive Highlights✅ Excellent test coverage with regression tests for race conditions SummaryRecommendation: Approve with changes The code is well-designed and follows good practices. The symlink path traversal protection is excellent. Main concerns: Must fix before merge:
Should fix before merge: Nice to have: The PR adds valuable functionality with good attention to security and testing. The identified issues are addressable and don't represent fundamental design flaws. |
Security and Code Quality Review - PR #36I've completed a thorough security and code quality review of the workspace snapshots and process tracing implementation. Executive SummaryOverall Assessment: EXCELLENT ✅ This PR demonstrates strong security practices and solid engineering. The implementation shows careful attention to security-critical details like symlink path traversal prevention, race condition mitigation, and proper error handling. Key Strengths:
1. Security Analysis✅ Symlink Path Traversal Prevention (EXCELLENT)Location: The symlink handling demonstrates excellent security practice: // Reject absolute symlink targets - they could point anywhere on the filesystem
if filepath.IsAbs(header.Linkname) {
return fmt.Errorf("invalid symlink in archive: absolute path not allowed: %s -> %s",
header.Name, header.Linkname)
}
// Resolve the symlink target relative to its location within destPath
symlinkDir := filepath.Dir(targetPath)
resolvedTarget := filepath.Join(symlinkDir, header.Linkname)
resolvedTarget = filepath.Clean(resolvedTarget)
// Verify the resolved target stays within destPath
relToDestPath, err := filepath.Rel(destPath, resolvedTarget)
if err != nil || strings.HasPrefix(relToDestPath, "..") {
return fmt.Errorf("invalid symlink in archive: target escapes destination: %s -> %s",
header.Name, header.Linkname)
}Validated against:
✅ Regular File Path Traversal Prevention (EXCELLENT)Location: Every extracted file path is validated before filesystem operations: targetPath := filepath.Join(destPath, header.Name) //nolint:gosec // G305: validated below
relToDestPath, err := filepath.Rel(destPath, targetPath)
if err != nil || strings.HasPrefix(relToDestPath, "..") {
return fmt.Errorf("invalid path in archive: %s", header.Name)
}✅ Decompression Bomb Protection (GOOD)Location: // Limit copy size to prevent decompression bombs (1GB max per file)
if _, err := io.Copy(f, io.LimitReader(tr, 1<<30)); err != nil {The 1GB per-file limit is reasonable for workspace snapshots. ✅ Command Execution Safety (GOOD)Location: All command executions use cmd := exec.Command("cp", "-c", "-R", "-p", workspacePath, clonePath)
cmd := exec.Command("diskutil", "info", mountPoint)No injection risk as arguments aren't shell-interpreted. 2. Race Conditions & Concurrency✅ ProcConnectorTracer Synchronization (EXCELLENT)Location: The synchronization design is careful and correct: Mutex Strategy:
Key Pattern - Event Emission: func (t *ProcConnectorTracer) emitEvent(event ExecEvent) {
t.mu.Lock()
if t.stopped {
t.mu.Unlock()
return
}
// Copy callbacks under lock
cbs := make([]func(ExecEvent), len(t.callbacks))
copy(cbs, t.callbacks)
// Non-blocking channel send
select {
case t.events <- event:
default:
t.droppedEvents++
}
t.mu.Unlock()
// Invoke callbacks OUTSIDE lock to prevent deadlock
for _, cb := range cbs {
cb(event)
}
}Strengths:
Test Coverage:
3. Error Handling & Edge Cases✅ File Descriptor Management (EXCELLENT)Location: for info.Mode().IsRegular() {
f, err := os.Open(path)
if err != nil {
return fmt.Errorf("open file %s: %w", relPath, err)
}
_, copyErr := io.Copy(tw, f)
f.Close() // Close immediately, not deferred, to avoid accumulating file handles
if copyErr != nil {
return fmt.Errorf("copy file content %s: %w", relPath, copyErr)
}
}Closes immediately instead of deferred to avoid fd leaks in loops - subtle but important. ✅ Archive Restore Error Recovery (EXCELLENT)Location: The restore operation preserves // Backup .git before cleaning workspace
gitBackup := gitDir + ".backup"
if err := os.Rename(gitDir, gitBackup); err != nil {
return fmt.Errorf("backup .git directory: %w", err)
}
// Extract archive
if err := b.RestoreTo(nativeRef, workspacePath); err != nil {
if gitBackup != "" {
_ = os.Rename(gitBackup, gitDir) // Best-effort recovery
}
return fmt.Errorf("extract archive: %w", err)
}Test: ✅ Engine Metadata Consistency (EXCELLENT)Location: Snapshot creation is atomic - if metadata save fails, the snapshot is cleaned up: nativeRef, err := e.backend.Create(e.workspace, id)
if err != nil {
return Metadata{}, fmt.Errorf("backend create: %w", err)
}
meta := Metadata{...}
e.snapshots[id] = meta
if err := e.saveMetadata(); err != nil {
// Clean up the snapshot if we can't save metadata
_ = e.backend.Delete(nativeRef)
delete(e.snapshots, id)
return Metadata{}, fmt.Errorf("save metadata: %w", err)
}Prevents orphaned snapshots. 4. Platform-Specific Code Correctness✅ Build Tags (EXCELLENT)Proper platform separation:
✅ macOS Architecture Handling (EXCELLENT)Location: Correctly handles ARM64 and x86_64 with different kinfo_proc offsets: func init() {
switch runtime.GOARCH {
case "arm64":
procSize = 648
pidOffset = 72
ppidOffset = 76
commOffset = 243
startOffset = 128
default: // amd64/x86_64
procSize = 492
pidOffset = 68
ppidOffset = 72
commOffset = 163
startOffset = 120
}
}5. Test Coverage Quality✅ Archive Backend Tests (EXCELLENT)File: 11 test functions covering:
Helper function ✅ Tracer Regression Tests (EXCELLENT)File: 8 tests specifically for concurrency issues:
6. Code Quality & Best Practices✅ Error Messages (EXCELLENT - per CLAUDE.md)return fmt.Errorf("create netlink socket: %w (requires CAP_NET_ADMIN or root)", err)
return fmt.Errorf("invalid symlink in archive: target escapes destination: %s -> %s",
header.Name, header.Linkname)
return fmt.Errorf("corrupted snapshot metadata at %s: %w\nTo reset, delete the file: rm %s",
path, err, path)
✅ Code Style (EXCELLENT)
✅ Documentation (EXCELLENT)
SummaryReady for Merge ✅ This PR demonstrates strong security engineering with no critical vulnerabilities found.
Optional Improvements (non-blocking):
Great work on this implementation! The attention to security details, comprehensive testing, and careful concurrency handling make this a high-quality addition to the codebase. |
Code Review: PR #36 - Workspace Snapshots and Process TracingI've completed a comprehensive review of this PR, examining security-critical components, race condition fixes, error handling, and test coverage. Overall, this is high-quality, security-conscious code that delivers well on its promises. 🟢 STRENGTHSSecurity (Excellent)
Race Condition Handling (Excellent)The tracer event emission code is textbook-perfect (
Comprehensive regression tests in Error Messages (Excellent)Perfect adherence to CLAUDE.md guidance: "Good error messages are documentation" Example from return nil, fmt.Errorf("corrupted snapshot metadata at %s: %w\nTo reset, delete the file: rm %q", path, err, path)Tells users exactly what's wrong AND how to fix it. Test Coverage (Excellent)
Go Best Practices (Excellent)
🟡 ISSUES FOUND🔴 Medium Severity: Rollback Safety Snapshot AccumulationFile: If creating the safety snapshot fails (e.g., disk full), rollback aborts. If the restore itself fails on a second attempt, another safety snapshot is created, potentially filling disk space with accumulating safety snapshots. Recommendation: Check for existing safety snapshots and reuse them, or add a 🟡 Low Severity: CLI Prune Returns Success on Partial FailureFile: The prune command continues after deletion failures (reasonable for bulk operations) but returns Fix: if failed > 0 {
return fmt.Errorf("failed to delete %d of %d snapshots", failed, len(toDelete))
}
return nil🟡 Low Severity: Darwin kinfo_proc Structure OffsetsFile: Hard-coded structure offsets for Concerns:
Recommendations:
🟢 Minor: APFS Path Validation Only Checks Base NameFile: if strings.HasPrefix(filepath.Base(workspacePath), "-") {
return "", fmt.Errorf("invalid workspace path: name cannot start with -")
}Only validates the final path component. A path like 🔵 ENHANCEMENT SUGGESTIONS (Non-Blocking)1. Atomic Metadata WritesFile: Use write-to-temp-then-rename pattern to prevent metadata corruption if process is killed mid-write: tmpPath := path + ".tmp"
if err := os.WriteFile(tmpPath, data, 0644); err != nil {
return err
}
return os.Rename(tmpPath, path)2. Total Snapshot Size LimitArchive backend has file count (100k) and per-file size (1GB) limits, but no total snapshot size limit. Consider adding: const maxSnapshotSize = 10 << 30 // 10GB3. Tracer Metrics ExposureThe tracer tracks 4. Configurable Darwin Polling Interval100ms polling = 10 syscalls/sec. Consider making configurable or using longer interval (200-500ms) for low-priority tracing. 📊 PERFORMANCE✅ APFS Copy-on-Write: Using 📝 DOCUMENTATION✅ Excellent inline documentation for privilege requirements, security model, and platform differences ✅ RECOMMENDATION: APPROVE WITH MINOR FIXESThis PR maintains the high code quality standards of the moat codebase. The snapshot implementation is well-architected, security-conscious, and thoroughly tested. Required before merge:
Suggested enhancements (can be follow-up PRs):
No critical bugs or security vulnerabilities found. The symlink path traversal fixes are solid, race conditions are properly handled, and test coverage is excellent. Great work! 🎉 |
PR Review: Workspace Snapshots and Process TracingThis is a substantial and well-implemented feature addition. Overall, the code quality is excellent with strong attention to security, error handling, and testing. SummaryThis PR adds two major features:
StrengthsCode Quality
Architecture
Issues & Recommendations1. Security: Archive File Count Limit (Low)Location: internal/snapshot/archive.go:215 maxArchiveFiles is 100,000 which seems high. Most projects have fewer files. Consider 25-50k for better zip bomb protection. 2. Prune Command Error Handling (Medium)Location: cmd/moat/cli/snapshots.go:244 Returns error after partial success. Document this behavior in help text or make it a warning instead. 3. APFS Detection Robustness (Low)Location: internal/snapshot/apfs_darwin.go:219-242 IsAPFS uses diskutil which could fail in restricted environments. Current fallback is safe but could have better error handling. 4. Documentation: APFS Evolution (Low)Add comment explaining why cp -c vs tmutil (volume-level vs directory-level snapshots). 5. Cleanup Interval Configuration (Low)Location: internal/trace/tracer_linux.go:69 60s cleanup interval is hardcoded. Consider making configurable for short-lived process tracking. Performance
Test CoverageExcellent coverage with unit, integration, regression, security, and edge case tests. Security Review
Final VerdictApprove with minor suggestions High-quality, production-ready code with strong security awareness, excellent error handling, clean architecture, and comprehensive testing. Issues are low-medium priority and dont block merging. Optional follow-ups:
Great work! The attention to security and edge cases is commendable. |
Code Review: PR #36 - Workspace Snapshots and Process TracingOverviewThis PR adds comprehensive workspace snapshot functionality with APFS and archive backends, plus platform-specific process tracing. The implementation is substantial (9,364 lines added) with good test coverage. Overall, this is high-quality code with strong security practices, but there are several critical issues that should be addressed before merge. 🔴 Critical Issues1. Command Injection Vulnerability in APFS BackendFile: The argument injection protection only checks Recommendation: Use the cmd := exec.Command("cp", "-c", "-R", "-p", "--", workspacePath, clonePath)Or validate the full resolved path: absPath, err := filepath.Abs(workspacePath)
if err != nil {
return "", fmt.Errorf("invalid workspace path: %w", err)
}
if strings.Contains(absPath, "/-") {
return "", fmt.Errorf("invalid workspace path: path components cannot start with -")
}2. Potential DoS via Compression BombFile: While there's a 1GB per-file limit and 100,000 file count limit, there's no check for overall decompressed size. An attacker could create a tar.gz with 100,000 files × 1GB each = 100TB extraction. Recommendation: Add cumulative size tracking: const maxTotalSize = 10 << 30 // 10GB total
var totalWritten int64
// In extraction loop:
if totalWritten > maxTotalSize {
return fmt.Errorf("archive exceeds maximum size limit (%d bytes)", maxTotalSize)
}
written, err := io.Copy(f, io.LimitReader(tr, 1<<30))
totalWritten += written3. Integer Overflow Risk in Darwin TracerFile: The type conversions have Recommendation: Add bounds checking: pidRaw := binary.LittleEndian.Uint32(buf[pidOffset:])
if pidRaw > math.MaxInt32 {
return processInfo{} // Invalid PID
}
info.pid = int(int32(pidRaw))🟡 High Priority Issues4. Snapshot ID Collision RiskFile: 4 bytes (32 bits) provides only ~4B unique IDs, giving 50% collision probability after ~65,000 snapshots (birthday paradox). The timestamp fallback is also non-deterministic. Recommendation: Use 8 bytes for IDs: func NewID() string {
b := make([]byte, 8) // 64 bits = better collision resistance
if _, err := rand.Read(b); err != nil {
return fmt.Sprintf("snap_%016x", time.Now().UnixNano())
}
return "snap_" + hex.EncodeToString(b)
}5. Missing Metadata Corruption RecoveryFile: If metadata is corrupted, the entire snapshot system fails, even though actual snapshot data still exists on disk. Recommendation: Add automatic recovery that rescans the snapshot directory and rebuilds metadata from discovered snapshots. 6. Linux Tracer Memory LeakFile: In Fix: if _, err := os.Stat(procPath); err != nil {
// Remove PID on any error - if we can't check, assume it's gone
delete(t.trackedPIDs, pid)
}7. Config Defaults Applied When Snapshots DisabledFile: Snapshot defaults are applied even when Fix: if !cfg.Snapshots.Disabled {
if cfg.Snapshots.Triggers.IdleThresholdSeconds == 0 {
cfg.Snapshots.Triggers.IdleThresholdSeconds = 30
}
// ... other defaults
}🔵 Code Quality Improvements8. Error Messages Leak Path InformationFile: Per CLAUDE.md, error messages should be actionable without exposing unnecessary internals: // Instead of: "corrupted snapshot metadata at /full/path: rm /full/path"
// Use: "corrupted snapshot metadata: moat snapshots prune <run-id>"9. Inconsistent CLI Error HandlingFiles: CLI commands duplicate existence checks that the storage layer already performs. Delegate to storage: store, err := storage.NewRunStore(baseDir, runID)
if err != nil {
return fmt.Errorf("run %s not found or inaccessible: %w", runID, err)
}10. Hardcoded Architecture OffsetsFile: Document which macOS versions the offsets were tested against: // kinfo_proc offsets for macOS 13+ (Ventura)
// Validated on: macOS 13.0, 14.0 (Sonoma), 15.0 (Sequoia)✅ Positive Observations
📊 Test Coverage GapsMissing edge case tests for:
⚡ Performance Considerations
📝 Conventions (CLAUDE.md)✅ No SummaryOverall Assessment: High-quality implementation with strong security practices and comprehensive testing. The critical issues (#1-3) represent security/reliability risks that should be addressed before merge. Medium-priority issues are mostly robustness and UX improvements. Recommendation: Request changes for critical issues, then approve with optional follow-up items. Estimated effort:
|
Add ExecEvent struct for capturing command execution events including: - Process info (PID, PPID, timestamp, working directory) - Command execution details (command, args, exit code, duration) - IsGitCommit() method to detect git commit operations - IsBuildCommand() method to detect common build commands Build detection supports npm, yarn, go, make, cargo, mvn, gradle with proper word boundary matching to avoid false positives (e.g., npm run build-docker is correctly not detected as a build command).
Implements ArchiveBackend that creates tar.gz snapshots with: - Gitignore pattern support via go-git library - Additional exclude patterns configuration - .git directory preservation during restore - Symlink and file permission preservation - Path traversal protection on extract
Add Engine type that provides high-level snapshot management: - Auto-detects APFS on macOS, falls back to archive backend - Persists metadata to snapshots.json for durability across restarts - Thread-safe Create/Restore/RestoreTo/Delete/List/Get operations - Supports ForceBackend option for testing specific backends - Passes through UseGitignore and Additional exclude patterns
- Reject absolute symlink targets (could point anywhere on filesystem) - Validate relative symlinks don't escape destination directory - Add comprehensive test cases for path traversal attack vectors - Use filepath.Rel to properly detect escape attempts via ../
…handling - Add stopped flag to all tracers to prevent double-close panics - Fix emitEvent race condition by holding lock during channel send - Add droppedEvents counter and logging for observability - Fix Darwin grandchild tracking to properly track all descendants - Improve Linux readLoop error handling with max consecutive errors limit - Add comprehensive regression tests for concurrent operations
- Add cleanupStalePIDs() to Linux tracer to handle missed EXIT events - Add started flag to StubTracer for consistency with other tracers - Fix misleading test comment that referenced old behavior
- Add TestStubTracerDoubleStart to verify Start() returns error when called twice - Add Linux-specific tests for cleanupStalePIDs function
- Fix errcheck: explicitly ignore os.Rename errors in recovery paths - Fix gosec G115: add nolint for safe integer conversions - Fix gosec G305/G110: add nolint with validation comments, limit copy size - Fix govet shadow: rename shadowed err variables - Fix prealloc: pre-allocate slices with known capacity - Fix staticcheck: remove unused metas slice in test
- Fix file close error handling in archive.go (check close errors) - Fix file close error handling in storage.go WriteExecEvent - APFS stub now returns ErrAPFSNotAvailable instead of silently succeeding - Add recovery guidance for corrupted snapshot metadata - Add consecutive error tracking to Darwin tracer (like Linux tracer)
…fault The APFS backend incorrectly uses tmutil (Time Machine) which creates volume-level snapshots, not directory-level snapshots. This means: - Snapshots are of the entire disk, not the workspace - Restore requires root privileges - Restore affects the entire volume, not just the workspace The correct approach for APFS workspace snapshots would be to use cp -c for copy-on-write directory cloning. Until that's implemented, default to the archive backend which correctly handles directory-level snapshots. The APFS backend is still available via ForceBackend option but with a warning comment.
Replace tmutil-based Time Machine snapshots with cp -c copy-on-write directory cloning. This is the correct approach for workspace snapshots: - tmutil creates volume-level snapshots (entire disk) - cp -c creates directory-level clones (exactly what we need) Benefits of cp -c on APFS: - Instant cloning (metadata-only operation) - Space-efficient (shared blocks until modified) - No Time Machine entitlements required - Works at directory granularity Changes: - APFSBackend.Create() now uses cp -c -R -p for COW cloning - APFSBackend.Restore() copies files back using cp -c - Engine auto-detects APFS and uses APFS backend when available - Updated tests for new cp -c based implementation - NewAPFSBackend() now takes snapshotDir parameter
Security improvements: - Add path validation in APFS backend to prevent argument injection (paths starting with "-" could be misinterpreted as flags) - Add file count limit (100k) in archive restore to prevent zip bombs Documentation: - Add privilege requirements documentation for Linux proc connector tracer - Quote path in metadata corruption error message for safe copy/paste Addresses feedback from Claude code review.
CLI: - Return error from prune command on partial deletion failure instead of silently succeeding Darwin tracer: - Add runtime validation of kinfo_proc structure offsets - Validates by checking PID 1 (launchd) has expected properties - Logs warning if validation fails (offsets may be wrong for macOS version)
- Rename ALL_CAPS constants to camelCase (revive var-naming) - Add nolint comments for syscall.Getpid() uint32 conversion (gosec G115) Also installed pre-commit hook to catch these issues locally.
- APFS backend: add "--" separator to all cp commands to prevent command injection via filenames starting with "-" - Archive backend: add total extracted size limit (10GB) to prevent compression bomb attacks that could exhaust disk space
Add tests verifying: - File count limit (100k files) triggers extraction failure - Archive limit constants exist with reasonable values These tests ensure the compression bomb protections are in place and catch any accidental removal of security limits.
da690d3 to
1615725
Compare
Code Review: PR #36 - Workspace Snapshots and Process TracingExecutive SummaryI've completed a comprehensive review of this PR. Overall, this is high-quality code with excellent security considerations, thorough testing, and solid engineering practices. Recommendation: ✅ APPROVE with minor suggestions The code is production-ready. All critical security issues have been addressed, comprehensive tests are in place, and the implementation follows Go best practices. Key Strengths 🎉1. Security - EXCELLENT ⭐
2. Concurrency - EXCELLENT ⭐The event emission pattern across all tracers is perfect - non-blocking channel sends, callbacks invoked outside locks, double-stop protection, and comprehensive regression tests. 3. Platform-Specific Code - EXCELLENT ⭐
4. User Experience - EXCELLENT ⭐
Minor Suggestions 💡1. Timing Issue in Total Size Check (archive.go:282-285)Check happens AFTER potentially exceeding limit. Suggest checking 2. Benign Race in Cleanup Timing (tracer_linux.go:211-214)
3. Type Conversion Clarity (tracer_darwin.go:308-311)Add comment explaining why int32 intermediate conversion is necessary. Security AnalysisAll Critical Attack Vectors Mitigated ✅
Security Score: 9.5/10 What Makes This PR Stand Out 🌟
Final Recommendation✅ APPROVEThis PR is production-ready as-is. The minor suggestions are truly optional - the code already meets high standards for security, correctness, and maintainability. Suggested Actions Before Merge:Required: NONE Optional:
Excellent work on this PR! The attention to security details, thorough testing, and careful handling of platform-specific code demonstrates strong engineering expertise. 🎉 |
Summary
moat rollbackcommand to restore workspace to previous snapshotKey Components
Snapshots (
internal/snapshot/)Process Tracing (
internal/trace/)CLI Commands
moat snapshot- Manual workspace snapshotmoat snapshots- List/manage snapshotsmoat rollback- Restore to previous snapshotConfiguration
snapshotandtracingfields inagent.yamlTest Plan
go test -race ./...passes🤖 Generated with Claude Code