Skip to content

feat(deps): add docker dependency for Docker access inside containers#73

Closed
dpup wants to merge 38 commits into
mainfrom
docker-in-docker
Closed

feat(deps): add docker dependency for Docker access inside containers#73
dpup wants to merge 38 commits into
mainfrom
docker-in-docker

Conversation

@dpup
Copy link
Copy Markdown
Collaborator

@dpup dpup commented Jan 27, 2026

Add support for running Docker commands inside moat containers via two modes:

docker:host - Mounts the host Docker socket

  • Fast startup, shared image cache with host
  • Full access to host Docker daemon
  • Use when there's reasonable trust of the agent and want speed

docker:dind - Runs isolated Docker daemon inside container

  • Complete isolation from host Docker
  • Requires privileged mode (set automatically)
  • Slower startup (~5-10s for daemon init)
  • Use when more isolation is required

Key implementation details:

  • DockerMode type with host/dind variants (explicit mode required)
  • moat-init script handles setup for both modes:
    • Host: detects socket GID inside container, adds moatuser to group
    • Dind: starts dockerd, waits for readiness, adds moatuser to docker group
  • Docker CLI installed from official Docker repo (docker-ce-cli)
  • Dind mode also installs docker-ce, containerd.io, docker-buildx-plugin
  • BuildKit disabled in dind mode (DOCKER_BUILDKIT=0) to avoid session issues
  • Image tags include DockerMode to prevent cache collisions

Example usage in agent.yaml:

dependencies:
  - docker:host  # or docker:dind

@claude
Copy link
Copy Markdown

claude Bot commented Jan 27, 2026

📦 Previous Review (superseded by newer review)

Pull Request Review: Docker Dependency Support

This PR adds Docker access inside moat containers with two modes (host and dind). The implementation is well-structured with good test coverage and documentation. Below are my findings:

Code Quality & Best Practices

Strengths:

  • Clean separation of concerns with dedicated modules (internal/deps, internal/run)
  • Comprehensive error messages with actionable guidance (e.g., ErrDockerHostRequiresDockerRuntime)
  • Good use of type safety with DockerMode enum and validation
  • Excellent documentation following the project's style guide
  • Parser enforces explicit mode specification (docker:host or docker:dind), preventing ambiguous configurations

Areas for Improvement:

  1. Magic Numbers in moat-init.sh (lines 31, 177-178)

    • The script uses hardcoded retry counts (20 iterations, 30 second timeout)
    • Consider extracting these as named constants at the top of the script for maintainability
    # At top of file
    SSH_SOCKET_WAIT_ITERS=20
    DIND_TIMEOUT_SECONDS=30
  2. Error Handling in moat-init.sh (line 158)

    • The mutual exclusivity check for MOAT_DOCKER_DIND and MOAT_DOCKER_GID is good
    • However, it only runs when both are non-empty. Consider checking if either is set when starting docker setup sections to provide earlier validation
  3. Image Tag Collision Prevention (internal/deps/builder.go:42-45)

    • Good: DockerMode is included in the hash to prevent cache collisions
    • The hash is only 12 characters. While collision probability is low, consider documenting the collision risk or using a longer hash for production

Potential Bugs

  1. Race Condition in Docker Daemon Startup (moat-init.sh:181-184)

    • The script checks if socket exists AND docker info succeeds
    • There's a potential TOCTOU (time-of-check-time-of-use) race between checking the socket and running docker info
    • Mitigation: The loop continues if either check fails, so this is actually safe. Consider adding a comment explaining this
  2. Group Detection Logic (moat-init.sh:220-234)

    • Line 222: Uses stat -c which is GNU-specific (won't work on BSD/macOS)
    • This is fine since containers are Linux-based, but worth documenting
    • Consider error handling if stat fails (currently silent failure at line 223)
  3. Privileged Mode Auto-Enable (internal/run/docker.go:153)

    • DinD mode automatically sets Privileged: true
    • This is correct but worth highlighting in security documentation since privileged containers have significant security implications

Security Concerns

  1. Privileged Containers (DinD Mode)

    • ✅ Good: Documentation clearly states privileged mode is required
    • ✅ Good: Mode must be explicitly chosen (no default to privileged)
    • ⚠️ Consider: Add audit logging when privileged containers are created
    • ⚠️ Consider: The PR description says "when more isolation is required" but privileged mode actually reduces host isolation (though it isolates the Docker daemon). The docs correctly say "isolation from host Docker" but could be clearer about the privileged trade-off
  2. Host Mode Socket Access

    • ✅ Good: Clear documentation of security implications (full host Docker access)
    • ✅ Good: Runtime validation prevents Apple containers from using host mode
    • The trust model is well-documented ("reasonable trust of the agent")
  3. Shell Injection Prevention

    • Version validation in deps/parser.go:244-256 prevents shell injection in version strings
    • Docker commands in Dockerfile are properly structured
    • No interpolation of untrusted input into shell commands

Performance Considerations

  1. BuildKit Support (internal/deps/dockerfile.go:47-53)

    • ✅ Good: Defaults to BuildKit with cache mounts for faster builds
    • ✅ Good: Gracefully falls back to legacy builder when needed
    • The UseBuildKit *bool pattern allows explicit false (not just nil)
  2. DinD Startup Overhead

    • Documented as ~5-10 seconds
    • Uses vfs storage driver (most compatible but slowest)
    • Consider: Document option to use overlay2 if available (requires kernel support)
  3. Image Layer Caching

    • Docker installation is in its own layer (lines 312-339 of dockerfile.go)
    • Installing Docker CLI from official repo is separate from other packages
    • Could potentially combine apt operations but current approach is clearer

Test Coverage

Excellent coverage overall:

  • ✅ Unit tests for parser (parser_test.go)
  • ✅ Unit tests for Docker dependency resolution (docker_test.go)
  • ✅ E2E tests for both modes (e2e/docker_test.go)
  • ✅ Tests for edge cases (empty list, missing mode, wrong runtime)

Suggestions:

  1. Add test for Docker daemon failing to start in DinD mode (simulate timeout)
  2. Add test for socket GID detection failing in host mode
  3. Add integration test verifying BuildKit disable in DinD mode (DOCKER_BUILDKIT=0)

Documentation Quality

Strengths:

  • Comprehensive docs in 02-agent-yaml.md with clear examples
  • Tradeoffs section explains security/performance implications
  • Design doc (docs/plans/2026-01-26-docker-dependency-design.md) captures rationale
  • Error messages are actionable (specify exactly what to do)

Minor issues:

  1. Line 204 of docs says "when you trust the agent" but commit message says "when there's reasonable trust" - consider consistent phrasing
  2. Runtime requirements section could link to docs on --runtime flag for better discoverability

Architectural Considerations

  1. Type System Design

    • DockerMode as a distinct type is excellent
    • Registry entry for docker is minimal (type: docker) since behavior is mode-specific
    • Consider: Should TypeDocker even be in the registry? It's parsed differently than other deps
  2. Initialization Script Complexity

    • moat-init.sh is now 260 lines handling multiple concerns (SSH, Claude, Codex, Docker)
    • Consider: Breaking into separate sourced scripts per feature for maintainability
    • Current approach is acceptable but may become unwieldy with more features
  3. Error Messages

    • Following CLAUDE.md guidelines excellently
    • Example: ErrDockerHostRequiresDockerRuntime tells user exactly what went wrong and how to fix it
    • No searching docs required

Specific Code Issues

  1. Dockerfile.go Line 173 - Missing error check

    # Should check if dockerd PID is valid before kill -0
    if DOCKERD_PID is empty, kill -0 will fail

    Actually this is in shell script, not Go. The shell handles this fine.

  2. Parser.go Line 63 - Error message quality

    return Dependency{}, fmt.Errorf("docker dependency requires explicit mode: use 'docker:host' or 'docker:dind'")

    ✅ Perfect - actionable, specific, helpful

  3. docker.go Line 104-114 - Platform-specific code

    stat, ok := info.Sys().(*syscall.Stat_t)
    if !ok {
        return 0, fmt.Errorf("failed to get docker socket stats (unsupported platform)")
    }

    This is Linux-specific but that's fine since containers run on Linux. Good error message.

Breaking Changes

  • None. This is a pure addition with no impact on existing functionality.
  • Backward compatible: existing agent.yaml files work unchanged

Recommendations

Must Fix:

  • None identified - code is production-ready

Should Consider:

  1. Add audit logging when privileged containers are created (for security compliance)
  2. Extract magic numbers from moat-init.sh to named constants
  3. Add more error handling in socket GID detection (line 222 of moat-init.sh)
  4. Consider longer hash for image tags (16 chars instead of 12)

Nice to Have:

  1. Document overlay2 storage driver option for DinD (faster than vfs)
  2. Add test for daemon startup timeout
  3. Consider breaking moat-init.sh into modular scripts as complexity grows

Overall Assessment

This is high-quality work that follows the project's conventions and style guide. The implementation is secure, well-tested, and thoroughly documented. The two-mode approach (host vs dind) gives users appropriate control over the security/performance trade-off.

Recommendation: Approve with minor suggestions

The code is ready to merge. The suggestions above are enhancements rather than blockers. The feature works correctly, is secure, and will serve users well.

Additional Notes

  • Commit message follows Conventional Commits format ✅
  • No Co-Authored-By line (following CLAUDE.md guidelines) ✅
  • All dependencies properly tracked (iptables for DinD, openssh-client reused from SSH grants)
  • Image selection logic correctly handles both modes without conflicts

@dpup dpup force-pushed the docker-in-docker branch from 7d66cf4 to afa3b0b Compare January 27, 2026 01:48
@claude
Copy link
Copy Markdown

claude Bot commented Jan 27, 2026

📦 Previous Review (superseded by newer review)

Pull Request Review: Docker Dependency Support

This PR adds comprehensive Docker support inside moat containers through two modes: docker:host (socket mounting) and docker:dind (isolated daemon). The implementation is well-architected with strong test coverage and excellent documentation.

Code Quality: Excellent

Strengths:

  • Clean separation of concerns: Docker logic is well-isolated in dedicated files (internal/run/docker.go, mode handling in moat-init.sh)
  • Type safety: The DockerMode type with explicit mode requirements prevents misconfiguration
  • Comprehensive error messages: Clear, actionable guidance for users (e.g., Apple containers incompatibility)
  • Consistent naming: Functions follow Go conventions with clear intent (HasDockerDependency, GetDockerDependency, ResolveDockerDependency)
  • Well-documented code: Inline comments explain the "why" (e.g., BuildKit disabled in dind, security model in proxy)

Minor observations:

  1. In internal/deps/parser.go:62-64, the error for bare docker (without mode) could be caught earlier in the flow, but current placement is acceptable
  2. The containsString helper in internal/run/docker_test.go:225-236 reimplements strings.Contains - consider removing

Test Coverage: Comprehensive

Excellent coverage across layers:

  • Unit tests: 440+ lines in docker_test.go, 160+ in parser_test.go covering all modes and edge cases
  • E2E tests: 400+ lines with 3 test scenarios (host mode, dind mode, isolation verification)
  • Integration tests: Well-structured with proper cleanup and timeout handling

Particularly well-tested:

  • Runtime validation (Apple vs Docker)
  • Socket GID detection and group permissions
  • Mode parsing and validation
  • Error message content verification
  • Isolation between dind and host containers

Security Considerations

Strengths:

  1. Explicit mode requirement: Users must choose between docker:host or docker:dind, forcing security-conscious decisions
  2. Clear security tradeoffs documented: Host mode risks clearly explained in docs and design document
  3. Proper privilege escalation: Dind mode automatically sets privileged flag
  4. Socket permissions: Proper GID detection and group membership management

Recommendations:

  1. Consider audit logging: Docker operations (especially docker:host) provide significant privileges. Consider adding audit events when docker dependency is used, similar to SSH grants

    • Location: internal/run/manager.go where ResolveDockerDependency is called
    • Event type: Could be audit.EventContainerStart with docker mode in metadata
  2. Document security model explicitly: While the design doc mentions trust model, consider adding a security callout in docs/content/reference/02-agent-yaml.md about:

    • Host mode gives agent full Docker daemon access (can see/modify host containers)
    • Dind mode requires privileged containers (kernel capability implications)

Performance Considerations

Well-handled:

  1. BuildKit cache mounts: Used appropriately in Dockerfile generation
  2. Image layer optimization: Docker packages installed in separate layer for caching
  3. Dind startup time documented: Clear expectations set (~5-10s) in docs and comments
  4. BuildKit disabled in dind: Correctly addresses session management issues

Note on dind performance:
The 30-second timeout (internal/deps/scripts/moat-init.sh:15) is reasonable but may be tight for slow systems. Current implementation handles this well with clear error messages showing socket state.

Potential Issues

Critical: None found

Minor issues:

  1. Race condition potential in moat-init.sh:186-201: The dind daemon readiness check uses a polling loop that checks both socket existence AND docker info. While functionally correct, there is a theoretical race where:

    • dockerd crashes between socket check and docker info
    • The error is caught (good!) but could be more granular

    Recommendation: Consider logging dockerd output continuously rather than only on error, or add a health check endpoint.

  2. Inconsistent error type naming (internal/run/docker.go:49-51):

    type ErrDockerRequiresDockerRuntime = ErrDockerHostRequiresDockerRuntime

    This alias is marked deprecated but still exported. Consider making it unexported or removing in a follow-up PR.

  3. Missing test case: TestResolveDockerDependency_DockerRuntime_DindMode (line 353) does not verify the SocketMount is explicitly empty/zero-valued. Add assertion to confirm the struct is zeroed out for dind mode.

  4. Documentation gap: The PR description mentions BuildKit disabled with DOCKER_BUILDKIT=0, but I do not see this env var set in the code. Verify this is set or update docs if the approach changed.

Architecture & Design

Excellent architectural decisions:

  1. Mode-based approach: Explicit host vs dind is cleaner than trying to auto-detect or provide a "safe" default
  2. Runtime validation at Create time: Failing fast prevents wasted image build time
  3. Graceful degradation: Clear error messages guide users to working configurations
  4. Separation from main manager: internal/run/docker.go keeps Docker concerns isolated

Design consistency:

  • Follows existing patterns for dependency handling
  • Image tag includes DockerMode (prevents cache collisions)
  • moat-init script handles all privileged setup before dropping to moatuser

Documentation: Outstanding

Reference docs (02-agent-yaml.md):

  • Clear tradeoff tables for each mode
  • Security considerations highlighted
  • Runtime requirements explicitly stated
  • Example usage provided

Design document:

  • Rationale clearly explained
  • Alternative approaches considered
  • Implementation details thorough

Inline code comments:

  • Explain non-obvious decisions (vfs storage driver choice, BuildKit disabled)
  • Security model documented in proxy.go
  • Known limitations called out

Suggestions for Follow-up Work

  1. Metrics/observability: Consider adding metrics for:

    • Docker dependency usage by mode
    • Dind startup time distribution
    • Socket permission errors
  2. Dockerfile optimization: In dind mode, the daemon logs to /var/log/dockerd.log. Consider adding log rotation or size limits for long-running containers.

  3. Testing enhancement: Add chaos test for dind daemon crash/restart scenarios.

  4. Feature consideration: For very high isolation needs, document alternative of using remote Docker daemon via DOCKER_HOST env var (would not require either mode).

Summary

This is high-quality production-ready code. The implementation is thorough, well-tested, and properly documented. The design decisions are sound and the security tradeoffs are clearly communicated.

Recommendation: Approve with minor suggestions for follow-up

The minor issues noted above are non-blocking and could be addressed in follow-up PRs. The core functionality is solid and ready to merge.

Great work on:

  • Comprehensive test coverage (unit + e2e + integration)
  • Excellent error messages that guide users to solutions
  • Clear documentation of security tradeoffs
  • Proper handling of both Docker and Apple container runtimes
  • Isolated, maintainable code structure

@dpup dpup force-pushed the docker-in-docker branch from afa3b0b to 65516e5 Compare January 27, 2026 01:50
@claude
Copy link
Copy Markdown

claude Bot commented Jan 27, 2026

📦 Previous Review (superseded by newer review)

Pull Request Review: Docker Dependency for Container Access

This is a well-implemented feature that adds Docker access inside moat containers with two distinct modes. The implementation is thorough, well-tested, and follows the project's conventions.

✅ Strengths

1. Excellent Architecture & Design

  • Clear separation between two modes (docker:host and docker:dind) with explicit mode requirement prevents accidental misconfiguration
  • Smart use of existing infrastructure (moat-init script, image tagging, audit logging)
  • Good runtime validation (Apple containers blocked for both modes with clear error messages)
  • Thoughtful security model with audit logging for privileged containers

2. Comprehensive Testing

  • 2,709 lines of test code across unit tests (parser_test.go:824, dockerfile_test.go:1,039, docker_test.go:440) and e2e tests (docker_test.go:406)
  • Good coverage of edge cases: missing modes, runtime incompatibilities, GID detection failures
  • E2e tests verify actual Docker functionality (running containers, isolation between modes)

3. Security Considerations

  • Privileged mode only enabled when required (dind mode)
  • Audit logging captures privileged container creation with reason (docker:dind)
  • Group-based socket access in host mode (no chmod 777 shortcuts)
  • Docker mode included in image tag hash to prevent cache collisions between modes

4. Documentation Quality

  • Clear tradeoffs documented for each mode
  • Design doc captures rationale and rejected alternatives
  • Good user-facing error messages with actionable steps

5. Code Quality

  • Clean separation of concerns (parser.go, docker.go, dockerfile.go)
  • Consistent error handling patterns
  • Good use of constants and typed enums

🔍 Issues & Recommendations

1. Security: BuildKit Disabled Globally in Dind Mode ⚠️

// internal/run/manager.go:879-880
proxyEnv = append(proxyEnv, "DOCKER_BUILDKIT=0")
proxyEnv = append(proxyEnv, "MOAT_DISABLE_BUILDKIT=1")

Issue: BuildKit is disabled globally for dind mode due to session management issues, but this affects performance and functionality. The vfs storage driver is already slow; disabling BuildKit makes it worse.

Recommendation:

  • Document this limitation prominently in the user-facing docs (currently only mentioned in PR description)
  • Consider if there's a way to scope the BuildKit disable to specific scenarios
  • Add a warning log when BuildKit is disabled so users understand the performance impact

2. Potential Race Condition in moat-init.sh

# internal/deps/scripts/moat-init.sh:186-201
while [ "$DIND_WAITED" -lt "$DIND_TIMEOUT_SECONDS" ]; do
  if [ -S /var/run/docker.sock ] && docker info >/dev/null 2>&1; then
    break
  fi
  if ! kill -0 "$DOCKERD_PID" 2>/dev/null; then
    echo "Error: Docker daemon failed to start" >&2
    exit 1
  fi
  sleep 1
done

Issue: The docker info check runs as root before group permissions are set up. This works, but there's a subtle timing window where if dockerd becomes ready between the loop check and the group setup, the non-root user might not have access yet.

Recommendation: This is likely fine in practice (30s timeout provides plenty of buffer), but consider adding a final verification after group setup or documenting that the readiness check runs as root.

3. Image Tag Hash Length Increased

// internal/deps/builder.go:79
hash := hex.EncodeToString(h[:])[:16]  // was [:12]

Good: The comment explains the reasoning (collision probability). This change affects all users, not just Docker dependency users.

Question: Was this tested with existing images? Image tag changes could trigger rebuilds. Consider mentioning this in the PR description or commit message.

4. Error Message Inconsistency

// internal/run/docker.go:41-46
return `'docker:host' dependency requires Docker runtime

Apple containers cannot access the host Docker socket.
Either:
  - Use 'docker:dind' mode (runs isolated Docker daemon), or
  - Use Docker runtime: moat run --runtime docker`

Issue: First bullet suggests dind mode, but dind mode also requires Docker runtime (lines 57-62). This could confuse users on macOS.

Recommendation: Update error message:

Either:
  - Use Docker Desktop and force Docker runtime: moat run --runtime docker, or
  - Use 'docker:dind' mode with Docker runtime: moat run --runtime docker

5. Missing Test: Docker Mode Validation in ParseAll
The parser has good tests for individual parsing, but I don't see a test that verifies ParseAll correctly rejects docker without a mode.

Recommendation: Add a test case:

{[]string{"docker"}, true, "requires explicit mode"}

6. Dockerfile Docker CLI Installation

// internal/deps/dockerfile.go:318-338
b.WriteString("RUN install -m 0755 -d /etc/apt/keyrings \\\n")
// ... adds Docker's official repo ...

Good: Using official Docker repo ensures up-to-date CLI. However, this ties images to Debian-based distros.

Question: What happens if a user specifies a non-Debian base image in the future? Consider documenting this constraint or adding validation.

7. Privileged Container Warning

// internal/run/manager.go:1461-1463
containerAuditData.Privileged = true
containerAuditData.Reason = "docker:dind"

Good: Audit logging captures this, but users might not check audit logs.

Recommendation: Add a user-visible warning when starting a dind container:

⚠️  Starting privileged container for docker:dind mode

8. Potential File Descriptor Leak

# internal/deps/scripts/moat-init.sh:179
dockerd --storage-driver=vfs --log-level=warn >/var/log/dockerd.log 2>&1 &

The dockerd process is started in the background but never explicitly reaped. This is probably fine (it should exit when the container exits), but consider whether a trap handler is needed.

📋 Minor Issues

9. Typo in Comment

// internal/deps/builder.go:76
// Use 16 chars (64 bits) for sufficiently low collision probability

16 hex chars = 64 bits ✓ (this is correct, not an issue)

10. Magic Numbers

# internal/deps/scripts/moat-init.sh:14-15
SSH_SOCKET_WAIT_ITERS=20     # iterations * 0.1s = 2 second timeout
DIND_TIMEOUT_SECONDS=30      # timeout for Docker daemon startup

Good: These are defined as constants at the top of the file. Consider if 30s is sufficient for slow systems (CI environments, slow disks). The vfs driver is notably slow to initialize.

🎯 Performance Considerations

11. Vfs Storage Driver Performance
The design doc mentions vfs is used for compatibility, but it's significantly slower than overlay2. Users on modern kernels could benefit from overlay2.

Recommendation: Consider detecting kernel support and using overlay2 when available, falling back to vfs. Or document that users can customize the storage driver via DOCKER_OPTS if needed.

🔒 Security Assessment

Overall: The security model is sound:

  • ✅ Privileged mode only when necessary
  • ✅ Audit logging for compliance
  • ✅ Group-based socket access (no world-writable sockets)
  • ✅ Explicit mode requirement prevents accidents
  • ✅ Clear error messages guide users to correct configuration

Concerns:

  • Running privileged containers is inherently risky. The PR correctly documents this and provides dind as an isolation option.
  • Host mode gives containers access to host Docker daemon (can break out). This is acceptable given moat's trust model, but worth documenting more prominently.

📚 Documentation Recommendations

12. Missing from User Docs:

  • BuildKit disabled in dind mode (performance impact)
  • Storage driver choice (vfs) and why
  • Security implications of host mode (can access host containers/network)
  • Startup time expectations ("~5-10s" is vague; CI users need to budget time)

✨ Nice Touches

  • Clear error messages with examples
  • Consistent use of structured logging
  • Good separation between unit and e2e tests
  • Thoughtful image tag collision prevention
  • Clean code organization

🎬 Conclusion

This is a high-quality implementation of a useful feature. The code is well-structured, thoroughly tested, and follows project conventions. The main issues are:

  1. Documentation gaps (BuildKit disabled, security implications)
  2. Error message clarity (dind suggestion on macOS when it also requires Docker runtime)
  3. User visibility (no warning for privileged containers)

These are all addressable with minor changes and don't block merging.

Recommendation: Approve with minor revisions to documentation and error messages.


Suggested Changes

  1. Update error message in internal/run/docker.go:41-46 to clarify both modes require Docker runtime on macOS
  2. Add user-visible warning when starting privileged containers
  3. Document BuildKit limitation in docs/content/reference/02-agent-yaml.md
  4. Consider adding test case for docker without mode in ParseAll
  5. Add note about storage driver customization options

Great work on this feature! The two-mode approach provides good flexibility, and the implementation is solid.

dpup and others added 25 commits January 27, 2026 02:08
Add support for running Docker commands inside moat containers via two modes:

**docker:host** - Mounts the host Docker socket
- Fast startup, shared image cache with host
- Full access to host Docker daemon
- Use when you trust the agent and want speed

**docker:dind** - Runs isolated Docker daemon inside container
- Complete isolation from host Docker
- Requires privileged mode (set automatically)
- Slower startup (~5-10s for daemon init)
- Use for untrusted code or when isolation is required

Key implementation details:
- DockerMode type with host/dind variants (explicit mode required)
- moat-init script handles setup for both modes:
  - Host: detects socket GID inside container, adds moatuser to group
  - Dind: starts dockerd, waits for readiness, adds moatuser to docker group
- Docker CLI installed from official Docker repo (docker-ce-cli)
- Dind mode also installs docker-ce, containerd.io, docker-buildx-plugin
- BuildKit disabled in dind mode (DOCKER_BUILDKIT=0) to avoid session issues
- Image tags include DockerMode to prevent cache collisions

Example usage in agent.yaml:
```yaml
dependencies:
  - docker:host  # or docker:dind
```
Add design document for automatic BuildKit sidecar with docker:dind mode.
Key points:
- BuildKit sidecar automatically deployed with docker:dind
- Shared Docker network for container communication
- Fast builds (buildkit) + full Docker daemon (runtime ops)
- No user configuration required

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Update RemoveNetwork to match RemoveContainer pattern by checking errdefs.IsNotFound
- Add test case TestDockerRuntime_RemoveNetwork_NotFound to verify best-effort behavior
- Add comment explaining bridge driver choice in CreateNetwork

This ensures the documentation matches actual behavior: RemoveNetwork is truly
best-effort and won't fail when the network doesn't exist.
# Conflicts:
#	internal/container/docker.go
Detailed plan to fix image building by using BuildKit Go client
instead of Docker SDK when BuildKit sidecar is active.

Key points:
- Use github.com/moby/buildkit/client for proper BuildKit support
- Automatic fallback to Docker SDK when BuildKit not available
- Full feature support: cache mounts, multi-platform, progress streaming
- Backward compatible with existing docker:host configurations
Enhance error handling and observability for BuildKit integration:

Error Messages:
- BuildKit connection failures now suggest checking docker:dind sidecar status
  and BUILDKIT_HOST configuration
- Build failures include context about Dockerfile syntax and build context path
- Ping failures provide actionable suggestions for network troubleshooting

Debug Logging:
- Log routing decision (BuildKit vs Docker SDK) with relevant parameters
- Log BuildKit connection address during client creation
- Log build options (tag, platform, no_cache, context_dir) when starting build
- Log successful operations (ping, build completion) for debugging
- Log errors with tag context for easier troubleshooting

All logging follows existing patterns from internal/log package using
structured key-value pairs at appropriate levels (Debug/Error).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add comprehensive integration and E2E tests for BuildKit integration:

**Integration tests (internal/buildkit/client_test.go):**
- TestClient_Build: Basic image build with BuildKit
- TestClient_BuildWithArgs: Build with build arguments
- TestClient_BuildWithInvalidDockerfile: Error handling for invalid Dockerfile
- TestClient_BuildWithUnreachableBuildKit: Error handling for unreachable BuildKit

All integration tests skip gracefully when BUILDKIT_HOST is not set,
following the pattern established by TestClient_Ping.

**E2E tests (internal/e2e/docker_test.go):**
- TestDockerDindBuildKitImageBuild: Full end-to-end test that:
  - Verifies BUILDKIT_HOST is set correctly in docker:dind containers
  - Builds an image using BuildKit
  - Verifies the built image exists and can be run
  - Tests the complete BuildKit integration flow

**Test fixes:**
- Fixed TestDockerRuntime_BuildImage_PathSelection to handle
  BuildKit connection errors correctly

All tests pass locally and skip appropriately when dependencies
(BuildKit, Docker) are not available.
Add BuildKit client usage documentation to agent.yaml reference:
- Explain how BuildKit integration works with docker:dind
- Document fallback behavior when BuildKit is unavailable
- Add troubleshooting section for common error messages
- Clarify BuildKit vs Docker SDK usage patterns
dpup added 9 commits January 27, 2026 18:38
Fixes two critical issues preventing BuildKit from working:

1. **Proxy bypass for BuildKit connections**: Added 'buildkit' hostname
   to NO_PROXY list. Without this, BuildKit gRPC connections were being
   routed through the HTTP proxy, causing "error reading server preface:
   EOF" errors because the proxy doesn't understand gRPC protocol.

2. **Privileged mode for BuildKit sidecar**: BuildKit requires privileged
   mode to perform bind mounts for its snapshot operations. Without this,
   builds fail with "operation not permitted" when BuildKit tries to
   mount its internal filesystem snapshots.

Changes:
- internal/run/manager.go: Add 'buildkit' to NO_PROXY list
- internal/container/runtime.go: Add Privileged field to SidecarConfig
- internal/container/docker.go: Honor Privileged field in StartSidecar
- internal/run/manager.go: Set Privileged=true for BuildKit sidecar

These fixes allow BuildKit to communicate directly without proxy
interference and have the necessary permissions to manage build contexts.
Fixes image export failure in docker:dind mode. BuildKit sidecar now
mounts the dind Docker socket to export built images to the daemon.

Problem:
- BuildKit would build images successfully but couldn't export them
- Docker daemon would try to pull "moat/run:xxx" from Docker Hub
- Error: "pull access denied for moat/run, repository does not exist"

Root cause:
- BuildKit's "image" exporter requires access to Docker API to export
- Without socket access, images stayed in BuildKit's cache
- Docker daemon had no way to access the built images

Solution:
- Mount /var/run/docker.sock from dind into BuildKit sidecar
- This is the dind container's socket, NOT the host's socket
- Maintains isolation: BuildKit can only affect dind daemon, not host
- Allows BuildKit to export images directly to dind daemon

Security note:
- Socket mounting only affects docker:dind mode (already privileged)
- BuildKit mounts the isolated dind socket, not host socket
- docker:host mode remains unchanged (mounts host socket directly)
- dind provides better isolation: cannot access host filesystem

Changes:
- internal/container/runtime.go: Add Mounts field to SidecarConfig
- internal/container/docker.go: Process mounts in StartSidecar
- internal/run/manager.go: Mount docker socket to BuildKit sidecar
The 'image' exporter writes to containerd image store, not Docker daemon.
Use 'docker' exporter which writes directly to Docker daemon via socket.

This ensures images built by BuildKit are immediately available to the
Docker daemon for running containers, fixing 'pull access denied' errors.
BuildKit sidecar needs access to the same /tmp directory where build
contexts are created. Without this, BuildKit cannot read the Dockerfile
and context files, resulting in 'method not supported' errors.

The build flow:
1. Main container creates temp dir: /tmp/moat-build-*
2. Main container writes Dockerfile to temp dir
3. Main container calls BuildKit with path to temp dir
4. BuildKit reads files from temp dir (needs access!)

Both containers now share /tmp, allowing BuildKit to access build
contexts created by the main container.
Replace deprecated LocalDirs with proper session-based file sync.
This fixes the 'method not supported by the client' error.

The issue:
- LocalDirs is deprecated and doesn't work reliably
- BuildKit couldn't access build context via LocalDirs
- Error: 'method /moby.filesync.v1.FileSend/diffcopy not supported'

The solution:
- Create a BuildKit session for each build
- Attach filesync.NewFSSyncProvider with StaticDirSource
- Upload build context from client to BuildKit via session protocol
- This is the modern, supported way to provide context to BuildKit

The build context is now properly uploaded to BuildKit over the
session protocol, regardless of filesystem boundaries between
containers.
Changes Docker SDK build behavior to use legacy builder (BuilderV1) by
default instead of BuilderBuildKit. This fixes 'no active sessions'
errors on Docker Desktop and other environments where Docker SDK's
BuildKit integration is unreliable.

The problem:
- Docker SDK's BuildKit integration (BuilderBuildKit flag) is buggy
- Causes 'no active sessions' errors with buildx and Docker Desktop
- Users had to set MOAT_DISABLE_BUILDKIT=1 workaround

The solution:
- docker:dind mode: Uses our BuildKit client (fast, reliable)
- docker:host mode: Uses legacy builder (slower, but works everywhere)
- Users can opt-in to Docker SDK's BuildKit via DOCKER_BUILDKIT=1

This ensures moat works out-of-the-box on all platforms while still
providing the fast BuildKit path for docker:dind users.

Future work: Detect and connect to buildx's BuildKit for docker:host
mode to get BuildKit benefits without docker:dind overhead.
Reverts the previous change that made legacy builder the default.
BuildKit should be the default for better performance - the 'no active
sessions' issue was environment-specific and can be worked around with
MOAT_DISABLE_BUILDKIT=1.

This restores the original behavior where:
- Default: Uses BuildKit (BuilderBuildKit) for fast builds
- Opt-out: Set MOAT_DISABLE_BUILDKIT=1 for legacy builder

The previous commit incorrectly assumed all Docker Desktop setups
have broken BuildKit, when it's actually working for most users.
Fixes "method /moby.filesync.v1.FileSend/diffcopy not supported by the
client" error that occurred during BuildKit image export phase.

Root cause: The docker exporter requires an Output function to handle
the tar stream. Manual session management created race conditions and
complexity.

Solution:
- Use LocalMounts for build context (BuildKit auto-manages session)
- Provide Output function that pipes tar stream to `docker load`
- Remove manual session lifecycle management
- Stream image directly without intermediate files

This simplifies the BuildKit integration while fixing nested moat
execution (moat-in-moat) and following BuildKit's recommended patterns.

Tested: All E2E tests pass including TestDockerDindBuildKitImageBuild
and all dependency runtime tests.
@dpup dpup force-pushed the docker-in-docker branch 2 times, most recently from 6a9e84c to cdcbee3 Compare January 27, 2026 23:33
@claude
Copy link
Copy Markdown

claude Bot commented Jan 27, 2026

📦 Previous Review (superseded by newer review)

Pull Request Review: Docker Dependency Support

This PR adds comprehensive Docker-in-container support with two distinct modes (docker:host and docker:dind). The implementation is well-designed with excellent test coverage and clear error messages.

Summary

What's Added:

  • Two Docker access modes: docker:host (socket mounting) and docker:dind (isolated daemon)
  • Automatic BuildKit sidecar for docker:dind mode to improve build performance
  • Runtime validation preventing incompatible mode/runtime combinations
  • Comprehensive E2E and unit test coverage
  • Clear, actionable error messages

Scope: 36 files changed, 5,942 additions, 77 deletions

Code Quality: Excellent A+

Strengths:

  1. Clean separation of concerns - Parser, Dockerfile generator, runtime resolver, and init script each have single responsibilities
  2. Explicit mode requirement - Bare docker rejected, forces choice between docker:host or docker:dind
  3. Excellent error messages - internal/run/docker.go:40-63 includes actionable remediation steps
  4. Robust security model - Host mode uses GID-based access control, dind mode uses privileged flag only when necessary
  5. Comprehensive test coverage - Unit tests cover all code paths, E2E tests validate both modes end-to-end
  6. BuildKit integration - Automatic sidecar for dind mode, streams directly to docker load

Security Analysis

Docker Socket Access (Host Mode):

  • Risk: Full access to host Docker daemon
  • Mitigations: Explicit mode selection, non-root user access, clear documentation
  • Recommendation: Acceptable for trusted agents

Privileged Containers (Dind Mode):

  • Risk: Privileged mode can escape container
  • Mitigations: Explicit mode selection, only on Docker runtime, isolated daemon, audit logging, vfs storage driver
  • Recommendation: Acceptable for isolation scenarios

Init Script Security:

  • Mutual exclusion check prevents mode conflicts (moat-init.sh:160-165)
  • Refuses to run as root without moatuser (line 256-267)
  • Proper permission handling (600 for credentials, 755 for directories)
  • Socket GID detected inside container (handles Docker Desktop UID remapping)

Test Coverage: Excellent A+

Unit Tests:

  • Parser tests: Bare docker rejection, both modes accepted, invalid mode rejection
  • Resolver tests: Runtime compatibility, Apple container rejection, socket mount config, GID retrieval
  • Dockerfile tests: Package selection (docker-ce-cli vs full docker-ce), mode preservation

E2E Tests:

  • TestDockerDependency: Host mode end-to-end
  • TestDockerDependencyWithAppleRuntime: Graceful rejection
  • TestDockerDindDependency: Dind mode end-to-end
  • TestDockerDindIsolation: Isolation verified
  • TestDockerDindBuildKitImageBuild: BuildKit sidecar works

Coverage Gaps: None identified

Minor Recommendations

  1. BuildKit Image Pinning (internal/run/docker.go:202) - Pin moby/buildkit:latest to specific version for reproducibility
  2. DinD Timeout (moat-init.sh:15) - Consider extending from 30s to 60s for slow environments
  3. Documentation - Add user-facing guide for mode selection (when to use host vs dind)
  4. Storage Driver - Document why vfs was chosen, consider making configurable

Final Assessment

Overall Quality: Excellent ⭐⭐⭐⭐⭐

  • Code Quality: A+ (Clean architecture, excellent error handling, follows project conventions)
  • Test Coverage: A+ (Comprehensive unit and E2E tests)
  • Security: A (Appropriate privilege controls, audit logging, minor: BuildKit pinning)
  • Documentation: B+ (Agent YAML updated, design docs comprehensive, could use more user guidance)
  • Performance: A- (Host mode fast, dind mode acceptable startup time)

Approval

Status: ✅ Approved with minor recommendations

This PR is production-ready and can be merged. The recommendations above are minor improvements that can be addressed in follow-up PRs.

Recommended Follow-up Tasks:

  1. Pin BuildKit version
  2. Add user-facing documentation for mode selection
  3. Consider extending dind timeout to 60s

Security Note: The privileged mode and socket mounting features are appropriately gated behind explicit user choices with clear documentation. The security model is sound.

Great work on this implementation! The code quality is excellent, tests are comprehensive, and the design is clean. The explicit mode requirement and actionable error messages demonstrate thoughtful UX design.

@claude
Copy link
Copy Markdown

claude Bot commented Jan 27, 2026

📦 Previous Review (superseded by newer review)

Pull Request Review: Docker Dependency Support

This PR adds Docker-in-Docker (dind) and docker:host support to moat containers. I've conducted a thorough review of the implementation across code quality, security, performance, and testing. Overall, this is a well-architected feature with strong attention to security and good test coverage. Below are my findings organized by category.


✅ Strengths

Architecture & Design

  • Excellent separation of concerns: The docker dependency logic is cleanly isolated in internal/run/docker.go with clear interfaces
  • Well-documented trade-offs: The PR description and code comments clearly explain host vs dind mode differences
  • Proper abstraction: DockerDependencyConfig provides a clean contract between dependency resolution and container configuration
  • Thoughtful error handling: Custom error types with actionable messages

Security

  • Privileged mode gating: Dind mode correctly requires privileged containers only when needed
  • Runtime validation: Proper checks prevent incompatible runtime/mode combinations
  • Socket permission handling: Host mode properly detects socket GID and manages group membership
  • Moat-init safety: Script validates both MOAT_DOCKER_DIND and MOAT_DOCKER_GID aren't set simultaneously

Code Quality

  • Comprehensive test coverage: E2E tests for both modes, unit tests for docker.go logic
  • Clear documentation: Inline comments explain complex flows
  • Following conventions: Adheres to CLAUDE.md guidelines on error messages and documentation

🔍 Issues & Recommendations

High Priority

1. BuildKit Sidecar Lifecycle Management Gap

The computeBuildKitConfig function in internal/run/docker.go:192-204 enables BuildKit for dind mode, but I don't see corresponding cleanup logic for the BuildKit sidecar container.

Issue: When Destroy is called on a run, the BuildKit sidecar container and network need to be cleaned up.

Recommendation: Add cleanup in Manager.Destroy for BuildKit sidecar and network removal.

2. Potential Socket Permission Race in moat-init.sh

The socket GID detection in host mode runs inside the container and may race with container startup.

Location: internal/deps/scripts/moat-init.sh:226-243

Issue: The socket may not exist yet when moat-init runs, or permissions may not be stable. The code handles this with "|| true" but provides only a warning.

Recommendation: Add retry logic or move GID detection to host-side where it currently exists at line 160, and pass the group name (not just GID) to the container.

3. Missing Image Tag for BuildKit Cache Separation

The PR description mentions "Image tags include DockerMode to prevent cache collisions" but I need to verify this is implemented.

Action needed: Confirm internal/image/image.go includes docker mode in image tag computation. If not, add it to prevent cache collisions between docker:host and docker:dind images.


Medium Priority

4. Dind Daemon Startup Timeout Configuration

The 30-second timeout for dockerd startup is hardcoded at the top of moat-init.sh.

Issue: 30 seconds may be too short for slow CI environments or systems with high I/O latency. No way to override.

Recommendation: Make this configurable via environment variable with 30s default.

5. VFS Storage Driver Performance

The dind mode uses --storage-driver=vfs for maximum compatibility.

Issue: VFS is slow and doesn't support copy-on-write. While the comment mentions overlay2 as an alternative, there's no auto-detection or configuration option.

Recommendation: Try overlay2 first, fallback to vfs if it fails. Add configuration option and document performance implications.

6. BuildKit Client Error Messages Could Be More Actionable

Error messages reference BUILDKIT_HOST but don't guide users on what to check.

Recommendation: Add troubleshooting steps to error messages explaining how to check if the sidecar is running, network connectivity, and logs.


Low Priority

7. Test Coverage Gaps

  • Missing: Unit test for GetDockerSocketGID with mocked filesystem
  • Missing: Test for mutual exclusivity check in moat-init.sh
  • Missing: Test for dockerd startup failure and timeout in dind mode

8. Documentation Consistency

The PR description mentions "BuildKit disabled in dind mode (DOCKER_BUILDKIT=0)" but I don't see this environment variable set anywhere in the code. The internal/buildkit/client.go implements BuildKit support for dind mode, which contradicts the PR description.

Recommendation: Update PR description to reflect that BuildKit IS used in dind mode via sidecar.


🚀 Performance Considerations

Positive

  • BuildKit integration should provide significant build performance improvements
  • Apt cache mounts optimize package installation

Concerns

  1. VFS storage driver will significantly slow down image pulls and builds in dind mode
  2. BuildKit sidecar startup overhead: ~1-2 seconds added to dind container creation
  3. Privileged mode implications should be clearly documented

Recommendation: Add performance comparison section to docs showing startup times and build performance differences between modes.


🔐 Security Analysis

Strengths

  • Docker socket is read-write by necessity, but this is documented
  • Privileged mode is only enabled for dind, not host mode
  • Group-based access control prevents running docker as root
  • moat-init properly drops privileges with gosu

Potential Concerns

  1. Privileged containers in dind mode: Acceptable risk given isolation requirements, but worth highlighting in security docs
  2. Host Docker access in host mode: Container can interact with all host containers
  3. Socket GID from environment: Low risk since moat-init also detects GID inside container

Recommendation: Add security implications section to docker dependency documentation explaining the trust model for each mode.


🎯 Summary & Verdict

This is a high-quality PR with solid architecture, good security practices, and comprehensive testing. The implementation demonstrates deep understanding of Docker's architecture and the challenges of running Docker inside containers.

Must Address Before Merge

  1. ✋ BuildKit sidecar cleanup (Issue feat(routing): add hostname-based service routing for agents #1) - Resource leak
  2. ✋ Verify image tag includes docker mode (Issue feat(audit): add tamper-proof audit logging with cryptographic verification #3) - Cache collision risk

Should Address Before Merge

  1. ⚠️ Docker mode configuration in dind (Issue perf(audit): optimize Merkle tree rebuilding to O(log n) per append #8) - Unclear intent

Can Address in Follow-up PR

  1. Socket permission race condition hardening (Issue Add Claude Code GitHub Workflow #2)
  2. Dind timeout configurability (Issue feat(proxy): Add HTTPS support to routing proxy #4)
  3. Storage driver auto-detection (Issue perf(audit): optimize Merkle tree rebuilding to O(log n) per append #5)
  4. Error message improvements (Issue refactor(audit): extract magic numbers to named constants #6)
  5. Additional test coverage (Issue docs(audit): add integration guide for wiring audit into agent run #7)
  6. Performance documentation
  7. Security implications documentation

Recommendation

Conditional LGTM - This is excellent work that moves moat's capabilities forward significantly. Please address issues #1 and #3 before merging, and clarify #8. The remaining items can be follow-up PRs to keep momentum.

Great job on the thorough design docs and test coverage! 🎉

dpup and others added 3 commits January 27, 2026 23:38
Updates conceptual documentation to explain docker:host and docker:dind
dependency modes, their security implications, and BuildKit sidecar
integration.

Changes:
- concepts/01-sandboxing.md: Add Docker access modes section explaining
  security tradeoffs between docker:host and docker:dind
- concepts/06-dependencies.md: Add comprehensive Docker dependencies
  section with mode comparison table and BuildKit sidecar details
- getting-started/02-installation.md: Fix misleading BuildKit installation
  note (BuildKit sidecar is automatic with docker:dind, no manual install
  needed)

The reference documentation (agent-yaml.md) already has comprehensive
technical details. These updates provide conceptual understanding and
decision-making guidance for users.
Addresses code review issue #1: BuildKit sidecar lifecycle management gap.

When Destroy() is called, the BuildKit sidecar container and network were
not being cleaned up, causing resource leaks.

Changes:
- Remove BuildKit sidecar container in Manager.Destroy() if present
- Remove BuildKit network in Manager.Destroy() if present
- Both operations are best-effort with appropriate error handling
- Network removal uses debug logging (may already be removed)

Prevents resource leaks when runs with docker:dind are destroyed.
- Check error returns for cleanup operations
- Remove unused types import
- Fix variable shadowing in buildImageWithBuildKit
- Pre-allocate mounts slice capacity
- Replace deprecated types.ContainerJSON with container.InspectResponse

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@claude
Copy link
Copy Markdown

claude Bot commented Jan 27, 2026

📦 Previous Review (superseded by newer review)

Pull Request Review: Docker Dependency and BuildKit Integration

This PR adds comprehensive Docker support inside moat containers with two modes (docker:host and docker:dind), plus automatic BuildKit sidecar integration for fast builds. The implementation is well-structured and thoroughly documented.


Code Quality ✅

Strengths

1. Excellent Architecture

  • Clean separation between docker:host and docker:dind modes via DockerMode type
  • Runtime abstraction properly extended with new methods (CreateNetwork, RemoveNetwork, StartSidecar)
  • BuildKit client wrapper (internal/buildkit/client.go) provides clean abstraction over BuildKit gRPC API

2. Strong Documentation

  • Inline comments explain complex flows (e.g., session-based file sync, docker load piping)
  • Error messages are actionable and user-friendly
  • Design documents in docs/plans/ capture architectural decisions
  • Reference docs updated to match implementation

3. Comprehensive Testing

  • Unit tests for BuildKit client (client_test.go)
  • E2E tests covering full docker:dind + BuildKit flow
  • Manager tests for BuildKit configuration logic

4. Proper Error Handling

  • Cleanup logic on failures (network, sidecar, proxy)
  • Best-effort cleanup in Stop/Destroy with appropriate logging
  • Connection errors include troubleshooting suggestions

Potential Issues 🔍

1. Resource Leak in Stop() - FIXED ✅

The BuildKit sidecar container was not being removed in Manager.Destroy(). This has been addressed in commit 2659592 which adds proper cleanup.

2. Privileged Mode Security Implications ⚠️

Both docker:dind and BuildKit sidecar require privileged mode:

  • docker:dind (internal/run/manager.go:1463): Sets Privileged: true for the main container
  • BuildKit sidecar (internal/run/manager.go:1384): Sets Privileged: true for the sidecar

Recommendation: Consider documenting in the security model docs (concepts/01-sandboxing.md) that privileged containers can escape to the host. The current docs mention this but could be more explicit about the risks.

3. Hard-coded BuildKit Image Tag

internal/run/docker.go:202 uses moby/buildkit:latest:

SidecarImage: "moby/buildkit:latest",

Concern: Using latest tag can lead to:

  • Non-reproducible builds across environments
  • Unexpected behavior changes when upstream releases new versions
  • Difficult debugging when issues arise from version changes

Recommendation: Consider pinning to a specific version (e.g., moby/buildkit:v0.12.5) or making it configurable via environment variable.

4. BuildKit Startup Timeout

internal/run/manager.go:1415-1433 waits 10 seconds for BuildKit:

for i := 0; i < 10; i++ {
    time.Sleep(1 * time.Second)
    // ...
}

Concern: Fixed 10-second timeout may be insufficient on slow systems or during first pull of BuildKit image.

Recommendation: Consider making timeout configurable or increasing default to 30 seconds.

5. No Validation of Mount Paths in BuildKit

internal/run/manager.go:1385-1401 mounts /var/run/docker.sock and /tmp into BuildKit sidecar without verifying paths exist:

Mounts: []container.MountConfig{
    {Source: "/var/run/docker.sock", Target: "/var/run/docker.sock", ReadOnly: false},
    {Source: "/tmp", Target: "/tmp", ReadOnly: false},
},

Concern: If paths don't exist, sidecar start will fail with unclear error.

Recommendation: Add validation or improve error messages if sidecar startup fails due to missing mounts.


Performance Considerations ⚡

Positive Aspects

1. BuildKit Provides Significant Benefits

  • Parallel build stages
  • Efficient layer caching
  • Modern Dockerfile features (cache mounts, etc.)

2. Shared /tmp Mount Efficient
Build contexts are shared via /tmp mount rather than copying over network.

Potential Concerns

1. BuildKit Sidecar Startup Overhead
Each docker:dind run starts a BuildKit sidecar (~1-10 seconds). For short-lived runs, this overhead may be noticeable.

Mitigation: This is documented behavior. Users needing faster startup can use docker:host mode.

2. Privileged Mode Performance
The vfs storage driver used in dind mode (moat-init.sh:179) is slower than overlay2 but more compatible.

Recommendation: Consider detecting overlay2 support and using it when available.


Security Concerns 🔒

1. Privileged Containers Can Escape ⚠️

Both docker:dind and BuildKit sidecar run in privileged mode, which effectively disables container isolation. An attacker with code execution in the container can:

  • Mount host filesystem
  • Load kernel modules
  • Access all host devices

Current Mitigation:

  • Documented in PR description
  • Users explicitly choose docker:dind mode
  • docker:host mode provides better isolation (but still grants Docker daemon access)

Recommendation: ✅ Already well-documented. No changes needed.

2. Docker Socket Access in docker:host Mode

Mounting /var/run/docker.sock grants full control over the host Docker daemon, which can be used to:

  • Start privileged containers
  • Mount host filesystem into containers
  • Effectively escape to host

Current Mitigation:

  • Documented in agent.yaml reference
  • Users must explicitly use docker:host

Recommendation: ✅ Already well-documented. No changes needed.

3. NO_PROXY Configuration Prevents Proxy Bypass

internal/run/manager.go:509 correctly excludes BuildKit from proxy:

noProxy := hostAddr + ",localhost,127.0.0.1,buildkit"

This prevents BuildKit gRPC traffic from being routed through HTTP proxy (which would fail).

4. BuildKit Socket Access

The BuildKit sidecar mounts the dind Docker socket, not the host socket. This maintains isolation - BuildKit can only affect the isolated dind daemon, not the host.

Assessment: ✅ Correct security boundaries.


Test Coverage 🧪

Excellent Coverage

Unit Tests:

  • internal/buildkit/client_test.go: BuildKit client operations
  • internal/run/docker_test.go: Docker dependency resolution
  • internal/run/manager_test.go: BuildKit configuration logic

E2E Tests:

  • internal/e2e/docker_test.go: Full docker:dind + BuildKit integration
  • Tests verify BUILDKIT_HOST is set, images build, and can be run

Suggestions

1. Test BuildKit Cleanup
Consider adding a test that verifies BuildKit sidecar and network are properly cleaned up in Stop() and Destroy().

2. Test Failure Scenarios

  • BuildKit sidecar fails to start
  • BuildKit network creation fails
  • Image build fails mid-build

These are partially covered but could be more explicit.


Best Practices ✅

Followed Well

1. Conventional Commits: All commits follow conventional commit format
2. Error Messages: Clear, actionable error messages with troubleshooting steps
3. Logging: Structured logging with appropriate levels (Debug for trace, Warn for non-fatal errors)
4. Resource Cleanup: Comprehensive cleanup in Stop/Destroy with best-effort semantics
5. Documentation: Reference docs and concept docs updated to match implementation

Minor Suggestions

1. Consider Using Context Timeouts
Several operations use bare ctx without timeouts:

  • BuildKit sidecar startup (manager.go:1404)
  • Network creation (manager.go:1366)

Recommendation: Consider using context.WithTimeout for operations that may hang indefinitely.

2. Magic Numbers

  • 10-second BuildKit wait (manager.go:1418)
  • 30-second dockerd timeout (moat-init.sh:15)

Recommendation: Extract as named constants at package level.


Documentation 📚

Strengths

1. Comprehensive Reference Documentation

  • docs/content/reference/02-agent-yaml.md: Detailed docker dependency documentation
  • Tables comparing docker:host vs docker:dind
  • BuildKit integration explained

2. Concept Documentation

  • docs/content/concepts/01-sandboxing.md: Security implications
  • docs/content/concepts/06-dependencies.md: Detailed dependency system explanation

3. Design Documents

  • docs/plans/: Four detailed design documents capturing decisions and evolution

Suggestions

1. Add Troubleshooting Section
Consider adding common issues to reference docs:

  • "BuildKit sidecar failed to start" → Check Docker Hub access
  • "method not supported" errors → Already documented
  • Performance tips (overlay2 vs vfs)

2. Security Warning in Examples
The agent.yaml example in project root uses docker:dind. Consider adding a comment:

dependencies:
  - docker:dind  # Privileged mode - use docker:host for less isolation

Final Verdict ✅

This is high-quality work that adds significant value to moat:

Strengths

  • Well-architected with clean abstractions
  • Comprehensive testing (unit + E2E)
  • Excellent documentation and error messages
  • Proper resource cleanup and error handling
  • Security boundaries correctly implemented

Minor Improvements Recommended

  1. Pin BuildKit image version instead of using :latest
  2. Increase BuildKit startup timeout or make configurable
  3. Extract magic numbers as constants
  4. Consider context timeouts for long-running operations

Blockers

None. The code is production-ready as-is.


Recommendation

APPROVE

This PR can be merged. The suggested improvements are nice-to-haves that can be addressed in follow-up PRs if desired. The core implementation is solid, secure, and well-tested.

@claude
Copy link
Copy Markdown

claude Bot commented Jan 28, 2026

📦 Previous Review (superseded by newer review)

Pull Request Review: Docker Dependency Support

This PR adds comprehensive Docker support inside moat containers with two access modes: docker:host (socket mounting) and docker:dind (Docker-in-Docker). The implementation is well-architected, thoroughly tested, and includes excellent documentation.

Code Quality ✅

Strengths:

  • Clean abstraction: The DockerMode type (internal/deps/types.go:26-34) provides a clear enumeration of host/dind modes with explicit mode requirements preventing ambiguous configurations
  • Type safety: Parser validates modes at parse time (internal/deps/parser.go:57-78), ensuring invalid modes fail fast with clear error messages
  • Good separation of concerns: Docker resolution logic (internal/run/docker.go) is cleanly separated from general run management
  • Security by default: Requires explicit mode specification - docker alone errors with helpful message (parser.go:63)

Code organization:
The feature follows the existing moat patterns well with parser, Dockerfile generator, run manager, and container runtime each handling their responsibilities.

Security Considerations 🔒

Excellent security practices:

  1. Privileged mode handling (internal/audit/entry.go:67-77):

    • New ContainerData audit entry type tracks privileged containers
    • Includes Privileged boolean and Reason field (e.g., docker:dind)
    • Provides full audit trail for security-sensitive operations
  2. Mode validation (internal/run/docker.go:82-96):

    • ValidateDockerDependency ensures modes match runtime capabilities
    • Apple containers correctly rejected for both modes (no socket access, no privileged support)
    • Clear error messages guide users to --runtime docker
  3. GID detection (internal/run/docker.go:98-115):

    • Socket GID detected inside container (not from host) to handle Docker Desktop ownership translation
    • moatuser added to appropriate group for socket access
    • Uses syscall.Stat_t for platform-correct GID retrieval
  4. Init script security (internal/deps/scripts/moat-init.sh:155-243):

    • Mutual exclusion check prevents conflicting MOAT_DOCKER_DIND + MOAT_DOCKER_GID
    • DinD daemon startup includes health checks and timeout handling
    • Docker socket permissions verified before proceeding

Recommendation: Consider adding a warning to the CLI when docker:host is used, similar to how other tools warn about privileged operations. This could be a follow-up enhancement.

Test Coverage 🧪

Comprehensive test suite (17 test files covering docker/deps):

  1. Unit tests (internal/run/docker_test.go, internal/deps/parser_test.go):

    • Parser tests cover valid modes, invalid modes, missing mode requirement
    • Docker dependency detection and retrieval well-tested
    • GID detection, mode validation covered
  2. E2E tests (internal/e2e/docker_test.go):

    • TestDockerDependency: Verifies docker ps works inside container
    • TestDockerDependencyWithAppleRuntime: Validates graceful failure on incompatible runtime
    • Proper runtime detection with skipIfNoDocker helpers
    • Good error diagnostics with log output on failures
  3. Dockerfile generation tests:

    • Verifies correct package installation for each mode
    • Tests BuildKit integration in generated Dockerfiles

Test quality:

  • Good use of table-driven tests
  • Proper cleanup with defer statements
  • Timeout handling in E2E tests
  • Meaningful test names and clear failure messages

Performance Considerations ⚡

Well-optimized:

  1. Host mode:

    • Socket mounting has minimal overhead
    • Shared image cache with host prevents duplicate downloads
    • GID detection is a simple stat syscall
  2. DinD mode:

    • Documentation accurately describes ~5-10s daemon startup time
    • Init script includes timeout handling (30s max)
    • Uses vfs storage driver for maximum compatibility
  3. BuildKit integration:

    • Automatic BuildKit sidecar in dind mode provides faster builds
    • Network isolation via dedicated Docker network per run
    • Environment variable BUILDKIT_HOST routes builds correctly

Documentation Quality 📚

Excellent documentation across multiple locations:

  1. Reference docs (docs/content/reference/02-agent-yaml.md:189-295):

    • Clear explanation of both modes with tradeoffs table
    • BuildKit sidecar architecture explained thoroughly
    • Runtime requirements clearly stated
    • Troubleshooting section with common errors
  2. Concept docs (docs/content/concepts/06-dependencies.md:99-138):

    • High-level overview integrated into dependency documentation
    • Comparison table between modes
  3. Inline code comments well-placed throughout

Documentation accuracy:
✅ Verified claims against code:

  • DinD timeout is 30s (matches DIND_TIMEOUT_SECONDS in init script)
  • BuildKit sidecar uses moby/buildkit:latest (matches internal/run/docker.go:202)
  • vfs storage driver is used (matches moat-init.sh:179)
  • Environment variable names match implementation

Potential Issues & Recommendations

Minor observations:

  1. BuildKit complexity (internal/buildkit/client.go, 202 lines):

    • The BuildKit client implementation is substantial but currently unused
    • Question: Are there plans to integrate this more deeply?
    • Suggestion: Consider adding a comment explaining future integration plans
  2. Dockerfile BuildKit flag (internal/deps/dockerfile.go:33-36):

    • UseBuildKit option defaults to true but is not used in dind flow
    • Suggestion: Document when this flag should be set to false
  3. Error messages (internal/run/docker.go:40-63):

    • Error types for incompatible runtimes are excellent
    • Could enhance with suggestions about when each mode is appropriate
  4. Audit logging:

    • ✅ Container privileged status logged
    • ✅ BuildKit container/network IDs tracked
    • Consider logging Docker mode (host vs dind) explicitly as separate field for easier querying
  5. DinD daemon logs:

    • Logs go to /var/log/dockerd.log
    • Enhancement: Consider streaming dockerd logs to moat audit system for full observability

Best Practices ✅

Follows moat conventions:

  • ✅ Error messages are actionable (tell users exactly what to do)
  • ✅ Documentation matches actual behavior
  • ✅ CLAUDE.md updated with architecture details
  • ✅ No backwards-compatibility hacks

Code style:

  • ✅ Standard Go formatting
  • ✅ Clear variable names
  • ✅ Good use of constants
  • ✅ Proper error wrapping with context

Summary

This is a high-quality PR that adds significant value to moat. The implementation is secure, well-tested, performantly designed, and thoroughly documented.

Strengths:

  • Explicit mode requirement prevents ambiguous configurations
  • Comprehensive audit logging for privileged operations
  • Excellent runtime validation with clear error messages
  • Strong test coverage including E2E tests
  • Documentation accuracy verified against implementation
  • BuildKit sidecar provides excellent build performance in dind mode

Recommendations for follow-up:

  1. Consider CLI warning when using docker:host mode
  2. Document BuildKit client future plans in code comments
  3. Add Docker mode to audit log as explicit field
  4. Consider streaming dockerd logs to audit system

Approval recommendation: ✅ Approve with optional follow-up enhancements

Great work on this feature!

Add skipIfNestedDind() helper to detect environments where nested dind
is unreliable. Nested dind (dind inside dind) is not a supported
configuration.

Detection strategy (checked in order):
1. GITHUB_ACTIONS=true - GitHub Actions always uses dind
2. /.dockerenv + /var/run/docker.sock - we're in a container with docker
3. CI=true + docker info shows dind indicators - other CI systems

This fixes test failures in CI where dind tests fail with
"dial unix /var/run/docker.sock: connect: no such file or directory"
because the inner dockerd can't start properly in nested dind setups.

The tests will still run in local development environments where
nested dind restrictions don't apply.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@dpup dpup force-pushed the docker-in-docker branch from 39e3b45 to ee4dd3f Compare January 28, 2026 00:47
@claude
Copy link
Copy Markdown

claude Bot commented Jan 28, 2026

Pull Request Review: Docker Dependency Support

This PR adds comprehensive support for running Docker commands inside moat containers through two modes: docker:host and docker:dind. I've conducted a thorough review of the implementation across all 38 changed files.


Overall Assessment

Verdict: Excellent implementation with strong attention to security, testing, and documentation.

The PR demonstrates mature engineering practices with comprehensive E2E tests, clear documentation, and thoughtful handling of edge cases. The dual-mode approach provides flexibility for different isolation needs while maintaining security.


Strengths

1. Security Design ✅

  • Clear security model: Host mode provides speed, dind mode provides isolation
  • Proper privilege handling: Dind mode correctly requires --privileged, documented security implications
  • Apple containers rejection: Both modes correctly rejected for Apple containers with clear error messages
  • Group-based access: Host mode uses GID detection inside container (handles Docker Desktop translation)
  • Non-root execution: Maintains moatuser throughout, with proper group membership

2. Code Quality ✅

  • Type safety: Explicit DockerMode type prevents magic strings
  • Error handling: Comprehensive with actionable error messages
  • Clear separation: internal/run/docker.go cleanly isolates Docker-specific logic
  • Comments: Excellent inline documentation explaining why not just what

3. Testing ✅

  • Unit tests: Comprehensive coverage in internal/run/docker_test.go (440 lines)
  • E2E tests: Real-world validation in internal/e2e/docker_test.go (571 lines)
  • Isolation verification: TestDockerDindIsolation confirms dind containers can't see host containers
  • Edge cases: Tests Apple runtime rejection, missing socket, GID detection
  • CI-aware: Proper skip logic for privileged containers and nested dind

4. Documentation ✅

  • User-facing docs: Clear examples in docs/content/reference/02-agent-yaml.md
  • Concept docs: Security trade-offs explained in docs/content/concepts/01-sandboxing.md
  • Design docs: Three planning documents capture decision rationale
  • CLAUDE.md updated: Project overview reflects new architecture

@claude
Copy link
Copy Markdown

claude Bot commented Jan 28, 2026

Issues & Recommendations

Critical Issues: None ✅

High Priority

1. BuildKit Integration Complexity 🟡

File: internal/buildkit/client.go (new file, 202 lines)

The BuildKit client integration adds significant complexity with custom gRPC session protocol implementation, manual containerd snapshot handling, and complex error handling with retries.

Concern: This appears to be solving a problem that may not need solving. The PR description doesn't mention BuildKit, and docker:dind mode could work with just the Docker daemon.

Recommendation:

  • Consider removing BuildKit integration in this PR and adding it separately if needed
  • If keeping it: Add comprehensive integration tests for BuildKit failures
  • Document why BuildKit is necessary (if it is)

2. Docker-in-Docker Reliability in CI 🟡

File: internal/e2e/docker_test.go:172-204

The nested dind detection is complex and may be fragile. The heuristic checking for /.dockerenv and /var/run/docker.sock may have false positives/negatives.

Recommendation:

  • Add a MOAT_E2E_ALLOW_DIND=1 env var to explicitly enable dind tests
  • Default to skipping dind tests in all CI environments
  • Document the CI configuration requirements for dind tests

Medium Priority

3. Image Tag Cache Collision 🟡

The PR description mentions "Image tags include DockerMode to prevent cache collisions" but I don't see this implemented in the diff. The dockerMode is captured but not used in image tagging logic that I can see.

Recommendation: Verify this claim. If not implemented, either add docker mode to image tags or remove the claim from the PR description.

4. Error Message Consistency 🟡

File: internal/run/docker.go:40-63

Both error types suggest moat run --runtime docker, but the host mode error also suggests dind mode as an alternative. The dind error doesn't suggest host mode, which could be simpler for some users.

5. moat-init Script Length 🟡

File: internal/deps/scripts/moat-init.sh (269 lines, up from 140)

The init script has grown significantly. Consider splitting into modular functions at the top of the script. Each feature could be a function: setup_ssh(), setup_docker_host(), setup_docker_dind()

6. Privileged Mode Security Documentation 🟡

The sandboxing doc explains the modes well, but could be more explicit about the security implications of --privileged. Consider adding a security note warning that privileged mode grants the container nearly all capabilities of the host.

@claude
Copy link
Copy Markdown

claude Bot commented Jan 28, 2026

Code-Specific Observations

Excellent Patterns

1. Clear validation separation (internal/run/docker.go:88-96):
The validation function cleanly separates Docker/Apple runtime checks and returns appropriate error types for each case.

2. Descriptive error types (internal/run/docker.go:38-47):
Error messages are actionable with clear alternatives - exactly what good error messages look like.

3. Comprehensive struct documentation (internal/run/docker.go:18-34):
Each field has a clear comment explaining when it is used.


Performance Considerations

Positive

  • Host mode: Zero overhead, shares host image cache
  • BuildKit cache mounts: Dockerfile uses --mount=type=cache for apt
  • Lazy pull: Images only pulled if not present

Concerns

  • Dind startup: 5-10s daemon init time (documented, acceptable)
  • VFS storage: Slower than overlay2 but more compatible (good trade-off)
  • Image size: Adding Docker to every image that needs it (unavoidable)

Test Coverage Assessment

What's Tested ✅

  • Docker host mode works
  • Docker dind mode works
  • Dind provides isolation from host
  • Apple runtime correctly rejects both modes
  • Socket GID detection
  • BuildKit integration

What Could Be Tested 🔵

  • Socket permission denied (wrong GID)
  • Dockerd startup failure (corrupt daemon, port conflict)
  • Docker CLI version mismatch warnings
  • Cleanup after dind failure (orphaned containers/networks)

Conclusion

This is a well-engineered PR that adds significant value to moat. The implementation is thorough, well-tested, and properly documented. My main concern is the BuildKit integration complexity, which may not be necessary for the core feature.

Recommendation: ✅ Approve with minor changes

The issues identified are mostly polish items and potential future improvements. The core functionality is solid, secure, and ready to merge after addressing:

  1. BuildKit necessity clarification (consider deferring to separate PR)
  2. Verification of image tag cache collision prevention
  3. Minor documentation enhancements

Great work overall! The test coverage and error handling are exemplary.


Review completed: 2026-01-28
Files reviewed: 38 changed files (6,069 additions, 77 deletions)
Reviewer: Claude Code Assistant

@dpup
Copy link
Copy Markdown
Collaborator Author

dpup commented Jan 28, 2026

Integrating this into a larger runtime PR too 😬

@dpup dpup closed this Jan 28, 2026
@dpup dpup deleted the docker-in-docker branch January 28, 2026 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant