Skip to content

feat(routing): add hostname-based service routing for agents#1

Merged
dpup merged 31 commits into
mainfrom
dan
Jan 13, 2026
Merged

feat(routing): add hostname-based service routing for agents#1
dpup merged 31 commits into
mainfrom
dan

Conversation

@dpup
Copy link
Copy Markdown
Collaborator

@dpup dpup commented Jan 13, 2026

Add hostname-based service routing for agents, enabling predictable URLs like http://web.myapp.localhost:8080 for local development with OAuth flows, webhooks, and multi-service architectures.

Key features:

  • Agent naming: Agents get names via --name flag, agent.yaml, or auto-generated (e.g., fluffy-chicken)
  • Port exposure: Configure ports: in agent.yaml to expose container services
  • Hostname routing: Access services at http://..localhost:
  • Standalone proxy: New agent proxy start/stop/status commands for privileged port binding (e.g., sudo agent proxy start --port=80)
  • Environment injection: Containers receive AGENTOPS_URL_* vars for OAuth callback configuration

CLI improvements:

  • Simplified agent run [path] syntax (removed redundant positional arg)
  • New --name flag to override agent name
  • Enhanced agent list output showing services

Example usage:

  # agent.yaml
  name: myapp
  runtime:
    node: 20
  ports:
    web: 3000
    api: 8080
  $ agent run ./project
  Started agent "myapp" (run run-a1b2c3d4)
  Services:
    web: http://web.myapp.localhost:8080 (container :3000)
    api: http://api.myapp.localhost:8080 (container :8080)
  $ agent run --name=otherbranch ./project
  Started agent "otherbranch" (run run-a1b2c3d4)
  Services:
    web: http://web.otherbranch.localhost:8080 (container :3000)
    api: http://api.otherbranch.localhost:8080 (container :8080)

dpup added 30 commits January 12, 2026 22:29
Design for exposing container services via hostname routing so multiple
agent instances can run with predictable URLs (e.g., web.myapp.localhost).

Key features:
- Agent naming: --name flag, agent.yaml, or random adjective-animal
- Port config via agent.yaml ports map
- Shared reverse proxy with automatic lifecycle
- Environment variable injection for OAuth-friendly URLs
Detailed step-by-step plan with TDD approach for implementing:
- Random name generation (adjective-animal)
- Global config for proxy port
- Port binding for Docker and Apple containers
- Route table and reverse proxy
- Proxy lifecycle management
- CLI integration with --name flag
- Environment variable injection
Add port binding support to Docker runtime:
- Update CreateContainer to configure exposed ports and port bindings
- Add GetPortBindings method to retrieve actual assigned host ports
- Add stub GetPortBindings to Apple runtime for interface compliance

Port bindings let Docker assign random host ports for container ports,
which is used for hostname routing where DNS lookup returns the host IP
and HTTP traffic is routed through the port binding to the proxy.
Remove agent routes and runsByName index entry when destroying a run.
This ensures the route table stays clean and agent names can be reused
after a run is destroyed.
Add Lifecycle type that manages the shared reverse proxy:
- EnsureRunning: starts proxy if not running, reuses if already running
- Port mismatch detection: errors if requested port differs from running proxy
- ShouldStop: checks if proxy can be stopped (no more agents)
- Proper lock file cleanup on stop
- Add proxyLifecycle field to Manager struct
- Start routing proxy automatically when first agent with ports starts
- Stop routing proxy when last agent with ports is destroyed
- Use global config for proxy port setting
- run: Show agent name, run ID, and service URLs after start
- list: Show NAME, RUN ID, STATE, SERVICES columns instead of ID/AGENT/STATE/CREATED
Add integration test that verifies:
- Proxy lifecycle start/stop
- Route registration for multiple services
- Host header routing to correct backend
- Default service routing
- Unknown agent 404 response
- Route removal and cleanup

Also fix lifecycle port handling:
- Port 0 now means "use random available port"
- Port < 0 means "use default (8080)"
- Port mismatch check only triggers when specific port requested
- Add "Hostname-Based Service Routing" section
- Document agent naming (--name flag, config, random)
- Document ports configuration in agent.yaml
- Document proxy URL structure and environment variables
- Document global proxy configuration
- Add --name flag to `agent run` command documentation
Two simple Python HTTP servers demonstrating hostname-based routing:
- web_server.py: responds "Hello from port 3000"
- api_server.py: responds "Hello from port 8080"
- agent.yaml: exposes both as web.demo.localhost and api.demo.localhost
The proxy runs in a separate process from agents. When a new agent
registers routes by writing to routes.json, the proxy's in-memory
RouteTable was stale. Now Lookup, LookupDefault, AgentExists, and
Agents reload from disk to pick up changes from other processes.
The <agent> positional arg was redundant with the name field in
agent.yaml and the --name flag. Now the CLI is simpler:

  agent run [path] [-- command]

The agent name comes from:
1. --name flag (highest priority)
2. name field in agent.yaml
3. Random generated name (e.g., fluffy-chicken)

Also removed the Agent field from Run struct and Options since
only Name is needed for identification and routing.
Add `agent proxy start/stop/status` commands to run the routing proxy
as a standalone daemon, enabling privileged port binding:

  sudo agent proxy start --port=80
  agent run examples/multi-service -- bash /workspace/start.sh

Fix port binding issues:
- Use bridge network mode when ports need publishing (host mode ignores
  port bindings)
- Bind to 0.0.0.0 so Docker properly assigns random host ports
- Clean up routes when agents exit via Stop() or Wait()

Fix typo in example start.sh script.
- Remove deprecated rand.Seed() call (Go 1.20+ auto-seeds)
- Remove unused runsByName field from Manager
- Consolidate CLI output (remove redundant fmt.Printf calls)
- Rename Agent field to Name in run.Options struct usage in E2E tests
- Fix gofmt formatting issue in internal/run/run.go
dpup added a commit that referenced this pull request Jan 27, 2026
Addresses code review issue #1: BuildKit sidecar lifecycle management gap.

When Destroy() is called, the BuildKit sidecar container and network were
not being cleaned up, causing resource leaks.

Changes:
- Remove BuildKit sidecar container in Manager.Destroy() if present
- Remove BuildKit network in Manager.Destroy() if present
- Both operations are best-effort with appropriate error handling
- Network removal uses debug logging (may already be removed)

Prevents resource leaks when runs with docker:dind are destroyed.
dpup added a commit that referenced this pull request Jan 29, 2026
* feat(deps): add docker dependency for Docker access inside containers

Add support for running Docker commands inside moat containers via two modes:

**docker:host** - Mounts the host Docker socket
- Fast startup, shared image cache with host
- Full access to host Docker daemon
- Use when you trust the agent and want speed

**docker:dind** - Runs isolated Docker daemon inside container
- Complete isolation from host Docker
- Requires privileged mode (set automatically)
- Slower startup (~5-10s for daemon init)
- Use for untrusted code or when isolation is required

Key implementation details:
- DockerMode type with host/dind variants (explicit mode required)
- moat-init script handles setup for both modes:
  - Host: detects socket GID inside container, adds moatuser to group
  - Dind: starts dockerd, waits for readiness, adds moatuser to docker group
- Docker CLI installed from official Docker repo (docker-ce-cli)
- Dind mode also installs docker-ce, containerd.io, docker-buildx-plugin
- BuildKit disabled in dind mode (DOCKER_BUILDKIT=0) to avoid session issues
- Image tags include DockerMode to prevent cache collisions

Example usage in agent.yaml:
```yaml
dependencies:
  - docker:host  # or docker:dind
```

* docs(plans): add gVisor integration design

* docs(plans): add gVisor implementation plan

* feat(container): add GVisorAvailable detection function

* feat(container): add sandbox parameter to NewDockerRuntime

* feat(container): pass OCI runtime to container creation

* feat(container): add RuntimeOptions with sandbox support

* feat(cli): add --no-sandbox flag

* feat(run): wire NoSandbox through run manager

* feat(config): add sandbox field to agent.yaml

* feat(cli): honor sandbox setting from agent.yaml

* docs: document --no-sandbox flag and sandbox config

* fix(container): make gVisor install instructions copy-pasteable

Consolidate the Linux installation steps into a single command that
users can copy and run directly. Also adds -y flag to apt install
and includes systemctl reload step.

* docs(plans): add buildkit sidecar design

Add design document for automatic BuildKit sidecar with docker:dind mode.
Key points:
- BuildKit sidecar automatically deployed with docker:dind
- Shared Docker network for container communication
- Fast builds (buildkit) + full Docker daemon (runtime ops)
- No user configuration required

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* fix(cli): admin commands don't require gVisor sandbox

Commands like 'moat status', 'moat clean', 'moat system images', and
'moat system containers' only query/manage existing resources - they
don't create new containers that need sandboxing.

These commands now explicitly use Sandbox: false to avoid requiring
gVisor when it's not needed.

* fix(container): address PR review feedback

- Deduplicate gVisor detection: NewDockerRuntime now uses GVisorAvailable()
- Use Docker's default runtime when sandbox disabled (empty string vs 'runc')
- Add config validation for sandbox field (only '' or 'none' allowed)
- Add timeout to TestGVisorAvailable to prevent CI hangs
- Add tests for sandbox config validation

* fix(e2e): disable sandbox for E2E tests

E2E tests verify moat functionality, not gVisor specifically.
Using NoSandbox: true allows tests to run in CI without gVisor.

* feat(container): add network create/remove methods

* fix(container): ensure RemoveNetwork properly ignores not-found errors

- Update RemoveNetwork to match RemoveContainer pattern by checking errdefs.IsNotFound
- Add test case TestDockerRuntime_RemoveNetwork_NotFound to verify best-effort behavior
- Add comment explaining bridge driver choice in CreateNetwork

This ensures the documentation matches actual behavior: RemoveNetwork is truly
best-effort and won't fail when the network doesn't exist.

* feat(container): add sidecar start method

* feat(storage): add buildkit sidecar metadata fields

* feat(run): add buildkit configuration logic

* feat(container): add network create/remove methods

* wip: integrate buildkit lifecycle (partial)

* feat(container): add sidecar start method

* feat(run): integrate buildkit sidecar in create lifecycle

* feat(container): add inspect container helper

* feat(run): cleanup buildkit sidecar and network on stop

* feat(audit): add buildkit sidecar audit events

* docs(reference): document buildkit sidecar for docker:dind

* fix(container): make RemoveNetwork truly best-effort

* fix: add missing InspectContainer method after merge

* docs(plans): add buildkit client integration plan

Detailed plan to fix image building by using BuildKit Go client
instead of Docker SDK when BuildKit sidecar is active.

Key points:
- Use github.com/moby/buildkit/client for proper BuildKit support
- Automatic fallback to Docker SDK when BuildKit not available
- Full feature support: cache mounts, multi-platform, progress streaming
- Backward compatible with existing docker:host configurations

* feat(deps): add buildkit client library

* feat(buildkit): add client wrapper for image builds

* feat(container): use buildkit client for image builds when available

* feat(buildkit): improve error messages and logging

Enhance error handling and observability for BuildKit integration:

Error Messages:
- BuildKit connection failures now suggest checking docker:dind sidecar status
  and BUILDKIT_HOST configuration
- Build failures include context about Dockerfile syntax and build context path
- Ping failures provide actionable suggestions for network troubleshooting

Debug Logging:
- Log routing decision (BuildKit vs Docker SDK) with relevant parameters
- Log BuildKit connection address during client creation
- Log build options (tag, platform, no_cache, context_dir) when starting build
- Log successful operations (ping, build completion) for debugging
- Log errors with tag context for easier troubleshooting

All logging follows existing patterns from internal/log package using
structured key-value pairs at appropriate levels (Debug/Error).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* test(buildkit): add integration tests for client

Add comprehensive integration and E2E tests for BuildKit integration:

**Integration tests (internal/buildkit/client_test.go):**
- TestClient_Build: Basic image build with BuildKit
- TestClient_BuildWithArgs: Build with build arguments
- TestClient_BuildWithInvalidDockerfile: Error handling for invalid Dockerfile
- TestClient_BuildWithUnreachableBuildKit: Error handling for unreachable BuildKit

All integration tests skip gracefully when BUILDKIT_HOST is not set,
following the pattern established by TestClient_Ping.

**E2E tests (internal/e2e/docker_test.go):**
- TestDockerDindBuildKitImageBuild: Full end-to-end test that:
  - Verifies BUILDKIT_HOST is set correctly in docker:dind containers
  - Builds an image using BuildKit
  - Verifies the built image exists and can be run
  - Tests the complete BuildKit integration flow

**Test fixes:**
- Fixed TestDockerRuntime_BuildImage_PathSelection to handle
  BuildKit connection errors correctly

All tests pass locally and skip appropriately when dependencies
(BuildKit, Docker) are not available.

* docs(buildkit): document client integration and troubleshooting

Add BuildKit client usage documentation to agent.yaml reference:
- Explain how BuildKit integration works with docker:dind
- Document fallback behavior when BuildKit is unavailable
- Add troubleshooting section for common error messages
- Clarify BuildKit vs Docker SDK usage patterns

* fix(buildkit): add privileged mode and proxy bypass for sidecar

Fixes two critical issues preventing BuildKit from working:

1. **Proxy bypass for BuildKit connections**: Added 'buildkit' hostname
   to NO_PROXY list. Without this, BuildKit gRPC connections were being
   routed through the HTTP proxy, causing "error reading server preface:
   EOF" errors because the proxy doesn't understand gRPC protocol.

2. **Privileged mode for BuildKit sidecar**: BuildKit requires privileged
   mode to perform bind mounts for its snapshot operations. Without this,
   builds fail with "operation not permitted" when BuildKit tries to
   mount its internal filesystem snapshots.

Changes:
- internal/run/manager.go: Add 'buildkit' to NO_PROXY list
- internal/container/runtime.go: Add Privileged field to SidecarConfig
- internal/container/docker.go: Honor Privileged field in StartSidecar
- internal/run/manager.go: Set Privileged=true for BuildKit sidecar

These fixes allow BuildKit to communicate directly without proxy
interference and have the necessary permissions to manage build contexts.

* fix(cli): show 'Initializing...' progress indicator on startup

Users reported no visible feedback when running 'moat run'. Now shows
'Initializing...' immediately while the runtime, proxy, and container
are being set up.

* fix(buildkit): mount docker socket for image export

Fixes image export failure in docker:dind mode. BuildKit sidecar now
mounts the dind Docker socket to export built images to the daemon.

Problem:
- BuildKit would build images successfully but couldn't export them
- Docker daemon would try to pull "moat/run:xxx" from Docker Hub
- Error: "pull access denied for moat/run, repository does not exist"

Root cause:
- BuildKit's "image" exporter requires access to Docker API to export
- Without socket access, images stayed in BuildKit's cache
- Docker daemon had no way to access the built images

Solution:
- Mount /var/run/docker.sock from dind into BuildKit sidecar
- This is the dind container's socket, NOT the host's socket
- Maintains isolation: BuildKit can only affect dind daemon, not host
- Allows BuildKit to export images directly to dind daemon

Security note:
- Socket mounting only affects docker:dind mode (already privileged)
- BuildKit mounts the isolated dind socket, not host socket
- docker:host mode remains unchanged (mounts host socket directly)
- dind provides better isolation: cannot access host filesystem

Changes:
- internal/container/runtime.go: Add Mounts field to SidecarConfig
- internal/container/docker.go: Process mounts in StartSidecar
- internal/run/manager.go: Mount docker socket to BuildKit sidecar

* fix(buildkit): use docker exporter instead of image exporter

The 'image' exporter writes to containerd image store, not Docker daemon.
Use 'docker' exporter which writes directly to Docker daemon via socket.

This ensures images built by BuildKit are immediately available to the
Docker daemon for running containers, fixing 'pull access denied' errors.

* fix(buildkit): mount /tmp for build context access

BuildKit sidecar needs access to the same /tmp directory where build
contexts are created. Without this, BuildKit cannot read the Dockerfile
and context files, resulting in 'method not supported' errors.

The build flow:
1. Main container creates temp dir: /tmp/moat-build-*
2. Main container writes Dockerfile to temp dir
3. Main container calls BuildKit with path to temp dir
4. BuildKit reads files from temp dir (needs access!)

Both containers now share /tmp, allowing BuildKit to access build
contexts created by the main container.

* fix(buildkit): use session-based file sync for build context

Replace deprecated LocalDirs with proper session-based file sync.
This fixes the 'method not supported by the client' error.

The issue:
- LocalDirs is deprecated and doesn't work reliably
- BuildKit couldn't access build context via LocalDirs
- Error: 'method /moby.filesync.v1.FileSend/diffcopy not supported'

The solution:
- Create a BuildKit session for each build
- Attach filesync.NewFSSyncProvider with StaticDirSource
- Upload build context from client to BuildKit via session protocol
- This is the modern, supported way to provide context to BuildKit

The build context is now properly uploaded to BuildKit over the
session protocol, regardless of filesystem boundaries between
containers.

* fix(container): default to legacy builder for docker:host reliability

Changes Docker SDK build behavior to use legacy builder (BuilderV1) by
default instead of BuilderBuildKit. This fixes 'no active sessions'
errors on Docker Desktop and other environments where Docker SDK's
BuildKit integration is unreliable.

The problem:
- Docker SDK's BuildKit integration (BuilderBuildKit flag) is buggy
- Causes 'no active sessions' errors with buildx and Docker Desktop
- Users had to set MOAT_DISABLE_BUILDKIT=1 workaround

The solution:
- docker:dind mode: Uses our BuildKit client (fast, reliable)
- docker:host mode: Uses legacy builder (slower, but works everywhere)
- Users can opt-in to Docker SDK's BuildKit via DOCKER_BUILDKIT=1

This ensures moat works out-of-the-box on all platforms while still
providing the fast BuildKit path for docker:dind users.

Future work: Detect and connect to buildx's BuildKit for docker:host
mode to get BuildKit benefits without docker:dind overhead.

* revert: restore BuildKit as default for Docker SDK

Reverts the previous change that made legacy builder the default.
BuildKit should be the default for better performance - the 'no active
sessions' issue was environment-specific and can be worked around with
MOAT_DISABLE_BUILDKIT=1.

This restores the original behavior where:
- Default: Uses BuildKit (BuilderBuildKit) for fast builds
- Opt-out: Set MOAT_DISABLE_BUILDKIT=1 for legacy builder

The previous commit incorrectly assumed all Docker Desktop setups
have broken BuildKit, when it's actually working for most users.

* fix(buildkit): resolve filesync error with docker exporter

Fixes "method /moby.filesync.v1.FileSend/diffcopy not supported by the
client" error that occurred during BuildKit image export phase.

Root cause: The docker exporter requires an Output function to handle
the tar stream. Manual session management created race conditions and
complexity.

Solution:
- Use LocalMounts for build context (BuildKit auto-manages session)
- Provide Output function that pipes tar stream to `docker load`
- Remove manual session lifecycle management
- Stream image directly without intermediate files

This simplifies the BuildKit integration while fixing nested moat
execution (moat-in-moat) and following BuildKit's recommended patterns.

Tested: All E2E tests pass including TestDockerDindBuildKitImageBuild
and all dependency runtime tests.

* chore: Fixups

* docs: add Docker dependency and BuildKit sidecar documentation

Updates conceptual documentation to explain docker:host and docker:dind
dependency modes, their security implications, and BuildKit sidecar
integration.

Changes:
- concepts/01-sandboxing.md: Add Docker access modes section explaining
  security tradeoffs between docker:host and docker:dind
- concepts/06-dependencies.md: Add comprehensive Docker dependencies
  section with mode comparison table and BuildKit sidecar details
- getting-started/02-installation.md: Fix misleading BuildKit installation
  note (BuildKit sidecar is automatic with docker:dind, no manual install
  needed)

The reference documentation (agent-yaml.md) already has comprehensive
technical details. These updates provide conceptual understanding and
decision-making guidance for users.

* fix(run): add BuildKit sidecar cleanup in Destroy

Addresses code review issue #1: BuildKit sidecar lifecycle management gap.

When Destroy() is called, the BuildKit sidecar container and network were
not being cleaned up, causing resource leaks.

Changes:
- Remove BuildKit sidecar container in Manager.Destroy() if present
- Remove BuildKit network in Manager.Destroy() if present
- Both operations are best-effort with appropriate error handling
- Network removal uses debug logging (may already be removed)

Prevents resource leaks when runs with docker:dind are destroyed.

* fix: address linter errors in BuildKit integration

- Check error returns for cleanup operations
- Remove unused types import
- Fix variable shadowing in buildImageWithBuildKit
- Pre-allocate mounts slice capacity
- Replace deprecated types.ContainerJSON with container.InspectResponse

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* test(e2e): skip nested dind tests in CI environments

Add skipIfNestedDind() helper to detect environments where nested dind
is unreliable. Nested dind (dind inside dind) is not a supported
configuration.

Detection strategy (checked in order):
1. GITHUB_ACTIONS=true - GitHub Actions always uses dind
2. /.dockerenv + /var/run/docker.sock - we're in a container with docker
3. CI=true + docker info shows dind indicators - other CI systems

This fixes test failures in CI where dind tests fail with
"dial unix /var/run/docker.sock: connect: no such file or directory"
because the inner dockerd can't start properly in nested dind setups.

The tests will still run in local development environments where
nested dind restrictions don't apply.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* docs: add runtime separation design

Design for improving separation between Docker and Apple container
runtimes using the Feature Manager pattern.

Problem: Current Runtime interface mixes universal operations with
Docker-only features, forcing type assertions and stub implementations.

Solution: Split Runtime into minimal core + optional feature managers
(NetworkManager, SidecarManager) that Docker provides and Apple returns
nil for. Manager code checks for nil and fails gracefully.

Benefits:
- No type assertions
- No stub implementations
- Clear feature boundaries
- Type safety
- Easy to extend with new Docker features

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* docs: add runtime separation implementation plan

Detailed 14-task implementation plan for migrating Runtime interface
to Feature Manager pattern.

Migration strategy:
- Phase 1: Add feature managers alongside existing methods
- Phase 2: Migrate manager.go to use feature managers
- Phase 3: Remove old methods from Runtime interface

Each task is a 2-5 minute action with exact code, file paths, test
commands, and expected output.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* feat(container): add NetworkManager and SidecarManager interfaces

* feat(container): add Docker feature manager implementations

- Add dockerNetworkManager and dockerSidecarManager structs
- Move CreateNetwork, RemoveNetwork to dockerNetworkManager
- Move StartSidecar, InspectContainer to dockerSidecarManager
- Add NetworkManager() and SidecarManager() accessors to DockerRuntime
- Keep old methods on DockerRuntime as delegates for backward compatibility
- Add nil managers to AppleRuntime (returns nil for both)
- InspectContainer on manager returns common ContainerInspectResponse type

* test(container): add feature manager detection tests

* refactor(run): migrate BuildKit network creation to feature manager

* refactor(run): migrate remaining type assertions to feature managers

Migrates all remaining DockerRuntime type assertions in manager.go to use
feature managers (SidecarManager and NetworkManager). This continues the
runtime separation pattern established in previous commits.

Migrated operations:
- StartSidecar (BuildKit sidecar start) → SidecarManager
- InspectContainer (BuildKit ready check) → SidecarManager
- RemoveNetwork (5 locations across Create, Stop, Destroy) → NetworkManager

All cleanup paths preserved with proper nil checks for runtime compatibility.
Error messages now indicate when operations require Docker runtime.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* refactor(container): remove delegating methods from DockerRuntime

Remove old delegate methods that are no longer needed after migration
to feature managers:
- CreateNetwork/RemoveNetwork from DockerRuntime (use NetworkManager)
- StartSidecar from DockerRuntime (use SidecarManager)
- InspectContainer from DockerRuntime (use SidecarManager)
- CreateNetwork/RemoveNetwork/StartSidecar from AppleRuntime stubs
- CreateNetwork/RemoveNetwork/StartSidecar from Runtime interface

All callers now use the feature manager pattern via NetworkManager()
and SidecarManager() accessors.

Also updated docker_test.go to call feature managers directly instead
of the old delegate methods.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* feat(container): add BuildManager interface

* feat(container): add Docker BuildManager implementation

- Add dockerBuildManager struct implementing BuildManager interface
- Move BuildImage, ImageExists, GetImageHomeDir to dockerBuildManager
- Keep temporary delegate methods on DockerRuntime for compatibility
- Add stub BuildManager method to AppleRuntime (implemented in Task 3)
- All build-related helper methods moved to manager

* feat(container): add Apple BuildManager implementation

Adds appleBuildManager struct to isolate build functionality:
- BuildImage (including fixBuilderDNS helper)
- ImageExists
- GetImageHomeDir

The AppleRuntime.BuildManager() accessor now returns the manager.
Temporary delegate methods remain on AppleRuntime for compatibility
(will be removed in Task 5).

* refactor(run): migrate BuildImage to BuildManager pattern

* refactor(container): remove build methods from Runtime interface

All callers now use BuildManager() to access build functionality.
Removed BuildImage, ImageExists, and GetImageHomeDir from Runtime
interface and their delegate implementations in DockerRuntime and
AppleRuntime.

Updated docker_test.go to call BuildManager().BuildImage() instead
of calling the removed Runtime.BuildImage() method.

Fixed ensureImage helper to use buildMgr.ImageExists() directly.

* refactor(run): migrate ImageExists and GetImageHomeDir to BuildManager

* refactor(container): extract user-facing messages to output package

Extract common user-facing messages (image pulling, building) into a shared
output package for consistency across Docker and Apple runtimes.

Benefits:
- Consistent messaging across both runtimes
- Single source of truth for display format
- Easy to add progress bars, colors uniformly later
- Reduces duplication without coupling core logic

Changed messages:
- PullingImage: Used in 5 locations (3 Docker, 2 Apple)
- BuildingImage: Used in 2 locations (1 Docker, 1 Apple)

* docs(concepts): add comprehensive container runtimes guide

Created docs/content/concepts/07-runtimes.md covering:
- Runtime detection and selection (Docker vs Apple containers)
- gVisor sandbox modes (runsc vs runc)
- --no-sandbox flag usage and security implications
- Proxy security models per runtime
- BuildKit and Docker-in-Docker integration
- Platform-specific guidance
- Troubleshooting common issues

Updated docs/content/concepts/01-sandboxing.md:
- Simplified runtime overview section
- Added reference to new runtimes page
- Maintained focus on isolation concepts

Follows STYLE-GUIDE.md:
- Objective tone (avoids marketing language)
- Factual claims (specific mechanisms)
- Practical examples (commands with output)
- Clear cross-references

* docs(container): document BuildKit + gVisor compatibility assumption

Add comment to dockerSidecarManager documenting that BuildKit
sidecars inherit the main container's OCI runtime and noting
that BuildKit + gVisor combination has not been extensively
tested in production.

* feat(apple): add 30-second timeout to DNS lock acquisition

Add timeout to DNS lock in fixBuilderDNS to prevent indefinite
blocking when multiple moat processes configure builder DNS
concurrently. Uses goroutine + select pattern to enforce
30-second timeout with clear error message.

Previously, lock acquisition would block indefinitely, causing all
concurrent builds to serialize. With timeout, builds will fail fast
if unable to acquire lock within 30 seconds.

* feat(sidecar): add container labels for orphan cleanup

Add RunID field to SidecarConfig and use it to label BuildKit
sidecars with moat.run-id and moat.role labels. These labels
enable cleanup of orphaned sidecars if moat crashes before
calling Destroy().

Changes:
- Add RunID field to SidecarConfig struct
- Create labels map in StartSidecar with run-id and role
- Pass r.ID when creating BuildKit sidecar in manager.go

Follow-up: Implement orphan cleanup in loadPersistedRuns() to
scan containers and remove sidecars whose run-id is not in m.runs.

* feat(docker): add gVisor availability check before container creation

Add runtime check in CreateContainer to detect if gVisor became
unavailable after DockerRuntime initialization (e.g., daemon restart,
config change). Provides clearer error message indicating the runtime
was previously available.

This is an edge case - gVisor is checked during NewDockerRuntime(),
but this additional check catches configuration changes during long-
running moat processes.

* feat(firewall): add port range validation in SetupFirewall

Add validation that proxyPort is in valid range (1-65535) before
using in iptables script. While port comes from trusted source
(proxy.Server.Start()), validation adds defense in depth.

Applied to both Docker and Apple container runtimes.

* fix(runtime): auto-disable gVisor on macOS and Windows

Make default behavior platform-aware to avoid requiring --no-sandbox
on platforms where gVisor is unavailable.

Changes:
- Update DefaultRuntimeOptions() to detect platform
- On Linux: sandbox=true (requires gVisor, enhanced isolation)
- On macOS/Windows: sandbox=false (gVisor unavailable in Docker Desktop)
- Log clear reason when gVisor disabled on non-Linux platforms

This eliminates the error "gVisor (runsc) is required but not available"
on macOS and Windows, where gVisor is never available in Docker Desktop.
Users can still override with --no-sandbox on Linux if needed.

Updated documentation to reflect automatic platform detection and
clarify that --no-sandbox only affects Linux.

* fix(docker): suppress gVisor warning on macOS and Windows

Only show "running without gVisor sandbox" warning on Linux where
gVisor is available but explicitly disabled via --no-sandbox.

On macOS and Windows, gVisor is unavailable in Docker Desktop, so
running without it is the expected default behavior, not a security
downgrade. Suppressing the warning provides cleaner output on these
platforms.

User experience on macOS/Windows:
- No warning when running normally (gVisor auto-disabled)
- No --no-sandbox flag needed
- Clean startup without confusing security messages

* fix(manager): use platform-aware defaults when NoSandbox not specified

Change ManagerOptions.NoSandbox from bool to *bool to distinguish:
- nil (default): Use platform-aware defaults from DefaultRuntimeOptions()
- &true: Explicitly disable sandbox (--no-sandbox flag)
- &false: Not used (would mean explicitly enable, but we use nil instead)

This fixes the issue where NewManager() was forcing sandbox=true on all
platforms, causing gVisor requirement errors on macOS/Windows where gVisor
is unavailable in Docker Desktop.

Before: NewManager() always required gVisor
After: NewManager() uses platform defaults (gVisor on Linux, standard on macOS/Windows)

Updated CLI callers:
- exec.go: Only set NoSandbox when --no-sandbox flag provided
- status.go, clean.go: Explicitly set NoSandbox=true (no sandbox needed)

* fix(e2e): update tests for NoSandbox pointer type

Update all E2E tests to use pointer syntax for NoSandbox field.
Changed from 'NoSandbox: true' to 'NoSandbox: &[]bool{true}[0]'
to match the new *bool type.

All E2E tests passing.

* chore: Delete review doc

* fix(runtime): address code review feedback

- Add warning when creating privileged containers (docker:dind mode)
- Document BuildKit + gVisor experimental status in runtime docs

The privileged container warning helps users understand security
implications of docker:dind mode. The BuildKit + gVisor note clarifies
that while expected to work, this combination hasn't been extensively
tested in production.

Code review verified that resource cleanup already has proper nil checks
for feature managers, and dockerLoadWriter.Close() correctly waits for
the docker load process.

* fix: address second round of code review feedback

Security improvements:
- Audit all privileged containers unconditionally (not just docker:dind)
- Validate Docker socket GID and permissions
- Add BuildKit readiness check with exponential backoff

The audit logging now captures any privileged container creation regardless
of the reason, improving security visibility. Socket validation warns when
permissions are unexpected. BuildKit readiness check handles ~5-10s daemon
initialization time with retry loop (30s timeout, 100ms-2s backoff).

* fix: address critical concurrency and resource leak issues

Critical fixes (MUST FIX before merge):
1. Fix proxy server shutdown race condition - use sync.Once pattern for
   both ProxyServer and SSHAgentServer to prevent concurrent Stop() calls

2. Fix incomplete BuildKit cleanup - add RemoveContainer call in error
   path when main container creation fails (was only stopping, not removing)

These fixes prevent race conditions in proxy shutdown and resource leaks
when container creation fails after BuildKit sidecar has started.

* fix: add thread-safe state access for Run struct

Critical fix for race condition in run state access. Run state fields
(State, Error, StartedAt, StoppedAt) were accessed with inconsistent
locking - Manager lock (m.mu) protected the runs map but not the Run
struct fields themselves.

Solution: Add dedicated stateMu mutex to Run struct with helper methods:
- GetState() / SetState() for thread-safe state access
- SetStateWithError() for atomic state+error updates
- SetStateWithTime() for atomic state+timestamp updates

Updated critical paths:
- monitorContainerExit goroutine (updates state after container exits)
- Stop() method (checks and sets state)
- Wait() method (reads error state)

This prevents data races between concurrent access from multiple
goroutines (monitorContainerExit, Stop, Wait).

---------

Co-authored-by: Andy Bonventre <andybons@meetneptune.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants