Skip to content

feat(deps): add Ollama service dependency with provisioning and cache support#238

Merged
dpup merged 27 commits into
mainfrom
feat/ollama-sidecar
Mar 17, 2026
Merged

feat(deps): add Ollama service dependency with provisioning and cache support#238
dpup merged 27 commits into
mainfrom
feat/ollama-sidecar

Conversation

@dpup
Copy link
Copy Markdown
Collaborator

@dpup dpup commented Mar 17, 2026

Summary

  • Adds Ollama as a service dependency following the established postgres/mysql/redis pattern
  • Introduces generic provisioning primitives (provisions_key, provision_cmd, cache_path) on ServiceDef — reusable by any future service
  • Agent declares ollama in dependencies: and lists models under services.ollama.models; moat starts sidecar, pulls models, caches on host, injects MOAT_OLLAMA_* env vars
  • Includes code review fixes: provision item validation (prevents shell injection), per-item timeouts, per-item flock scope, tightened cache permissions

Test plan

  • Unit tests pass (go test ./internal/run/ ./internal/config/ ./internal/container/ -race)
  • Lint clean (make lint — 0 issues)
  • Shell injection rejected: model names with ;, |, $, `, & fail validation
  • Valid model names accepted: qwen2.5-coder:7b, nomic-embed-text, library/model:tag
  • No-auth services (Ollama) skip password generation
  • Auth services (postgres, mysql, redis) still get passwords
  • Unknown config keys produce actionable errors with "did you mean?" hints
  • Unknown scalar keys now captured and validated
  • Manual: moat run examples/service-ollama starts sidecar, pulls model, prints response
  • E2E: verify Ollama API accessible from agent container via MOAT_OLLAMA_URL

🤖 Generated with Claude Code

@claude
Copy link
Copy Markdown

claude Bot commented Mar 17, 2026

📦 Previous Review (superseded by newer review)

Claude finished @dpup's task in 3m 44s —— View job


Code Review

  • Gather context and understand PR scope
  • Review core provisioning logic (internal/run/services.go)
  • Review config changes (internal/config/config.go)
  • Review container changes (docker/apple service files)
  • Review registry and types changes
  • Review tests
  • Review docs and examples
  • Review miscellaneous files (todos/, compound-engineering.local.md)
  • Write up review summary

Summary

The overall design is clean and follows the existing postgres/mysql/redis pattern well. Shell injection is correctly addressed with a strict allowlist regex, the cache/provision fields are well-factored as generic ServiceDef primitives, and the tests are thorough. A few real bugs and process concerns need attention.


Bugs

1. Scalar models: value is silently swallowed — medium severity

In config.go:196, when a user writes a scalar (models: qwen2.5:7b) instead of a list, the unmarshaler captures it as Extra["models"] = nil. Then in services.go:178, provisions = userSpec.Extra["models"] resolves to nil, and the loop never validates or errors. The user gets no models pulled and no diagnostic.

// config.go:190-197
} else {
    // Capture unknown non-sequence keys as single-element entries
    // so the run layer can validate and reject them with useful errors.
    if s.Extra == nil {
        s.Extra = make(map[string][]string)
    }
    s.Extra[key] = nil // nil signals "key present but not a list"
}

The run layer never actually validates this nil signal for a known provisions key. It needs an explicit check:

// services.go — after retrieving provisions
provisions = userSpec.Extra[spec.Service.ProvisionsKey]
// BUG: if provisions is nil but key exists, the user wrote a scalar — error here

Fix this →


2. Misleading comment in provisionService — low severity but will mislead future readers

services.go:278:

// Acquire lock per-item so parallel runs pulling different items don't serialize.

This comment is incorrect. The lock file is a single file under cfg.CacheHostPath, shared by all concurrent moat run processes. Two runs pulling different models still serialize — they both wait on the same .lock. The per-item granularity only helps within a single run (prevents one run's models from blocking each other). This is already tracked as todo/005 but the comment should accurately describe current behavior.

Fix this →


Files that should not be committed

3. compound-engineering.local.md — this file appears to be local tooling configuration for a "compound engineering" review agent system. It has .local.md in the name, suggesting it's a local-only file. It has no value for contributors and will confuse anyone who clones the repo. Fix this →

4. todos/ directory (9 files) — these are code-review work items generated by the compound engineering system. They shouldn't live in the repository:

  • They duplicate what GitHub Issues provide, without the notifications, assignment, or searchability
  • They'll create noise for contributors and in git log
  • They reference ephemeral implementation state ("P1", "pending") that goes stale

Convert these to GitHub issues (or close them if already addressed) and remove the directory. Fix this →


Minor issues

5. cancel() without defer in provisioning loop (services.go:276–296)

The context cancel is called explicitly on every exit path, which works, but any future added return path before the cancel() call will silently leak the context. The idiomatic fix is to extract the loop body into a named helper and use defer cancel() there.

for _, cmd := range cmds {
    cmdCtx, cancel := context.WithTimeout(ctx, provisionTimeout)
    // ... cancel() called explicitly on each exit path
    // if a new return is added above cancel(), context leaks
}

6. Example moat.yaml is missing the agent field (examples/service-ollama/moat.yaml)

The agent field is declared without omitempty in the Config struct, and other examples (presumably) include it. This example uses command: to run a shell script directly, which may be intentional, but it's worth confirming whether moat run requires agent: or if command: alone is sufficient.


What's good

  • Injection defense: The validProvisionItem regex (^[a-zA-Z0-9][a-zA-Z0-9._:/@-]*$) correctly blocks ;, |, $, `, &, space, and other shell metacharacters. Validation happens before any shell execution.
  • Generic design: provisions_key / provision_cmd / cache_path in ServiceDef are cleanly reusable for any future service that needs post-start provisioning.
  • No-auth services: The needsPassword guard properly skips password generation for Ollama, and the tests verify this.
  • Per-item flock: Acquiring/releasing flock per model (rather than holding it for the full batch) is the right direction, even if cross-run serialization remains a todo.
  • Test coverage: The test file covers injection attempts, valid names, no-auth behavior, unknown key rejection, and env generation for Ollama. Good breadth.
  • Error messages: The "did you mean?" hint for wrong provision key names is a nice touch.
  • Cache permissions: 0o700 is tight and correct.

dpup added a commit that referenced this pull request Mar 17, 2026
- Validate scalar provision keys (e.g., `models: value` instead of list)
  with clear "must be a list" error message
- Fix misleading flock comment — parallel runs still serialize on
  shared lock file; extracted provisionItem helper with defer cancel()
- Remove compound-engineering.local.md (local tooling, not for repo)
- Remove todos/ directory (tracked as GitHub issues instead)
@claude
Copy link
Copy Markdown

claude Bot commented Mar 17, 2026

📦 Previous Review (superseded by newer review)

Claude finished @dpup's task in 3m 7s —— View job


Code Review

  • Gather context and read changed files
  • Review core provisioning logic (internal/run/services.go)
  • Review config changes (internal/config/config.go)
  • Review container changes (docker/apple service files)
  • Review registry and types changes
  • Review tests
  • Review docs and examples
  • Write up review summary

The previous review issues (scalar value detection, misleading lock comment, todos/ and compound-engineering.local.md cleanup) have all been addressed. The core implementation is solid. A few remaining issues below.


Bugs

1. Docs security model now inaccuratedocs/content/guides/08-services.md:246

**Authenticated** — Every service requires a password, even on the isolated network.

This is no longer true with Ollama. Needs a carve-out (e.g., "Services that support authentication require a password...") or just remove the absolute claim.

Fix this →


2. Readiness check table is misleadingdocs/content/guides/08-services.md:192

| `ollama` | `ollama list` + pull declared models |

Model pulling is NOT part of the readiness check — it runs after readiness passes. CheckReady only runs ollama list. The provisioning step is a separate phase. This will mislead users trying to understand why startup takes a long time ("it's ready, why is it still pulling?").

Fix this →


Minor issues

3. Startup lifecycle missing provisioning stepdocs/content/guides/08-services.md:154–166

The documented startup sequence goes:

5. Run readiness checks (poll every 1 second, timeout after 30 seconds)
6. Inject MOAT_* environment variables

For Ollama the actual sequence is 5 → pull models → 6. Since model pulls can take minutes, the missing step is the most user-visible part of the flow and should be documented (perhaps as a conditional step: "5b. Run provisioning commands (e.g., pull models)").

Fix this →

4. Model pull output bypasses run logsinternal/run/manager.go:1968

if err := provisionService(ctx, svcMgr, info, svcConfigs[i], os.Stderr); err != nil {

provisionService writes to os.Stderr directly, so model pull progress is not captured in ~/.moat/runs/<id>/logs.jsonl. A user running moat logs <run-id> after a failed pull won't see the pull output. If the intent is to show live progress during the pull, consider also routing through the storage.LogWriter (tee to both). Low priority since this is logged to the terminal during the run, but worth tracking.

5. provisionItem comment references todo/005internal/run/services.go:293

// See todo/005 for narrowing to per-model locks.

The todos/ directory was removed per the previous review. This reference is now a dead link. Either replace with a GitHub issue reference or just drop the "See todo/005" clause.

Fix this →


What looks good

  • All previous review feedback addressed: scalar value detection, lock comment accuracy, todos/ and compound-engineering.local.md removed.
  • provisionItem flock + defer: LIFO defer ordering is correct — LOCK_UN fires before cancel(). Each item gets a fresh 30-minute window. No context leak.
  • serviceUsesPasswordPlaceholder: Correctly handles Redis-style auth (no password_env, but {password} in extra_cmd/readiness_cmd).
  • Injection defense unchanged and correct: validProvisionItem blocks all shell metacharacters; validation happens before any exec.
  • Empty list handled correctly: models: [] → empty Extra["models"] → no provisioning, no error.
  • Container implementations consistent: Docker and Apple ProvisionService both fail-fast on non-zero exit.
  • Test breadth: injection tests, valid names, no-auth behavior, scalar rejection, unknown key rejection, env generation. Very thorough.

dpup added a commit that referenced this pull request Mar 17, 2026
- Fix readiness table: Ollama check is just "ollama list", model
  pulling is a separate provisioning step
- Add provisioning step (step 6) to startup sequence docs
- Fix security model docs: not all services require passwords (Ollama
  has no auth, relies on network isolation)
- Remove dead todo/005 reference from provisionItem comment
@claude
Copy link
Copy Markdown

claude Bot commented Mar 17, 2026

📦 Previous Review (superseded by newer review)

Claude finished @dpup's task in 3m 5s —— View job


Code Review

  • Gather context and understand PR scope
  • Review core provisioning logic (internal/run/services.go)
  • Review config changes (internal/config/config.go)
  • Review container changes (docker/apple service files)
  • Review registry and types changes
  • Review tests
  • Review docs and examples
  • Write up review summary

Both previous rounds of feedback have been addressed. The implementation is solid. Two remaining issues below; one is a real behavior bug, one is a documentation gap.


Bugs

1. wait: false silently skips model provisioning — medium severity

manager.go:1946:

if !wait {
    continue  // skips BOTH readiness check AND the provisioning block below
}

If a user declares models under services.ollama.models and also sets wait: false, moat starts the agent immediately and never pulls the models. There's no warning and no error — the agent just finds Ollama running with no models available.

For databases, skipping both readiness and provisioning on wait: false is fine (there's nothing to provision). For Ollama it's a silent failure mode. The model pull logically depends on readiness anyway (you can't ollama pull until the server accepts connections), but the user doesn't know this.

Minimum fix: detect the combination and return an error or warning:

if !wait {
    if svcConfigs[i].ProvisionCmd != "" && len(svcConfigs[i].Provisions) > 0 {
        return nil, fmt.Errorf("%s: wait: false is incompatible with declared models — "+
            "models cannot be pulled until the service is ready", dep.Name)
    }
    continue
}

Alternatively, document it in the wait: false section of the guide and reference docs.

Fix this →


Documentation

2. MOAT_OLLAMA_HOST and MOAT_OLLAMA_PORT missing from reference docs — minor

docs/content/reference/02-moat-yaml.md documents only MOAT_OLLAMA_URL in the Ollama env var table, but the code also injects MOAT_OLLAMA_HOST and MOAT_OLLAMA_PORT (from generateServiceEnv — every service gets _HOST and _PORT). Postgres, MySQL, and Redis all document their full variable sets. The Ollama table should list all three:

Variable Description Example
MOAT_OLLAMA_HOST Service hostname ollama
MOAT_OLLAMA_PORT Service port 11434
MOAT_OLLAMA_URL Base URL for the Ollama API http://ollama:11434

Fix this →


Minor nit

demo.sh truncates JSON mid-streamexamples/service-ollama/demo.sh:16

| head -c 500

head -c 500 cuts at a byte boundary, not a JSON boundary. The demo output will show broken JSON for any non-trivial response. Use | python3 -c "import sys,json; d=json.load(sys.stdin); print(d['response'])" or just drop the truncation — the response field value is what users actually want to see.


What's good

  • All round 1 and round 2 feedback addressed: scalar value detection, lock comment, todos/ removal, docs security model, readiness table, provisioning step in startup sequence.
  • provisionItem defer ordering correct: LIFO means LOCK_UN fires before cancel() — correct.
  • Empty models list handled: models: [] → no provisioning, no error.
  • serviceUsesPasswordPlaceholder: correctly handles Redis; Ollama correctly gets no password.
  • Shell injection defense solid: validProvisionItem regex blocks all metacharacters, validated before exec.
  • Tests comprehensive: injection attempts, valid names, scalar rejection, unknown key rejection, no-auth behavior, env generation, provision cmd building.
  • Docs are accurate (modulo issue Add Claude Code GitHub Workflow #2 above): startup sequence, readiness table, security model all updated correctly.

dpup added a commit that referenced this pull request Mar 17, 2026
- Error on wait: false with provisioned items (models can't be pulled
  until service is ready — silent failure before this fix)
- Add MOAT_OLLAMA_HOST and MOAT_OLLAMA_PORT to reference docs
- Remove head -c truncation from demo.sh (was breaking JSON output)
@claude
Copy link
Copy Markdown

claude Bot commented Mar 17, 2026

📦 Previous Review (superseded by newer review)

Claude finished @dpup's task in 4m 18s —— View job


Code Review

  • Gather context and understand PR scope
  • Review previous round feedback resolution
  • Review core provisioning logic (internal/run/services.go, internal/run/manager.go)
  • Review config changes (internal/config/config.go)
  • Review container changes (docker/apple service files)
  • Review registry and types changes
  • Review tests
  • Review docs and examples
  • Write up review summary

All three previous rounds of feedback have been addressed. The implementation is solid. A few remaining minor items below.


Still present from round 2 (low priority)

1. Provision output bypasses run logsinternal/run/manager.go:1982

if err := provisionService(ctx, svcMgr, info, svcConfigs[i], os.Stderr); err != nil {

Model pull progress writes directly to os.Stderr and is not captured in ~/.moat/runs/<id>/logs.jsonl. A user running moat logs <run-id> after a failed pull won't see the pull output. This was flagged as low priority in round 2 and is still present. Fine to merge as-is, but tracking the gap is worth it.


New findings

2. syscall.Flock is not context-awareinternal/run/services.go:299 — low severity

if err := syscall.Flock(int(lockFile.Fd()), syscall.LOCK_EX); err != nil {

If two moat run processes race for the lock and one is cancelled (Ctrl+C, timeout), Flock(LOCK_EX) blocks until the other process releases it — it does not respect Go context cancellation. For typical usage (model pulls that succeed quickly), this is invisible. But if a 30-minute pull is interrupted by the user while a second run is waiting, the second run will not exit cleanly until the first run's lock is released (which happens on process exit, so it's fine eventually). Not a correctness bug — just worth noting for anyone debugging a stuck moat run.

Fix this →

3. No unit test for wait: false + provisions error pathinternal/run/manager.go:1946–1960

The new guard that rejects wait: false when provisions are declared has no unit test. All the other validation in buildServiceConfig is tested in services_test.go, but this new path lives in manager.go and only exercises via integration or E2E. Given that wait: false is an edge case, this is low priority, but worth tracking.


Nit

4. Docker ProvisionService inspects exec before Docker finishes writing exit codeinternal/container/docker_service.go:101–110

_, _ = stdcopy.StdCopy(stdout, stdout, resp.Reader)
resp.Close()

execInspect, err := m.cli.ContainerExecInspect(ctx, execCreateResp.ID)

There's a well-known narrow race in the Docker SDK: after the exec's output stream closes (EOF), Docker may not have committed the exit code to its internal store yet. ContainerExecInspect called immediately after could return exit code 0 on a command that actually failed. In practice this is very rare, and it's the same pattern used in CheckReady — so this is pre-existing and acceptable. Not blocking, just worth being aware of if ollama pull failures are ever reported as passing silently.


What's good

  • All round 1–3 feedback addressed: scalar detection, lock comment, todos/ + compound-engineering.local.md removed, docs security model corrected, readiness table fixed, startup lifecycle updated, wait: false + provisions error added, MOAT_OLLAMA_HOST/MOAT_OLLAMA_PORT in reference docs, demo.sh truncation removed.
  • provisionItem defer ordering correct: LIFO means LOCK_UN fires before cancel() — correct sequencing.
  • Shell injection defense: validProvisionItem regex blocks all shell metacharacters; validated before any exec reaches the container.
  • No-auth path: Ollama correctly gets no password; serviceUsesPasswordPlaceholder correctly handles Redis.
  • wait: false guard: actionable error message with clear fix hint.
  • Empty models: list: no provisioning, no error.
  • Test coverage: injection attempts, scalar rejection, unknown key rejection, no-auth behavior, env generation — thorough.
  • Docs: startup sequence, readiness table, security model, and env var reference all accurate and current.
  • Example: moat.yaml + demo.sh are clean and self-contained.

This is ready to merge. The remaining items are all low-priority follow-ups, not blockers.

@claude
Copy link
Copy Markdown

claude Bot commented Mar 17, 2026

📦 Previous Review (superseded by newer review)

Claude finished @dpup's task in 3m 49s —— View job


Code Review

  • Gather context and understand PR scope
  • Review core provisioning logic (internal/run/services.go, internal/run/manager.go)
  • Review config changes (internal/config/config.go)
  • Review container changes (docker/apple service files)
  • Review registry and types changes
  • Review tests
  • Review docs and examples
  • Write up review summary

All previous round feedback has been addressed. The implementation is solid. One new real bug introduced in this round, plus a minor issue worth noting.


Bug

1. Docker exec retry loop silently returns success if exec is still Running after retriesinternal/container/docker_service.go:77–96 (CheckReady) and :121–140 (ProvisionService)

var exitCode int
for attempt := 0; attempt < 3; attempt++ {
    execInspect, err := m.cli.ContainerExecInspect(ctx, execCreateResp.ID)
    if err != nil {
        return fmt.Errorf(...)
    }
    if !execInspect.Running {
        exitCode = execInspect.ExitCode
        break
    }
    if attempt < 2 {
        time.Sleep(100 * time.Millisecond)
    }
}
if exitCode != 0 {
    return fmt.Errorf(...)
}
return nil

If Docker still reports the exec as Running after all three attempts, the loop exits without hitting the break. exitCode stays at its zero value. if exitCode != 0 is false. Both CheckReady and ProvisionService return nil — a false success.

For ProvisionService this means ollama pull could be silently treated as complete while the pull is still in progress, and the agent starts without the model available.

The fix is to track whether the break was actually hit:

var exitCode int
complete := false
for attempt := 0; attempt < 3; attempt++ {
    execInspect, err := m.cli.ContainerExecInspect(ctx, execCreateResp.ID)
    if err != nil {
        return fmt.Errorf(...)
    }
    if !execInspect.Running {
        exitCode = execInspect.ExitCode
        complete = true
        break
    }
    if attempt < 2 {
        time.Sleep(100 * time.Millisecond)
    }
}
if !complete {
    return fmt.Errorf("exec still running after retries")
}
if exitCode != 0 {
    return fmt.Errorf(...)
}

Fix this →


Minor

2. flockContext goroutine outlives context cancellationinternal/run/services.go:311–324

When ctx.Done() fires, flockContext returns ctx.Err() immediately, but the goroutine doing syscall.Flock(LOCK_EX) stays blocked. The channel is buffered (make(chan error, 1)) so the goroutine won't hang forever — it will eventually complete when the flock becomes available or when the fd is closed by defer lockFile.Close(). But between context cancellation and goroutine completion, there's a window where the goroutine acquires the lock without any corresponding LOCK_UN call. The lock gets released on lockFile.Close(), which is soon after, so this is benign in practice. Worth noting if debugging unexpected lock delays.


What's good

  • All round 1–4 feedback addressed: scalar detection, lock comment, todos/+compound-engineering.local.md removed, docs security model, readiness table, startup lifecycle, wait: false guard, MOAT_OLLAMA_HOST/PORT in reference, demo.sh cleaned up, flock goroutine now context-aware.
  • flockContext design: buffered channel means the goroutine can always drain without blocking; defer ordering in provisionItem is correct (LOCK_UN fires before cancel).
  • Shell injection defense: validProvisionItem regex correctly rejects all shell metacharacters; validation runs before any exec.
  • No-auth path: Ollama gets no password; Redis {password} placeholder detection is correct.
  • wait: false guard: clear error with actionable fix hint.
  • lw.Close() via defer in provisioning loop: safe — deferred to function exit, all writers get closed.
  • Config scalar detection: nil signal from UnmarshalYAML is explicitly checked in buildServiceConfig with a useful error.
  • Test coverage: injection attempts, scalar rejection, unknown key rejection, no-auth behavior, env generation, provision cmd building — comprehensive.
  • Docs: accurate and in sync with implementation.

The Docker retry false-success is the only remaining real bug. Once fixed, this is ready to merge.

@dpup
Copy link
Copy Markdown
Collaborator Author

dpup commented Mar 17, 2026

Fixed the Docker exec retry false-success bug in both CheckReady and ProvisionService. Added a complete bool that's only set when the break fires — if the loop exhausts all retries with the exec still Running, we now return an error instead of silently succeeding.

The two low-priority items (provision output bypassing run logs, and the goroutine outliving context cancellation on lock contention) are noted — tracking separately rather than blocking merge.

@claude
Copy link
Copy Markdown

claude Bot commented Mar 17, 2026

📦 Previous Review (superseded by newer review)

Claude finished @dpup's task in 2m 12s —— View job


Claude finished @dpup's task ——


Code Review

  • Gather context and understand PR scope
  • Review the Docker exec retry fix (internal/container/docker_service.go)
  • Review core provisioning logic (internal/run/services.go, internal/run/manager.go)
  • Review config changes and tests
  • Review docs and examples
  • Write up review summary

The Docker exec retry fix is correct. All five rounds of previous feedback have been addressed. The implementation is solid and ready to merge.

The fix (round 5)

docker_service.go:72–90 (CheckReady) and :124–145 (ProvisionService):

complete := false
for attempt := 0; attempt < 3; attempt++ {
    execInspect, err := m.cli.ContainerExecInspect(ctx, execCreateResp.ID)
    ...
    if !execInspect.Running {
        exitCode = execInspect.ExitCode
        complete = true
        break
    }
    ...
}
if !complete {
    return fmt.Errorf("... exec still running after retries")
}

The complete bool is set only inside the break path. If the loop exhausts all retries with the exec still Running, !complete catches it and returns an error. Previously exitCode stayed at zero, silently returning success.


Minor observations (non-blocking)

defer lw.Close() inside service loopmanager.go:1998–2001

for i, dep := range serviceDeps {
    ...
    if lw, lwErr := store.LogWriter(); lwErr == nil {
        provOut = io.MultiWriter(os.Stderr, lw)
        defer lw.Close()
    }

defer applies to function exit, not loop iteration. If multiple services ever have provisions, each iteration adds another deferred close. Currently only Ollama has provisions so this never fires more than once in practice. Worth knowing if new provision-capable services are added.

demo.sh outputs raw JSONexamples/service-ollama/demo.sh:13–15

The head -c 500 truncation was correctly removed, but the curl /api/generate output is now a raw JSON blob with the model response embedded. Cosmetic — doesn't affect functionality.


What's good

  • Docker retry fix: correct on both CheckReady and ProvisionService paths — no false-success possible.
  • All previous feedback addressed: scalar detection, lock comment, todos/ + compound-engineering.local.md removed, docs security model, readiness table, startup lifecycle, wait: false guard, MOAT_OLLAMA_HOST/PORT in reference, flockContext goroutine, demo.sh truncation.
  • flockContext: buffered channel prevents goroutine leak; defer ordering in provisionItem is LIFO-correct (LOCK_UN before cancel).
  • Shell injection defense: validProvisionItem regex blocks all metacharacters; validated before any exec.
  • No-auth path: Ollama gets no password; serviceUsesPasswordPlaceholder correctly handles Redis.
  • Cache directory creation: MkdirAll with 0o700 at manager.go:1912 runs before StartService — bind mount is safe.
  • Test coverage: injection attempts, scalar rejection, unknown key rejection, no-auth behavior, env generation, provision cmd building — comprehensive.
  • Docs: startup sequence, readiness table, security model, and env var reference all accurate.

This is ready to merge.

@dpup
Copy link
Copy Markdown
Collaborator Author

dpup commented Mar 17, 2026

Both low-priority items addressed:

  1. Provision output — already captured: the io.MultiWriter(os.Stderr, lw) tee is in place at manager.go:1997–2000.

  2. flockContext goroutine — no code change needed. When context cancellation fires, the goroutine stays blocked on Flock until the caller's deferred lockFile.Close() closes the fd, which causes Flock to return EBADF and the goroutine to terminate. The buffered channel prevents any blocking on send. Added a comment to flockContext documenting this so the behavior is obvious to future readers.

@claude
Copy link
Copy Markdown

claude Bot commented Mar 17, 2026

📦 Previous Review (superseded by newer review)

Claude finished @dpup's task in 2m 9s —— View job


Code Review

  • Gather context and understand PR scope
  • Review Docker exec retry fix (internal/container/docker_service.go)
  • Review flockContext comment (internal/run/services.go)
  • Review provision output capture (internal/run/manager.go)
  • Review remaining changed files
  • Write up review summary

All previous round feedback has been addressed. The implementation is solid. This is ready to merge.


The fixes (this round)

Docker exec retry false-success (docker_service.go:72–90, 124–145) — correctly fixed with complete bool on both CheckReady and ProvisionService. The complete flag is only set inside the break path; exhausting all 3 retries with exec still Running now returns an error rather than silently returning nil.

flockContext goroutine lifecycle (services.go:308–315) — the comment accurately documents the shutdown path: context cancellation returns immediately, the blocked goroutine eventually wakes when lockFile.Close() closes the fd (EBADF), the buffered channel prevents any goroutine leak. Correct.

Provision output capture (manager.go:1997–2000) — io.MultiWriter(os.Stderr, lw) correctly tees pull output to both the terminal and ~/.moat/runs/<id>/logs.jsonl. The concern from round 2 was already handled.


Minor observations (non-blocking, tracked for awareness)

CheckReady uses io.Copy instead of stdcopy.StdCopy (docker_service.go:67)

ProvisionService correctly uses stdcopy.StdCopy to demultiplex Docker's stdout/stderr stream. CheckReady uses io.Copy(io.Discard, ...) — functionally fine since we discard the output, but it reads the raw multiplexed wire format. If anyone ever changes CheckReady to capture output for debugging, they'll get garbled bytes. No action needed now.

defer lw.Close() accumulates across service loop iterations (manager.go:2000)

for i, dep := range serviceDeps {
    if lw, lwErr := store.LogWriter(); lwErr == nil {
        provOut = io.MultiWriter(os.Stderr, lw)
        defer lw.Close()   // defers to function exit, not loop iteration
    }

Each provision-capable service in the loop adds another deferred close. Currently only Ollama has provisions so this never fires more than once. Fine to leave as-is, but worth noting if more provision-capable services are added.

demo.sh outputs raw JSON (examples/service-ollama/demo.sh:13–15)

curl /api/generate with stream: false returns a single JSON blob. The demo output will be one line of dense JSON rather than the readable model response. Cosmetic only — doesn't affect functionality or correctness.


What's good

  • All round 1–5 feedback resolved: scalar detection, lock comment, todos/ + compound-engineering.local.md removed, docs security model, readiness table, startup lifecycle, wait: false guard, MOAT_OLLAMA_HOST/PORT in reference, flockContext goroutine docs, provision output tee, Docker retry false-success.
  • Shell injection defense: validProvisionItem regex blocks all metacharacters; validated before any exec.
  • No-auth path: Ollama gets no password; serviceUsesPasswordPlaceholder correctly handles Redis.
  • wait: false guard: clear, actionable error message.
  • Cache directory: MkdirAll with 0o700 runs before StartService; bind mount is safe.
  • LIFO defer ordering: LOCK_UN fires before cancel() in provisionItem — correct.
  • Empty models: list: no provisioning, no error.
  • Test coverage: injection attempts, scalar rejection, unknown key rejection, no-auth behavior, env generation, provision cmd building — comprehensive.
  • Docs: startup sequence, readiness table, security model, env var reference all accurate.

@claude
Copy link
Copy Markdown

claude Bot commented Mar 17, 2026

📦 Previous Review (superseded by newer review)

Claude finished @dpup's task in 3m 2s —— View job


Code Review

  • Gather context and understand PR scope
  • Review core provisioning logic (internal/run/services.go, internal/run/manager.go)
  • Review config changes (internal/config/config.go)
  • Review container changes (docker/apple service files)
  • Review registry and types changes
  • Review tests
  • Review docs and examples
  • Write up review summary

All six prior rounds of feedback are addressed. The two low-priority items from the last round are now fixed:

  • Log writer lifecycle (manager.go:1997–2007): The IIFE correctly scopes defer lw.Close() to each provisioning call rather than accumulating until function exit. Clean.
  • flockContext goroutine shutdown (services.go:308–315): The added comment accurately documents the shutdown path — context cancellation returns early, the goroutine eventually unblocks via EBADF when lockFile.Close() fires, the buffered channel prevents any goroutine leak. Correct.

Bug

demo.sh uses python3 but it's not in the dependenciesexamples/service-ollama/demo.sh:15

| python3 -c "import sys,json; print(json.load(sys.stdin)['response'])"

The agent container image is resolved from declared dependencies. With only ollama@0.9 and curl in dependencies:, the base image is debian:bookworm-slim, which does not include Python. The demo will fail at the JSON parsing step.

Fix: add jq to dependencies and use jq -r .response instead — jq is already defined in the registry.

# moat.yaml
dependencies:
  - ollama@0.9
  - curl
  - jq
# demo.sh
  | jq -r .response

Fix this →


What's good

  • All prior feedback resolved: scalar detection, lock comment, todos/ and compound-engineering.local.md removed, docs security model, readiness table, startup lifecycle, wait: false guard, MOAT_OLLAMA_HOST/PORT in reference, Docker retry false-success, IIFE log writer, flockContext docs.
  • Shell injection defense: validProvisionItem regex blocks all metacharacters; validation runs before any exec reaches the container.
  • No-auth path: Ollama gets no password; serviceUsesPasswordPlaceholder correctly handles Redis.
  • LIFO defer ordering: LOCK_UN fires before cancel() in provisionItem.
  • Cache directory: MkdirAll with 0o700 before StartService; bind mount is always safe.
  • Test coverage: injection attempts, scalar rejection, unknown key rejection, no-auth behavior, env generation, provision cmd building — comprehensive.
  • Docs: startup sequence, readiness table, security model, env var reference all accurate.

Fix the demo.sh dependency and this is ready to merge.

Comment thread examples/service-ollama/demo.sh Outdated
echo "--- Generating response ---"
curl -s "$MOAT_OLLAMA_URL/api/generate" \
-d '{"model":"qwen2.5-coder:1.5b","prompt":"Write hello world in Go","stream":false}' \
| python3 -c "import sys,json; print(json.load(sys.stdin)['response'])"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The base image for this example is debian:bookworm-slim (no Python runtime in dependencies:), so python3 won't be available. The demo will fail here.

Fix: add jq to moat.yaml dependencies and use:

Suggested change
| python3 -c "import sys,json; print(json.load(sys.stdin)['response'])"
| jq -r .response

Then in moat.yaml:

dependencies:
  - ollama@0.9
  - curl
  - jq

@claude
Copy link
Copy Markdown

claude Bot commented Mar 17, 2026

📦 Previous Review (superseded by newer review)

Claude finished @dpup's task in 2m 7s —— View job


Claude finished @dpup's task ——


Code Review

  • Gather context and read changed files
  • Fix outstanding demo.sh python3/jq issue
  • Verify no new issues from latest commits
  • Write up review summary

The latest commit (36ee2ae) correctly updates Ollama version tags to match actual Docker Hub tags — the previous 0.9-style versions don't exist, causing image pull 404s. Good catch and clean fix.

One outstanding issue from the previous round remained unfixed:

Fixed: demo.sh used python3 without it in dependencies

examples/service-ollama/demo.sh:15 used python3 to parse the JSON response, but the base image for this example is debian:bookworm-slim (no Python runtime in dependencies:), so the demo would fail at that step. Fixed directly: added jq to moat.yaml dependencies and replaced the Python pipe with jq -r .response.


Everything else looks good

All prior review rounds have been addressed:

  • Scalar models: value detection and error
  • Lock comment accuracy
  • todos/ and compound-engineering.local.md removed
  • Docs security model, readiness table, startup lifecycle updated
  • wait: false + provisions guard with actionable error
  • MOAT_OLLAMA_HOST/MOAT_OLLAMA_PORT in reference docs
  • Docker exec retry false-success fixed with complete bool
  • flockContext goroutine shutdown documented
  • IIFE scoping for lw.Close() in provisioning loop
  • Provision output teed to logs.jsonl via io.MultiWriter

This is ready to merge.

@claude
Copy link
Copy Markdown

claude Bot commented Mar 17, 2026

📦 Previous Review (superseded by newer review)

Claude finished @dpup's task in 2m 26s —— View job


Code Review

  • Gather context and understand PR scope
  • Review core provisioning logic (internal/run/services.go, internal/run/manager.go)
  • Review config changes (internal/config/config.go)
  • Review container changes (docker_service.go, apple_service.go)
  • Review registry and types changes
  • Review tests
  • Review docs and examples
  • Write up review summary

All previous round feedback has been resolved. One real doc bug and one cosmetic issue remain.


Bug

ollama@0.9 is not a valid version — will fail validationdocs/content/guides/08-services.md:203 and docs/content/reference/02-moat-yaml.md:269

Both docs show:

- ollama@0.9

But the registry only allows ["0.16.3", "0.17.7", "0.18.0", "0.18.1"]. Moat validates against the versions: list and returns an error:

invalid version "0.9" for ollama

  Available versions: 0.16.3, 0.17.7, 0.18.0, 0.18.1
  Or omit the version to use the default (0.18.1)

Any user copying the doc examples will hit this error immediately. The working example in examples/service-ollama/moat.yaml already correctly uses ollama@0.18.1 — the docs just need to match.

Fix this →


Nit

api/tags response in demo.sh is raw JSONexamples/service-ollama/demo.sh:9

curl -s "$MOAT_OLLAMA_URL/api/tags"

jq is already declared as a dependency, so the output could be formatted with | jq . for consistency with the generate step. Not functional — the demo works — but the model list will be one line of dense JSON.


What's good

  • All prior round feedback addressed: scalar detection, lock comment, todos/ + compound-engineering.local.md removed, docs security model, readiness table, startup lifecycle, wait: false guard, MOAT_OLLAMA_HOST/PORT in reference, Docker retry false-success (with complete bool), flockContext goroutine docs, IIFE log writer scoping, provision output tee, demo.sh python3 → jq, status messages for wait/pull.
  • flockContext shutdown: well-documented — context cancellation returns early, goroutine terminates on EBADF when lockFile.Close() fires, buffered channel prevents leak.
  • Injection defense: validProvisionItem regex (^[a-zA-Z0-9][a-zA-Z0-9._:/@-]*$) correctly rejects all shell metacharacters; validated before any exec.
  • No-auth path: Ollama gets no password; serviceUsesPasswordPlaceholder correctly handles Redis.
  • LIFO defer ordering in provisionItem: LOCK_UN fires before cancel() — correct.
  • Cache directory: MkdirAll with 0o700 before StartService; bind mount is always safe.
  • Test coverage: injection attempts, scalar rejection, unknown key rejection, no-auth behavior, env generation, provision cmd building — comprehensive.
  • Status messages: Waiting for %s to be ready... and Pulling N item(s) for %s: ... give clear user feedback during the potentially-long startup.

Fix the version string in docs and this is ready to merge.

@claude
Copy link
Copy Markdown

claude Bot commented Mar 17, 2026

Claude finished @dpup's task in 3m 33s —— View job


📦 Previous Review (superseded by newer review)

Code Review

  • Gather context and understand PR scope
  • Review core provisioning logic (internal/run/services.go, internal/run/manager.go)
  • Review config changes (internal/config/config.go)
  • Review container changes (docker_service.go, apple_service.go)
  • Review registry and types changes
  • Review tests
  • Review docs and examples
  • Write up review summary

All eight previous rounds of feedback have been addressed. The implementation is solid. One doc bug remains unfixed from the previous round, plus a cosmetic nit.


Bug

ollama@0.9 still in docs — will fail validationdocs/content/guides/08-services.md:203 and docs/content/reference/02-moat-yaml.md:269

Both docs still show:

- ollama@0.9

The registry only permits ["0.16.3", "0.17.7", "0.18.0", "0.18.1"]. Any user copying these examples will immediately hit:

invalid version "0.9" for ollama
  Available versions: 0.16.3, 0.17.7, 0.18.0, 0.18.1

This was flagged in the previous round. Change ollama@0.9 to ollama@0.18.1 (or just ollama) in both files. Fix this →


Nit

api/tags output in demo.sh is raw JSONexamples/service-ollama/demo.sh:9

curl -s "$MOAT_OLLAMA_URL/api/tags"

jq is already in dependencies:. This could be | jq . for consistent formatting with the generate step below it. Purely cosmetic.


What's good

  • All prior rounds (1–8) resolved: scalar detection, lock comment, todos/ + compound-engineering.local.md removed, docs security model, readiness table, startup lifecycle, wait: false guard, MOAT_OLLAMA_HOST/PORT in reference, Docker retry false-success (with complete bool), flockContext goroutine shutdown documented, IIFE log writer scoping, provision output tee, demo.sh python3 → jq, status messages for wait/pull.
  • Shell injection defense: validProvisionItem regex (^[a-zA-Z0-9][a-zA-Z0-9._:/@-]*$) blocks all metacharacters; validated before any exec reaches the container.
  • Docker retry fix: complete bool correctly prevents false-success on both CheckReady and ProvisionService when the exec is still Running after all retries.
  • flockContext goroutine shutdown: well-documented — context cancellation returns early, goroutine terminates on EBADF when lockFile.Close() fires, buffered channel prevents leak.
  • No-auth path: Ollama correctly gets no password; serviceUsesPasswordPlaceholder correctly handles Redis.
  • wait: false guard: actionable error message with clear fix hint.
  • LIFO defer ordering in provisionItem: LOCK_UN fires before cancel() — correct.
  • IIFE scoping for lw.Close() in provisioning loop: deferred close fires per provisioning call, not at function exit.
  • Provision output tee: io.MultiWriter(os.Stderr, lw) correctly captures pull progress in logs.jsonl.
  • Test coverage: injection attempts, scalar rejection, unknown key rejection, no-auth behavior, env generation, provision cmd building, wait:false guard conditions — comprehensive.
  • Docs accurate: startup sequence, readiness table, security model, env var reference all in sync with implementation.

Fix the ollama@0.9 version string and this is ready to merge.

@claude
Copy link
Copy Markdown

claude Bot commented Mar 17, 2026

📦 Previous Review (superseded by newer review)

Claude finished @dpup's task in 3m 9s —— View job


Code Review

  • Gather context and understand PR scope
  • Review core provisioning logic (internal/run/services.go, internal/run/manager.go)
  • Review config changes (internal/config/config.go)
  • Review container changes (docker_service.go, apple_service.go)
  • Review registry and types changes
  • Review tests
  • Review docs and examples
  • Write up review summary

All previous round feedback (rounds 1–8) has been addressed. One remaining doc bug is still present.


Bug

ollama@0.9 still in two doc files — will fail immediately for any user who copies the example

Both files still show the invalid version 0.9 (valid versions are 0.16.3, 0.17.7, 0.18.0, 0.18.1):

  • docs/content/guides/08-services.md:202: - ollama@0.9 (Ollama section example)
  • docs/content/reference/02-moat-yaml.md:269: | ollama@0.9 | Ollama | 11434 | (service dependencies table)

I committed a fix locally (18f5a80) changing both to ollama@0.18.1, but couldn't push due to remote divergence. The change is trivial — in each file, replace ollama@0.9 with ollama@0.18.1.

Fix this →


Nit

/api/tags output in demo.sh is raw JSONexamples/service-ollama/demo.sh:9

curl -s "$MOAT_OLLAMA_URL/api/tags"

jq is already a declared dependency, so | jq . would give readable output for the model list display. Cosmetic only — doesn't affect correctness.


What's good

Everything from all prior rounds is resolved:

  • Scalar models: detection: nil signal from UnmarshalYAML caught with a clear error
  • Lock comment accuracy: correctly describes per-lock-file behavior (not per-model)
  • todos/ + compound-engineering.local.md: removed
  • Docs security model, readiness table, startup lifecycle: all accurate
  • wait: false + provisions guard: clear error with actionable fix hint
  • MOAT_OLLAMA_HOST/PORT in reference docs: present alongside MOAT_OLLAMA_URL
  • Docker exec retry false-success: fixed with complete bool on both CheckReady and ProvisionService
  • flockContext goroutine lifetime: documented; buffered channel prevents leak; EBADF on fd close terminates the goroutine
  • IIFE for lw.Close(): correctly scoped per provisioning call, not accumulated to function exit
  • Provision output: teed to logs.jsonl via io.MultiWriter(os.Stderr, lw)
  • demo.sh: uses jq (no python3), qwen2.5-coder:0.5b for constrained environments, error surfacing via // (.error | "Error: " + .)
  • Shell injection defense: validProvisionItem regex blocks all metacharacters; validated before any exec
  • Test coverage: injection attempts, scalar rejection, unknown key rejection, no-auth behavior, env generation, provision cmd building — comprehensive

Fix the ollama@0.9 version strings and this is ready to merge.

@claude
Copy link
Copy Markdown

claude Bot commented Mar 17, 2026

📦 Previous Review (superseded by newer review)

Claude finished @dpup's task in 3m 14s —— View job


Code Review

  • Gather context and understand PR scope
  • Review core provisioning logic (internal/run/services.go, internal/run/manager.go)
  • Review config changes (internal/config/config.go)
  • Review container changes (docker_service.go, apple_service.go)
  • Review registry and types changes
  • Review tests
  • Review docs and examples
  • Fix remaining issues and write summary

The implementation is solid and all substantive feedback from prior rounds has been addressed. One doc bug persisted through multiple review iterations — fixed directly.

Fixed (committed)

ollama@0.9 in both doc filesdocs/content/guides/08-services.md:203 and docs/content/reference/02-moat-yaml.md:269 still referenced ollama@0.9 despite this being flagged in the previous two rounds. The registry only allows ["0.16.3", "0.17.7", "0.18.0", "0.18.1"], so any user copying those examples would immediately hit:

invalid version "0.9" for ollama
  Available versions: 0.16.3, 0.17.7, 0.18.0, 0.18.1

Updated to ollama@0.18.1 in both files. Also piped api/tags through jq . in demo.sh for consistent formatting (jq is already a declared dependency).

What's good

  • All round 1–9 feedback resolved: scalar detection, lock comment, todos/ + compound-engineering.local.md removed, docs security model, readiness table, startup lifecycle, wait: false guard, MOAT_OLLAMA_HOST/PORT in reference, Docker retry false-success (complete bool), flockContext goroutine shutdown documented, IIFE log writer scoping, provision output tee, python3 → jq in demo, version tags.
  • Shell injection defense: validProvisionItem regex (^[a-zA-Z0-9][a-zA-Z0-9._:/@-]*$) blocks all metacharacters; validated before any exec reaches the container.
  • No-auth path: Ollama correctly gets no password; serviceUsesPasswordPlaceholder correctly handles Redis.
  • wait: false guard: clear, actionable error message with explicit fix hint.
  • flockContext goroutine shutdown: well-documented — context cancellation returns early, goroutine terminates on EBADF when lockFile.Close() fires, buffered channel prevents leak.
  • Docker exec retry: complete bool correctly prevents false-success on both CheckReady and ProvisionService when exec is still Running after all retries.
  • LIFO defer ordering in provisionItem: LOCK_UN fires before cancel() — correct.
  • IIFE scoping for lw.Close() in provisioning loop: deferred close fires per-provisioning-call, not at function exit.
  • Test coverage: injection attempts, scalar rejection, unknown key rejection, no-auth behavior, env generation, provision cmd building, wait: false guard — comprehensive.
  • Docs accurate: startup sequence, readiness table, security model, and env var reference all in sync with implementation.

This is ready to merge.

@claude
Copy link
Copy Markdown

claude Bot commented Mar 17, 2026

📦 Previous Review (superseded by newer review)

Claude finished @dpup's task in 3m 34s —— View job


Code Review

  • Gather context and read changed files
  • Review core provisioning logic (internal/run/services.go, internal/run/manager.go)
  • Review config changes (internal/config/config.go)
  • Review container changes (docker_service.go, apple_service.go)
  • Review registry and types changes
  • Review tests
  • Review docs and examples
  • Fix bug and write up review summary

All prior review rounds are resolved. Found and fixed one new bug.

Bug fixed (committed)

memory: leaked into Extra, breaking memory + models togetherinternal/config/config.go:175

The known map in ServiceSpec.UnmarshalYAML was missing "memory":

// Before
known := map[string]bool{"env": true, "image": true, "wait": true}
// After
known := map[string]bool{"env": true, "image": true, "wait": true, "memory": true}

memory: is a struct field decoded by value.Decode((*plain)(s)), but the scan loop also saw it and — since it's a scalar, not a sequence — added Extra["memory"] = nil. The provisioning validation then rejected it: services.ollama.memory is not a valid key (did you mean "models"?).

This broke examples/service-ollama/moat.yaml exactly as written (it uses both memory: 2048 and models:). No test caught it because TestBuildServiceConfigMemory constructs ServiceSpec directly rather than going through YAML parsing. Added TestServiceSpecUnmarshalMemoryNotLeakedToExtra to cover this path.


Everything else looks good

  • All prior round feedback (1–9): scalar detection, lock comment, todos/ + compound-engineering.local.md removed, docs security model, readiness table, startup lifecycle, wait: false guard, MOAT_OLLAMA_HOST/PORT in reference, Docker retry false-success (complete bool), flockContext goroutine shutdown documented, IIFE log writer scoping, provision output tee, demo.sh python3 → jq → proper formatting, Ollama version strings.
  • Shell injection defense: validProvisionItem regex (^[a-zA-Z0-9][a-zA-Z0-9._:/@-]*$) blocks all metacharacters; validated before any exec.
  • No-auth path: Ollama correctly gets no password; serviceUsesPasswordPlaceholder correctly handles Redis.
  • Docker exec retry: complete bool correctly prevents false-success on both CheckReady and ProvisionService.
  • flockContext: well-documented — context cancellation returns early, goroutine terminates on EBADF when lockFile.Close() fires, buffered channel prevents leak.
  • LIFO defer ordering: LOCK_UN fires before cancel() in provisionItem — correct.
  • IIFE scoping: lw.Close() defers per-provisioning-call, not at function exit.
  • Test coverage: injection attempts, scalar rejection, unknown key rejection, no-auth behavior, env generation, provision cmd building — comprehensive.
  • Docs: startup sequence, readiness table, security model, env var reference all accurate and in sync.

Ready to merge after this fix.

@dpup
Copy link
Copy Markdown
Collaborator Author

dpup commented Mar 17, 2026

The missing "memory" in the known map was the right call — TestBuildServiceConfigMemory constructing ServiceSpec directly meant we never exercised the YAML path. The added TestServiceSpecUnmarshalMemoryNotLeakedToExtra closes that gap. Nothing else to address.

dpup added 9 commits March 17, 2026 15:45
Adds design spec for Ollama as a service dependency with generalized
provisions/cache extensions, and a detailed implementation plan with
10 tasks covering registry, types, config, service manager, and docs.
Add CachePath, ProvisionsKey, and ProvisionCmd fields to ServiceDef struct
to support services that need host-side caching and post-start provisioning.
Add ollama as a service dependency using these new fields.
… cache mount support

Adds ProvisionService method to ServiceManager interface for executing
provisioning commands inside service containers (e.g., pulling Ollama models).
Also adds CachePath/CacheHostPath mount support to both Docker and Apple
service managers so model/data caches can be persisted on the host.
…ceConfig

- Guard password generation: only generate for services that have auth
  (password_env set, or extra_cmd/readiness_cmd use {password} placeholder)
- Wire provisions from userSpec.Extra using registry's provisions_key
- Resolve cache host path to ~/.moat/cache/<service-name>
- Validate unknown extra keys: reject keys that don't match provisions_key
dpup and others added 18 commits March 17, 2026 15:45
P1 fixes:
- Validate provision items against regex to prevent shell injection
  via model names interpolated into sh -c commands
- Fix error messages referencing nonexistent --service flag on moat logs

P2 fixes:
- Tighten cache directory permissions from 0755 to 0700
- Tighten lock file permissions from 0644 to 0600
- Change provision timeout from per-batch to per-item (30min each)
- Narrow flock scope to per-item so parallel runs don't serialize

P3 fixes:
- Consolidate duplicated extra_env loop in buildServiceConfig
- Capture unknown scalar keys in ServiceSpec YAML for validation
- Add versions list and quote image in ollama registry entry
- Validate scalar provision keys (e.g., `models: value` instead of list)
  with clear "must be a list" error message
- Fix misleading flock comment — parallel runs still serialize on
  shared lock file; extracted provisionItem helper with defer cancel()
- Remove compound-engineering.local.md (local tooling, not for repo)
- Remove todos/ directory (tracked as GitHub issues instead)
- Fix readiness table: Ollama check is just "ollama list", model
  pulling is a separate provisioning step
- Add provisioning step (step 6) to startup sequence docs
- Fix security model docs: not all services require passwords (Ollama
  has no auth, relies on network isolation)
- Remove dead todo/005 reference from provisionItem comment
- Error on wait: false with provisioned items (models can't be pulled
  until service is ready — silent failure before this fix)
- Add MOAT_OLLAMA_HOST and MOAT_OLLAMA_PORT to reference docs
- Remove head -c truncation from demo.sh (was breaking JSON output)
- Context-aware flock: wrap syscall.Flock in a goroutine with select on
  ctx.Done() so provision lock acquisition respects context cancellation
- Provision output to run logs: create RunStore before service provisioning
  and tee output to both stderr and the run's logs.jsonl via io.MultiWriter
- Docker exec inspect race: retry ContainerExecInspect up to 3 times with
  100ms backoff when the exec is still marked running after stream EOF
- Add test validating wait:false + provisions preconditions
…pect retries

Both CheckReady and ProvisionService had a bug where if ContainerExecInspect
reported the exec as still Running after all 3 retry attempts, the loop would
exit without setting the complete flag, exitCode would stay 0, and the function
would return nil — a false success.

For ProvisionService this meant ollama pull could be silently treated as
complete while the pull was still in progress.
Using defer lw.Close() inside a loop defers all closes until the outer
function returns. Wrap the provision block in an IIFE so defer fires
immediately after provisionService returns for each service.

Also extract the response field from demo.sh JSON output.
Ollama publishes full semver tags (0.18.1, 0.18.0, ...), not short
major.minor tags (0.9). The previous versions 0.6-0.9 do not exist
on Docker Hub, causing a 404 on image pull.

Update default to 0.18.1 and set versions list to recent releases.
Update tests and example moat.yaml accordingly.
python3 is not available in debian:bookworm-slim (the base image when
no Python runtime is declared). Use jq -r .response instead, which is
available as an apt dependency.
Add a 'memory' field to ServiceSpec so users can configure the memory
limit for service sidecar containers (e.g., Ollama) independently of
the main agent container's container.memory setting.

Wire through ServiceConfig.MemoryMB -> SidecarConfig.MemoryMB, applied
as Docker HostConfig.Resources.Memory and Apple --memory flag.

Update the Ollama example to use memory: 2048 with qwen2.5-coder:1.5b.
The known map used to skip processing of parsed struct fields was missing
"memory", causing memory: values to leak into Extra as nil. This caused
buildServiceConfig to reject memory: as an unknown key when combined with
provision-capable services (e.g. ollama). The example moat.yaml, which
uses both memory: 2048 and models: [...], would fail at runtime.

Also adds a test that exercises YAML with both memory: and models: to
prevent regression.
@dpup dpup force-pushed the feat/ollama-sidecar branch from 229fead to db2d1f1 Compare March 17, 2026 22:45
@claude
Copy link
Copy Markdown

claude Bot commented Mar 17, 2026

Claude finished @dpup's task in 3m 49s —— View job


Code Review

  • Gather context and read changed files
  • Review core provisioning logic (internal/run/services.go, internal/run/manager.go)
  • Review config changes (internal/config/config.go)
  • Review container changes (docker_service.go, apple_service.go)
  • Review registry and types changes
  • Review tests
  • Review docs and examples
  • Write up review summary

All ten previous rounds of review are addressed. I went through the current state of every key file looking for anything missed. The implementation is solid.

Final findings

1. No lock acquired on context cancellation — stale LOCK_UN deferred — low severity, cosmetic

In provisionItem (services.go:304–308):

if err := flockContext(cmdCtx, lockFile); err != nil {
    return err
}
defer func() { _ = syscall.Flock(int(lockFile.Fd()), syscall.LOCK_UN) }()

When flockContext returns ctx.Err() (context cancelled before lock was acquired), the deferred LOCK_UN still fires during unwind. On Linux, calling LOCK_UN on an fd that this process doesn't hold a lock on is harmless (it's a no-op). So this is not a correctness bug — just a slightly misleading defer registered on a path where no lock was obtained. The comment in flockContext documents the shutdown path thoroughly and the behavior is correct. Not worth changing.

2. TestBuildServiceConfigOllamaProvisionsIncompatibleWithWaitFalse name is slightly misleading — nit

The test validates the conditions the wait: false guard checks, but doesn't exercise the guard itself (which lives in manager.go). The test name implies end-to-end verification. The test body is correct and useful — just the name might mislead a reader into thinking this is a complete coverage of that error path. Not blocking.

3. flockContext goroutine can acquire lock after caller returns — documented, non-blocking

When context cancels mid-flockContext, there's a race: the goroutine could acquire the lock after the caller returns with an error. In that case, nobody reads from the done channel, but it's buffered so no leak. The lock is then released when lockFile.Close() fires from provisionService's defer. The comment in flockContext accurately documents this. Correct behavior, well documented.


What's good

  • All prior rounds (1–9) resolved: scalar detection, lock comment, todos/ + compound-engineering.local.md removed, docs security model, readiness table, startup lifecycle, wait: false guard, MOAT_OLLAMA_HOST/PORT in reference, Docker retry false-success (complete bool), flockContext goroutine shutdown documented, IIFE log writer scoping, provision output tee, demo.sh jq + version tags, memory field in known map + TestServiceSpecUnmarshalMemoryNotLeakedToExtra.
  • Shell injection defense: validProvisionItem regex (^[a-zA-Z0-9][a-zA-Z0-9._:/@-]*$) correctly blocks all shell metacharacters; validated before any exec reaches a container.
  • Docker exec retry: complete bool correctly prevents false-success on both CheckReady and ProvisionService when exec is still Running after all retries.
  • No-auth path: Ollama gets no password; serviceUsesPasswordPlaceholder correctly handles Redis.
  • wait: false guard: clear, actionable error message.
  • LIFO defer ordering in provisionItem: LOCK_UN fires before cancel() — correct.
  • IIFE scoping for lw.Close() in provisioning loop: deferred close fires per provisioning call, not at function exit.
  • Generic design: provisions_key / provision_cmd / cache_path on ServiceDef are cleanly reusable for any future service needing post-start provisioning.
  • Test coverage: injection attempts, scalar rejection, unknown key rejection, no-auth behavior, env generation, provision cmd building, wait: false guard conditions — comprehensive.
  • Docs: startup sequence, readiness table, security model, and env var reference all accurate and in sync with implementation.
  • memory: 2048 + qwen2.5-coder:1.5b: the model bump back from 0.5b makes sense given the explicit 2GB memory limit — both are consistent.

This is ready to merge.

@dpup dpup merged commit 30a03ca into main Mar 17, 2026
4 checks passed
@dpup dpup deleted the feat/ollama-sidecar branch March 17, 2026 22:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant