Promote dev to main: fix(router) dangling cascade-image cleanup by zbigniewsobiecki · Pull Request #1244 · mongrel-intelligence/cascade

zbigniewsobiecki · 2026-05-01T08:51:04Z

Promotes 1 commit from `dev`:

fix(router): periodic dangling cascade-image cleanup (close 102GB leak) #1243 — fix(router): periodic dangling cascade-image cleanup (close 102GB leak)

What

Periodic 30-min loop that prunes dangling Docker images carrying `cascade.managed=true` label. Closes the leak class where `commitContainerToSnapshot` re-tags `cascade-snapshot--:latest` per work-item run and orphans the prior digest outside the snapshot registry — measured at 102 GB reclaimable on the prod host on 2026-05-01.

Reviewed and approved on #1243 by @nhopeatall. Full CI green on dev (3/3 workflows).

Risk

Low. Pure additive lifecycle hook, scoped strictly to `label=cascade.managed=true` (regression-pinned in tests). Cannot touch unrelated host workloads (`ucho-dev`, `ucho-prod`, MySQL, Loki, etc.).

Follow-up after deploy

One-time backlog cleanup on the prod host:

```bash
docker image prune --force \
--filter "label=cascade.managed=true" \
--filter "dangling=true"
```

Expected reclaim: ~100 GB.

🤖 Generated with Claude Code

…k) (#1243) `commitContainerToSnapshot` re-tags `cascade-snapshot-<proj>-<workitem>:latest` on every run of the same work item. Each re-commit re-points the tag to a new digest; the previous digest becomes dangling (untagged) and falls out of the in-memory snapshot registry, so the registry-driven `runSnapshotCleanup` never sees it again. Production was measured at 102 GB reclaimable across ~136 dangling images on 2026-05-01 (50% disk used and climbing). Add a periodic dangling-image cleanup loop (`src/router/dangling-image-cleanup.ts`) that mirrors the existing `orphan-cleanup.ts` lifecycle pattern. 30-min interval (slower than the 5-min snapshot loop because dangling accumulation is gradual and `force: false` rmi is cheap). Wired into `startWorkerProcessor` / `stopWorkerProcessor` next to the existing snapshot-cleanup hooks. Safety scope is the load-bearing invariant: the scan filter is `dangling=true AND label=cascade.managed=true`, AND-ed by Docker's filter API. The label clause is the only thing protecting unrelated host workloads (ucho-dev/prod, MySQL, Loki, etc.) from being reaped — pinned by an explicit regression test. Per-image errors mirror `removeSnapshotImage`: 409 (in use) and 404 (already gone) are silently swallowed; any other error is logged at warn and Sentry-captured under tag `dangling_image_remove`. Loop continues. `listImages` failure is logged at error and Sentry-captured under tag `dangling_image_cleanup_scan`; the function never throws. The original commit-time fix (capture and rmi the prior image inside `commitContainerToSnapshot`) was considered and rejected: `buildSnapshotImageName` is deterministic for a given `(projectId, workItemId)` pair, so for re-commits the new and old `imageName` always match and the surgical edit would be a no-op. The dangling-cleanup loop catches every such case at 30-min latency. Tests (14 new, all green): scan filter shape (×2 — regression guards against scope expansion), happy path + log summary, zero-noise on empty, 409 swallow, 404 swallow, generic error → Sentry + continue, listImages failure → Sentry + no throw, lifecycle (start/stop idempotent + multi-cycle). A one-time `docker image prune --force --filter label=cascade.managed=true --filter dangling=true` is needed on the prod host after deploy to reclaim the existing 102 GB backlog; the new loop only handles future drift. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codecov · 2026-05-01T08:55:48Z

Codecov Report

❌ Patch coverage is 88.17204% with 11 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/router/dangling-image-cleanup.ts	87.77%	11 Missing ⚠️

📢 Thoughts on this report? Let us know!

zbigniewsobiecki merged commit 5c1c912 into main May 1, 2026
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Promote dev to main: fix(router) dangling cascade-image cleanup#1244

Promote dev to main: fix(router) dangling cascade-image cleanup#1244
zbigniewsobiecki merged 1 commit intomainfrom
dev

zbigniewsobiecki commented May 1, 2026

Uh oh!

Uh oh!

codecov Bot commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zbigniewsobiecki commented May 1, 2026

What

Risk

Follow-up after deploy

Uh oh!

Uh oh!

codecov Bot commented May 1, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant