Skip to content

Promote dev to main: fix(router) dangling cascade-image cleanup#1244

Merged
zbigniewsobiecki merged 1 commit intomainfrom
dev
May 1, 2026
Merged

Promote dev to main: fix(router) dangling cascade-image cleanup#1244
zbigniewsobiecki merged 1 commit intomainfrom
dev

Conversation

@zbigniewsobiecki
Copy link
Copy Markdown
Member

Promotes 1 commit from `dev`:

What

Periodic 30-min loop that prunes dangling Docker images carrying `cascade.managed=true` label. Closes the leak class where `commitContainerToSnapshot` re-tags `cascade-snapshot--:latest` per work-item run and orphans the prior digest outside the snapshot registry — measured at 102 GB reclaimable on the prod host on 2026-05-01.

Reviewed and approved on #1243 by @nhopeatall. Full CI green on dev (3/3 workflows).

Risk

Low. Pure additive lifecycle hook, scoped strictly to `label=cascade.managed=true` (regression-pinned in tests). Cannot touch unrelated host workloads (`ucho-dev`, `ucho-prod`, MySQL, Loki, etc.).

Follow-up after deploy

One-time backlog cleanup on the prod host:

```bash
docker image prune --force \
--filter "label=cascade.managed=true" \
--filter "dangling=true"
```

Expected reclaim: ~100 GB.

🤖 Generated with Claude Code

…k) (#1243)

`commitContainerToSnapshot` re-tags `cascade-snapshot-<proj>-<workitem>:latest`
on every run of the same work item. Each re-commit re-points the tag to a new
digest; the previous digest becomes dangling (untagged) and falls out of the
in-memory snapshot registry, so the registry-driven `runSnapshotCleanup` never
sees it again. Production was measured at 102 GB reclaimable across ~136
dangling images on 2026-05-01 (50% disk used and climbing).

Add a periodic dangling-image cleanup loop (`src/router/dangling-image-cleanup.ts`)
that mirrors the existing `orphan-cleanup.ts` lifecycle pattern. 30-min interval
(slower than the 5-min snapshot loop because dangling accumulation is gradual
and `force: false` rmi is cheap). Wired into `startWorkerProcessor` /
`stopWorkerProcessor` next to the existing snapshot-cleanup hooks.

Safety scope is the load-bearing invariant: the scan filter is
`dangling=true AND label=cascade.managed=true`, AND-ed by Docker's filter API.
The label clause is the only thing protecting unrelated host workloads
(ucho-dev/prod, MySQL, Loki, etc.) from being reaped — pinned by an explicit
regression test. Per-image errors mirror `removeSnapshotImage`: 409 (in use)
and 404 (already gone) are silently swallowed; any other error is logged at
warn and Sentry-captured under tag `dangling_image_remove`. Loop continues.
`listImages` failure is logged at error and Sentry-captured under tag
`dangling_image_cleanup_scan`; the function never throws.

The original commit-time fix (capture and rmi the prior image inside
`commitContainerToSnapshot`) was considered and rejected: `buildSnapshotImageName`
is deterministic for a given `(projectId, workItemId)` pair, so for re-commits
the new and old `imageName` always match and the surgical edit would be a
no-op. The dangling-cleanup loop catches every such case at 30-min latency.

Tests (14 new, all green): scan filter shape (×2 — regression guards against
scope expansion), happy path + log summary, zero-noise on empty, 409 swallow,
404 swallow, generic error → Sentry + continue, listImages failure → Sentry +
no throw, lifecycle (start/stop idempotent + multi-cycle).

A one-time `docker image prune --force --filter label=cascade.managed=true
--filter dangling=true` is needed on the prod host after deploy to reclaim the
existing 102 GB backlog; the new loop only handles future drift.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@zbigniewsobiecki zbigniewsobiecki merged commit 5c1c912 into main May 1, 2026
14 checks passed
@codecov
Copy link
Copy Markdown

codecov Bot commented May 1, 2026

Codecov Report

❌ Patch coverage is 88.17204% with 11 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/router/dangling-image-cleanup.ts 87.77% 11 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant