Promote dev to main: fix(router) dangling cascade-image cleanup#1244
Merged
zbigniewsobiecki merged 1 commit intomainfrom May 1, 2026
Merged
Promote dev to main: fix(router) dangling cascade-image cleanup#1244zbigniewsobiecki merged 1 commit intomainfrom
zbigniewsobiecki merged 1 commit intomainfrom
Conversation
…k) (#1243) `commitContainerToSnapshot` re-tags `cascade-snapshot-<proj>-<workitem>:latest` on every run of the same work item. Each re-commit re-points the tag to a new digest; the previous digest becomes dangling (untagged) and falls out of the in-memory snapshot registry, so the registry-driven `runSnapshotCleanup` never sees it again. Production was measured at 102 GB reclaimable across ~136 dangling images on 2026-05-01 (50% disk used and climbing). Add a periodic dangling-image cleanup loop (`src/router/dangling-image-cleanup.ts`) that mirrors the existing `orphan-cleanup.ts` lifecycle pattern. 30-min interval (slower than the 5-min snapshot loop because dangling accumulation is gradual and `force: false` rmi is cheap). Wired into `startWorkerProcessor` / `stopWorkerProcessor` next to the existing snapshot-cleanup hooks. Safety scope is the load-bearing invariant: the scan filter is `dangling=true AND label=cascade.managed=true`, AND-ed by Docker's filter API. The label clause is the only thing protecting unrelated host workloads (ucho-dev/prod, MySQL, Loki, etc.) from being reaped — pinned by an explicit regression test. Per-image errors mirror `removeSnapshotImage`: 409 (in use) and 404 (already gone) are silently swallowed; any other error is logged at warn and Sentry-captured under tag `dangling_image_remove`. Loop continues. `listImages` failure is logged at error and Sentry-captured under tag `dangling_image_cleanup_scan`; the function never throws. The original commit-time fix (capture and rmi the prior image inside `commitContainerToSnapshot`) was considered and rejected: `buildSnapshotImageName` is deterministic for a given `(projectId, workItemId)` pair, so for re-commits the new and old `imageName` always match and the surgical edit would be a no-op. The dangling-cleanup loop catches every such case at 30-min latency. Tests (14 new, all green): scan filter shape (×2 — regression guards against scope expansion), happy path + log summary, zero-noise on empty, 409 swallow, 404 swallow, generic error → Sentry + continue, listImages failure → Sentry + no throw, lifecycle (start/stop idempotent + multi-cycle). A one-time `docker image prune --force --filter label=cascade.managed=true --filter dangling=true` is needed on the prod host after deploy to reclaim the existing 102 GB backlog; the new loop only handles future drift. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Promotes 1 commit from `dev`:
What
Periodic 30-min loop that prunes dangling Docker images carrying `cascade.managed=true` label. Closes the leak class where `commitContainerToSnapshot` re-tags `cascade-snapshot--:latest` per work-item run and orphans the prior digest outside the snapshot registry — measured at 102 GB reclaimable on the prod host on 2026-05-01.
Reviewed and approved on #1243 by @nhopeatall. Full CI green on dev (3/3 workflows).
Risk
Low. Pure additive lifecycle hook, scoped strictly to `label=cascade.managed=true` (regression-pinned in tests). Cannot touch unrelated host workloads (`ucho-dev`, `ucho-prod`, MySQL, Loki, etc.).
Follow-up after deploy
One-time backlog cleanup on the prod host:
```bash
docker image prune --force \
--filter "label=cascade.managed=true" \
--filter "dangling=true"
```
Expected reclaim: ~100 GB.
🤖 Generated with Claude Code