fix(router): label cascade Dockerfiles so dangling-image cleanup actually matches#1256
Conversation
…ally matches `scanAndCleanupDanglingImages` (PR #1243) filters images by `dangling=true AND label=cascade.managed=true`. The label clause is the safety belt that keeps the loop from reaping unrelated host workloads (ucho-dev/prod, MySQL, Loki, etc.) — but the label was never applied to cascade-built images. `cascade.managed=true` was only set as a CONTAINER label at run time (container-manager.ts), never as an IMAGE label via a `LABEL` directive in any Dockerfile. Live verification on the dev host: 140 dangling images present, but `docker images --filter dangling=true --filter label=cascade.managed=true` returns zero. Every Loki cleanup-pass log line shows `removedCount=0, reclaimedBytes=0` — the loop has been a no-op since deploy. Adds `LABEL cascade.managed=true` to all five cascade Dockerfiles (router, worker, dashboard, frontend, selfhosted) so newly-built images carry the label, dangling rebuilds inherit it, and the existing strict filter starts matching exactly the right set. No code change in the cleanup loop. No filter widening. Static guard test pins both halves of the contract: the filter shape AND the per-Dockerfile LABEL directive. A new `Dockerfile.<svc>` without the label fails CI loud. Pre-label dangling backlog (~130 images on prod) needs a one-off manual prune; documented in PR body. Out of scope: cascade-worker tag bloat (29 SHA-pinned tags accumulating) — separate retention loop, separate PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
nhopeatall
left a comment
There was a problem hiding this comment.
test review
🕵️ claude-code · claude-opus-4-6 · run details
nhopeatall
left a comment
There was a problem hiding this comment.
Summary
Clean, well-scoped fix for a real production issue — the dangling-image cleanup loop was a complete no-op since deploy because no built image carried the cascade.managed=true label it filters on. The fix correctly adds LABEL cascade.managed=true to the production stage of all five Dockerfiles, and the static guard test pins both halves of the contract going forward.
Code Issues
Should Fix
- All five Dockerfiles — The comment block references
PR #12xx(a placeholder) instead of the actualPR #1256. This should be updated for traceability — when someone reads these comments in 6 months,#12xxwon't resolve to anything useful.
🕵️ claude-code · claude-opus-4-6 · run details
|
|
||
| # `cascade.managed=true` is the contract the router's dangling-image cleanup | ||
| # loop filters on (src/router/dangling-image-cleanup.ts). Without this LABEL, | ||
| # the loop matches zero images and reclaims nothing — see PR #12xx. |
There was a problem hiding this comment.
[SHOULD_FIX] Placeholder PR #12xx should be PR #1256. Same placeholder appears in all five Dockerfiles — worth a quick find-and-replace before merge.
There was a problem hiding this comment.
Fixed! Replaced the PR #12xx placeholder with the actual PR #1256 across all five Dockerfiles.
There was a problem hiding this comment.
Fixed! Replaced the PR #12xx placeholder with PR #1256 across all five Dockerfiles. The commit is at e4192ce.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
scanAndCleanupDanglingImages(PR fix(router): periodic dangling cascade-image cleanup (close 102GB leak) #1243) was a no-op since deploy — it filters images bydangling=true AND label=cascade.managed=true, but no cascade-built image carried that label.cascade.managed=truewas only ever applied to running containers (incontainer-manager.ts); never baked into any Dockerfile.docker images --filter dangling=true --filter label=cascade.managed=true | wc -lreturns 0. Every prod cleanup-pass log line showsremovedCount=0, reclaimedBytes=0.LABEL cascade.managed=trueto all five cascade Dockerfiles. The existing strict label filter starts matching exactly the right set without any code change in the cleanup loop and without any blast-radius widening.Why the strict filter stays strict
The cascade-router host runs other unrelated workloads (ucho-dev/prod, MySQL, Loki, etc.). Widening the filter would risk reaping their dangling images. The right fix is to make cascade-built images carry the label so the existing safety belt actually matches them.
Out of scope
Test plan
npx vitest run --project unit-api tests/unit/router/dangling-image-cleanup.test.ts— green (20 tests, 5 new LABEL guards)npm test— full unit suite (8781 tests, 23 skipped, 0 failures)npm run typecheck— cleannpm run lint— clean (13 pre-existing warnings, unrelated)[DanglingImageCleanup] Cleanup pass complete: { removedCount: N>0, reclaimedBytes: M>0 }within 30 mindocker image prune --filter "until=24h"on prod to clear the pre-label backlog🤖 Generated with Claude Code