Fix disk space GC for BuildKit cache and registry blobs by phinze · Pull Request #588 · mirendev/runtime

phinze · 2026-02-10T15:02:06Z

One of our internal clusters filled up — uncollected BuildKit cache and
orphaned registry blobs with no cleanup path.

BuildKit cache: The gcpolicy only had filters for source.local and
exec.cachemount, so regular build layers never got collected. Removed
the filters to make it a catch-all — the filtered types were a subset
anyway.

Registry blobs: Artifact GC archives artifacts and image GC cleans
up containerd images, but the actual blob files on disk were never
reaped. Added blob GC to the existing ImageWatchdog so it runs on the
same schedule (pressure checks + weekly). It collects digests referenced
by non-archived artifacts and deletes everything else, with a 1-hour
grace period for in-flight uploads.

coderabbitai · 2026-02-10T15:02:46Z

📝 Walkthrough

Walkthrough

Removed the filters line from the generated BuildKit daemon worker.oci.gcpolicy configuration and added a unit test TestGenerateConfig validating generated config contents and registry host handling. Implemented blob garbage collection: added BlobGCResult and ImageWatchdog.RunBlobGC to enumerate registry blobs, collect referenced digests from OCI-like manifests, skip recently modified or referenced blobs, and delete unreferenced old blobs while recording deletions and failures. Modified image GC flow to always run blob GC and log blob GC results with per-blob details.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@components/buildkit/config_test.go`:
- Around line 24-25: The test incorrectly asserts presence of the substring
using idx := strings.Index(config, "[[worker.oci.gcpolicy]]") followed by
r.Greater(idx, 0), which fails if the match is at index 0; change the assertion
to check for non-negative/exists (for example, use r.NotEqual(-1, idx) or
r.GreaterOrEqual(idx, 0)) so the test correctly passes when the substring is
found at the start; update the assertion call in
components/buildkit/config_test.go replacing r.Greater(idx, 0) with the chosen
check.

In `@controllers/sandbox/blob_gc.go`:
- Around line 117-122: The comment is misleading: when json.Unmarshal of
art.Manifest fails, the code logs a warning and continues, which results in that
artifact contributing no entries to the digests map so blobs only referenced by
this malformed artifact may be deleted; update the inline comment near the
json.Unmarshal block (referencing art.Manifest, manifest, digests, and the
surrounding blob cleanup logic) to clearly state this intentional tradeoff
(favoring GC availability over safety), and mention that malformed manifests are
treated as having no digests and therefore could allow deletion of blobs
uniquely referenced by that artifact.

🧹 Nitpick comments (3)

controllers/sandbox/blob_gc.go (1)
61-87: Consider filtering for regular files only.

If the blobs directory ever contains subdirectories (e.g., sharded storage like sha256/<prefix>/<hash>), os.Remove will fail for non-empty directories. This could inflate TotalBlobs and populate FailedBlobs with confusing errors.

If the current registry implementation guarantees a flat file structure, this is fine as-is. Otherwise, consider filtering:
♻️ Suggested defensive check
 for _, entry := range entries {
     name := entry.Name()
+    if entry.IsDir() {
+        continue
+    }

     // Skip recently modified files to guard against concurrent uploads
controllers/sandbox/blob_gc_test.go (2)
57-63: Test isolation: artifacts persist across subtests.

The entity server is shared across all subtests, so artifacts created in earlier subtests remain visible in later ones. This works because each subtest uses unique digest names, but it's fragile—accidental name collisions could cause intermittent failures.

Consider either creating a fresh entity server per subtest, or using t.Cleanup to remove artifacts after each subtest.

243-245: Clarify test comment to avoid confusion.

The comment states "its blobs should NOT be deleted" but the blob sha256:someblob isn't actually referenced by this artifact (we can't know what it references since the manifest is malformed). The test actually verifies that orphan blobs are deleted even when a malformed manifest exists.
📝 Suggested clarification
-       // Create artifact with invalid manifest - its blobs should NOT be deleted
-       // because we can't verify what it references
+       // Create artifact with invalid manifest - we can't determine what blobs it references,
+       // so unrelated blobs can still be garbage collected
        createArtifact(t, eac, core_v1alpha.ACTIVE, "not-valid-json{{{")

components/buildkit/config_test.go

controllers/sandbox/blob_gc.go

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@controllers/sandbox/blob_gc_test.go`:
- Around line 237-261: Update the misleading test comment in the "handles
malformed manifest" test: the lines stating "its blobs should NOT be deleted
because we can't verify what it references" contradict the assertion that the
orphan blob is deleted; change or remove that sentence so the comment aligns
with the actual behavior asserted (the malformed manifest is skipped and the
orphan blob is deleted). Refer to the test setup that calls createArtifact(...,
"not-valid-json{{{") and the assertion against watchdog.RunBlobGC() /
result.DeletedBlobs to ensure the comment reflects that deletion is expected.

In `@controllers/sandbox/blob_gc.go`:
- Around line 65-73: When entry.Info() fails, the code adds the error to
result.FailedBlobs but doesn't account for that blob in the retained count,
breaking the invariant TotalBlobs == RetainedBlobs + len(DeletedBlobs) +
len(FailedBlobs); update the failure branch in the loop that calls entry.Info()
(where name is the blob key) to also increment the retained counter
(result.RetainedBlobs++) for errors, or alternatively add a new counter field
(e.g., result.FailedCount) and ensure TotalBlobs is computed using that field;
change either the error branch around entry.Info() or the result struct usage so
the sums remain consistent.

🧹 Nitpick comments (3)

components/buildkit/config_test.go (1)
12-12: Consider extracting magic numbers to named constants.

The parameters 10*1024*1024*1024 and 86400 could be more readable as named constants (e.g., testKeepBytes and testKeepDurationSeconds).
♻️ Proposed refactor
+const (
+	testKeepBytes    = 10 * 1024 * 1024 * 1024 // 10 GiB
+	testKeepDuration = 86400                   // 1 day in seconds
+)
+
 func TestGenerateConfig(t *testing.T) {
 	c := &Component{}
-	config := c.generateConfig(10*1024*1024*1024, 86400, "registry.example.com:5000")
+	config := c.generateConfig(testKeepBytes, testKeepDuration, "registry.example.com:5000")
controllers/sandbox/image_gc.go (1)
158-161: Consider adding nil-check for defensive coding.

While RunBlobGC currently always returns a non-nil result, adding a nil-check would make this code more robust against future changes.
🛡️ Defensive nil-check
 	// Run blob GC independently of image GC
 	blobResult, blobErr := w.RunBlobGC(ctx)
 	if blobErr != nil {
 		w.Log.Error("blob GC failed", "trigger", trigger, "error", blobErr)
+	}
+	if blobResult == nil {
+		return
+	}
+	if len(blobResult.DeletedBlobs) > 0 || len(blobResult.FailedBlobs) > 0 {
-	} else if len(blobResult.DeletedBlobs) > 0 || len(blobResult.FailedBlobs) > 0 {
controllers/sandbox/blob_gc_test.go (1)
57-63: Shared entity server may cause test pollution between subtests.

The entity server is shared across all subtests, meaning artifacts created in earlier tests persist and affect later tests. While the current tests use unique digest names and separate tmpDir for blob files, this setup is fragile. If a future test uses overlapping digest names, it could fail unexpectedly or produce false positives.

Consider either:

Creating a fresh entity server per subtest

Adding cleanup logic after each subtest

Using table-driven tests with explicit cleanup
♻️ Option 1: Fresh entity server per subtest
 func TestRunBlobGC(t *testing.T) {
-	entServer, cleanup := entitytestutils.NewInMemEntityServer(t)
-	defer cleanup()
-
-	eac := entServer.EAC
 	log := entitytestutils.TestLogger(t)
 	oldTime := time.Now().Add(-2 * time.Hour)

 	t.Run("deletes unreferenced blobs", func(t *testing.T) {
+		entServer, cleanup := entitytestutils.NewInMemEntityServer(t)
+		defer cleanup()
+		eac := entServer.EAC
+
 		r := require.New(t)
 		tmpDir := t.TempDir()
 		// ... rest of test

controllers/sandbox/blob_gc_test.go

coderabbitai · 2026-02-10T18:00:37Z

controllers/sandbox/blob_gc.go

+		info, err := entry.Info()
+		if err != nil {
+			result.FailedBlobs[name] = err
+			continue
+		}
+		if now.Sub(info.ModTime()) < 1*time.Hour {
+			result.RetainedBlobs++
+			continue
+		}


⚠️ Potential issue | 🟡 Minor

Count inconsistency when entry.Info() fails.

When entry.Info() returns an error, the blob is added to FailedBlobs but not counted in RetainedBlobs. This creates an inconsistency where TotalBlobs != RetainedBlobs + len(DeletedBlobs) + len(FailedBlobs). Consider whether failed blobs should be counted as retained (since they weren't deleted) or if a separate counter is needed.

🔧 Proposed fix to maintain count consistency

info, err := entry.Info() if err != nil { result.FailedBlobs[name] = err + result.RetainedBlobs++ // Count as retained since we couldn't delete continue }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

info, err := entry.Info()

if err != nil {

result.FailedBlobs[name] = err

continue

}

if now.Sub(info.ModTime()) < 1*time.Hour {

result.RetainedBlobs++

continue

}

info, err := entry.Info()

if err != nil {

result.FailedBlobs[name] = err

result.RetainedBlobs++ // Count as retained since we couldn't delete

continue

}

if now.Sub(info.ModTime()) < 1*time.Hour {

result.RetainedBlobs++

continue

}

🤖 Prompt for AI Agents

In `@controllers/sandbox/blob_gc.go` around lines 65 - 73, When entry.Info() fails, the code adds the error to result.FailedBlobs but doesn't account for that blob in the retained count, breaking the invariant TotalBlobs == RetainedBlobs + len(DeletedBlobs) + len(FailedBlobs); update the failure branch in the loop that calls entry.Info() (where name is the blob key) to also increment the retained counter (result.RetainedBlobs++) for errors, or alternatively add a new counter field (e.g., result.FailedCount) and ensure TotalBlobs is computed using that field; change either the error branch around entry.Info() or the result struct usage so the sums remain consistent.

One of our internal clusters filled up with uncollected BuildKit cache and orphaned registry blobs. BuildKit's gcpolicy had filters for only source.local and exec.cachemount, so regular build cache layers grew unbounded. Removed the filters to make it a catch-all policy. Registry blobs had no cleanup at all. Added blob GC to the existing ImageWatchdog — it parses artifact manifests to find referenced digests and deletes any blob files that aren't referenced by a non-archived artifact. Skips recently modified blobs (<1h) to avoid racing with concurrent uploads.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@controllers/sandbox/blob_gc_test.go`:
- Around line 57-63: TestRunBlobGC currently uses a single shared entity
server/eac (created via entitytestutils.NewInMemEntityServer) for all subtests
which can cause test pollution; modify the test so each subtest gets its own
isolated server: inside the TestRunBlobGC subtests create a fresh entity server
and eac by calling entitytestutils.NewInMemEntityServer (or teardown artifacts)
per t.Run, or alternatively ensure each subtest removes any created artifacts
via eac before returning; update references to eac in the subtest closures to
use the per-subtest instance so tests like "keeps blobs referenced by active
artifacts" and "deletes unreferenced blobs" run in isolation.

🧹 Nitpick comments (1)

controllers/sandbox/blob_gc.go (1)
71-74: Consider making the grace period configurable.

The 1-hour grace period is hardcoded. While this is reasonable for the current use case, consider making it configurable via ImageGCConfig for flexibility in different deployment environments.
♻️ Optional: Make grace period configurable

In image_gc.go, add to ImageGCConfig:
// BlobGracePeriod is how long to wait before deleting unreferenced blobs (default: 1h)
BlobGracePeriod time.Duration
Then in blob_gc.go:
-		if now.Sub(info.ModTime()) < 1*time.Hour {
+		if now.Sub(info.ModTime()) < w.Config.BlobGracePeriod {

coderabbitai · 2026-02-10T20:04:03Z

controllers/sandbox/blob_gc_test.go

+func TestRunBlobGC(t *testing.T) {
+	entServer, cleanup := entitytestutils.NewInMemEntityServer(t)
+	defer cleanup()
+
+	eac := entServer.EAC
+	log := entitytestutils.TestLogger(t)
+	oldTime := time.Now().Add(-2 * time.Hour)


⚠️ Potential issue | 🟡 Minor

Potential test pollution due to shared entity server across subtests.

All subtests share the same eac (entity access client), but each subtest creates artifacts without cleanup. This means artifacts created in earlier subtests (e.g., "keeps blobs referenced by active artifacts") will still exist when later subtests run, potentially affecting results.

For example, the "deletes unreferenced blobs" test may behave differently depending on whether artifacts from other subtests are present.

Consider either:

Creating a fresh InMemEntityServer per subtest, or

Cleaning up artifacts after each subtest

♻️ Option 1: Fresh server per subtest

t.Run("deletes unreferenced blobs", func(t *testing.T) { + entServer, cleanup := entitytestutils.NewInMemEntityServer(t) + defer cleanup() + eac := entServer.EAC r := require.New(t) tmpDir := t.TempDir()

🤖 Prompt for AI Agents

In `@controllers/sandbox/blob_gc_test.go` around lines 57 - 63, TestRunBlobGC currently uses a single shared entity server/eac (created via entitytestutils.NewInMemEntityServer) for all subtests which can cause test pollution; modify the test so each subtest gets its own isolated server: inside the TestRunBlobGC subtests create a fresh entity server and eac by calling entitytestutils.NewInMemEntityServer (or teardown artifacts) per t.Run, or alternatively ensure each subtest removes any created artifacts via eac before returning; update references to eac in the subtest closures to use the per-subtest instance so tests like "keeps blobs referenced by active artifacts" and "deletes unreferenced blobs" run in isolation.

phinze requested a review from a team as a code owner February 10, 2026 15:02

coderabbitai bot reviewed Feb 10, 2026

View reviewed changes

components/buildkit/config_test.go Outdated Show resolved Hide resolved

controllers/sandbox/blob_gc.go Show resolved Hide resolved

evanphx approved these changes Feb 10, 2026

View reviewed changes

phinze force-pushed the phinze/fix-disk-space-gc branch from a01106c to 64d4662 Compare February 10, 2026 17:56

coderabbitai bot reviewed Feb 10, 2026

View reviewed changes

phinze force-pushed the phinze/fix-disk-space-gc branch from 64d4662 to 0b1a473 Compare February 10, 2026 18:00

phinze force-pushed the phinze/fix-disk-space-gc branch from 0b1a473 to 3e1fe41 Compare February 10, 2026 20:01

phinze enabled auto-merge February 10, 2026 20:03

coderabbitai bot reviewed Feb 10, 2026

View reviewed changes

phinze merged commit 63b6ae8 into main Feb 10, 2026
9 checks passed

phinze deleted the phinze/fix-disk-space-gc branch February 10, 2026 20:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix disk space GC for BuildKit cache and registry blobs#588

Fix disk space GC for BuildKit cache and registry blobs#588
phinze merged 1 commit intomainfrom
phinze/fix-disk-space-gc

phinze commented Feb 10, 2026

Uh oh!

coderabbitai bot commented Feb 10, 2026 •

edited

Loading

Walkthrough

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot Feb 10, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

phinze commented Feb 10, 2026

Uh oh!

coderabbitai bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Feb 10, 2026 •

edited

Loading