Skip to content

Fix disk space GC for BuildKit cache and registry blobs#588

Merged
phinze merged 1 commit intomainfrom
phinze/fix-disk-space-gc
Feb 10, 2026
Merged

Fix disk space GC for BuildKit cache and registry blobs#588
phinze merged 1 commit intomainfrom
phinze/fix-disk-space-gc

Conversation

@phinze
Copy link
Contributor

@phinze phinze commented Feb 10, 2026

One of our internal clusters filled up — uncollected BuildKit cache and
orphaned registry blobs with no cleanup path.

BuildKit cache: The gcpolicy only had filters for source.local and
exec.cachemount, so regular build layers never got collected. Removed
the filters to make it a catch-all — the filtered types were a subset
anyway.

Registry blobs: Artifact GC archives artifacts and image GC cleans
up containerd images, but the actual blob files on disk were never
reaped. Added blob GC to the existing ImageWatchdog so it runs on the
same schedule (pressure checks + weekly). It collects digests referenced
by non-archived artifacts and deletes everything else, with a 1-hour
grace period for in-flight uploads.

@phinze phinze requested a review from a team as a code owner February 10, 2026 15:02
@coderabbitai
Copy link

coderabbitai bot commented Feb 10, 2026

📝 Walkthrough

Walkthrough

Removed the filters line from the generated BuildKit daemon worker.oci.gcpolicy configuration and added a unit test TestGenerateConfig validating generated config contents and registry host handling. Implemented blob garbage collection: added BlobGCResult and ImageWatchdog.RunBlobGC to enumerate registry blobs, collect referenced digests from OCI-like manifests, skip recently modified or referenced blobs, and delete unreferenced old blobs while recording deletions and failures. Modified image GC flow to always run blob GC and log blob GC results with per-blob details.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@components/buildkit/config_test.go`:
- Around line 24-25: The test incorrectly asserts presence of the substring
using idx := strings.Index(config, "[[worker.oci.gcpolicy]]") followed by
r.Greater(idx, 0), which fails if the match is at index 0; change the assertion
to check for non-negative/exists (for example, use r.NotEqual(-1, idx) or
r.GreaterOrEqual(idx, 0)) so the test correctly passes when the substring is
found at the start; update the assertion call in
components/buildkit/config_test.go replacing r.Greater(idx, 0) with the chosen
check.

In `@controllers/sandbox/blob_gc.go`:
- Around line 117-122: The comment is misleading: when json.Unmarshal of
art.Manifest fails, the code logs a warning and continues, which results in that
artifact contributing no entries to the digests map so blobs only referenced by
this malformed artifact may be deleted; update the inline comment near the
json.Unmarshal block (referencing art.Manifest, manifest, digests, and the
surrounding blob cleanup logic) to clearly state this intentional tradeoff
(favoring GC availability over safety), and mention that malformed manifests are
treated as having no digests and therefore could allow deletion of blobs
uniquely referenced by that artifact.
🧹 Nitpick comments (3)
controllers/sandbox/blob_gc.go (1)

61-87: Consider filtering for regular files only.

If the blobs directory ever contains subdirectories (e.g., sharded storage like sha256/<prefix>/<hash>), os.Remove will fail for non-empty directories. This could inflate TotalBlobs and populate FailedBlobs with confusing errors.

If the current registry implementation guarantees a flat file structure, this is fine as-is. Otherwise, consider filtering:

♻️ Suggested defensive check
 for _, entry := range entries {
     name := entry.Name()
+    if entry.IsDir() {
+        continue
+    }

     // Skip recently modified files to guard against concurrent uploads
controllers/sandbox/blob_gc_test.go (2)

57-63: Test isolation: artifacts persist across subtests.

The entity server is shared across all subtests, so artifacts created in earlier subtests remain visible in later ones. This works because each subtest uses unique digest names, but it's fragile—accidental name collisions could cause intermittent failures.

Consider either creating a fresh entity server per subtest, or using t.Cleanup to remove artifacts after each subtest.


243-245: Clarify test comment to avoid confusion.

The comment states "its blobs should NOT be deleted" but the blob sha256:someblob isn't actually referenced by this artifact (we can't know what it references since the manifest is malformed). The test actually verifies that orphan blobs are deleted even when a malformed manifest exists.

📝 Suggested clarification
-       // Create artifact with invalid manifest - its blobs should NOT be deleted
-       // because we can't verify what it references
+       // Create artifact with invalid manifest - we can't determine what blobs it references,
+       // so unrelated blobs can still be garbage collected
        createArtifact(t, eac, core_v1alpha.ACTIVE, "not-valid-json{{{")

@phinze phinze force-pushed the phinze/fix-disk-space-gc branch from a01106c to 64d4662 Compare February 10, 2026 17:56
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@controllers/sandbox/blob_gc_test.go`:
- Around line 237-261: Update the misleading test comment in the "handles
malformed manifest" test: the lines stating "its blobs should NOT be deleted
because we can't verify what it references" contradict the assertion that the
orphan blob is deleted; change or remove that sentence so the comment aligns
with the actual behavior asserted (the malformed manifest is skipped and the
orphan blob is deleted). Refer to the test setup that calls createArtifact(...,
"not-valid-json{{{") and the assertion against watchdog.RunBlobGC() /
result.DeletedBlobs to ensure the comment reflects that deletion is expected.

In `@controllers/sandbox/blob_gc.go`:
- Around line 65-73: When entry.Info() fails, the code adds the error to
result.FailedBlobs but doesn't account for that blob in the retained count,
breaking the invariant TotalBlobs == RetainedBlobs + len(DeletedBlobs) +
len(FailedBlobs); update the failure branch in the loop that calls entry.Info()
(where name is the blob key) to also increment the retained counter
(result.RetainedBlobs++) for errors, or alternatively add a new counter field
(e.g., result.FailedCount) and ensure TotalBlobs is computed using that field;
change either the error branch around entry.Info() or the result struct usage so
the sums remain consistent.
🧹 Nitpick comments (3)
components/buildkit/config_test.go (1)

12-12: Consider extracting magic numbers to named constants.

The parameters 10*1024*1024*1024 and 86400 could be more readable as named constants (e.g., testKeepBytes and testKeepDurationSeconds).

♻️ Proposed refactor
+const (
+	testKeepBytes    = 10 * 1024 * 1024 * 1024 // 10 GiB
+	testKeepDuration = 86400                   // 1 day in seconds
+)
+
 func TestGenerateConfig(t *testing.T) {
 	c := &Component{}
-	config := c.generateConfig(10*1024*1024*1024, 86400, "registry.example.com:5000")
+	config := c.generateConfig(testKeepBytes, testKeepDuration, "registry.example.com:5000")
controllers/sandbox/image_gc.go (1)

158-161: Consider adding nil-check for defensive coding.

While RunBlobGC currently always returns a non-nil result, adding a nil-check would make this code more robust against future changes.

🛡️ Defensive nil-check
 	// Run blob GC independently of image GC
 	blobResult, blobErr := w.RunBlobGC(ctx)
 	if blobErr != nil {
 		w.Log.Error("blob GC failed", "trigger", trigger, "error", blobErr)
+	}
+	if blobResult == nil {
+		return
+	}
+	if len(blobResult.DeletedBlobs) > 0 || len(blobResult.FailedBlobs) > 0 {
-	} else if len(blobResult.DeletedBlobs) > 0 || len(blobResult.FailedBlobs) > 0 {
controllers/sandbox/blob_gc_test.go (1)

57-63: Shared entity server may cause test pollution between subtests.

The entity server is shared across all subtests, meaning artifacts created in earlier tests persist and affect later tests. While the current tests use unique digest names and separate tmpDir for blob files, this setup is fragile. If a future test uses overlapping digest names, it could fail unexpectedly or produce false positives.

Consider either:

  1. Creating a fresh entity server per subtest
  2. Adding cleanup logic after each subtest
  3. Using table-driven tests with explicit cleanup
♻️ Option 1: Fresh entity server per subtest
 func TestRunBlobGC(t *testing.T) {
-	entServer, cleanup := entitytestutils.NewInMemEntityServer(t)
-	defer cleanup()
-
-	eac := entServer.EAC
 	log := entitytestutils.TestLogger(t)
 	oldTime := time.Now().Add(-2 * time.Hour)

 	t.Run("deletes unreferenced blobs", func(t *testing.T) {
+		entServer, cleanup := entitytestutils.NewInMemEntityServer(t)
+		defer cleanup()
+		eac := entServer.EAC
+
 		r := require.New(t)
 		tmpDir := t.TempDir()
 		// ... rest of test

Comment on lines 65 to 73
info, err := entry.Info()
if err != nil {
result.FailedBlobs[name] = err
continue
}
if now.Sub(info.ModTime()) < 1*time.Hour {
result.RetainedBlobs++
continue
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Count inconsistency when entry.Info() fails.

When entry.Info() returns an error, the blob is added to FailedBlobs but not counted in RetainedBlobs. This creates an inconsistency where TotalBlobs != RetainedBlobs + len(DeletedBlobs) + len(FailedBlobs). Consider whether failed blobs should be counted as retained (since they weren't deleted) or if a separate counter is needed.

🔧 Proposed fix to maintain count consistency
 		info, err := entry.Info()
 		if err != nil {
 			result.FailedBlobs[name] = err
+			result.RetainedBlobs++ // Count as retained since we couldn't delete
 			continue
 		}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
info, err := entry.Info()
if err != nil {
result.FailedBlobs[name] = err
continue
}
if now.Sub(info.ModTime()) < 1*time.Hour {
result.RetainedBlobs++
continue
}
info, err := entry.Info()
if err != nil {
result.FailedBlobs[name] = err
result.RetainedBlobs++ // Count as retained since we couldn't delete
continue
}
if now.Sub(info.ModTime()) < 1*time.Hour {
result.RetainedBlobs++
continue
}
🤖 Prompt for AI Agents
In `@controllers/sandbox/blob_gc.go` around lines 65 - 73, When entry.Info()
fails, the code adds the error to result.FailedBlobs but doesn't account for
that blob in the retained count, breaking the invariant TotalBlobs ==
RetainedBlobs + len(DeletedBlobs) + len(FailedBlobs); update the failure branch
in the loop that calls entry.Info() (where name is the blob key) to also
increment the retained counter (result.RetainedBlobs++) for errors, or
alternatively add a new counter field (e.g., result.FailedCount) and ensure
TotalBlobs is computed using that field; change either the error branch around
entry.Info() or the result struct usage so the sums remain consistent.

@phinze phinze force-pushed the phinze/fix-disk-space-gc branch from 64d4662 to 0b1a473 Compare February 10, 2026 18:00
One of our internal clusters filled up with uncollected BuildKit cache
and orphaned registry blobs.

BuildKit's gcpolicy had filters for only source.local and
exec.cachemount, so regular build cache layers grew unbounded. Removed
the filters to make it a catch-all policy.

Registry blobs had no cleanup at all. Added blob GC to the existing
ImageWatchdog — it parses artifact manifests to find referenced digests
and deletes any blob files that aren't referenced by a non-archived
artifact. Skips recently modified blobs (<1h) to avoid racing with
concurrent uploads.
@phinze phinze force-pushed the phinze/fix-disk-space-gc branch from 0b1a473 to 3e1fe41 Compare February 10, 2026 20:01
@phinze phinze enabled auto-merge February 10, 2026 20:03
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@controllers/sandbox/blob_gc_test.go`:
- Around line 57-63: TestRunBlobGC currently uses a single shared entity
server/eac (created via entitytestutils.NewInMemEntityServer) for all subtests
which can cause test pollution; modify the test so each subtest gets its own
isolated server: inside the TestRunBlobGC subtests create a fresh entity server
and eac by calling entitytestutils.NewInMemEntityServer (or teardown artifacts)
per t.Run, or alternatively ensure each subtest removes any created artifacts
via eac before returning; update references to eac in the subtest closures to
use the per-subtest instance so tests like "keeps blobs referenced by active
artifacts" and "deletes unreferenced blobs" run in isolation.
🧹 Nitpick comments (1)
controllers/sandbox/blob_gc.go (1)

71-74: Consider making the grace period configurable.

The 1-hour grace period is hardcoded. While this is reasonable for the current use case, consider making it configurable via ImageGCConfig for flexibility in different deployment environments.

♻️ Optional: Make grace period configurable

In image_gc.go, add to ImageGCConfig:

// BlobGracePeriod is how long to wait before deleting unreferenced blobs (default: 1h)
BlobGracePeriod time.Duration

Then in blob_gc.go:

-		if now.Sub(info.ModTime()) < 1*time.Hour {
+		if now.Sub(info.ModTime()) < w.Config.BlobGracePeriod {

Comment on lines +57 to +63
func TestRunBlobGC(t *testing.T) {
entServer, cleanup := entitytestutils.NewInMemEntityServer(t)
defer cleanup()

eac := entServer.EAC
log := entitytestutils.TestLogger(t)
oldTime := time.Now().Add(-2 * time.Hour)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential test pollution due to shared entity server across subtests.

All subtests share the same eac (entity access client), but each subtest creates artifacts without cleanup. This means artifacts created in earlier subtests (e.g., "keeps blobs referenced by active artifacts") will still exist when later subtests run, potentially affecting results.

For example, the "deletes unreferenced blobs" test may behave differently depending on whether artifacts from other subtests are present.

Consider either:

  1. Creating a fresh InMemEntityServer per subtest, or
  2. Cleaning up artifacts after each subtest
♻️ Option 1: Fresh server per subtest
 	t.Run("deletes unreferenced blobs", func(t *testing.T) {
+		entServer, cleanup := entitytestutils.NewInMemEntityServer(t)
+		defer cleanup()
+		eac := entServer.EAC
 		r := require.New(t)
 		tmpDir := t.TempDir()
🤖 Prompt for AI Agents
In `@controllers/sandbox/blob_gc_test.go` around lines 57 - 63, TestRunBlobGC
currently uses a single shared entity server/eac (created via
entitytestutils.NewInMemEntityServer) for all subtests which can cause test
pollution; modify the test so each subtest gets its own isolated server: inside
the TestRunBlobGC subtests create a fresh entity server and eac by calling
entitytestutils.NewInMemEntityServer (or teardown artifacts) per t.Run, or
alternatively ensure each subtest removes any created artifacts via eac before
returning; update references to eac in the subtest closures to use the
per-subtest instance so tests like "keeps blobs referenced by active artifacts"
and "deletes unreferenced blobs" run in isolation.

@phinze phinze merged commit 63b6ae8 into main Feb 10, 2026
9 checks passed
@phinze phinze deleted the phinze/fix-disk-space-gc branch February 10, 2026 20:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants