Fix disk space GC for BuildKit cache and registry blobs#588
Conversation
📝 WalkthroughWalkthroughRemoved the Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@components/buildkit/config_test.go`:
- Around line 24-25: The test incorrectly asserts presence of the substring
using idx := strings.Index(config, "[[worker.oci.gcpolicy]]") followed by
r.Greater(idx, 0), which fails if the match is at index 0; change the assertion
to check for non-negative/exists (for example, use r.NotEqual(-1, idx) or
r.GreaterOrEqual(idx, 0)) so the test correctly passes when the substring is
found at the start; update the assertion call in
components/buildkit/config_test.go replacing r.Greater(idx, 0) with the chosen
check.
In `@controllers/sandbox/blob_gc.go`:
- Around line 117-122: The comment is misleading: when json.Unmarshal of
art.Manifest fails, the code logs a warning and continues, which results in that
artifact contributing no entries to the digests map so blobs only referenced by
this malformed artifact may be deleted; update the inline comment near the
json.Unmarshal block (referencing art.Manifest, manifest, digests, and the
surrounding blob cleanup logic) to clearly state this intentional tradeoff
(favoring GC availability over safety), and mention that malformed manifests are
treated as having no digests and therefore could allow deletion of blobs
uniquely referenced by that artifact.
🧹 Nitpick comments (3)
controllers/sandbox/blob_gc.go (1)
61-87: Consider filtering for regular files only.If the
blobsdirectory ever contains subdirectories (e.g., sharded storage likesha256/<prefix>/<hash>),os.Removewill fail for non-empty directories. This could inflateTotalBlobsand populateFailedBlobswith confusing errors.If the current registry implementation guarantees a flat file structure, this is fine as-is. Otherwise, consider filtering:
♻️ Suggested defensive check
for _, entry := range entries { name := entry.Name() + if entry.IsDir() { + continue + } // Skip recently modified files to guard against concurrent uploadscontrollers/sandbox/blob_gc_test.go (2)
57-63: Test isolation: artifacts persist across subtests.The entity server is shared across all subtests, so artifacts created in earlier subtests remain visible in later ones. This works because each subtest uses unique digest names, but it's fragile—accidental name collisions could cause intermittent failures.
Consider either creating a fresh entity server per subtest, or using
t.Cleanupto remove artifacts after each subtest.
243-245: Clarify test comment to avoid confusion.The comment states "its blobs should NOT be deleted" but the blob
sha256:someblobisn't actually referenced by this artifact (we can't know what it references since the manifest is malformed). The test actually verifies that orphan blobs are deleted even when a malformed manifest exists.📝 Suggested clarification
- // Create artifact with invalid manifest - its blobs should NOT be deleted - // because we can't verify what it references + // Create artifact with invalid manifest - we can't determine what blobs it references, + // so unrelated blobs can still be garbage collected createArtifact(t, eac, core_v1alpha.ACTIVE, "not-valid-json{{{")
a01106c to
64d4662
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@controllers/sandbox/blob_gc_test.go`:
- Around line 237-261: Update the misleading test comment in the "handles
malformed manifest" test: the lines stating "its blobs should NOT be deleted
because we can't verify what it references" contradict the assertion that the
orphan blob is deleted; change or remove that sentence so the comment aligns
with the actual behavior asserted (the malformed manifest is skipped and the
orphan blob is deleted). Refer to the test setup that calls createArtifact(...,
"not-valid-json{{{") and the assertion against watchdog.RunBlobGC() /
result.DeletedBlobs to ensure the comment reflects that deletion is expected.
In `@controllers/sandbox/blob_gc.go`:
- Around line 65-73: When entry.Info() fails, the code adds the error to
result.FailedBlobs but doesn't account for that blob in the retained count,
breaking the invariant TotalBlobs == RetainedBlobs + len(DeletedBlobs) +
len(FailedBlobs); update the failure branch in the loop that calls entry.Info()
(where name is the blob key) to also increment the retained counter
(result.RetainedBlobs++) for errors, or alternatively add a new counter field
(e.g., result.FailedCount) and ensure TotalBlobs is computed using that field;
change either the error branch around entry.Info() or the result struct usage so
the sums remain consistent.
🧹 Nitpick comments (3)
components/buildkit/config_test.go (1)
12-12: Consider extracting magic numbers to named constants.The parameters
10*1024*1024*1024and86400could be more readable as named constants (e.g.,testKeepBytesandtestKeepDurationSeconds).♻️ Proposed refactor
+const ( + testKeepBytes = 10 * 1024 * 1024 * 1024 // 10 GiB + testKeepDuration = 86400 // 1 day in seconds +) + func TestGenerateConfig(t *testing.T) { c := &Component{} - config := c.generateConfig(10*1024*1024*1024, 86400, "registry.example.com:5000") + config := c.generateConfig(testKeepBytes, testKeepDuration, "registry.example.com:5000")controllers/sandbox/image_gc.go (1)
158-161: Consider adding nil-check for defensive coding.While
RunBlobGCcurrently always returns a non-nil result, adding a nil-check would make this code more robust against future changes.🛡️ Defensive nil-check
// Run blob GC independently of image GC blobResult, blobErr := w.RunBlobGC(ctx) if blobErr != nil { w.Log.Error("blob GC failed", "trigger", trigger, "error", blobErr) + } + if blobResult == nil { + return + } + if len(blobResult.DeletedBlobs) > 0 || len(blobResult.FailedBlobs) > 0 { - } else if len(blobResult.DeletedBlobs) > 0 || len(blobResult.FailedBlobs) > 0 {controllers/sandbox/blob_gc_test.go (1)
57-63: Shared entity server may cause test pollution between subtests.The entity server is shared across all subtests, meaning artifacts created in earlier tests persist and affect later tests. While the current tests use unique digest names and separate
tmpDirfor blob files, this setup is fragile. If a future test uses overlapping digest names, it could fail unexpectedly or produce false positives.Consider either:
- Creating a fresh entity server per subtest
- Adding cleanup logic after each subtest
- Using table-driven tests with explicit cleanup
♻️ Option 1: Fresh entity server per subtest
func TestRunBlobGC(t *testing.T) { - entServer, cleanup := entitytestutils.NewInMemEntityServer(t) - defer cleanup() - - eac := entServer.EAC log := entitytestutils.TestLogger(t) oldTime := time.Now().Add(-2 * time.Hour) t.Run("deletes unreferenced blobs", func(t *testing.T) { + entServer, cleanup := entitytestutils.NewInMemEntityServer(t) + defer cleanup() + eac := entServer.EAC + r := require.New(t) tmpDir := t.TempDir() // ... rest of test
| info, err := entry.Info() | ||
| if err != nil { | ||
| result.FailedBlobs[name] = err | ||
| continue | ||
| } | ||
| if now.Sub(info.ModTime()) < 1*time.Hour { | ||
| result.RetainedBlobs++ | ||
| continue | ||
| } |
There was a problem hiding this comment.
Count inconsistency when entry.Info() fails.
When entry.Info() returns an error, the blob is added to FailedBlobs but not counted in RetainedBlobs. This creates an inconsistency where TotalBlobs != RetainedBlobs + len(DeletedBlobs) + len(FailedBlobs). Consider whether failed blobs should be counted as retained (since they weren't deleted) or if a separate counter is needed.
🔧 Proposed fix to maintain count consistency
info, err := entry.Info()
if err != nil {
result.FailedBlobs[name] = err
+ result.RetainedBlobs++ // Count as retained since we couldn't delete
continue
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| info, err := entry.Info() | |
| if err != nil { | |
| result.FailedBlobs[name] = err | |
| continue | |
| } | |
| if now.Sub(info.ModTime()) < 1*time.Hour { | |
| result.RetainedBlobs++ | |
| continue | |
| } | |
| info, err := entry.Info() | |
| if err != nil { | |
| result.FailedBlobs[name] = err | |
| result.RetainedBlobs++ // Count as retained since we couldn't delete | |
| continue | |
| } | |
| if now.Sub(info.ModTime()) < 1*time.Hour { | |
| result.RetainedBlobs++ | |
| continue | |
| } |
🤖 Prompt for AI Agents
In `@controllers/sandbox/blob_gc.go` around lines 65 - 73, When entry.Info()
fails, the code adds the error to result.FailedBlobs but doesn't account for
that blob in the retained count, breaking the invariant TotalBlobs ==
RetainedBlobs + len(DeletedBlobs) + len(FailedBlobs); update the failure branch
in the loop that calls entry.Info() (where name is the blob key) to also
increment the retained counter (result.RetainedBlobs++) for errors, or
alternatively add a new counter field (e.g., result.FailedCount) and ensure
TotalBlobs is computed using that field; change either the error branch around
entry.Info() or the result struct usage so the sums remain consistent.
64d4662 to
0b1a473
Compare
One of our internal clusters filled up with uncollected BuildKit cache and orphaned registry blobs. BuildKit's gcpolicy had filters for only source.local and exec.cachemount, so regular build cache layers grew unbounded. Removed the filters to make it a catch-all policy. Registry blobs had no cleanup at all. Added blob GC to the existing ImageWatchdog — it parses artifact manifests to find referenced digests and deletes any blob files that aren't referenced by a non-archived artifact. Skips recently modified blobs (<1h) to avoid racing with concurrent uploads.
0b1a473 to
3e1fe41
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@controllers/sandbox/blob_gc_test.go`:
- Around line 57-63: TestRunBlobGC currently uses a single shared entity
server/eac (created via entitytestutils.NewInMemEntityServer) for all subtests
which can cause test pollution; modify the test so each subtest gets its own
isolated server: inside the TestRunBlobGC subtests create a fresh entity server
and eac by calling entitytestutils.NewInMemEntityServer (or teardown artifacts)
per t.Run, or alternatively ensure each subtest removes any created artifacts
via eac before returning; update references to eac in the subtest closures to
use the per-subtest instance so tests like "keeps blobs referenced by active
artifacts" and "deletes unreferenced blobs" run in isolation.
🧹 Nitpick comments (1)
controllers/sandbox/blob_gc.go (1)
71-74: Consider making the grace period configurable.The 1-hour grace period is hardcoded. While this is reasonable for the current use case, consider making it configurable via
ImageGCConfigfor flexibility in different deployment environments.♻️ Optional: Make grace period configurable
In
image_gc.go, add toImageGCConfig:// BlobGracePeriod is how long to wait before deleting unreferenced blobs (default: 1h) BlobGracePeriod time.DurationThen in
blob_gc.go:- if now.Sub(info.ModTime()) < 1*time.Hour { + if now.Sub(info.ModTime()) < w.Config.BlobGracePeriod {
| func TestRunBlobGC(t *testing.T) { | ||
| entServer, cleanup := entitytestutils.NewInMemEntityServer(t) | ||
| defer cleanup() | ||
|
|
||
| eac := entServer.EAC | ||
| log := entitytestutils.TestLogger(t) | ||
| oldTime := time.Now().Add(-2 * time.Hour) |
There was a problem hiding this comment.
Potential test pollution due to shared entity server across subtests.
All subtests share the same eac (entity access client), but each subtest creates artifacts without cleanup. This means artifacts created in earlier subtests (e.g., "keeps blobs referenced by active artifacts") will still exist when later subtests run, potentially affecting results.
For example, the "deletes unreferenced blobs" test may behave differently depending on whether artifacts from other subtests are present.
Consider either:
- Creating a fresh
InMemEntityServerper subtest, or - Cleaning up artifacts after each subtest
♻️ Option 1: Fresh server per subtest
t.Run("deletes unreferenced blobs", func(t *testing.T) {
+ entServer, cleanup := entitytestutils.NewInMemEntityServer(t)
+ defer cleanup()
+ eac := entServer.EAC
r := require.New(t)
tmpDir := t.TempDir()🤖 Prompt for AI Agents
In `@controllers/sandbox/blob_gc_test.go` around lines 57 - 63, TestRunBlobGC
currently uses a single shared entity server/eac (created via
entitytestutils.NewInMemEntityServer) for all subtests which can cause test
pollution; modify the test so each subtest gets its own isolated server: inside
the TestRunBlobGC subtests create a fresh entity server and eac by calling
entitytestutils.NewInMemEntityServer (or teardown artifacts) per t.Run, or
alternatively ensure each subtest removes any created artifacts via eac before
returning; update references to eac in the subtest closures to use the
per-subtest instance so tests like "keeps blobs referenced by active artifacts"
and "deletes unreferenced blobs" run in isolation.
One of our internal clusters filled up — uncollected BuildKit cache and
orphaned registry blobs with no cleanup path.
BuildKit cache: The gcpolicy only had filters for
source.localandexec.cachemount, so regular build layers never got collected. Removedthe filters to make it a catch-all — the filtered types were a subset
anyway.
Registry blobs: Artifact GC archives artifacts and image GC cleans
up containerd images, but the actual blob files on disk were never
reaped. Added blob GC to the existing ImageWatchdog so it runs on the
same schedule (pressure checks + weekly). It collects digests referenced
by non-archived artifacts and deletes everything else, with a 1-hour
grace period for in-flight uploads.