Skip to content

Add mark-and-sweep GC for shared OCI cache#199

Merged
sjmiller609 merged 7 commits into
mainfrom
hypeship/oci-cache-gc
May 7, 2026
Merged

Add mark-and-sweep GC for shared OCI cache#199
sjmiller609 merged 7 commits into
mainfrom
hypeship/oci-cache-gc

Conversation

@sjmiller609
Copy link
Copy Markdown
Collaborator

@sjmiller609 sjmiller609 commented Apr 23, 2026

Summary

The shared OCI cache at data_dir/system/oci-cache currently grows
without bound — neither the pull path (layout.AppendImage) nor the
registry push path (BlobStore.Put) ever remove blobs, and the image
retention controller only touches data_dir/images. Over time this
accumulates dead manifest, config, and layer blobs that are no longer
reachable from index.json.

This change adds a new lib/ocicachegc package that walks index.json
and every referenced manifest to build the set of live blob digests,
then deletes any file under blobs/sha256/ that isn't in that set.
Blobs whose mtime is within the configured min_blob_age are always
kept; that grace period is what lets the sweep run safely alongside
concurrent pulls (which write layer blobs before updating index.json)
and registry pushes (which rename <hex>.tmp<hex> before the
manifest trigger).

Config

Disabled by default. Opt-in via:

images:
  oci_cache_gc:
    enabled: true
    interval: 1h
    min_blob_age: 1h

How it decides what's live

  1. Read index.json.
  2. For each descriptor, record its digest and read the referenced blob.
    If the blob is a manifest or manifest index, recurse into its
    config, layers, manifests, and subject references.
  3. Anything not in the resulting set is eligible for sweep.

Unparseable or missing referenced blobs are treated as opaque leaves —
they remain "live" but we don't descend into them. The collector never
deletes a blob it cannot prove is dead.

.tmp files and anything whose name is not a 64-hex-char blob digest
are ignored by the sweep entirely.

Metrics

  • hypeman_oci_cache_gc_sweeps_total (counter, status)
  • hypeman_oci_cache_gc_sweep_duration_seconds (histogram)
  • hypeman_oci_cache_gc_deleted_blobs_total (counter)
  • hypeman_oci_cache_gc_deleted_bytes_total (counter)

Test plan

  • go test ./lib/ocicachegc/... passes (live set kept, orphans deleted, grace period honored, tmp/non-blob filenames ignored, manifest index traversal)
  • go test ./cmd/api/config/... passes (new duration validators)
  • go test ./lib/imageretention/... passes (unchanged)
  • go build ./cmd/api/... clean
  • go vet ./... clean
  • Enable on a dev node, run some pulls + a pull failure mid-flight, confirm the GC reclaims disk without touching in-flight blobs
  • Check metrics are scraped in SigNoz

Manual validation

  • On deft-kernel-dev, ran the real hypeman binary from a fresh scratch clone with images.oci_cache_gc.enabled: true and an isolated temp data_dir.
  • Seeded data_dir/system/oci-cache with one live manifest/config/layer set, one old orphan blob, and one recent orphan blob.
  • Observed startup logs for oci cache gc enabled, oci cache gc started, and deleted unreferenced oci blob for the old orphan digest.
  • Verified on disk that the old orphan blob was deleted, while the live blobs and the recent orphan blob remained.
  • Also ran the Deft CI-like Linux flow from a fresh clone: go mod download, make oapi-generate, make build, go run ./cmd/test-prewarm, go test -count=1 -tags containers_image_openpgp -timeout=20m ./... (pass, 300s).

Note

Medium Risk
Introduces a new background garbage collector that deletes files from data_dir/system/oci-cache, which could remove needed blobs if liveness/age rules are wrong or misconfigured. Mitigated by opt-in config defaults, grace period (min_blob_age), conservative parsing behavior, and extensive tests.

Overview
Adds an opt-in mark-and-sweep garbage collector for the shared OCI cache (data_dir/system/oci-cache) via new images.oci_cache_gc config (enabled, interval, min_blob_age) with validation and updated example configs.

Implements lib/ocicachegc to compute a live blob set by walking index.json (plus optional extra roots) and deleting unreferenced blobs/sha256/<digest> files older than the grace period, with OTel metrics/tracing and a single-sweep-at-a-time lock.

Wires the collector into cmd/api/main.go as a background service when enabled, and extends lib/registry to track BuildKit cache/* tag digests in-memory (LiveCacheManifestDigests) so GC treats them as additional roots; adds targeted unit tests for config, wiring, registry cache-tag tracking, and GC behavior.

Reviewed by Cursor Bugbot for commit 5554c9d. Bugbot is set up for automated code reviews on this repo. Configure here.

sjmiller609 and others added 2 commits April 23, 2026 20:34
The shared OCI cache at data_dir/system/oci-cache grew without bound
because neither the pull path nor the registry push path had a cleanup
hook. The image retention controller only touches data_dir/images, so
manifests and layer blobs that were no longer referenced lived forever.

This change adds a new lib/ocicachegc package that walks index.json and
every referenced manifest to build the live set of blob digests, then
deletes any file under blobs/sha256/ that is not in that set. Blobs
whose mtime is within the configured min_blob_age are kept; this grace
period is what lets the sweep run safely alongside concurrent pulls
(which write layer blobs before updating index.json) and registry
pushes.

Disabled by default. Enable via:

    images:
      oci_cache_gc:
        enabled: true
        interval: 1h
        min_blob_age: 1h
@sjmiller609 sjmiller609 requested a review from hiroTamada April 23, 2026 21:40
@sjmiller609 sjmiller609 marked this pull request as ready for review April 23, 2026 21:40
@firetiger-agent
Copy link
Copy Markdown

Firetiger deploy monitoring skipped

This PR didn't match the auto-monitor filter configured on your GitHub connection:

Any PR that changes the kernel API. Monitor changes to API endpoints (packages/api/cmd/api/) and Temporal workflows (packages/api/lib/temporal) in the kernel repo

Reason: PR adds a new garbage collection package for OCI cache management, but does not modify API endpoints (packages/api/cmd/api/) or Temporal workflows (packages/api/lib/temporal), which are the specific areas the filter requires for monitoring.

To monitor this PR anyway, reply with @firetiger monitor this.

Comment thread lib/ocicachegc/gc.go
sjmiller609 and others added 2 commits April 24, 2026 19:30
Previously walkDescriptor added subject.Digest to the live set as a
leaf without descending, so the subject manifest's own config and
layers could be swept. Recurse like manifests[] so the full referrer
chain stays marked.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Comment thread lib/ocicachegc/gc.go Outdated
The registry stores manifest and layer blobs for cache/* pushes in the
shared OCI blob dir but skips triggerConversion, so those blobs are
never rooted in index.json. With GC enabled this caused the sweep to
delete cache blobs the registry was still serving from its in-memory
tag map, breaking BuildKit cache exports.

Track cache/* tag -> manifest digest in the registry and expose the
set via LiveCacheManifestDigests. The GC takes a RootsProvider; on
every sweep it walks those manifests' configs and layers as
additional roots alongside index.json.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 8507ca0. Configure here.

Comment thread lib/registry/registry.go
Comment thread lib/ocicachegc/gc.go
sjmiller609 and others added 2 commits May 7, 2026 22:25
OCI v1.1 lets the index itself carry a subject descriptor. liveBlobs
only iterated index.Manifests, so a blob reachable solely via the
index-level subject was never marked and could be swept.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds a tracer to the collector with spans around the sweep, the mark
phase, and the blob sweep loop, plus span attributes capturing live,
scanned, deleted, and skipped-recent counts. Records live blob count
per successful sweep as a histogram metric so cache size is observable
from metrics alone. Demotes the per-blob delete log to DEBUG (an
ongoing maintenance event) and promotes the sweep summary to INFO only
when blobs were actually deleted, leaving idle sweeps at DEBUG.
@sjmiller609 sjmiller609 merged commit 16710cd into main May 7, 2026
11 checks passed
@sjmiller609 sjmiller609 deleted the hypeship/oci-cache-gc branch May 7, 2026 23:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants