feat(service): add node-level OCI layer deduplication and pull concurrency control by rishi-jat · Pull Request #39 · modelpack/model-csi-driver

rishi-jat · 2026-05-09T23:55:18Z

Fixes #24
Fixes #35

Summary

Implement node-level OCI layer deduplication and concurrency control for concurrent model pulls.

This replaces the previous volume-scoped inflight behavior with layer-aware orchestration using:

digest-level singleflight
node-level weighted semaphore flow control
persistent layer metadata rebuild support
hardlink-based layer reuse across volumes

Concurrent pods requesting identical model content on the same node now perform exactly one network fetch per layer digest while safely reusing cached content locally.

Changes

LayerCache

Added a node-scoped LayerCache with:

digest → file path tracking
concurrent-safe lookup/register/remove operations
startup rebuild from persisted layer_digests.json
stale cache eviction during volume cleanup

Layer-aware pull orchestration

Implemented manual layer orchestration in the pull path:

manifest layer inspection
cache lookup before fetch
digest-level singleflight coordination
semaphore-based node-wide concurrency limiting
hardlink reuse for waiting callers
fallback handling for hardlink failures (EXDEV)

Progress and observability accounting

Added:

go Hook.LayerCached(...)

Cached and waiting callers now:

correctly advance CSI progress state
avoid duplicate pulled layer logs
avoid duplicate metrics.NodePullOpObserve increments

Only the goroutine executing the real network fetch emits pull telemetry.

Worker integration

wire LayerCache into worker/service lifecycle
rebuild cache during service startup
evict layer entries during model deletion

Validation

Verified with:

bash go test ./... go test -race ./pkg/service/... golangci-lint run

Additional coverage includes:

concurrent layer dedup correctness
semaphore release during cancellation
cache rebuild persistence
cache eviction lifecycle
race detection under concurrent pull workloads

Introduces a thread-safe LayerCache that provides: - Singleflight concurrency deduplication per layer digest. - Node-level weighted semaphore for fetch flow control. - Restart-safe rebuild mechanisms via layer_digests.json. Signed-off-by: Rishi Jat <rishijat098@gmail.com>

Adds the LayerCached method to status.Hook. This allows wait-paths and cache hits to increment their individual pod UI progress counters correctly to 100% without emitting duplicated Prometheus node metrics or generating superfluous network completion logs. Signed-off-by: Rishi Jat <rishijat098@gmail.com>

Replaces high-level b.Fetch() wrappers with a robust native pipeline: - Natively parses and fetches descriptors using oras. - Hardlinks cached layers efficiently. - Wraps remote.Fetch within singleflight and node-level semaphores. - Gracefully handles TOCTOU link failures, cross-device EXDEV bounds, and network flakes via exponential backoff retries. - Proved 100% data race free via new httptest simulation testbed. Signed-off-by: Rishi Jat <rishijat098@gmail.com>

Initializes the global LayerCache at the service layer and threads it down to the Worker, passing it into the NewLayerAwarePuller factory. Registers lifecycle eviction hooks when volumes are deleted. Signed-off-by: Rishi Jat <rishijat098@gmail.com>

rishi-jat · 2026-05-09T23:56:57Z

/cc @imeoer
/cc @xujihui1985
/cc @aftersnow
/cc @chlins

gemini-code-assist

Code Review

This pull request introduces a layer-level deduplication mechanism for model pulls through a new LayerCache and a layer-aware puller. The implementation uses hardlinks to reuse existing layers on disk, singleflight to deduplicate concurrent network requests for the same digest, and a node-level semaphore for flow control. Feedback identifies several critical improvements: ensuring semaphore acquisition respects context cancellation to prevent resource leaks, refining path prefix matching to avoid incorrect cache evictions, supporting digest-based image references, and correcting the logic for removing stale cache entries during hardlink failures.

gemini-code-assist · 2026-05-09T23:57:08Z

+	if h.sem != nil {
+		// Use a background context so we don't fail the pull if the parent
+		// context is cancelled while waiting — the pull itself handles cancellation.
+		_ = h.sem.Acquire(context.Background(), 1)


Acquiring the semaphore with context.Background() ignores the pull request's cancellation. If many pulls are cancelled while waiting for the semaphore, these goroutines will remain blocked until they acquire a permit, potentially leading to resource exhaustion. It is recommended to pass the pull context into the layerTrackingHook and use it here.

gemini-code-assist · 2026-05-09T23:57:08Z

+	for d, paths := range lc.layers {
+		filtered := paths[:0]
+		for _, p := range paths {
+			if !strings.HasPrefix(p, prefix) {


The use of strings.HasPrefix for path matching can lead to incorrect results if one volume name is a prefix of another (e.g., /models/vol1 and /models/vol11). It is safer to ensure the prefix ends with a path separator or use a more robust path comparison logic.

Suggested change

if !strings.HasPrefix(p, prefix) {

if !strings.HasPrefix(p, filepath.Clean(prefix)+string(os.PathSeparator)) && p != prefix {

gemini-code-assist · 2026-05-09T23:57:08Z

+		return errors.Wrap(err, "create remote client")
+	}
+
+	_, manifestReader, err := client.Manifests().FetchReference(ctx, ref.Tag())


The ref.Tag() call might return an empty string if the model reference is digest-based (e.g., example.com/model@sha256:...). In such cases, FetchReference will likely fail. It is safer to handle both tags and digests.

Suggested change

_, manifestReader, err := client.Manifests().FetchReference(ctx, ref.Tag())

target := ref.Tag()

if target == "" {

target = ref.Digest().String()

}

_, manifestReader, err := client.Manifests().FetchReference(ctx, target)

gemini-code-assist · 2026-05-09T23:57:08Z

+					if err := os.Link(sourcePath, destPath); err != nil {
+						// Hardlink failed (EXDEV or source was deleted). Remove stale entry and retry fetch.
+						logger.WithContext(ctx).WithError(err).Warnf("layer-aware pull: hardlink after singleflight failed for %s, retrying", desc.Digest)
+						p.layerCache.RemoveByPrefix(sourcePath)


Using RemoveByPrefix with a full file path (sourcePath) is conceptually incorrect and potentially dangerous due to the prefix matching logic. It should use Remove(sourcePath) instead. Additionally, if the error was EXDEV (cross-device link), the source file in the cache is still valid and should not be removed; removal should only happen if the source is actually missing (e.g., ENOENT).

Suggested change

p.layerCache.RemoveByPrefix(sourcePath)

if os.IsNotExist(err) {

p.layerCache.Remove(sourcePath)

}

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

imeoer · 2026-05-15T15:45:14Z

Thanks for the PR! This is a co-reviews by me and GPT 5.4 here :)

I reviewed both PRs (#39 and #40) in detail. Both move the project in the right direction, but they make different tradeoffs.

For PR39, the main strengths are:

It covers the full feature set more explicitly. It introduces digest-level singleflight, node-level layer concurrency control, hardlink-based layer reuse, and cache rebuild support after restart.
It has stronger observability semantics. The separate LayerCached path is a good idea because it avoids counting a cache hit as a real network pull.
It clearly targets the original problem statement end-to-end: deduplicating concurrent layer pulls, limiting node-wide fan-out, and reducing duplicate disk usage.

For PR39, have the following issues:

[Important] The implementation is significantly more complex because it partially re-implements layer orchestration in the driver instead of staying on top of the existing modctl pull flow. It manually resolves manifests, fetches blobs, and decodes layers. That gives more control, but it also creates a long-term maintenance burden and increases the chance of divergence from upstream modctl behavior.
[Important] The solution is not fully uniform across all pull modes. For filtered or partial pulls, it falls back to the old path, so the dedup/concurrency behavior is not consistently applied everywhere.
[Important] The semaphore-based throttling is count-based, not size-aware. It limits the number of in-flight layers, but it does not account for layer size, so it is not a real bandwidth control mechanism.
[Important] One review comment is valid: acquiring the semaphore with context.Background() in the tracking hook means cancellation may not stop goroutines that are blocked waiting for permits.
The cache eviction logic based on strings.HasPrefix is risky, because one volume path can be a prefix of another unrelated path.
[Important] Manifest fetching appears to assume a tag-based reference. Digest-based references should also be handled correctly.
On hardlink retry failure, using RemoveByPrefix(sourcePath) is conceptually wrong and can remove too much state. If the source path is still valid and the error is only EXDEV, the cache entry should not be removed.

For PR40, the main strengths are:

The design is much simpler and fits the current codebase better. It keeps the existing pull/fetch flow and extends it through hooks instead of replacing the orchestration layer.
It is easier to reason about operationally. The LayerCache Acquire/Publish/Fail model is straightforward, and the ownership/hit flow is localized.
The persistence model is practical. Writing per-volume layers.json and rebuilding from it is a good way to preserve dedup state across restarts.
I especially like that cache-hit volumes also persist their own layers.json. That avoids losing dedup state when the original owner volume is removed.
From a maintenance perspective, this approach is less invasive and should be easier to evolve incrementally.

For PR40, I see the following issues:

It does not fully implement node-level layer concurrency throttling. It deduplicates same-digest work, but it does not add a global limit for different digests being pulled at the same time.
[Important] The target path is derived from OCI annotations and joined directly with the target directory. That introduces a path traversal risk unless the resolved path is validated to stay inside the target directory.
[Important] Persisting layers.json synchronously on every publish, including cache hits, may create unnecessary disk I/O during large pulls.
Waking all waiters with cond.Broadcast on cancellation can cause unnecessary contention and a thundering herd effect.
The reused-layer path still deserves careful review from a metrics perspective, because “reused” and “actually pulled from remote” should remain clearly distinguishable.

Overall, I think PR39 is more ambitious and covers more of the requested functionality in one shot, especially the node-level layer concurrency control. However, it does so with a much heavier implementation that is harder to maintain and riskier to merge.

My current preference is PR40. It is simpler, closer to the existing architecture, and easier to harden incrementally. :)

rishi-jat added 4 commits May 10, 2026 05:11

Copilot AI review requested due to automatic review settings May 9, 2026 23:55

Copilot started reviewing on behalf of rishi-jat May 9, 2026 23:55 View session

gemini-code-assist Bot reviewed May 9, 2026

View reviewed changes

rishi-jat changed the title ~~Feat/layer dedup~~ feat(service): add node-level OCI layer deduplication and pull concurrency control May 9, 2026

rishi-jat mentioned this pull request May 9, 2026

Deduplicate concurrent pulls for identical model references across pods with node-level singleflight #35

Open

Copilot AI reviewed May 10, 2026

View reviewed changes

imeoer self-requested a review May 12, 2026 11:40

imeoer mentioned this pull request May 15, 2026

feat(service): add layer cache for cross-mount dedup #40

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(service): add node-level OCI layer deduplication and pull concurrency control#39

feat(service): add node-level OCI layer deduplication and pull concurrency control#39
rishi-jat wants to merge 4 commits into
modelpack:mainfrom
rishi-jat:feat/layer-dedup

rishi-jat commented May 9, 2026

Uh oh!

rishi-jat commented May 9, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 9, 2026

Uh oh!

gemini-code-assist Bot May 9, 2026

Uh oh!

gemini-code-assist Bot May 9, 2026

Uh oh!

gemini-code-assist Bot May 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

imeoer commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	if !strings.HasPrefix(p, prefix) {
	if !strings.HasPrefix(p, filepath.Clean(prefix)+string(os.PathSeparator)) && p != prefix {

-	_, manifestReader, err := client.Manifests().FetchReference(ctx, ref.Tag())
+target := ref.Tag()
+	if target == "" {
+		target = ref.Digest().String()
+	}
+	_, manifestReader, err := client.Manifests().FetchReference(ctx, target)

-						p.layerCache.RemoveByPrefix(sourcePath)
+if os.IsNotExist(err) {
+						p.layerCache.Remove(sourcePath)
+					}

Conversation

rishi-jat commented May 9, 2026

Summary

Changes

LayerCache

Layer-aware pull orchestration

Progress and observability accounting

Worker integration

Validation

Uh oh!

rishi-jat commented May 9, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

imeoer commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants