Skip to content

solver: fix race in walkProvenance#6764

Merged
crazy-max merged 1 commit into
moby:masterfrom
tonistiigi:walkprovenance-fix
May 13, 2026
Merged

solver: fix race in walkProvenance#6764
crazy-max merged 1 commit into
moby:masterfrom
tonistiigi:walkprovenance-fix

Conversation

@tonistiigi
Copy link
Copy Markdown
Member

fix #6731
related to #5606

Walking back the build definition for provenance could cause a race when one of the builds used no-cache and was tracked by a modified vertex digest.

This could have caused vtx.op or vtx.op.op to be nil, or if a parallel build had created sharedOp but had not called CacheKey yet, it could have left empty pin for the source step that caused error when writing out the provenance.

Tree with more debug and analysis https://github.com/moby/buildkit/compare/master...tonistiigi:buildkit:walkprovenance-fix-long?expand=1

@cpuguy83

Walking back the build definition for provenance could
cause a race when one of the builds used no-cache and was
tracked by a modified vertex digest.

This could have caused vtx.op or vtx.op.op to be nil, or
if a parallel build had created sharedOp but had not called
CacheKey yet, it could have left empty pin for the source
step that caused error when writing out the provenance.

Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
Copy link
Copy Markdown
Member

@cpuguy83 cpuguy83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good fix (and a good find!).
The test seems like it could use some improvements to make it more deterministic but overall seems good.

There's the case where the pin field is written to/read from without synchronization. Not sure if there's a case where that actually happens at the same time.

startedAt := time.Now()
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
t.Logf("http hit: method=%s url=%s t+%s", r.Method, r.URL.Path, time.Since(startedAt).Round(time.Millisecond))
time.Sleep(fetchDelay)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this wait for a signal (e.g. closed channel) to release and signal when it enters?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did it not repro for you? Maybe the timings are too specific to my system, which usually gets it on the first attempt.

I'm not sure exactly how to make it more precise without making this server callback much more complicated. It needs to collide the walkProvenance from one batch with this sleep (that blocks CacheKey) being stuck when when provenance walk happens. But there isn't good indicator for the provenance walk timing. The delay needs to be relatively large to make sure that a significant part of the build duration is spent in the llb.HTTP callback and not in some other place.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, it did repro for me with this test. I was just reading it and had that thought.
Seems OK.

@crazy-max crazy-max added this to the v0.30.0 milestone May 13, 2026
@crazy-max crazy-max merged commit f2cc979 into moby:master May 13, 2026
193 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

provenance: image source pin can be read empty during capture

3 participants