solver: fix race in walkProvenance#6764
Conversation
Walking back the build definition for provenance could cause a race when one of the builds used no-cache and was tracked by a modified vertex digest. This could have caused vtx.op or vtx.op.op to be nil, or if a parallel build had created sharedOp but had not called CacheKey yet, it could have left empty pin for the source step that caused error when writing out the provenance. Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
cpuguy83
left a comment
There was a problem hiding this comment.
This looks like a good fix (and a good find!).
The test seems like it could use some improvements to make it more deterministic but overall seems good.
There's the case where the pin field is written to/read from without synchronization. Not sure if there's a case where that actually happens at the same time.
| startedAt := time.Now() | ||
| server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { | ||
| t.Logf("http hit: method=%s url=%s t+%s", r.Method, r.URL.Path, time.Since(startedAt).Round(time.Millisecond)) | ||
| time.Sleep(fetchDelay) |
There was a problem hiding this comment.
Should this wait for a signal (e.g. closed channel) to release and signal when it enters?
There was a problem hiding this comment.
Did it not repro for you? Maybe the timings are too specific to my system, which usually gets it on the first attempt.
I'm not sure exactly how to make it more precise without making this server callback much more complicated. It needs to collide the walkProvenance from one batch with this sleep (that blocks CacheKey) being stuck when when provenance walk happens. But there isn't good indicator for the provenance walk timing. The delay needs to be relatively large to make sure that a significant part of the build duration is spent in the llb.HTTP callback and not in some other place.
There was a problem hiding this comment.
Oh yes, it did repro for me with this test. I was just reading it and had that thought.
Seems OK.
fix #6731
related to #5606
Walking back the build definition for provenance could cause a race when one of the builds used no-cache and was tracked by a modified vertex digest.
This could have caused vtx.op or vtx.op.op to be nil, or if a parallel build had created sharedOp but had not called CacheKey yet, it could have left empty pin for the source step that caused error when writing out the provenance.
Tree with more debug and analysis https://github.com/moby/buildkit/compare/master...tonistiigi:buildkit:walkprovenance-fix-long?expand=1
@cpuguy83