-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issue with digest merge (inconsistent graph state) #3953
Conversation
Signed-off-by: Jacob Gillespie <jacobwgillespie@gmail.com>
3d74a27
to
4e8b42e
Compare
solver/jobs.go
Outdated
@@ -342,6 +342,14 @@ func (jl *Solver) loadUnlocked(v, parent Vertex, j *Job, cache map[Vertex]Vertex | |||
// if same vertex is already loaded without cache just use that | |||
st, ok := jl.actives[dgstWithoutCache] | |||
|
|||
if ok { | |||
v = &vertexWithCacheOptions{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this equivalent to v = st.vtx
?
If yes, then add a comment that we need we need to set v
because it is added to the cache map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably? I'm not super familiar with the lifespan of v
compared to st.vtx
, but at least from a smoke test BuildKit didn't die with that change. 🙂
I've pushed this change and added a comment 👍
Is it reproducable enough that these steps can be turned into an integration test? |
Signed-off-by: Jacob Gillespie <jacobwgillespie@gmail.com>
000c9da
to
4786f7b
Compare
Probably not, it's very much a race condition with a small target window to trigger the bug - I was able to reproduce using this code from @coryb - that involves running 10 goroutines each infinitely running builds that are themselves very fast, and then within about 20 seconds it will error. I did some experimenting and found that it kinda needed to be a stress test of 10 goroutines, as even 9 or 8 on my machine took significantly longer to reproduce. |
What if something like.
If you run this in 1) with cache 2) no-cache 3 sec later 3) with cache another 4 sec later |
That didn't reproduce it for me - it probably also depends on the timing of things that are calling |
Related to #2303
This PR attempts to fix the
panic: failed to get edge
error using the reproduction posted by @coryb here: #2303 (comment).It appears that it's possible for the digest merging logic in
loadUnlocked
to fail to accurately track which digest a vertex is using when it applies theIgnoreCache
/!IgnoreCache
merging logic. Specifically, a digest can get removed from theactives
list too early if a vertex isn't tracking the correct digest.It seems to be possible to get into this state with the following steps in order:
loadUnlocked
is called with a vertex whereIgnoreCache
is false - this will add to actives usingv.Digest()
.loadUnlocked
with the same vertex, but whereIgnoreCache
is true - this will add to the actives using the%s-ignorecache
digest.actives
.loadUnlocked
with the same vertex - this will match against the%s-ignorecache
digest in actives and not add any digests.solver.getEdge
, which tries to look upactives[e.Vertex.Digest()]
, however that digest will be the original one, not the%s-ignorecache
digest. This errors with theinconsistent graph state
error.This PR attempts to solve the issue by returning a new
vertexWithCacheOptions
when matching thedgstWithoutCache
, so that the vertex properly references that digest.This appears to solve the reproduction from #2303.