Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue with digest merge (inconsistent graph state) #3953

Merged
merged 2 commits into from
Jun 19, 2023

Conversation

jacobwgillespie
Copy link
Contributor

Related to #2303


This PR attempts to fix the panic: failed to get edge error using the reproduction posted by @coryb here: #2303 (comment).

It appears that it's possible for the digest merging logic in loadUnlocked to fail to accurately track which digest a vertex is using when it applies the IgnoreCache / !IgnoreCache merging logic. Specifically, a digest can get removed from the actives list too early if a vertex isn't tracking the correct digest.

It seems to be possible to get into this state with the following steps in order:

  1. loadUnlocked is called with a vertex where IgnoreCache is false - this will add to actives using v.Digest().
  2. While that digest is active, another job calls loadUnlocked with the same vertex, but where IgnoreCache is true - this will add to the actives using the %s-ignorecache digest.
  3. The job owning the first vertex is discarded and the digest is removed from actives.
  4. A third job calls loadUnlocked with the same vertex - this will match against the %s-ignorecache digest in actives and not add any digests.
  5. That third job will eventually call solver.getEdge, which tries to look up actives[e.Vertex.Digest()], however that digest will be the original one, not the %s-ignorecache digest. This errors with the inconsistent graph state error.

This PR attempts to solve the issue by returning a new vertexWithCacheOptions when matching the dgstWithoutCache, so that the vertex properly references that digest.

This appears to solve the reproduction from #2303.

Signed-off-by: Jacob Gillespie <jacobwgillespie@gmail.com>
solver/jobs.go Outdated
@@ -342,6 +342,14 @@ func (jl *Solver) loadUnlocked(v, parent Vertex, j *Job, cache map[Vertex]Vertex
// if same vertex is already loaded without cache just use that
st, ok := jl.actives[dgstWithoutCache]

if ok {
v = &vertexWithCacheOptions{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this equivalent to v = st.vtx ?

If yes, then add a comment that we need we need to set v because it is added to the cache map.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably? I'm not super familiar with the lifespan of v compared to st.vtx, but at least from a smoke test BuildKit didn't die with that change. 🙂

I've pushed this change and added a comment 👍

@tonistiigi
Copy link
Member

It seems to be possible to get into this state with the following steps in order:

Is it reproducable enough that these steps can be turned into an integration test?

Signed-off-by: Jacob Gillespie <jacobwgillespie@gmail.com>
@jacobwgillespie
Copy link
Contributor Author

jacobwgillespie commented Jun 16, 2023

Is it reproducable enough that these steps can be turned into an integration test?

Probably not, it's very much a race condition with a small target window to trigger the bug - I was able to reproduce using this code from @coryb - that involves running 10 goroutines each infinitely running builds that are themselves very fast, and then within about 20 seconds it will error. I did some experimenting and found that it kinda needed to be a stress test of 10 goroutines, as even 9 or 8 on my machine took significantly longer to reproduce.

@tonistiigi
Copy link
Member

tonistiigi commented Jun 16, 2023

What if something like.

RUN sleep 6
RUN true

If you run this in 1) with cache 2) no-cache 3 sec later 3) with cache another 4 sec later

@jacobwgillespie
Copy link
Contributor Author

That didn't reproduce it for me - it probably also depends on the timing of things that are calling getEdge (which I believe is related to createInputRequests), there's definitely some specific timing that needs to be hit.

@tonistiigi tonistiigi merged commit a27d16c into moby:master Jun 19, 2023
52 of 53 checks passed
@jacobwgillespie jacobwgillespie deleted the load-unlocked-bug branch June 20, 2023 03:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants