-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remotecache/v1.normalizeItem: panic: assignment to entry in nil map #2296
Comments
Hi!
The 3 panics have in common that contentCacheExporter.chains.Marshal() was being called from two different Go routines, for two different images that very likely have layers in common. Any hint what approach should be taken here ? Should there be a mutex around the call to |
I know more about why those crashes are happening now, but:
Any correction/adjustment on those explanations are more than welcome :) Where it failsWhen/If exporting the build cache, we call Finalize() on the corresponding How it failsThe There is no code path where Why it fails
The call @tonistiigi kindly pinging you as I don't know if you had seen this issue so far, and I hope those explanations could eventually help you understand what's wrong with the code. For the time being, we added a global mutex around normalization: buildkit/cache/remotecache/v1/chains.go Line 60 in 175e841 No crashes in the past 7 days so far. |
What is the case where this can happen? Every build should have its own |
I guess it is that Lines 183 to 185 in a0afb69
Maybe same construction as |
Thanks @tonistiigi ! Will try out something in that direction and let you know |
I spend quite some time trying to make it work but it doesn't seem reliable and I don't understand why yet. When I store the CacheRecords in the Context using a I tried to transform Also, I notice that the amount of I was able to reproduce the problem instead of waiting, by executing this following command twice using the input broadcast feature of my terminal (works maybe around 1/5 times):
The cache is composed of 15-16 layers. |
Please post your code. And steps for reproducer if you have that.
Do you have a stacktrace for that? |
Here is the code with the original suggestion, and a couple of additional log lines: This is a Dockerfile that seems good enough as a reproducer # BASE
FROM busybox AS base
WORKDIR /app/
COPY ./Dockerfile /app/
RUN cat Dockerfile > /app/done-something-in-base
# SOURCE CODE
FROM base as source-code
COPY ./Dockerfile /app/
# BUILDER
FROM source-code AS builder
RUN cat Dockerfile > /app/done-something-in-builder
# TEST
FROM busybox AS test
WORKDIR /app/
COPY --from=base /app/done-something-in-base ./Dockerfile-based
COPY --from=builder /app/done-something-in-builder ./Dockerfile-builder
CMD uptime Here are the steps to reproduce:
The StackTrace that is generate has a finite size, approx. the amount of items in the CacheChain. Storing CacheRecords by Exporter makes sense I think, because yesterday I saw a couple of "foreign" items coming from an |
@maxlaverse That's not quite what I had in mind. See/test #2410 . Using the context var itself makes sure that maps are separate per caller. I still haven't been able to repro myself. I ran your instructions in unmodified master and with my patch and it worked every time. I didn't test your patch as returning cached records for a different edge does not look right, so I'm not surprised that it causes issues. |
@tonistiigi I had misunderstood your proposition. I tried #2410. git fetch --all
git reset --hard origin/master
curl -L -o 2410.patch https://github.com/moby/buildkit/pull/2410.patch
git apply 2410.patch
make images
docker run --privileged -p1234:1234 -ti moby/buildkit:local --addr tcp://0.0.0.0:1234 My first two tries were fine, the third hang. I was a bit more patient and could see that the builds eventually finish, they're just incredibly slow. # From an empty folder with the Dockerfile of https://github.com/moby/buildkit/issues/2296#issuecomment-938508525
~/work/github.com/buildkit/bin/buildctl --addr tcp://127.0.0.1:1234 build build \
--progress=plain --frontend=dockerfile.v0 --local context="." --local dockerfile="." \
--opt target="test" --import-cache ref=tenant/test:cache2,type=registry \
--export-cache ref=tenant/test:cache2,type=registry,mode=max One build at a time:
If I send SIGQUIT after a minute on one of those builds, I get:
I eventually recompiled the daemon with a tiny change: diff --git a/cache/remotecache/v1/utils.go b/cache/remotecache/v1/utils.go
index 889064a6..67bb964c 100644
--- a/cache/remotecache/v1/utils.go
+++ b/cache/remotecache/v1/utils.go
@@ -132,6 +132,9 @@ type normalizeState struct {
}
func (s *normalizeState) removeLoops() {
+ defer func() {
+ logrus.Infof("NORMALIZED")
+ }()
roots := []digest.Digest{}
for dgst, it := range s.byKey {
if len(it.links) == 0 { This log line never appeared, the build was stuck on |
Sorry @tonistiigi, what I mentioned in my previous comment is unrelated to your change. I just checked out v0.9.0, applied a small patch and got the same result. diff --git a/cache/remotecache/v1/utils.go b/cache/remotecache/v1/utils.go
index 889064a6..3c3b6b4d 100644
--- a/cache/remotecache/v1/utils.go
+++ b/cache/remotecache/v1/utils.go
@@ -140,13 +140,14 @@ func (s *normalizeState) removeLoops() {
}
visited := map[digest.Digest]struct{}{}
-
+ var i int64
for _, d := range roots {
- s.checkLoops(d, visited)
+ s.checkLoops(d, visited, &i)
}
+ logrus.Infof("finally done in %d iterations", i)
}
-func (s *normalizeState) checkLoops(d digest.Digest, visited map[digest.Digest]struct{}) {
+func (s *normalizeState) checkLoops(d digest.Digest, visited map[digest.Digest]struct{}, i *int64) {
it, ok := s.byKey[d]
if !ok {
return
@@ -159,6 +160,10 @@ func (s *normalizeState) checkLoops(d digest.Digest, visited map[digest.Digest]s
defer func() {
delete(visited, d)
}()
+ (*i)++
+ if (*i)%10000 == 0 {
+ logrus.Infof("done %d iterations", *i)
+ }
for l, ids := range links {
for id := range ids {
@@ -172,7 +177,7 @@ func (s *normalizeState) checkLoops(d digest.Digest, visited map[digest.Digest]s
}
delete(links[l], id)
} else {
- s.checkLoops(id, visited)
+ s.checkLoops(id, visited, i)
}
}
}
The amount of work spent removing the loops can be quite different, for two consecutive and identical build commands:
My guess first was that I tried not to import the build cache just once, and now the removal of the loops is always done in 19 iterations, for the same Dockerfile still. I restored a log line that prints out the amount of Records and Layers that results from the export, and the number of Records seems constant now (with your change), which wasn't the case so far. I hadn't thought about this, but does the build cache have an effect on the I tested your PR with the reproducer and it works for me now, even after many tries. I won't be able to test it with the original project until middle of next week. |
@maxlaverse Looks like you had some corrupted cache in the registry from testing your previous patches that caused this extra work on reexport. We don't consider this as a thing we need to protect against as you always need to trust your cache source anyway. But if you have a better understanding of what caused the loops and if there is an easy fix we could try to do something about it as well. lmk when you have more testing data |
@maxlaverse Can we close this now? |
@tonistiigi I deployed a patched version with #2410 24 hours ago. There have been no occurrence of any CacheChain Marshall operation that would try to normalize items from a different CacheChain, instead of a couple of them per hour before. It's very likely your patch fixed this issue, meaning the We're still using a mutex around the Marshall operation. I'll try to remove it in the next days and/or follow the conversation about those CacheKey issues #2041 (comment). |
I'm not sure which
buildctl
command triggered the error. We are building a few different images in parallel. The general command format is:Version:
Config:
Error:
The text was updated successfully, but these errors were encountered: