Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remotecache/v1.normalizeItem: panic: assignment to entry in nil map #2296

Open
jamescook opened this issue Aug 4, 2021 · 14 comments
Open

remotecache/v1.normalizeItem: panic: assignment to entry in nil map #2296

jamescook opened this issue Aug 4, 2021 · 14 comments
Labels

Comments

@jamescook
Copy link

I'm not sure which buildctl command triggered the error. We are building a few different images in parallel. The general command format is:

buildctl build --progress=plain --frontend=dockerfile.v0 --local context="." --local dockerfile="." --opt target="production" --opt build-arg:GIT_BRANCH="$BUILDKITE_BRANCH" --opt build-arg:GIT_COMMIT_SHA="$BUILDKITE_COMMIT" --opt build-arg:GIT_COMMIT_MESSAGE="$BUILDKITE_MESSAGE" --import-cache ref=zzzzz/cache:production-zzzz-master,type=registry --import-cache ref=zzzz/cache:production-zzzz-master,type=registry --export-cache ref=zzzz/cache:production-zzzz-master,mode=max,type=registry --output \"name=zzzz/zzzz:production-bc6e10840eb7d7652c2cf120199d15e3623f65ba-amd64\",\"push=true\",\"type=image\"

Version:

buildkitd --version
buildkitd github.com/moby/buildkit v0.9.0 c8bb937807d405d92be91f06ce2629e6202ac7a9

Config:

cat /etc/buildkit/buildkitd.toml 
[worker.oci]
  max-parallelism = 3

[worker.containerd]
  max-parallelism = 3

Error:

panic: assignment to entry in nil map

goroutine 2556303 [running]:
github.com/moby/buildkit/cache/remotecache/v1.normalizeItem(0xc00fa0c380, 0xc019ca8350, 0xc019ca5498, 0x1edaa00, 0x0)
        /src/cache/remotecache/v1/utils.go:255 +0x8b7
github.com/moby/buildkit/cache/remotecache/v1.normalizeItem(0xc009003f10, 0xc019ca8350, 0xc019ca5828, 0x1edaa00, 0xc00bd8bc00)
        /src/cache/remotecache/v1/utils.go:206 +0x20b
github.com/moby/buildkit/cache/remotecache/v1.normalizeItem(0xc009003ea0, 0xc019ca8350, 0xc019ca5b58, 0x1edaa00, 0xc00a82e900)
        /src/cache/remotecache/v1/utils.go:206 +0x20b
github.com/moby/buildkit/cache/remotecache/v1.normalizeItem(0xc009003e30, 0xc019ca8350, 0xc019ca5e88, 0x1edaa00, 0xc00a82ea00)
        /src/cache/remotecache/v1/utils.go:206 +0x20b
github.com/moby/buildkit/cache/remotecache/v1.normalizeItem(0xc0129f1730, 0xc019ca8350, 0xc019ca61b8, 0x1edaa00, 0xc00a82ea00)
        /src/cache/remotecache/v1/utils.go:206 +0x20b
github.com/moby/buildkit/cache/remotecache/v1.normalizeItem(0xc004af9b90, 0xc019ca8350, 0xc019ca64e8, 0x1edaa00, 0x7f96ed05af00)
        /src/cache/remotecache/v1/utils.go:206 +0x20b
github.com/moby/buildkit/cache/remotecache/v1.normalizeItem(0xc00fe49f80, 0xc019ca8350, 0xc019ca6818, 0x1edaa00, 0xc00bd8bf00)
        /src/cache/remotecache/v1/utils.go:206 +0x20b
github.com/moby/buildkit/cache/remotecache/v1.normalizeItem(0xc0041a35e0, 0xc019ca8350, 0xc019ca6b48, 0x1edaa00, 0xc00d032500)
        /src/cache/remotecache/v1/utils.go:206 +0x20b
github.com/moby/buildkit/cache/remotecache/v1.normalizeItem(0xc0077c3dc0, 0xc019ca8350, 0xc019ca6e78, 0x1edaa00, 0xc008e58c00)
        /src/cache/remotecache/v1/utils.go:206 +0x20b
github.com/moby/buildkit/cache/remotecache/v1.normalizeItem(0xc00ed9be30, 0xc019ca8350, 0xc019ca71a8, 0x1edaa00, 0xc008e59400)
        /src/cache/remotecache/v1/utils.go:206 +0x20b
github.com/moby/buildkit/cache/remotecache/v1.normalizeItem(0xc00c9a2a80, 0xc019ca8350, 0xc019ca74d8, 0x1edaa00, 0xc008e59400)
        /src/cache/remotecache/v1/utils.go:206 +0x20b
github.com/moby/buildkit/cache/remotecache/v1.normalizeItem(0xc00c9a29a0, 0xc019ca8350, 0xc019ca7808, 0x1edaa00, 0xc008e59500)
        /src/cache/remotecache/v1/utils.go:206 +0x20b
github.com/moby/buildkit/cache/remotecache/v1.normalizeItem(0xc00c9a2930, 0xc019ca8350, 0xc019ca7b38, 0x1edaa00, 0xc008e59e00)
        /src/cache/remotecache/v1/utils.go:206 +0x20b
github.com/moby/buildkit/cache/remotecache/v1.normalizeItem(0xc00c9a28c0, 0xc019ca8350, 0xc019ca7e68, 0x1edaa00, 0x0)
        /src/cache/remotecache/v1/utils.go:206 +0x20b
github.com/moby/buildkit/cache/remotecache/v1.normalizeItem(0xc00bbbb5e0, 0xc019ca8350, 0xc019ca8198, 0x1edaa00, 0x40db00)
        /src/cache/remotecache/v1/utils.go:206 +0x20b
github.com/moby/buildkit/cache/remotecache/v1.normalizeItem(0xc00fa0da40, 0xc019ca8350, 0x97, 0xc017430000, 0xc019ca8338)
        /src/cache/remotecache/v1/utils.go:206 +0x20b
github.com/moby/buildkit/cache/remotecache/v1.(*CacheChains).normalize(0xc000156760, 0x148ff00, 0x0)
        /src/cache/remotecache/v1/chains.go:62 +0x3e7
github.com/moby/buildkit/cache/remotecache/v1.(*CacheChains).Marshal(0xc000156760, 0xc00f181a10, 0x1698138, 0xc00c3a4a98, 0xc01c511ea8)
        /src/cache/remotecache/v1/chains.go:79 +0x45
github.com/moby/buildkit/cache/remotecache.(*contentCacheExporter).Finalize(0xc008234c40, 0x16ab6e8, 0xc00c3a4510, 0x0, 0xc00c3a46f0, 0xc01c511dd0)
        /src/cache/remotecache/export.go:71 +0x58
github.com/moby/buildkit/solver/llbsolver.(*Solver).Solve.func4(0x16ab6e8, 0xc00c3a4510, 0x1684cc0, 0xc01c511da0, 0x1, 0x1698a48)
        /src/solver/llbsolver/solver.go:241 +0x1c9
github.com/moby/buildkit/solver/llbsolver.inBuilderContext.func1(0x16ab6e8, 0xc00c3a4510, 0x1684cc0, 0xc01c511da0, 0x0, 0x0)
        /src/solver/llbsolver/solver.go:358 +0x1c7
github.com/moby/buildkit/solver.(*Job).InContext(0xc01684c100, 0x16ab6e8, 0xc01bb6a1b0, 0xc01c511d88, 0x10, 0xc0145a8140)
        /src/solver/jobs.go:542 +0x175
github.com/moby/buildkit/solver/llbsolver.inBuilderContext(0x16ab6e8, 0xc01bb6a1b0, 0x16a4700, 0xc01684c100, 0x1501f53, 0xf, 0x0, 0x0, 0xc00fa0d9d0, 0x0, ...)
        /src/solver/llbsolver/solver.go:354 +0x18e
github.com/moby/buildkit/solver/llbsolver.(*Solver).Solve(0xc002467260, 0x16ab6e8, 0xc01bb6a1b0, 0xc0186d4080, 0x19, 0xc0186d4140, 0x19, 0x0, 0x0, 0xc009f46030, ...)
        /src/solver/llbsolver/solver.go:223 +0x12c5
github.com/moby/buildkit/control.(*Controller).Solve(0xc00053c000, 0x16ab6e8, 0xc01bb6a1b0, 0xc00e6c9320, 0x0, 0x0, 0x0)
        /src/control/control.go:301 +0x4fa
github.com/moby/buildkit/api/services/control._Control_Solve_Handler.func1(0x16ab6e8, 0xc01bb6a1b0, 0x14b8820, 0xc00e6c9320, 0x14feb92, 0xc, 0x4, 0x0)
        /src/api/services/control/control.pb.go:1489 +0x89
github.com/moby/buildkit/util/grpcerrors.UnaryServerInterceptor(0x16ab6e8, 0xc01bb6a1b0, 0x14b8820, 0xc00e6c9320, 0xc000156040, 0xc000ab2078, 0x0, 0x0, 0x0, 0x0)
        /src/util/grpcerrors/intercept.go:14 +0x6f
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1(0x16ab6e8, 0xc01bb6a1b0, 0x14b8820, 0xc00e6c9320, 0x14f9587, 0x8, 0x0, 0x0)
        /src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x63
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1(0x16ab640, 0xc01bb6a1b0, 0x14b8820, 0xc00e6c9320, 0xc000156040, 0xc0001560c0, 0x0, 0x0, 0x0, 0x0)
        /src/vendor/go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go:327 +0x6db
main.unaryInterceptor.func1(0x16ab6e8, 0xc008234000, 0x14b8820, 0xc00e6c9320, 0xc000156040, 0xc0001560c0, 0x0, 0x0, 0x0, 0x0)
        /src/cmd/buildkitd/main.go:572 +0x169
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1(0x16ab6e8, 0xc01bb6a030, 0x14b8820, 0xc00e6c9320, 0xc0021d0c00, 0x0, 0xc00079bb30, 0x40e338)
        /src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x63
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1(0x16ab6e8, 0xc01bb6a030, 0x14b8820, 0xc00e6c9320, 0xc000156040, 0xc000ab2078, 0xc006b4aba0, 0x535766, 0x14266e0, 0xc01bb6a030)
        /src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:34 +0xd7
github.com/moby/buildkit/api/services/control._Control_Solve_Handler(0x143dde0, 0xc00053c000, 0x16ab6e8, 0xc01bb6a030, 0xc01b7f81e0, 0xc000572bd0, 0x16ab6e8, 0xc01bb6a030, 0xc00293e000, 0xcda)
        /src/api/services/control/control.pb.go:1491 +0x150
google.golang.org/grpc.(*Server).processUnaryRPC(0xc00003d880, 0x16b9698, 0xc00e370180, 0xc004240480, 0xc0003cfe60, 0x1e8dfd8, 0x0, 0x0, 0x0)
        /src/vendor/google.golang.org/grpc/server.go:1286 +0x52b
google.golang.org/grpc.(*Server).handleStream(0xc00003d880, 0x16b9698, 0xc00e370180, 0xc004240480, 0x0)
        /src/vendor/google.golang.org/grpc/server.go:1609 +0xd0c
google.golang.org/grpc.(*Server).serveStreams.func1.2(0xc00e650460, 0xc00003d880, 0x16b9698, 0xc00e370180, 0xc004240480)
        /src/vendor/google.golang.org/grpc/server.go:934 +0xab
created by google.golang.org/grpc.(*Server).serveStreams.func1
        /src/vendor/google.golang.org/grpc/server.go:932 +0x1fd
@AkihiroSuda AkihiroSuda changed the title panic: assignment to entry in nil map remotecache/v1.normalizeItem: panic: assignment to entry in nil map Aug 4, 2021
@maxlaverse
Copy link
Contributor

Hi!
We encountered 3 other panics at different places of v1.normalizeItem():

fatal error: concurrent map iteration and map write

goroutine 5066100 [running]:
runtime.throw(0x1529605, 0x26)
	/usr/local/go/src/runtime/panic.go:1117 +0x72 fp=0xc0212aa8a8 sp=0xc0212aa878 pc=0x438352
runtime.mapiternext(0xc0212aac08)
	/usr/local/go/src/runtime/map.go:858 +0x54c fp=0xc0212aa928 sp=0xc0212aa8a8 pc=0x410d6c
github.com/moby/buildkit/cache/remotecache/v1.normalizeItem(0xc008f69420, 0xc0212ac120, 0xc0212aafe8, 0x1ee3a60, 0x0)
	/src/cache/remotecache/v1/utils.go:204 +0x17d fp=0xc0212aad08 sp=0xc0212aa928 pc=0xdd0a5d
github.com/moby/buildkit/cache/remotecache/v1.normalizeItem(0xc01381abd0, 0xc0212ac120, 0xc0212ab3c8, 0x1ee3a60, 0xc038f6fe00)
	/src/cache/remotecache/v1/utils.go:206 +0x20b fp=0xc0212ab0e8 sp=0xc0212aad08 pc=0xdd0aeb
github.com/moby/buildkit/cache/remotecache/v1.normalizeItem(0xc01fc9d1f0, 0xc0212ac120, 0xc0212ab748, 0x1ee3a60, 0x0)
	/src/cache/remotecache/v1/utils.go:206 +0x20b fp=0xc0212ab4c8 sp=0xc0212ab0e8 pc=0xdd0aeb
github.com/moby/buildkit/cache/remotecache/v1.normalizeItem(0xc009f27ea0, 0xc0212ac120, 0xc0212abb88, 0x1ee3a60, 0xc005700900)
	/src/cache/remotecache/v1/utils.go:206 +0x20b fp=0xc0212ab8a8 sp=0xc0212ab4c8 pc=0xdd0aeb
github.com/moby/buildkit/cache/remotecache/v1.normalizeItem(0xc01b758d90, 0xc0212ac120, 0xc0212abf68, 0x1ee3a60, 0x514700)
	/src/cache/remotecache/v1/utils.go:206 +0x20b fp=0xc0212abc88 sp=0xc0212ab8a8 pc=0xdd0aeb
github.com/moby/buildkit/cache/remotecache/v1.normalizeItem(0xc0104ee000, 0xc0212ac120, 0x85, 0xc00b645b00, 0x0)
	/src/cache/remotecache/v1/utils.go:206 +0x20b fp=0xc0212ac068 sp=0xc0212abc88 pc=0xdd0aeb
github.com/moby/buildkit/cache/remotecache/v1.(*CacheChains).normalize(0xc016587b00, 0xc0006c0000, 0x0)
	/src/cache/remotecache/v1/chains.go:63 +0x3e7 fp=0xc0212ac430 sp=0xc0212ac068 pc=0xdcd5c7
github.com/moby/buildkit/cache/remotecache/v1.(*CacheChains).Marshal(0xc016587b00, 0x4, 0x15249e2, 0x23, 0xc0212ac680)
	/src/cache/remotecache/v1/chains.go:80 +0x45 fp=0xc0212ac4f8 sp=0xc0212ac430 pc=0xdcd825
github.com/moby/buildkit/cache/remotecache.(*contentCacheExporter).Finalize(0xc01df832c0, 0x16b2ee8, 0xc0248acc90, 0x0, 0xc0248aced0, 0x3)
[...]

The 3 panics have in common that contentCacheExporter.chains.Marshal() was being called from two different Go routines, for two different images that very likely have layers in common.

Any hint what approach should be taken here ? Should there be a mutex around the call to Marshal() or is the goal to make its implementation be made thread safe ?

@maxlaverse
Copy link
Contributor

I know more about why those crashes are happening now, but:

  1. I still don't have a good understanding of all operations done during export (I'd love to!)
  2. I don't know what's normal and what's considered buggy

Any correction/adjustment on those explanations are more than welcome :)

Where it fails

When/If exporting the build cache, we call Finalize() on the corresponding Exporter in llbsolver.Solve(). Each Exporter instance has a CacheChains that contains items relevant for the export (a graph of digests/layers). Finalize() does a bunch of stuff, but what's relevant for the issue is that it calls Marshal() on its CacheChains. This method goes through all the items of the CacheChains and computes a cleaned/optimized version of the graph of items ("normalization"), which is then returned to llbsolver.Solve() and further up in the code. The errors mentioned in this issue are in the code path described above.

How it fails

The items in the CacheChains are linked together. The normalization of a CacheChains' is not only looping around its items, but recursively normalizing all the items they are linked to, or linked from. This operation builds a "state" but while doing so, also modifies items. This is where the problem comes from. It can happen that an item is normalized by two different Go routines, which means we eventually read and modify it at the same time, which crashes the Daemon in a non-recoverable manner.

There is no code path where Marshal() is called on the same CacheChains twice, but it can happen that two different CacheChains call their Marshal() at the same time, especially if two nearly identical builds were triggered exactly at the same time.

Why it fails

Exporter can have CacheChains with items referencing items initially added in a different CacheChains. The CacheChains-cross-linking of items seems to happen exclusively in the exporter.ExportTo(), more specifically when calling export.ExportTo on dependencies and secondary exporters. Those two loops bring items with a different CacheChains the one export.ExportTo() was called for (through CacheExporterTarget)

The call ExportTo() on the dependencies and secondary exporter's CacheKey exporter does not always return "foreign" items, but most of the time it's when an edge has been made exportable. I digged a bit deeper into CacheKey merging but I didn't understand what is done in a timely manner.

@tonistiigi kindly pinging you as I don't know if you had seen this issue so far, and I hope those explanations could eventually help you understand what's wrong with the code.

For the time being, we added a global mutex around normalization:

No crashes in the past 7 days so far.

@tonistiigi
Copy link
Member

It can happen that an item is normalized by two different Go routines, which means we eventually read and modify it at the same time, which crashes the Daemon in a non-recoverable manner.

What is the case where this can happen? Every build should have its own CacheChains.

@tonistiigi
Copy link
Member

tonistiigi commented Oct 5, 2021

What is the case where this can happen? Every build should have its own CacheChains.

I guess it is that e.res is shared for each ExportTo

buildkit/solver/exporter.go

Lines 183 to 185 in a0afb69

e.res = allRec
return e.res, nil
? That does not look correct. It should be per target.

Maybe same construction as bkm but to store the e.res.

@maxlaverse
Copy link
Contributor

Thanks @tonistiigi ! Will try out something in that direction and let you know

@maxlaverse
Copy link
Contributor

maxlaverse commented Oct 7, 2021

Maybe same construction as bkm but to store the e.res.

I spend quite some time trying to make it work but it doesn't seem reliable and I don't understand why yet. When I store the CacheRecords in the Context using a map, one of the concurrent builds often get stuck in the removal of loops of the normalized state. Adding some logs there reveals a repeating pattern when going through the graph.

I tried to transform e.res into a map[CacheExporterTarget][]CacheExporterRecord + mutex instead. It never froze when normalizing the CacheChain, but there are failed to remove looping cache key errors from time to time and it seems to corrupt the cache: it happens that a layer is rebuilt on the next run which shouldn't be the case.

Also, I notice that the amount of Records still varies as the end of the CacheChains Marshaling. This is already the case right now, but I was thinking that would stop once this issue is solved. The number if items in each CacheChains would vary as well without this PR, but now it's just growing. Something is just not working yet.

I was able to reproduce the problem instead of waiting, by executing this following command twice using the input broadcast feature of my terminal (works maybe around 1/5 times):

buildctl --addr tcp://127.0.0.1:1234 build build --progress=plain --frontend=dockerfile.v0 --local context="." --local dockerfile="spec_e2e/" --opt target="test" --import-cache ref=tenant/cache:test-maxime,type=registry  --export-cache ref=tenant/cache:test-maxime,mode=max

The cache is composed of 15-16 layers.

@tonistiigi
Copy link
Member

Please post your code. And steps for reproducer if you have that.

one of the concurrent builds often get stuck

Do you have a stacktrace for that?

@maxlaverse
Copy link
Contributor

maxlaverse commented Oct 8, 2021

Here is the code with the original suggestion, and a couple of additional log lines:
https://github.com/moby/buildkit/compare/master...venturehacks:fix_exporter_exportto_cache?expand=1

This is a Dockerfile that seems good enough as a reproducer

# BASE
FROM busybox AS base
WORKDIR /app/
COPY ./Dockerfile /app/
RUN cat Dockerfile > /app/done-something-in-base

# SOURCE CODE
FROM base as source-code
COPY ./Dockerfile /app/

# BUILDER
FROM source-code AS builder
RUN cat Dockerfile > /app/done-something-in-builder

# TEST
FROM busybox AS test
WORKDIR /app/
COPY --from=base /app/done-something-in-base ./Dockerfile-based
COPY --from=builder /app/done-something-in-builder ./Dockerfile-builder
CMD uptime

Here are the steps to reproduce:

  • build an image of Buildkit with the modification above
  • start a Buildkit Daemon as a container: docker run --privileged -p1234:1234 -ti <buildkit_image> --addr tcp://0.0.0.0:1234
  • store the Dockerfile above in an empty folder
  • go inside the folder and trigger two builds simultaneously - I use the "Broadcasting Input" feature of my terminal to achieve this. With the branch above that stores the CacheRecord in a context value, no need to trigger builds concurrently. A single one is enough
~/work/github.com/buildkit/bin/buildctl \
   --addr tcp://127.0.0.1:1234 build build \
   --progress=plain --frontend=dockerfile.v0 \
   --local context="." \
   --local dockerfile="." \
   --opt target="test" \
   --import-cache ref=tenant/test:cache,type=registry \
   --export-cache ref=tenant/test:cache,type=registry,mode=max

The StackTrace that is generate has a finite size, approx. the amount of items in the CacheChain.

Storing CacheRecords by Exporter makes sense I think, because yesterday I saw a couple of "foreign" items coming from an ExportTo() call on an exporter that had been called before with a different target. However, storing it in the context means the CacheRecord cache will be re-used for all ExportTo() calls given a context and a target, no matter if ExportTo() is called on different exporters (e.g. having an edge set), which should have a different result.

@tonistiigi
Copy link
Member

@maxlaverse That's not quite what I had in mind. See/test #2410 . Using the context var itself makes sure that maps are separate per caller.

I still haven't been able to repro myself. I ran your instructions in unmodified master and with my patch and it worked every time. I didn't test your patch as returning cached records for a different edge does not look right, so I'm not surprised that it causes issues.

@maxlaverse
Copy link
Contributor

@tonistiigi I had misunderstood your proposition.

I tried #2410.

git fetch --all
git reset --hard origin/master
curl -L -o 2410.patch https://github.com/moby/buildkit/pull/2410.patch
git apply 2410.patch
make images
docker run --privileged -p1234:1234  -ti moby/buildkit:local --addr tcp://0.0.0.0:1234

My first two tries were fine, the third hang. I was a bit more patient and could see that the builds eventually finish, they're just incredibly slow.

# From an empty folder with the Dockerfile of https://github.com/moby/buildkit/issues/2296#issuecomment-938508525
~/work/github.com/buildkit/bin/buildctl  --addr tcp://127.0.0.1:1234 build build  \
   --progress=plain --frontend=dockerfile.v0  --local context="."  --local dockerfile="." \
   --opt target="test"  --import-cache ref=tenant/test:cache2,type=registry  \
   --export-cache ref=tenant/test:cache2,type=registry,mode=max

One build at a time:

  • build 1: 4s in total
  • build 2: 16s in total
  • build 3: 98s in total
  • build 4: 66s in total
  • build 5: 70s in total, from which 60s were spent in preparing build cache for export done

If I send SIGQUIT after a minute on one of those builds, I get:

[...]
SIGQUIT: quit
PC=0x46a841 m=0 sigcode=0

goroutine 0 [idle]:
[...]

goroutine 808 [runnable]:
github.com/moby/buildkit/cache/remotecache/v1.(*normalizeState).checkLoops(0xc0004cc300, {0xc0002614a0, 0x47}, 0xc0004cc118)
	/src/cache/remotecache/v1/utils.go:165 +0x290
github.com/moby/buildkit/cache/remotecache/v1.(*normalizeState).checkLoops(0xc0004cc300, {0xc000261450, 0x47}, 0xc0004cc118)
	/src/cache/remotecache/v1/utils.go:175 +0x3e5
github.com/moby/buildkit/cache/remotecache/v1.(*normalizeState).checkLoops(0xc0004cc300, {0xc000261400, 0x47}, 0xc0004cc118)
	/src/cache/remotecache/v1/utils.go:175 +0x3e5
github.com/moby/buildkit/cache/remotecache/v1.(*normalizeState).checkLoops(0xc0004cc300, {0xc0002613b0, 0x47}, 0xc0004cc118)
	/src/cache/remotecache/v1/utils.go:175 +0x3e5
github.com/moby/buildkit/cache/remotecache/v1.(*normalizeState).checkLoops(0xc0004cc300, {0xc000261360, 0x47}, 0xc0004cc118)
	/src/cache/remotecache/v1/utils.go:175 +0x3e5
github.com/moby/buildkit/cache/remotecache/v1.(*normalizeState).checkLoops(0xc0004cc300, {0xc0002612c0, 0x47}, 0xc0004cc118)
	/src/cache/remotecache/v1/utils.go:175 +0x3e5
github.com/moby/buildkit/cache/remotecache/v1.(*normalizeState).checkLoops(0xc0004cc300, {0xc000261270, 0x47}, 0xc0004cc118)
	/src/cache/remotecache/v1/utils.go:175 +0x3e5
github.com/moby/buildkit/cache/remotecache/v1.(*normalizeState).checkLoops(0xc0004cc300, {0xc000261220, 0x47}, 0xc0004cc118)
	/src/cache/remotecache/v1/utils.go:175 +0x3e5
github.com/moby/buildkit/cache/remotecache/v1.(*normalizeState).checkLoops(0xc0004cc300, {0xc0002611d0, 0x47}, 0xc0004cc118)
	/src/cache/remotecache/v1/utils.go:175 +0x3e5
github.com/moby/buildkit/cache/remotecache/v1.(*normalizeState).checkLoops(0xc0004cc300, {0xc000261180, 0x47}, 0xc0004cc118)
	/src/cache/remotecache/v1/utils.go:175 +0x3e5
github.com/moby/buildkit/cache/remotecache/v1.(*normalizeState).checkLoops(0xc0004cc300, {0xc000261130, 0x47}, 0xc0004cc118)
	/src/cache/remotecache/v1/utils.go:175 +0x3e5
github.com/moby/buildkit/cache/remotecache/v1.(*normalizeState).checkLoops(0xc0004cc300, {0xc0002610e0, 0x47}, 0xc0004cc118)
	/src/cache/remotecache/v1/utils.go:175 +0x3e5
github.com/moby/buildkit/cache/remotecache/v1.(*normalizeState).checkLoops(0xc0004cc300, {0xc000261090, 0x47}, 0xc0004cc118)
	/src/cache/remotecache/v1/utils.go:175 +0x3e5
github.com/moby/buildkit/cache/remotecache/v1.(*normalizeState).checkLoops(0xc0004cc300, {0xc000261040, 0x47}, 0xc0004cc118)
	/src/cache/remotecache/v1/utils.go:175 +0x3e5
github.com/moby/buildkit/cache/remotecache/v1.(*normalizeState).checkLoops(0xc0004cc300, {0xc000260ff0, 0x47}, 0xc0004cc118)
	/src/cache/remotecache/v1/utils.go:175 +0x3e5
github.com/moby/buildkit/cache/remotecache/v1.(*normalizeState).checkLoops(0xc0004cc300, {0xc000260fa0, 0x47}, 0xc0004cc118)
	/src/cache/remotecache/v1/utils.go:175 +0x3e5
github.com/moby/buildkit/cache/remotecache/v1.(*normalizeState).checkLoops(0xc0004cc300, {0xc000260f50, 0x47}, 0xc0004cc118)
	/src/cache/remotecache/v1/utils.go:175 +0x3e5
github.com/moby/buildkit/cache/remotecache/v1.(*normalizeState).checkLoops(0xc0004cc300, {0xc000260f00, 0x47}, 0xc0004cc118)
	/src/cache/remotecache/v1/utils.go:175 +0x3e5
github.com/moby/buildkit/cache/remotecache/v1.(*normalizeState).checkLoops(0xc0004cc300, {0xc000260eb0, 0x47}, 0xc0004cc118)
	/src/cache/remotecache/v1/utils.go:175 +0x3e5
github.com/moby/buildkit/cache/remotecache/v1.(*normalizeState).checkLoops(0xc0004cc300, {0xc0001c5c70, 0x47}, 0xc0004cc118)
	/src/cache/remotecache/v1/utils.go:175 +0x3e5
github.com/moby/buildkit/cache/remotecache/v1.(*normalizeState).removeLoops(0xc0004cc300)
	/src/cache/remotecache/v1/utils.go:145 +0x205
github.com/moby/buildkit/cache/remotecache/v1.(*CacheChains).normalize(0xc0001e2000)
	/src/cache/remotecache/v1/chains.go:68 +0x3f6
github.com/moby/buildkit/cache/remotecache/v1.(*CacheChains).Marshal(0xc0001e2000)
	/src/cache/remotecache/v1/chains.go:79 +0x33
github.com/moby/buildkit/cache/remotecache.(*contentCacheExporter).Finalize(0xc000d20040, {0x14d22a8, 0xc000ad82a0})
	/src/cache/remotecache/export.go:71 +0x65
github.com/moby/buildkit/solver/llbsolver.(*Solver).Solve.func4({0x14d22a8, 0xc000ad82a0}, {0xc000634a20, 0x40ce00})
	/src/solver/llbsolver/solver.go:259 +0x182
github.com/moby/buildkit/solver/llbsolver.inBuilderContext.func1({0x14d22a8, 0xc000ad81b0}, {0x14a83c0, 0xc000a96b88})
	/src/solver/llbsolver/solver.go:379 +0x1d9
github.com/moby/buildkit/solver.(*Job).InContext(0xc00002a180, {0x14d22a8, 0xc000453f50}, 0xc000a96b70)
	/src/solver/jobs.go:569 +0x157
github.com/moby/buildkit/solver/llbsolver.inBuilderContext({0x14d22a8, 0xc000453f50}, {0x14bfba0, 0xc00002a180}, {0x131794f, 0x0}, {0x0, 0x0}, 0xc000189ea0)
	/src/solver/llbsolver/solver.go:375 +0x1b3
github.com/moby/buildkit/solver/llbsolver.(*Solver).Solve(0xc0003f3da0, {0x14d22a8, 0xc000453f50}, {0xc000430b00, 0x19}, {0xc000430b20, 0xc000523038}, {0x0, 0x0, {0xc000732860, ...}, ...}, ...)
	/src/solver/llbsolver/solver.go:241 +0x1385
github.com/moby/buildkit/control.(*Controller).Solve(0xc0005acbe0, {0x14d22a8, 0xc000453f50}, 0xc000d4a360)
	/src/control/control.go:302 +0xb06
github.com/moby/buildkit/api/services/control._Control_Solve_Handler.func1({0x14d22a8, 0xc000453f50}, {0x12dc380, 0xc000d4a360})
	/src/api/services/control/control.pb.go:1488 +0x78
github.com/moby/buildkit/util/grpcerrors.UnaryServerInterceptor({0x14d22a8, 0xc000453f50}, {0x12dc380, 0xc000d4a360}, 0x0, 0x0)
	/src/util/grpcerrors/intercept.go:14 +0x3d
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1({0x14d22a8, 0xc000453f50}, {0x12dc380, 0xc000d4a360})
	/src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x3a
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1({0x14d2200, 0xc0002c4c80}, {0x12dc380, 0xc000d4a360}, 0xc0006a2680, 0xc0006a26a0)
	/src/vendor/go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go:327 +0x61c
main.unaryInterceptor.func1({0x14d22a8, 0xc000453d70}, {0x12dc380, 0xc000d4a360}, 0xc0006a2680, 0xc0006a26a0)
	/src/cmd/buildkitd/main.go:578 +0x1b8
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1({0x14d22a8, 0xc000453d70}, {0x12dc380, 0xc000d4a360})
	/src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x3a
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1({0x14d22a8, 0xc000453d70}, {0x12dc380, 0xc000d4a360}, 0xc0002e0bb8, 0x11684a0)
	/src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:34 +0xbf
github.com/moby/buildkit/api/services/control._Control_Solve_Handler({0x1275740, 0xc0005acbe0}, {0x14d22a8, 0xc000453d70}, 0xc00069ac00, 0xc00059f650)
	/src/api/services/control/control.pb.go:1490 +0x138
google.golang.org/grpc.(*Server).processUnaryRPC(0xc00050b340, {0x14e9e68, 0xc00078c180}, 0xc0002db680, 0xc000125050, 0x1dce8d8, 0x0)
	/src/vendor/google.golang.org/grpc/server.go:1297 +0xccf
google.golang.org/grpc.(*Server).handleStream(0xc00050b340, {0x14e9e68, 0xc00078c180}, 0xc0002db680, 0x0)
	/src/vendor/google.golang.org/grpc/server.go:1626 +0xa2a
google.golang.org/grpc.(*Server).serveStreams.func1.2()
	/src/vendor/google.golang.org/grpc/server.go:941 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
	/src/vendor/google.golang.org/grpc/server.go:939 +0x294

I eventually recompiled the daemon with a tiny change:

diff --git a/cache/remotecache/v1/utils.go b/cache/remotecache/v1/utils.go
index 889064a6..67bb964c 100644
--- a/cache/remotecache/v1/utils.go
+++ b/cache/remotecache/v1/utils.go
@@ -132,6 +132,9 @@ type normalizeState struct {
 }

 func (s *normalizeState) removeLoops() {
+       defer func() {
+               logrus.Infof("NORMALIZED")
+       }()
        roots := []digest.Digest{}
        for dgst, it := range s.byKey {
                if len(it.links) == 0 {

This log line never appeared, the build was stuck on preparing build cache for export done for more than 6 minutes, and dumping the Go routines also showed a stack traces with around 20 checkLoops().

@maxlaverse
Copy link
Contributor

Sorry @tonistiigi, what I mentioned in my previous comment is unrelated to your change. I just checked out v0.9.0, applied a small patch and got the same result.

diff --git a/cache/remotecache/v1/utils.go b/cache/remotecache/v1/utils.go
index 889064a6..3c3b6b4d 100644
--- a/cache/remotecache/v1/utils.go
+++ b/cache/remotecache/v1/utils.go
@@ -140,13 +140,14 @@ func (s *normalizeState) removeLoops() {
 	}
 
 	visited := map[digest.Digest]struct{}{}
-
+	var i int64
 	for _, d := range roots {
-		s.checkLoops(d, visited)
+		s.checkLoops(d, visited, &i)
 	}
+	logrus.Infof("finally done in %d iterations", i)
 }
 
-func (s *normalizeState) checkLoops(d digest.Digest, visited map[digest.Digest]struct{}) {
+func (s *normalizeState) checkLoops(d digest.Digest, visited map[digest.Digest]struct{}, i *int64) {
 	it, ok := s.byKey[d]
 	if !ok {
 		return
@@ -159,6 +160,10 @@ func (s *normalizeState) checkLoops(d digest.Digest, visited map[digest.Digest]s
 	defer func() {
 		delete(visited, d)
 	}()
+	(*i)++
+	if (*i)%10000 == 0 {
+		logrus.Infof("done %d iterations", *i)
+	}
 
 	for l, ids := range links {
 		for id := range ids {
@@ -172,7 +177,7 @@ func (s *normalizeState) checkLoops(d digest.Digest, visited map[digest.Digest]s
 				}
 				delete(links[l], id)
 			} else {
-				s.checkLoops(id, visited)
+				s.checkLoops(id, visited, i)
 			}
 		}
 	}

The amount of work spent removing the loops can be quite different, for two consecutive and identical build commands:

# First builds
INFO[2021-10-09T10:??:??Z] finally done in 269026285 iterations
INFO[2021-10-09T10:25:04Z] finally done in 269034995 iterations

# Then
INFO[2021-10-09T10:25:28Z] finally done in 31 iterations

My guess first was that normalizeState is slightly different between those builds, or the non-deterministic iteration order of the maps comes into play. I had a dozen of builds in a row that exited only after 250M+ checkLoops() iterations. After this one build that finished in 31 iterations, the following 20+ did as well and was unable to reproduce the issue. I didn't restart the daemon, nor altered the Dockerfile in between.

I tried not to import the build cache just once, and now the removal of the loops is always done in 19 iterations, for the same Dockerfile still. I restored a log line that prints out the amount of Records and Layers that results from the export, and the number of Records seems constant now (with your change), which wasn't the case so far.

I hadn't thought about this, but does the build cache have an effect on the normalizeState @tonistiigi ? Could one build have corrupted the build cache and make it more likely for the problem to continue appearing ?


I tested your PR with the reproducer and it works for me now, even after many tries. I won't be able to test it with the original project until middle of next week.

@tonistiigi
Copy link
Member

@maxlaverse Looks like you had some corrupted cache in the registry from testing your previous patches that caused this extra work on reexport. We don't consider this as a thing we need to protect against as you always need to trust your cache source anyway. But if you have a better understanding of what caused the loops and if there is an easy fix we could try to do something about it as well.

lmk when you have more testing data

@tonistiigi
Copy link
Member

@maxlaverse Can we close this now?

@maxlaverse
Copy link
Contributor

maxlaverse commented Oct 27, 2021

@tonistiigi I deployed a patched version with #2410 24 hours ago. There have been no occurrence of any CacheChain Marshall operation that would try to normalize items from a different CacheChain, instead of a couple of them per hour before.

It's very likely your patch fixed this issue, meaning the panic: assignment to entry in nil map but also probably the fatal error: concurrent map iteration and map write errors.

We're still using a mutex around the Marshall operation. I'll try to remove it in the next days and/or follow the conversation about those CacheKey issues #2041 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants