-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ERROR: failed to solve: DeadlineExceeded: context deadline exceeded #4327
Comments
@crazy-max I remember you made some fixes in this area, does this look related? Although I think your fixes were for local images copy while this looks like files copy. @epasveer Is there a reason you are exporting 10+GB with local exporter? The size itself shouldn't be limited but maybe there is a better solution for what you are doing. We have seen issues with grpc streams locking grpc/grpc-go#4722 on big local exports but that looks to be a different case.
This name comes from Go stdlib from a context created with
|
You recall right, the fix was in buildkit/exporter/local/export.go Line 150 in 4c93208
|
@crazy-max iiuc that function isn't really called in local export as that would call |
@epasveer could you share your buildkitd logs? There might be something relevant in there (e.g. some errors get full stack traces, which could be quite useful if that's the case here). |
First, let me thank everyone for replying. (I'm a "docker build" newbie).
To be honest, I don't we are running that daemon anywhere. We are enabling buildkit when we run "docker build". The first "build" builds the base image. The second build uses that and compiles/builds our code base. I'll include the 2 dockerfiles in the following comments.
|
BASE docker file.
|
COMPILE code base docker file. Running this image with "docker build" fails sometimes afer exectuing the last COPY.
|
It's likely the result of the codebase compile/build that we are doing inside the docker. |
Hey there,
we run dind, we have images with pretty large layers building and pushing in parallel. it happens sporadically, but mostly on the servers where the disks are sluggish. |
Hi. I'm happy (maybe not) to see some others having the same problem. @mzihlmann , what did you do to enable the stacktrace? We have one of our gihub runners using a newer version of docker. Same occasional error.
We are using an older version of centos, with an older kernel. I'm told we are still using VFS because of this. Not sure if that matters. |
i saw @tonistiigi suggesting it before, simple as that: - docker buildx bake -f docker-bake.hcl
+ docker --debug buildx bake -f docker-bake.hcl |
Hi @tonistiigi , Does the below suggest the docker is contacting external sites? Or am I misinterpreting them? Thanks.
|
its trying to upload the image/cache to the registry. telemetry is probably just a hook not really used. basically their gist is that (filesystem) garbage-collector kicks in at a certain point, which makes everything come to crawl, leading to timeouts like the above. one option to overcome it is to disable garbage collector in the builder ( before_script:
- docker context create <name>
- docker buildx create --use --name <name> <name>
- docker buildx create --buildkitd-flags '--oci-worker-gc=false' --use --name <name> <name>
script:
- docker buildx bake -f docker-bake.hcl
after_script:
- docker buildx prune --filter="until=1h" --force still need to verify whether the issue is now gone for good though, i can tell you after a few days. |
Thanks for the added info. I'll see if we can try out what you suggest. |
I am facing a similar issue, the build fails intermitently with context deadline exceeded. Please let me know if these tweaks are solving the issue for y'all. Thanks. |
@epasveer Is this the only stacktrace you see with this error? This is the client side trace but "failed to solve" errors should usually include the daemon-side stacktrace as well. If this is GC related like @mzihlmann suggests then GC invocations are logged to the daemon log when |
Hi @tonistiigi The stack was included by @mzihlmann . I only included it as a reference. We've added the '--debug' flag this past Friday. Hopefully we'll see some results tomorrow, which I will post. |
So, with the --debug flag, we did encounter a "deadline exceeded".
|
To confirm, you mean the dockerd log files? |
The errors would be in dockerd log if |
issue still pops up at my side, will now add the suggested debug flags |
it still seems to run garbage collection somehow? i was not yet able to get a stacktrace from the dockerd sidecar.
i also saw this issue hinting at a kernel bug #1459 (comment). |
here is another log with a different error message
|
I also encountered this error recently when running |
This is really annoying:
docker bake logs says: |
While the underlying cause of this error still isn't known, we're making progress to investigating it in #4457 - at least with that merged, the error messages from the logs should be more helpful. |
Super! :) I'm getting rather annoyed having to restart 50% of the builds manually. |
I'm facing a similar issue on my Ubuntu 20.04.6 LTS servers.
I disabled garbage collection and enabled debug on buildkit level.
I use below command to build it:
Stack trace/error:
On buildkit container level I get:
I tried quite a lot of things already:
For us the problem happens when as a result of publishresults step we have > 450MB of data to copy/send, it happens in 98% of such cases. For other apps when we have ~100-200MB of data to transfer this problem on same os/docker/buildkit configuration happens very rarely, for ~2% of cases. Any hints will be really appreciated! :) |
does this relate? just found it when scanning through the other issues. |
any progress with how to track down these problems? 50% of my builds require a manual restart and then succeed on the 2nd try... it is VERY annoying |
0.12.5 give the exact same useless error message: |
We are encountering the same or a very similar error. On the client side we get the following error message:
On the serverside we get the following error message:
|
Is this something that can be solved or at least mitigated by modifying the configuration file or providing the instance that buildkitd is running on with additional resources? |
This was quite insightful, having the same issue, large number of containers, lots of image pulls and disk io, but underlying disk is ebs so throughput limited. 7 different instances per build, there would always be around 2 or so that failed => 0% build success rate. Switched to instance storage, (at least for these big things) and we're back in business. I've taken on @tonistiigi I am happy to try whatever debug builds necessary to gather anymore information. Granted if you're underprovisioned it's not going to be fun in any case even if this bug is solved. |
See my above comment, but in any case, I would have a look at the throughput of your disks where buildkit is running. If this is running in ephemeral infra i.e. build nodes, you should see whether you're running on provisioned networked disks e.g. AWS EBS, they will have throughput limits e.g. gp2 min 125 MB/s, scales with size to max 250MB, for gp3 min 128MB, can be independently scaled to 1000MB/s but costs more. With pulling images, building images, transferring context, you reach the low limits quite quickly! We concluded that there is no point provisioning said disk since we'll throw it away, and instead we just use whatever ephemeral storage exists on the host as that will typically have significantly higher throughput limits at the expense of not being backed up (it's an ephemeral build so we don't care) |
we still face the same issue from time to time, but it is mitigated mostly on our side through other measures. We realized that because we use buildx with a docker-container builder, it was not using our override to enable 10gbps networking on our servers, so that explains why uploading was slow in the first place. Also, we fixed our docker builds to rebuild far fewer layers on changes. all in all failure rate went down from 10% -> 0.6% which is okish. im still convinced its related to this issue here #3966
|
Hi, i'm having similar error but in different situation SSH daemon is running on 32 port, so maybe that's the reason, I don't know I tried creating it like docker buildx create --name wdsr --driver remote --bootstrap --use ssh://ubuntu@remote_ip_adress:32 I also tried adding my remote host to ssh config, so I can connect freely like docker buildx create --name wdsr --driver remote --bootstrap --use ssh://wdsr fails with the same error
Also I have key only ssh authorization, and maybe i'm overlooking something, but for example if I do builds that require ssh to clone git repo over it, I always have to provide ssh agent socket like this: docker buildx bake -f ./build-stage.compose.yaml --set *.platform=linux/amd64 --set *.ssh=default=$SSH_AUTH_SOCK --pull --push --progress plain --no-cache And it seems strange to me that i don't have to do that when creating remote builder over ssh. If this functionality (remote ssh builder) is actually supported It's quite frustrating that there is no documentation on how to use it. My machine has |
Seeing the same |
Still not solved in 13.1 and still same useless error message:
|
docker buildx build |
We have been facing this issue for a while now. Our setup includes using pantbuild to build and publish multiple docker image in our monorepo. With high concurrency in-place, we were seeing this issue erratically on our CI runners. |
Myself and a couple others on my team have experienced this for a bit now. After seeing the reply from @rajeshwar-nu , I took a poll of my team to see how everyone had their VM disks configured. Turns out that only those of us who were using thin-provisioned disks were having the issue. I replaced my disk with a preallocated one and the errors are gone. |
Same issue -- absolutely no problems locally, but I constantly hit this when running bake in a GHA runner. Based on the reports here and in other issues, I'm guessing that the resource constraints encountered in a GHA runner are to blame. Is there a timeout value somewhere that could be exposed as configuration? I would expect that when the environment is resource constrained, that the result would be a slower build, not a sporadic error. |
Can you get a stacktrace for this error with |
It keeps going with the |
@davhdavh This is a timeout on saving the build history record. Build has already completed by that point, including the export (what may point to some other reports in here being different). Normally, saving this record should take maybe 0.1 sec at most so not entirely obvious why it reaches 20 sec timeout (during that time you would see you build progress complete but the process is not returning). If you can debug this more, during that time of 20 sec when the process is waiting can you capture the stactrace of the process via SIGQUIT or For the "failed to kill process in container id" do you have steps that reproduce this? Is this happening before the first timeout? Not sure if the two are related. |
Well, there is several factors here:
So I think what ends up happening is that the read/write activity of the other build causes the final write of the other build to stall so much you end up with a write timeout. This is also consistent with the reporting of other that say the error rate drops massively when they provision faster drives. 20 sec stall on an old hdd that is double virtualized is not unexpected. I would recommend the quick fix of just setting the timeout to 3-5 minutes instead of 20 sec.
It highly correlates with the As for the cause, I think it is dotnets built-in build-server that somehow manages to resist a kill command when running inside a container in nonroot buildkitd container. |
20 sec stall on an platter old hdd is not unexpected. Nor is 20s write on a low-tier cloud environment drive unexpected. 300s should however be enough to tell the difference between crappy hardware and actual failure. fixes moby#4327 Signed-off-by: Dennis Haney <davh@davh.dk>
20 sec stall on an platter old hdd is not unexpected. Nor is 20s write on a low-tier cloud environment drive unexpected. 300s should however be enough to tell the difference between crappy hardware and actual failure. fixes moby#4327 Signed-off-by: Dennis Haney <davh@davh.dk>
Sorry for the vague error message.
We have a docker image that runs some things. Most times, it runs fine. Occasionally, it errors at the very end with a vague error message that is not from the yml file but I think is from the docker image exiting/ending. (All our commands run fine in the container. It is failing after the last one completes).
We run the image from a github action, if that matter.
The question is:
Thanks in advanced.
The text was updated successfully, but these errors were encountered: