-
Notifications
You must be signed in to change notification settings - Fork 18.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Networking between containers in custom network breaks after time #47211
Comments
Hi @robd003 - thank you for reporting ... No ideas about what could be happening at the moment. Would it be possible to boil it down to a minimal reproducible example, using standard images, so that we can take a look? How long does it run for before you see the problem, and then how long does it take to recover if it's going to? Do the issues correspond with anything unusual in container or system logs? Is it all running in the background, or as a compose foreground process? |
Hi, Our environment: During troubleshooting I was able to reproduce the problem inside the docker host. To reproduce the problem I started to connect from the host having IP 192.168.1.1 to a local published port via I also took a packet trace on eth0 during test connection from outside, there I saw that sometimes there were retries sent from the client and the connection proceeded later on, other times the initial SYN pack and it's retries were unanswered for some time and after that an ICMP "Destination unreachable" was sent out of the docker host. We have not have problems like this before the update and undoing the update fixed the problem in all cases. Hope this helps. |
Thank you @remetegabor ... I've had a go at reproducing, with a few alpine containers pinging each other on different networks, and a couple of nginx containers with published ports - but, no luck. It'd be really good to have an example config based on standard images that reproduces the problem for you. Then we'll be able to investigate. (Also, could you confirm you're using 25.0.1? And, how long it takes to see the problem, or how frequently it's happening?) |
I experienced the same problem on 4 different hosts when upgrading to 25.0.1 yesterday. 2x Debian 12 (aarch64) and 2x Ubuntu 22.04 (amd64). A few minutes after updating, containers started having trouble reaching each other. They recovered by themselves, but had trouble again a few minutes later and this repeated. The problems disappeared for me after restarting the host, or after restarting all containers (using |
@robmry Is there anything I can do to "snapshot" the state when the issue occurs? Would a docker inspect of every container be useful? iptables-save output? The issue occurs between all containers. This is most easily observed with the Django container since it's connecting to redis & cratedb, but I've also seen it with nginx connecting to django. The most annoying part is that it's intermittent. These issues started with the v25.0.0 release, before that everything was working perfectly. Could it be an issue with |
@robd003 There're a few things that could help us better understand what's going on:
|
@robmry my test were done on 25.0.0. Unfortunately today i won't be able to put together a minimal repro env. I will try to make it on Monday. In the mean time based on the other reports and what I saw I think there is no such thing as go wrong and recover on the engine, kernel or network stack side. Packets regarding tcp sessions are dropped randomly, and this makes higher level apps to fail and recover. I will try to collect logs requested by @akerouanton. |
I can confirm the behaviour on our side as well. Our Docker environment experiences random networking issues since upgrading to 25.0.0, later to 25.0.1. The problem even remains even after downgrading to 24.0.7. We restarted docker service with no effect, and restartet the whole machine and recreated the containers, still no changes. The server is a slightly larger installation, a 72GB VM running on Ubuntu 22.04.3 LTS x86_64 with Kernel 5.15.0-92-generic, the physical base system is running on Proxmox VE 8.1.4 x86_64. docker info
On this machine there's a public facing nginx container that proxies requests to ~25 different applications, all running as docker container. Mostly everything seems up and running, however, rather randomly the nginx service is not able to connect to the internal application
Here's the example of our nginx docker-compose file:
And this an example of one of the internal applications that are connected from the nginx service above:
Btw: Vaultwarden is a fork of Bitwarden, which mentions the same issue in their own community forum: I hope this helps to track down the issue fast. |
Thanks @dkatheininger for the details. Unfortunately I fear we can't do much without the iptables and conntrack dumps, and docker debug logs I requested here: #47211 (comment). That'd be great if you can provide that in addition. |
Thank you @akerouanton for your quick response. We'll need some time to provide iptables and conntrack output.
We even found mutiple duplicates. When we recreated the affected containers so that the received unique MAC addresses the problems seemed to disappear. |
@akerouanton another observation we made: |
Thank you @dkatheininger ... I think the duplicate mac addresses you found were the clue we needed. There was a bug in 25.0.0 (#45905) that could result in duplicate MAC addresses when one container was stopped, another started, then the first was restarted. We thought we'd fixed it in 25.0.1 (#47168) ... but duplicate MAC addresses created by 25.0.0 are retained when the containers are started with 25.0.1. I think the only workaround for that is to re-create the containers with 25.0.1, which shouldn't generate duplicate addresses. Because of the way the configuration and running-state of the container are stored - we may not be able to remove the need for that re-creation once duplicate MAC addresses have been generated. Another problem is that 25.0.1 doesn't respect a So - still investigating, but it would be good to know if this description fits with what's been observed, and whether re-creating the containers solves the problem in all of these cases. |
@robmry we can confirm that your description of the issue and the mitigation worked. Recreating the affected containers with 25.0.1 helped to prevent the duplicate MAC addresses |
Hi @dkatheininger ... thank you for confirming. |
Description
Containers will be able to communicate, but at some point I start getting "no route to host" seemingly randomly.
Sometimes this will resolve itself, but most of the time I have to redeploy the containers or restart the system.
Reproduce
Expected behavior
Network is reliable
docker version
Client: Docker Engine - Community Version: 25.0.1 API version: 1.44 Go version: go1.21.6 Git commit: 29cf629 Built: Tue Jan 23 23:09:55 2024 OS/Arch: linux/arm64 Context: default Server: Docker Engine - Community Engine: Version: 25.0.1 API version: 1.44 (minimum version 1.24) Go version: go1.21.6 Git commit: 71fa3ab Built: Tue Jan 23 23:09:55 2024 OS/Arch: linux/arm64 Experimental: false containerd: Version: 1.6.27 GitCommit: a1496014c916f9e62104b33d1bb5bd03b0858e59 runc: Version: 1.1.11 GitCommit: v1.1.11-0-g4bccb38 docker-init: Version: 0.19.0 GitCommit: de40ad0
docker info
Additional Info
This is running on Ubuntu 22.04 LTS using kernel 6.2.0-1018-aws on ARM64
The text was updated successfully, but these errors were encountered: