Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Networking between containers in custom network breaks after time #47211

Open
robd003 opened this issue Jan 25, 2024 · 14 comments
Open

Networking between containers in custom network breaks after time #47211

robd003 opened this issue Jan 25, 2024 · 14 comments
Labels
area/networking kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. status/0-triage version/25.0

Comments

@robd003
Copy link

robd003 commented Jan 25, 2024

Description

Containers will be able to communicate, but at some point I start getting "no route to host" seemingly randomly.

Sometimes this will resolve itself, but most of the time I have to redeploy the containers or restart the system.

Reproduce

  1. docker-compose up -d
version: '3.7'

services:
  nginx:
    restart: unless-stopped
    build: ../nginx
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - static_files:/var/www/html

  django:
    restart: unless-stopped
    build: ../django
    env_file:
      - .env
    volumes:
      - static_files:/static
    ports:
      - "8000:8000"
    depends_on:
      - crate
      - redis

  redis:
    restart: unless-stopped
    image: redis:7.2.4
    ports:
      - "6379:6379"
    command: redis-server --save "" --appendonly no

  crate:
    restart: unless-stopped
    image: crate:5.4.7
    environment:
      - CRATE_HEAP_SIZE=8192m
    ports:
      - "4200:4200"
      - "5442:5432"
    volumes:
      - crate_data:/data
    command: crate -Cnetwork.host=_site_ -Cpath.repo=/data/backup

  events:
    restart: unless-stopped
    build: ../events
    ports:
      - "8081:8081"
    env_file:
      - .env
    volumes:
      - /home/ubuntu/sql_logs:/debug

  postgres:
    image: postgres:16-bullseye
    volumes:
      - pg_data:/var/lib/postgresql/data

volumes:
  crate_data:
  static_files:
  pg_data:

Expected behavior

Network is reliable

docker version

Client: Docker Engine - Community
 Version:           25.0.1
 API version:       1.44
 Go version:        go1.21.6
 Git commit:        29cf629
 Built:             Tue Jan 23 23:09:55 2024
 OS/Arch:           linux/arm64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          25.0.1
  API version:      1.44 (minimum version 1.24)
  Go version:       go1.21.6
  Git commit:       71fa3ab
  Built:            Tue Jan 23 23:09:55 2024
  OS/Arch:          linux/arm64
  Experimental:     false
 containerd:
  Version:          1.6.27
  GitCommit:        a1496014c916f9e62104b33d1bb5bd03b0858e59
 runc:
  Version:          1.1.11
  GitCommit:        v1.1.11-0-g4bccb38
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

docker info

Client: Docker Engine - Community
 Version:    25.0.1
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.12.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.24.2
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 5
  Running: 5
  Paused: 0
  Stopped: 0
 Images: 32
 Server Version: 25.0.1
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: a1496014c916f9e62104b33d1bb5bd03b0858e59
 runc version: v1.1.11-0-g4bccb38
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.2.0-1018-aws
 Operating System: Ubuntu 22.04.3 LTS
 OSType: linux
 Architecture: aarch64
 CPUs: 2
 Total Memory: 3.746GiB
 Name: ip-172-30-30-160
 ID: GMOE:YHEH:24T3:2O6J:XF44:XZNY:OMFP:7RQC:7ALL:7633:PJ2O:HX2Y
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Additional Info

This is running on Ubuntu 22.04 LTS using kernel 6.2.0-1018-aws on ARM64

@robd003 robd003 added kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. status/0-triage labels Jan 25, 2024
@robmry
Copy link
Contributor

robmry commented Jan 25, 2024

Hi @robd003 - thank you for reporting ...

No ideas about what could be happening at the moment. Would it be possible to boil it down to a minimal reproducible example, using standard images, so that we can take a look?

How long does it run for before you see the problem, and then how long does it take to recover if it's going to?

Do the issues correspond with anything unusual in container or system logs?

Is it all running in the background, or as a compose foreground process?

@remetegabor
Copy link

Hi,
We have experienced similar symptoms that might be related to the same root cause. Hope this will help to get closer to understand what happens.

Our environment:
Centos 7 fully updated to latest patches. One ethernet connection on the host with a simple IPv4 address.
We have multiple local scoped bridge networks. The different nets are used to communicate between containers on the host (like access to a db container), finally we publish frontend facing services to fixed ports on the host network.
After the update to v25 there were connection errors to published ports from otside, randomly giving "No route to host" errors on TCP connect.

During troubleshooting I was able to reproduce the problem inside the docker host. To reproduce the problem I started to connect from the host having IP 192.168.1.1 to a local published port via telnet 192.168.1.1 8080 I was able to reproduce the problem, sometimes the connection was established normally, sometimes the there was random delay, and sometimes got the "No route to host" error. I have tested telnet localhost 8080 as well, in that case it was working reliably.
The problem appeared on multiple host with many containers images and ports.

I also took a packet trace on eth0 during test connection from outside, there I saw that sometimes there were retries sent from the client and the connection proceeded later on, other times the initial SYN pack and it's retries were unanswered for some time and after that an ICMP "Destination unreachable" was sent out of the docker host.

We have not have problems like this before the update and undoing the update fixed the problem in all cases.

Hope this helps.

@robmry
Copy link
Contributor

robmry commented Jan 25, 2024

Thank you @remetegabor ... I've had a go at reproducing, with a few alpine containers pinging each other on different networks, and a couple of nginx containers with published ports - but, no luck.

It'd be really good to have an example config based on standard images that reproduces the problem for you. Then we'll be able to investigate.

(Also, could you confirm you're using 25.0.1? And, how long it takes to see the problem, or how frequently it's happening?)

@LouisMT
Copy link

LouisMT commented Jan 25, 2024

I experienced the same problem on 4 different hosts when upgrading to 25.0.1 yesterday. 2x Debian 12 (aarch64) and 2x Ubuntu 22.04 (amd64). A few minutes after updating, containers started having trouble reaching each other. They recovered by themselves, but had trouble again a few minutes later and this repeated.

The problems disappeared for me after restarting the host, or after restarting all containers (using docker compose restart). Hope this information is of some use here!

@robd003
Copy link
Author

robd003 commented Jan 25, 2024

@robmry Is there anything I can do to "snapshot" the state when the issue occurs?

Would a docker inspect of every container be useful? iptables-save output?

The issue occurs between all containers. This is most easily observed with the Django container since it's connecting to redis & cratedb, but I've also seen it with nginx connecting to django.

The most annoying part is that it's intermittent.

These issues started with the v25.0.0 release, before that everything was working perfectly. Could it be an issue with docker-proxy?

@akerouanton
Copy link
Member

akerouanton commented Jan 25, 2024

@robd003 There're a few things that could help us better understand what's going on:

  • Turn on debug logs as per documented here: https://docs.docker.com/config/daemon/logs/#enable-debugging, and provide any logs emitted around the time the bug starts ;
  • The output of both iptables-save and conntrack -L when this occurs ;
  • You can also try running conntrack -E to see live conntrack events, until the bug appears.
  • The output of docker network inspect where this bug happens (ie. needs to match the other item asked here), and the IP addresses of affected containers on that network.
  • And also, did you update any other software around the same time as the Engine? (eg. kernel, systemd, etc...)

@remetegabor
Copy link

@robmry my test were done on 25.0.0. Unfortunately today i won't be able to put together a minimal repro env. I will try to make it on Monday.

In the mean time based on the other reports and what I saw I think there is no such thing as go wrong and recover on the engine, kernel or network stack side. Packets regarding tcp sessions are dropped randomly, and this makes higher level apps to fail and recover.
The simple test to make connections and close immediately resulting delays and connection failures randomly suggest this to be the root cause.

I will try to collect logs requested by @akerouanton.

@dkatheininger
Copy link

I can confirm the behaviour on our side as well. Our Docker environment experiences random networking issues since upgrading to 25.0.0, later to 25.0.1. The problem even remains even after downgrading to 24.0.7. We restarted docker service with no effect, and restartet the whole machine and recreated the containers, still no changes.

The server is a slightly larger installation, a 72GB VM running on Ubuntu 22.04.3 LTS x86_64 with Kernel 5.15.0-92-generic, the physical base system is running on Proxmox VE 8.1.4 x86_64.

docker info

Client: Docker Engine - Community
 Version:    24.0.7
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.11.2
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.21.0
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 65
  Running: 63
  Paused: 0
  Stopped: 2
 Images: 70
 Server Version: 24.0.7
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: a1496014c916f9e62104b33d1bb5bd03b0858e59
 runc version: v1.1.11-0-g4bccb38
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 5.15.0-92-generic
 Operating System: Ubuntu 22.04.3 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 20
 Total Memory: 70.67GiB
 Name: vm1
 ID: OSMC:FQFW:242B:3UQE:QNZY:U5L2:EQQB:D35Y:IR3M:LFPN:PCEW:KSIN
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

On this machine there's a public facing nginx container that proxies requests to ~25 different applications, all running as docker container. Mostly everything seems up and running, however, rather randomly the nginx service is not able to connect to the internal application

connect() failed (113: Host is unreachable) while connecting to upstream, client: 47.128.113.xxx, server: vaultwarden

Here's the example of our nginx docker-compose file:

version: "3"
services:
  nginx-proxy:
    container_name: nginx-proxy
    image: nginx:latest
    networks:
      - proxy-network
    ports:
      - 443:443/tcp
      - 80:80/tcp
    restart: always
    volumes:
      - <internal volumes>
networks:
  proxy-network:
    external: true

And this an example of one of the internal applications that are connected from the nginx service above:

version: "3.6"
services:
  vaultwarden:
    container_name: "vaultwarden"
    environment:
      - "ADMIN_TOKEN=xxx"
    image: "vaultwarden/server:latest"
    networks:
      - "proxy-network"
    restart: "unless-stopped"
    volumes:
      - "vaultwarden-data:/data"
networks:
  proxy-network:
    external: true
    name: "proxy-network"
volumes:
  vaultwarden-data:
    external: true

Btw: Vaultwarden is a fork of Bitwarden, which mentions the same issue in their own community forum:
https://community.bitwarden.com/t/docker-v25-networking-issues-self-hosted-only/62633

I hope this helps to track down the issue fast.

@akerouanton
Copy link
Member

Thanks @dkatheininger for the details. Unfortunately I fear we can't do much without the iptables and conntrack dumps, and docker debug logs I requested here: #47211 (comment). That'd be great if you can provide that in addition.

@dkatheininger
Copy link

Thank you @akerouanton for your quick response. We'll need some time to provide iptables and conntrack output.
As suggested in your comment we produced also docker network inspect proxy-network and analysed the output.
We found duplicate MAC-Addresses which appears to be odd. Here's an example:

[
    {
        "Name": "proxy-network",
        "Id": "521df80402ad53ea66e38c1a19863ed841cc0e46b559f58d1bf5f057e03a1e6f",
        "Created": "2022-12-13T09:38:48.247757267Z",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "172.19.0.0/16",
                    "Gateway": "172.19.0.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            ...
            "739edae05c7a76db8b1d75b5b1e2837ee435a3af7a3d8d19fa2e66da9bc91507": {
                "Name": "container16",
                "EndpointID": "5bcdbd8b7c3f2edea430646fb7218b2da0bca8865b32a03fde0c31d617109bcb",
                "MacAddress": "02:42:ac:13:00:18",
                "IPv4Address": "172.19.0.24/16",
                "IPv6Address": ""
            },

            ...

            "a390a500c776ad011898bf0e6a662ba3e2caffc8e535ae7e4c70c1a4648d07aa": {
                "Name": "container27",
                "EndpointID": "b7f3d2a4688dbafa35b27bc9a950542c6588688a860890472b803b2486c99bc8",
                "MacAddress": "02:42:ac:13:00:18",
                "IPv4Address": "172.19.0.18/16",
                "IPv6Address": ""
            }
            ...
        },
        "Options": {},
        "Labels": {}
    }
...

We even found mutiple duplicates. When we recreated the affected containers so that the received unique MAC addresses the problems seemed to disappear.

@dkatheininger
Copy link

@akerouanton another observation we made:
we are running 5 servers, all of them with Docker 25.0.1, same nginx / backend application setup as described above. But only two of them showed the networking issue. The affected systems are running Ubuntu 22.04.3 LTS, the other ones (that weren'r affected) are running Debian 12 (bookworm).

@robmry
Copy link
Contributor

robmry commented Jan 26, 2024

Thank you @dkatheininger ... I think the duplicate mac addresses you found were the clue we needed.

There was a bug in 25.0.0 (#45905) that could result in duplicate MAC addresses when one container was stopped, another started, then the first was restarted. We thought we'd fixed it in 25.0.1 (#47168) ... but duplicate MAC addresses created by 25.0.0 are retained when the containers are started with 25.0.1.

I think the only workaround for that is to re-create the containers with 25.0.1, which shouldn't generate duplicate addresses.

Because of the way the configuration and running-state of the container are stored - we may not be able to remove the need for that re-creation once duplicate MAC addresses have been generated.

Another problem is that 25.0.1 doesn't respect a --mac-address option over a container restart.

So - still investigating, but it would be good to know if this description fits with what's been observed, and whether re-creating the containers solves the problem in all of these cases.

@dkatheininger
Copy link

@robmry we can confirm that your description of the issue and the mitigation worked. Recreating the affected containers with 25.0.1 helped to prevent the duplicate MAC addresses

@robmry
Copy link
Contributor

robmry commented Jan 29, 2024

Hi @dkatheininger ... thank you for confirming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. status/0-triage version/25.0
Projects
None yet
Development

No branches or pull requests

7 participants