Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent faults of docker-internal DNS with IPv6 #2492

Closed
holgerpieta opened this issue Dec 19, 2019 · 10 comments
Closed

Intermittent faults of docker-internal DNS with IPv6 #2492

holgerpieta opened this issue Dec 19, 2019 · 10 comments

Comments

@holgerpieta
Copy link

Background / Observation

I’m running a small docker environment on a raspberry pi and I’ve recently enabled IPv6 (because it's nearly 2020).

For various reasons (isolation, daily changing IPv6 prefix from my ISP…) I went the road of disabling the userland proxy (via daemon.json) and adding the ipv6nat container. The IPv6 addresses for the containers are chose from a randomly selected ULA prefix (fd00:).
After convincing a number of containers (node-red, influx…) to actually use IPv6 (because IPv6 is such a new technology, only 20 years old), they are able to reach each other and can be reached from outside accessing only the IPv4 or IPv6 address of the host machine. DNS resolution in my home network also works as expected.

However, a somewhat fishy problem surfaced: The node red container complains roughly every 10 minutes that it couldn’t reach the influx container due to a DNS resolution error (getaddrinfo ENOTFOUND). It tries to access it every minute, but then about 10 concurrent requests. It doesn’t fail always, but in about 1 % of the trials DNS resolution fails, randomly.

Neither the system itself nor the network is under any significant load.

Debugging

I tried to do two things to debug: Log into the container and run nslookup manually and at the same time look at the docker logs after setting the daemon into debug mode. Here are the results:

Normal, expected case
Command line:

~ $ nslookup influx
nslookup: can't resolve '(null)': Name does not resolve
Name: influx
Address 1: 172.18.0.5 smart-home_influx_1.smart-home_smart-home-net
Address 2: fdc0:655:b8ac:6cee::5 smart-home_influx_1.smart-home_smart-home-net

Logs:

Dec 15 19:24:50 raspberrypi dockerd[22914]: time="2019-12-15T19:24:50.547353704+01:00" level=debug msg="Name To resolve: influx."
Dec 15 19:24:50 raspberrypi dockerd[22914]: time="2019-12-15T19:24:50.547496905+01:00" level=debug msg="[resolver] lookup for influx.: IP [172.18.0.5]"
Dec 15 19:24:50 raspberrypi dockerd[22914]: time="2019-12-15T19:24:50.548540662+01:00" level=debug msg="Name To resolve: influx."
Dec 15 19:24:50 raspberrypi dockerd[22914]: time="2019-12-15T19:24:50.548655456+01:00" level=debug msg="[resolver] lookup for influx.: IP [fdc0:655:b8ac:6cee::5]"
Dec 15 19:24:50 raspberrypi dockerd[22914]: time="2019-12-15T19:24:50.550023928+01:00" level=debug msg="IP To resolve 5.0.18.172"
Dec 15 19:24:50 raspberrypi dockerd[22914]: time="2019-12-15T19:24:50.550145093+01:00" level=debug msg="[resolver] lookup for IP 5.0.18.172: name smart-home_influx_1.smart-home_smart-home-net"
Dec 15 19:24:50 raspberrypi dockerd[22914]: time="2019-12-15T19:24:50.551397179+01:00" level=debug msg="IP To resolve 5.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.e.e.c.6.c.a.8.b.5.5.6.0.0.c.d.f"
Dec 15 19:24:50 raspberrypi dockerd[22914]: time="2019-12-15T19:24:50.551562583+01:00" level=debug msg="[resolver] lookup for IP 5.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.e.e.c.6.c.a.8.b.5.5.6.0.0.c.d.f: name smart-home_influx_1.smart-home_smart-home-net"

Half-broken case
Every now and then one of the following happens, i.e. either IPv4 or IPv6 is missing:

~ $ nslookup influx
nslookup: can't resolve '(null)': Name does not resolve
Name: influx
Address 1: 172.18.0.5 smart-home_influx_1.smart-home_smart-home-net

~ $ nslookup influx
nslookup: can't resolve '(null)': Name does not resolve
Name: influx
Address 1: fdc0:655:b8ac:6cee::5

Logs:

Dec 15 19:23:54 raspberrypi dockerd[22914]: time="2019-12-15T19:23:54.352789971+01:00" level=debug msg="Name To resolve: influx."
Dec 15 19:23:54 raspberrypi dockerd[22914]: time="2019-12-15T19:23:54.353013948+01:00" level=debug msg="[resolver] lookup for influx.: IP [172.18.0.5]"
Dec 15 19:23:54 raspberrypi dockerd[22914]: time="2019-12-15T19:23:54.353400015+01:00" level=debug msg="Name To resolve: influx."
Dec 15 19:23:54 raspberrypi dockerd[22914]: time="2019-12-15T19:23:54.353554437+01:00" level=debug msg="[resolver] lookup for influx.: IP [fdc0:655:b8ac:6cee::5]"
Dec 15 19:23:54 raspberrypi dockerd[22914]: time="2019-12-15T19:23:54.355044500+01:00" level=debug msg="IP To resolve 5.0.18.172"
Dec 15 19:23:54 raspberrypi dockerd[22914]: time="2019-12-15T19:23:54.355546509+01:00" level=debug msg="[resolver] lookup for IP 5.0.18.172: name smart-home_influx_1.smart-home_smart-home-net"

Dec 15 19:25:24 raspberrypi dockerd[22914]: time="2019-12-15T19:25:24.075780659+01:00" level=debug msg="Name To resolve: influx."
Dec 15 19:25:24 raspberrypi dockerd[22914]: time="2019-12-15T19:25:24.076029617+01:00" level=debug msg="[resolver] lookup for influx.: IP [fdc0:655:b8ac:6cee::5]"
Dec 15 19:25:24 raspberrypi dockerd[22914]: time="2019-12-15T19:25:24.075807307+01:00" level=debug msg="Name To resolve: influx."
Dec 15 19:25:24 raspberrypi dockerd[22914]: time="2019-12-15T19:25:24.076460497+01:00" level=debug msg="[resolver] lookup for influx.: IP [172.18.0.5]"
Dec 15 19:25:24 raspberrypi dockerd[22914]: time="2019-12-15T19:25:24.077406423+01:00" level=debug msg="IP To resolve 5.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.e.e.c.6.c.a.8.b.5.5.6.0.0.c.d.f"
Dec 15 19:25:24 raspberrypi dockerd[22914]: time="2019-12-15T19:25:24.077777971+01:00" level=debug msg="[resolver] lookup for IP 5.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.e.e.c.6.c.a.8.b.5.5.6.0.0.c.d.f: name smart-home_influx_1.smart-home_smart-home-net"

Error case
Worst case is this, happening less frequently, but still in about 1 % of the cases, causing a failed connection between containers:

~ $ nslookup influx
nslookup: can't resolve '(null)': Name does not resolve
nslookup: can't resolve 'influx': Name does not resolve

Logs:

Dec 15 19:22:27 raspberrypi dockerd[22914]: time="2019-12-15T19:22:27.002926895+01:00" level=debug msg="Name To resolve: influx."
Dec 15 19:22:27 raspberrypi dockerd[22914]: time="2019-12-15T19:22:27.005502807+01:00" level=debug msg="[resolver] lookup for influx.: IP [172.18.0.5]"
Dec 15 19:22:27 raspberrypi dockerd[22914]: time="2019-12-15T19:22:27.003265666+01:00" level=debug msg="Name To resolve: influx."
Dec 15 19:22:27 raspberrypi dockerd[22914]: time="2019-12-15T19:22:27.005844967+01:00" level=debug msg="[resolver] lookup for influx.: IP [fdc0:655:b8ac:6cee::5]"

Causes / Interpretation

Well, I'm no expert on docker networking, but if things fail rarely and randomly, it feels like a race condition to me.

@belfo
Copy link

belfo commented Dec 20, 2019

i noticed the same with only ipv4:
moby/moby#40300

@holgerpieta
Copy link
Author

Not quite the same: This issue here is totally random. About 1 out of 10 lookups deliver only one of two addresses (either IPv4 or IPv6) and every once in a while it fails fully, delivering no address.
But the next lookup will work just fine, without any actions in between. Restarts aren't needed but also don't help.

@holgerpieta
Copy link
Author

I've switched to podman, so I do not care anymore. Closing it now.

@kingfisher77
Copy link

We experience the same issue in a dual stack nat environment. From time to time the resolution failes. We bypass by using v4 address (v6 is also working).

Right now i have the impression that nginx is part of the game. As nginx resolver is 127.0.0.11 and v6 resolver is the v6 resolve rof the docker host which does not know anything from the internal service names of docker. Can this be the reason?

@kingfisher77
Copy link

Yes, by removing the ipv6 resolver (the docker host resolver) from nginx the frequent 502 errors disappeared.

@jdannenberg
Copy link

jdannenberg commented Apr 1, 2023

I am experiencing the same issue having IPv6 enabled. However, I think that @belfo is right and it also affects IPv4. The resolv.conf of my containers looks as follows

search example.org
nameserver 127.0.0.11
nameserver fd99:8053::1
options ndots:0

The 127.0.0.11 nameservice is the docker "internal" one and it responds with IPv4 and IPv6 addresses of external hosts and other containers in the same network. The latter one is my router's IPv6 DNS (see moby/moby#41651) whose answers to service names naturally is always NXDOMAIN.

However, as I am seeing NXDOMAIN from time to time for other docker services in the same network I would also conclude - with my limited understanding - that 127.0.0.11 sometimes fails for BOTH IPv4 and IPv6. Because, either IPv4 or IPv6 DNS answer should be enough for my services to connect, right?

docker info:

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.10.2
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.16.0
    Path:     /usr/libexec/docker/cli-plugins/docker-compose
  scan: Docker Scan (Docker Inc.)
    Version:  v0.23.0
    Path:     /usr/libexec/docker/cli-plugins/docker-scan

Server:
 Containers: 24
  Running: 24
  Paused: 0
  Stopped: 0
 Images: 26
 Server Version: 23.0.1
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 2456e983eb9e37e47538f59ea18f2043c9a73640
 runc version: v1.1.4-0-g5fd4c4d
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 5.10.0-20-amd64
 Operating System: Debian GNU/Linux 11 (bullseye)
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 9.727GiB
 Name: server-prod-docker-1
 ID: <sanitized, since I don't know what it is>
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Experimental: true
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

@Marc-Berg
Copy link

Marc-Berg commented Apr 10, 2023

I am experiencing the same issue having IPv6 enabled. However, I think that @belfo is right and it also affects IPv4. The resolv.conf of my containers looks as follows

search example.org
nameserver 127.0.0.11
nameserver fd99:8053::1
options ndots:0

The 127.0.0.11 nameservice is the docker "internal" one and it responds with IPv4 and IPv6 addresses of external hosts and other containers in the same network. The latter one is my router's IPv6 DNS (see moby/moby#41651) whose answers to service names naturally is always NXDOMAIN.

However, as I am seeing NXDOMAIN from time to time for other docker services in the same network I would also conclude - with my limited understanding - that 127.0.0.11 sometimes fails for BOTH IPv4 and IPv6.

I don't think so. Both name servers are requested in parallel. If the response from your router arrives first, DNS resolution of the internal Docker names fails. It's a race condition.

@jdannenberg
Copy link

I don't think so. Both name servers are requested in parallel. If the response from your router arrives first, DNS resolution of the internal Docker names fails. It's a race condition.

Are they really queried in parallel?

Form manpage:

If there are
multiple servers, the resolver library queries them in the
order listed. If no nameserver entries are present, the
default is to use the name server on the local machine.
(The algorithm used is to try a name server, and if the
query times out, try the next, until out of name servers,
then repeat trying all the name servers until a maximum
number of retries are made.)

@Marc-Berg
Copy link

Are they really queried in parallel?

This is how i understand this comment: moby/moby#41651 (comment)

@jdannenberg
Copy link

jdannenberg commented Apr 10, 2023

Are they really queried in parallel?

This is how i understand this comment: moby/moby#41651 (comment)

Oh wow that explains why only a subset of services is affected. I will check if those are alpine based.

Musl then behaves different from glibc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants