Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolving down Swarm service from service with dns: "127.0.0.11" results in hundreds of errors per second in syslog #47716

Open
elyulka opened this issue Apr 14, 2024 · 11 comments · May be fixed by #47744

Comments

@elyulka
Copy link

elyulka commented Apr 14, 2024

Description

I've setup haproxy to load balance services (following manuals set dns: "127.0.0.11" to do not forward requests to the external DSN servers) and noticed hundreds of errors per second in syslog when any backend service gets down:

Apr 14 15:51:08 staging-manager1 dockerd[2653338]: time="2024-04-14T15:51:08.083971294Z" level=error msg="[resolver] failed to query external DNS server" client-addr="udp:127.0.0.1:35393" dns-server="udp:127.0.0.11:53" error="read udp 127.0.0.1:35393->127.0.0.11:53: i/o timeout" question=";tasks.mon_prometheus.\tIN\t A" spanID=0a662e24539c4e08 traceID=3e0e421519bb2e7dcc60adf180880fb7

How can I avoid log pollution without making load to the external DNS service with queries of down service?

Reproduce

  1. create stack file docker-compose.yml:
version: '3.8'
services:
  dnstest:
    image: nicolaka/netshoot:v0.12
    dns: 127.0.0.11
    entrypoint:
      - sh
      - -c
      - 'while :; do dig non-existing; sleep 1; done'
  1. deploy by running docker stack deploy -c docker-compose.yml dnstest
  2. examine syslog flooded by errors tail -n10 -f /var/log/syslog

Expected behavior

logs should not be filled with hundreds of errors quering down service when limiting dns resolvers to single 127.0.0.11

docker version

Client: Docker Engine - Community
 Version:           26.0.0
 API version:       1.45
 Go version:        go1.21.8
 Git commit:        2ae903e
 Built:             Wed Mar 20 15:17:48 2024
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          26.0.0
  API version:      1.45 (minimum version 1.24)
  Go version:       go1.21.8
  Git commit:       8b79278
  Built:            Wed Mar 20 15:17:48 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.28
  GitCommit:        ae07eda36dd25f8a1b98dfbf587313b99c0190bb
 runc:
  Version:          1.1.12
  GitCommit:        v1.1.12-0-g51d5e94
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

docker info

Client: Docker Engine - Community
 Version:    26.0.0
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.13.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.17.2
    Path:     /root/.docker/cli-plugins/docker-compose

Server:
 Containers: 51
  Running: 35
  Paused: 0
  Stopped: 16
 Images: 92
 Server Version: 26.0.0
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: fluentd
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: active
  NodeID: ye1tqwlj7lag2sy839fce03ca
  Is Manager: true
  ClusterID: kwpn459kfqifaedft9c6naknp
  Managers: 1
  Nodes: 3
  Default Address Pool: 10.0.0.0/8
  SubnetSize: 24
  Data Path Port: 4789
  Orchestration:
   Task History Retention Limit: 5
  Raft:
   Snapshot Interval: 10000
   Number of Old Snapshots to Retain: 0
   Heartbeat Tick: 1
   Election Tick: 10
  Dispatcher:
   Heartbeat Period: 5 seconds
  CA Configuration:
   Expiry Duration: 3 months
   Force Rotate: 0
  Autolock Managers: false
  Root Rotation In Progress: false
  Node Address: 192.168.1.2
  Manager Addresses:
   192.168.1.2:2377
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: ae07eda36dd25f8a1b98dfbf587313b99c0190bb
 runc version: v1.1.12-0-g51d5e94
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
  no-new-privileges
 Kernel Version: 5.15.0-101-generic
 Operating System: Ubuntu 22.04.4 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 2
 Total Memory: 3.82GiB
 Name: staging-manager1
 ID: c92b7ce2-fc57-487d-8b93-6b85847c857b
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Registry Mirrors:
  https://*****:5010/
 Live Restore Enabled: false

Additional Info

Initially I was on v25, upgrade to v26 did not help.
I've opened haproxy issue but it seems like it's some docker edge case.

root@0b11c7683e24:/# dig mon_prometheus
;; communications error to 127.0.0.11#53: timed out
;; communications error to 127.0.0.11#53: timed out
;; communications error to 127.0.0.11#53: timed out

; <<>> DiG 9.18.24-1-Debian <<>> mon_prometheus
;; global options: +cmd
;; no servers could be reached

Here is output of tcpdump -v -i lo udp:
tcpdump-any-port.txt

I tried to run nslookup without overriding dns and got

/ # nslookup  mon_prometheus
Server:		127.0.0.11
Address:	127.0.0.11:53

** server can't find mon_prometheus: NXDOMAIN

** server can't find mon_prometheus: NXDOMAIN

OS: Digitalocean image "Docker 25.0.3 on Ubuntu 22.04"

@elyulka elyulka added kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. status/0-triage labels Apr 14, 2024
@elyulka
Copy link
Author

elyulka commented Apr 15, 2024

It seems like @robmry has experience with this part of docker codebase

@akerouanton
Copy link
Member

akerouanton commented Apr 15, 2024

following manuals set dns: "127.0.0.11" to do not forward requests to the external DSN servers

I have no idea what manuals you're talking about but if that's one of our docs page we need to fix it.

The dns field in daemon.json, the docker run --dns flag, and the Compose dns field all have the same purpose: they set the DNS servers the Engine forward to when A / AAAA queries don't match any container. By putting 127.0.0.11 there, you are telling the Engine to forward to itself -- effectively creating an infinite loop of queries. That's why you see so much logs.

There are a couple of different ways to ensure DNS queries don't get forwarded to upstream resolvers:

  • Using docker run --dns-option=ndots:0 ... (or equivalent Compose field) and using unqualified container names (eg. without the network name).
  • Connecting your containers to internal networks only. That's probably not what you're looking for as you want your HAProxy to load balance traffic to your backend containers.

For the record, we're waiting for the ICANN report Proposed Top-Level Domain String for Private Use to decide whether we want to make the daemon an authoritative NS for the DNS zone dckr.internal. (or something similar). This would provide a better mechanism to ensure queries aren't forwarded.

I'll check if we can slightly improve our config validation to make sure we don't accept 127.0.0.11.

@akerouanton akerouanton added kind/question area/networking area/networking/dns and removed status/0-triage kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. labels Apr 15, 2024
@elyulka
Copy link
Author

elyulka commented Apr 15, 2024

@akerouanton Thank you so much for clarifications! I've asked Haproxy team to update blog post to don't mislead people like me.

@elyulka
Copy link
Author

elyulka commented Apr 15, 2024

@akerouanton Unfortunately Swarm stack doesn't support dns_opt, i'm getting services.http Additional property dns_opt is not allowed error using next config:

  dns_opt:
    - "ndots:0"

@akerouanton
Copy link
Member

Ah, right -- that's not available on Swarm. Well, in that case unfortunately you have no way to disable upstream forwarding.

@s4ke
Copy link
Contributor

s4ke commented Apr 15, 2024

docker service create supports --dns-option though, right? dns_opt seems to be missing in plumbing for docker stack deploy. I guess it's worth creating a separate issue for this.

@bluikko
Copy link

bluikko commented Apr 18, 2024

following manuals set dns: "127.0.0.11"

What is this supposed to mean?
Is this set in daemon.json? Inside the container? Somewhere else?

Asking because regularly seeing this same kind of log flood with no DNS set in daemon.json.

@robmry
Copy link
Contributor

robmry commented Apr 18, 2024

following manuals set dns: "127.0.0.11"
What is this supposed to mean? Is this set in daemon.json? Inside the container? Somewhere else?

It looks like part of a docker compose service definition, equivalent to --dns in a docker run command.

Asking because regularly seeing this same kind of log flood with no DNS set in daemon.json.

If you're seeing something similar, without configuring 127.0.0.11 as an external DNS resolver via docker compose/run/create or any other means ... please could you raise a new issue? Be sure to include examples of log lines and, ideally, a minimal way for us to reproduce the problem (or, at least, a description/examples of the configuration you're using and any ideas you have about what might trigger it).

@bluikko
Copy link

bluikko commented Apr 18, 2024

@elyulka Did you remove the upstream DNS server of 127.0.0.11 from your Docker DNS configuration and did it help anything? I suspect this may be a red herring.

please could you raise a new issue

TL;DR: I do not have any useful data except I note the two similarities to the OP: Swarm & tasks.[...] query.

I would love to open an issue but I have nothing tangible on this problem. Just the symptoms that seem to manifest randomly with average maybe 1 to 6 times a month. Probably I'll continue monitoring and try to figure out something/anything of use before opening an issue.

All I know is that in various Swarms once in a while machine(s) suddenly start to spam this, triggering journald default flood limits, but calculated to about 100 log lines per second.
This flood repeats until the machine (or perhaps just dockerd) is restarted.

The log lines are identical except of course things like port numbers, query etc.; and the A record query does have our search domain appended. No DNS settings whatsoever are defined in daemon.json.
It would be very helpful if the log event would include client IP address or any identifying information. Next time I'll packet capture on lo and try to find which container is the client.

Interesting that our "flooding query" also starts with tasks. same as in the OP.
It seems like this is some well-known label used by Docker and as such it is strange why it would ever be forwarded to external resolvers. I say "seems" because I could not find a reference to this DNS in docs.docker.com but I did see a closed issue 5854

This issue did give some great ideas but since Swarm we also cannot use dns_opt. Inside at least some containers libc is already configured with similar option options ndots:0.
But in any case demanding all DNS queries to be unqualified doesn't seem feasible unless one has a very tight control over content of all containers.

@elyulka
Copy link
Author

elyulka commented Apr 18, 2024

@bluikko I removed setting from the Swarm Stack file and I don't have log flood anymore when Swarm internal DNS can't resolve request.

I also noticed that inside container there is options ndots:0 in /etc/resolv.conf present by default without explicit configuration.

@bluikko
Copy link

bluikko commented Apr 19, 2024

@bluikko I removed setting from the Swarm Stack file and I don't have log flood anymore when Swarm internal DNS can't resolve request.

You are right, the Swarm I am looking at had the same problem with DNSConfig set to 127.0.0.11. And whaddayaknow, it's also a HAProxy container!

There must be some bad advice being given in some documentation or "howto" somewhere: it's too much of a coincidence otherwise.

@elyulka I'll add my comment in the relevant HAProxy issue, hopefully this will get removed from the "howto". Oh how I despise howtos.

@robmry robmry linked a pull request Apr 23, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants