Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS not working in a service with an overlay network. It is seemingly dependent on the creation order of unrelated networks. #28188

Closed
domg123 opened this issue Nov 9, 2016 · 12 comments

Comments

@domg123
Copy link

commented Nov 9, 2016

dns_docker_issue.zip

Description
After repeatedly hitting inconsistent behavior when starting up an application infrastructure with a number of swarm mode networks and services, I narrowed it down to a reproducible use-case. It is still occasionally inconsistent, especially on AWS for docker beta (where I hit it initially), but I am able to reliably produce the issue on a local vbox docker-machine, which I detail below.

Steps to reproduce the issue:
Also see the attached a script that I've used to run through the issue.

  1. Set-up a local docker-machine, into swarm mode. I used ...
    docker-machine create --driver virtualbox --virtualbox-memory 3072 --virtualbox-disk-size 20000 net-test-machine
    docker swarm init --advertise-addr

  2. Set up some swarm mode networks. The order is important. If the internal network is last, it works.
    docker network create --driver overlay test_nw_1
    docker network create --driver overlay --internal test_internal_nw
    docker network create --driver overlay test_nw_2

  3. Run a service, on which you can check DNS
    docker service create --name dns-test-service --restart-max-attempts=1 --replicas=1 --network test_nw_2 alpine nslookup www.google.com

  4. Check the results
    docker service ps dns-test-service
    ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR 8kuzzunxubqdj4nzyn3v8kuc8 dns-test-service.1 alpine net-test-machine Shutdown Failed 2 seconds ago "task: non-zero exit (1)" 7zihw5somacz9g945j8pac79h \_ dns-test-service.1 alpine net-test-machine Shutdown Failed 12 seconds ago "task: non-zero exit (1)"

The above service was chosen as it was concise, but I could run the same test with an 'alpine sleep', exec in, and clearly see that Internet access was ok, but hosts would not resolve.

Describe the results you received:
Within the container, the DNS doesn't work. You can't ping www.google.com, but you can ping, for instance, 8.8.8.8

Describe the results you expected:
The containers should have working DNS.

Additional information you deem important (e.g. issue happens only occasionally):
I've hit this on AWS for Docker Beta, and a local docker-machine VirtualBox environment. In the case of Virtualbox, I experimented with altering the virtualbox to 1.12.1/.2 as well as 3, but the issue remained and I can reliably recreate this from scratch.

Output of docker version:

Client:
 Version:      1.12.3
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   6b644ec
 Built:        Wed Oct 26 21:44:32 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.3
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   6b644ec
 Built:        Wed Oct 26 23:26:11 2016
 OS/Arch:      linux/amd64

Output of docker info:

Containers: 53
 Running: 0
 Paused: 0
 Stopped: 53
Images: 1
Server Version: 1.12.3
Storage Driver: aufs
 Root Dir: /mnt/sda1/var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 107
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: null overlay bridge host
Swarm: active
 NodeID: caecow8uwead466cylqpa58m1
 Is Manager: true
 ClusterID: 2i7xeimsxlbbe3cndrwhzr5pv
 Managers: 1
 Nodes: 1
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: 192.168.99.100
Runtimes: runc
Default Runtime: runc
Security Options: seccomp
Kernel Version: 4.4.27-boot2docker
Operating System: Boot2Docker 1.12.3 (TCL 7.2); HEAD : 7fc7575 - Thu Oct 27 17:23:17 UTC 2016
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 2.937 GiB
Name: net-test-machine
ID: 2I6J:XEPY:H3MW:5EDI:ZODN:T4LD:MHY3:LCRW:A2WK:BXHN:P4JG:KIOJ
Docker Root Dir: /mnt/sda1/var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 239
 Goroutines: 429
 System Time: 2016-11-09T03:02:31.153116336Z
 EventsListeners: 0
Username: codadoc
Registry: https://index.docker.io/v1/
Labels:
 provider=virtualbox
Insecure Registries:
 127.0.0.0/8

Additional environment details (AWS, VirtualBox, physical, etc.):

I have docker-machine version 0.8.2, build e18a919

I hit the issue also in AWS for Docker Beta 1.12.3, where it was prevalent, but more inconsistent.

@domg123

This comment has been minimized.

Copy link
Author

commented Nov 23, 2016

Referenced from docker/libnetwork#1548

@dongluochen

This comment has been minimized.

Copy link
Contributor

commented Nov 29, 2016

What's the error from nslookup in the container? Can you do nslookup www.google.com from the host machine (not inside the container)?

$ docker service ls
ID            NAME              REPLICAS  IMAGE   COMMAND
6dcwb0ak9zmm  dns-test-service  0/1       alpine  nslookup www.google.com

$ docker ps -a
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS                              PORTS               NAMES
0026afe0042b        alpine:latest       "nslookup www.google."   5 seconds ago       Exited (0) Less than a second ago                       dns-test-service.1.24dhti7g5kv83xo3vt0av0mee

$ docker logs 0026afe0042b
nslookup: can't resolve '(null)': Name does not resolve

Name:      www.google.com
Address 1: 216.58.219.100 mia07s25-in-f4.1e100.net
Address 2: 2607:f8b0:4008:807::2004 mia07s25-in-x04.1e100.net
@thaJeztah

This comment has been minimized.

Copy link
Member

commented Dec 9, 2016

ping @domg123 ^^

@domg123

This comment has been minimized.

Copy link
Author

commented Dec 9, 2016

Yep, the host-machine, if I ssh into it, the DNS works fine. I can nslookup www.google.com, and get what I would expect. For the record that is ...

"docker@net-test-machine:~$ nslookup www.google.com
Server:    10.0.2.3
Address 1: 10.0.2.3

Name:      www.google.com
Address 1: 216.58.218.100 dfw25s07-in-f100.1e100.net
Address 2: 2607:f8b0:4000:803::2004 dfw25s13-in-x04.1e100.net
docker@net-test-machine:~$ nslookup www.google.com
Server:    10.0.2.3
Address 1: 10.0.2.3"

The result of nslookup in a failing container is ...

nslookup: can't resolve '(null)': Name does not resolve
nslookup: can't resolve 'www.google.com': Try again

The containers that work, return the following....

nslookup: can't resolve '(null)': Name does not resolve
Name:      www.google.com
Address 1: 216.58.218.164 dfw06s46-in-f164.1e100.net
Address 2: 2607:f8b0:4000:80a::2004 dfw06s46-in-x04.1e100.net

To reply tot his, I span up a new docker-machine for the test, since I've been working with 1.13. So I can confirm that the issue persists in 1.12.4rc1. I was using a 1.13rc3 client, fwiw (the script needs to be slightly altered, since it returns the task information differently)

I also ran it in 1.13 rc3, and I get the same issue. I note, I initially did not get the same issue, when I ran with a host which already had a bunch of things running. On a fresh machine, the issue returned.

@domg123

This comment has been minimized.

Copy link
Author

commented Dec 9, 2016

script updated for 1.13. Using this, I see the problem on a 1.13rc3 host
dns_docker_issue-1-13.zip

@thaJeztah

This comment has been minimized.

Copy link
Member

commented Dec 9, 2016

ping @sanimej any more suggestions for debugging?

@sanimej

This comment has been minimized.

Copy link

commented Dec 9, 2016

@domg123
Short answer: With all the default configs this behavior is expected. To avoid this issue you can do one of the following

  • change the resolver IP in docker-machine VM's resolv.conf from 10.0.2.3 to 8.8.8.8 or any other external IP.
  • specify the subnet explicitly for the overlay network; docker network create -d overlay --subnet 192.168.10.0/24 ov1

Here are the gory details on what exactly is casing the problem with the default configs..

When you create overlay networks the default subnet is a /24 network from 10.0.0.0. Since you have three networks the subnets will be..

   network 1 - 10.0.0.0/24
   network 2 - 10.0.1.0/24
   network 3 - 10.0.2.0/24

The VM created by docker-machine with vbox driver will get the following resolv.conf file

docker@net-test-machine:~$ cat /etc/resolv.conf 
search docker.com
nameserver 10.0.2.3
docker@net-test-machine:~$ 

Hence 10.0.2.3 will be the external DNS server for containers created on this host.

If you create a service on network 2 its IP on the overlay network will be 10.0.1.3. To reach the external DNS server 10.0.2.3 packet will be route via container's eth1 interface which is connected to a bridge network docker_gwbridge. Its purpose is only to provide external connectivity.

If you create a service on network 3, the task IP will be 10.0.2.3. Since the external DNS server is also on the same subnet, in fact its the same IP here, packet will not get routed correctly and the resolution will fail.

The 2nd network had the internal flag is not relevant here.

@sanimej

This comment has been minimized.

Copy link

commented Dec 9, 2016

@thaJeztah This behavior is because of the default configs being used in the docker-machine VM and for the overlay networks. Its not a bug and can be closed.

@domg123

This comment has been minimized.

Copy link
Author

commented Dec 10, 2016

Thanks @sanimej I appreciate your detailed explanation. Which also explains why the issue didn't appear on a host which already had a bunch of networks on (presumably the test networks were all on a higher subnet for that test).

@thaJeztah

This comment has been minimized.

Copy link
Member

commented Dec 10, 2016

Thanks @sanimej, yes, makes sense now. I added a documentation label to this; we do currently advise users to specify a subnet, but perhaps more in depth information about this use case could be useful.

/cc @mstanleyjones

@akalipetis

This comment has been minimized.

Copy link
Contributor

commented Nov 27, 2017

Shouldn't this be fixed if --dns 8.8.8.8 --dns 8.8.4.4 is specified in the engine startup args?

@domg123

This comment has been minimized.

Copy link
Author

commented Nov 27, 2017

@akalipetis you are right, and I should have gone back and noted that here .... after originally adding some logic into start up scripts, to in-place edit the resolv.conf. I discovered the dns flag, it does work, and is a much neater solution.

@domg123 domg123 closed this Nov 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.