Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker Swarm Mode service discovery is totally unstable #33589

Closed
soar opened this issue Jun 8, 2017 · 41 comments
Closed

Docker Swarm Mode service discovery is totally unstable #33589

soar opened this issue Jun 8, 2017 · 41 comments
Labels
area/networking area/swarm kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. version/17.05

Comments

@soar
Copy link

soar commented Jun 8, 2017

Description

I have about 10 nodes, 4 master nodes, 7 overlay networks and about 70 services. And swarm is totally unstable now: it accidentally stops resolving names of my services:

root@balancer:/# nslookup backend 127.0.0.11
Server:		127.0.0.11
Address:	127.0.0.11#53

** server can't find backend: NXDOMAIN

And this problem can exist for 10-20 minutes! All this time you'll see 502 Bad Gateway error. And then you'll see messages like this:

Jun  8 15:04:16 node4 dockerd[840]: time="2017-06-08T15:04:16.415822659+02:00" level=warning msg="Neighbor entry already present for IP 10.10.4.50, mac 02:42:0a:0a:04:32"
Jun  8 15:04:16 node4 dockerd[840]: time="2017-06-08T15:04:16.415898743+02:00" level=warning msg="Neighbor entry already present for IP <external-ip-here>, mac 02:42:0a:0a:04:32"
Jun  8 15:04:16 node4 dockerd[840]: time="2017-06-08T15:04:16.415971947+02:00" level=warning msg="Neighbor entry already present for IP 10.10.4.45, mac 02:42:0a:0a:04:2d"
Jun  8 15:04:16 node4 dockerd[840]: time="2017-06-08T15:04:16.415997559+02:00" level=warning msg="Neighbor entry already present for IP <external-ip-here>, mac 02:42:0a:0a:04:2d"
...

... and then it will recover itself:

root@balancer:/# nslookup backend 127.0.0.11
Server:		127.0.0.11
Address:	127.0.0.11#53

Non-authoritative answer:
Name:	backend
Address: 10.0.0.16

For 1-2 hours. And then - cluster will become broken again.

Steps to reproduce the issue:

  1. Create swarm
  2. Create services and networks
  3. Try to use it in production

Describe the results you received:
At this moment Swarm Mode is unusable for production environments.

Describe the results you expected:
I expected that it will!

Additional information you deem important (e.g. issue happens only occasionally):

Output of docker version:

Client:
 Version:      17.05.0-ce
 API version:  1.29
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 22:04:27 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.05.0-ce
 API version:  1.29 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 22:04:27 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

Containers: 3
 Running: 2
 Paused: 0
 Stopped: 1
Images: 3
Server Version: 17.05.0-ce
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 36
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: active
 NodeID: yidccbcki64epay4p4ugq6xm1
 Is Manager: true
 ClusterID: hqvohft3etj4ajnkgubbnjwzp
 Managers: 4
 Nodes: 15
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: <ip1>
 Manager Addresses:
  <ip2>:2377
  <ip3>:2377
  <ip1>:2377
  <ip4>:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9048e5e50717ea4497b757314bad98ea3763c145
runc version: 9c2d8d184e5da67c95d601382adf14862e4f2228
init version: 949e6fa
Kernel Version: 3.16.0-4-amd64
Operating System: Debian GNU/Linux 8 (jessie)
OSType: linux
Architecture: x86_64
CPUs: 12
Total Memory: 63.02GiB
Name: node5
ID: JRVQ:2D26:NA5B:TJ5L:VWSC:TEAE:G4B2:UXYF:3TRR:F7YY:LRW5:3IWA
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Username: mybot
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No kernel memory limit support
WARNING: No cpu cfs quota support
WARNING: No cpu cfs period support

@soar
Copy link
Author

soar commented Jun 8, 2017

cluster

root@node5:~# docker service ps balancer | grep node
i4p2b70c58rm        balancer.s42tug46h78l0p7z5unlsr0a2       fx/balancer:latest   node4                           Running             Running 7 hours ago                                       *:443->443/tcp,*:80->80/tcp
tnw1tx627um5        balancer.yidccbcki64epay4p4ugq6xm1       fx/balancer:latest   node5                           Running             Running 7 hours ago                                       *:443->443/tcp,*:80->80/tcp
kz6ubc7pgct4        balancer.jpfvnxy14acbjtrmv3n0x78xt       fx/balancer:latest   node2                           Running             Running 7 hours ago                                       *:443->443/tcp,*:80->80/tcp

root@node5:~# docker service ps jenkins
ID                  NAME                IMAGE               NODE                DESIRED STATE       CURRENT STATE          ERROR                              PORTS
5q4u91busffe        jenkins.1           jenkins:latest      dwrk-jenkins-01     Running             Running 10 days ago                                       
s4a4w8kch6xv         \_ jenkins.1       jenkins:latest      dwrk-jenkins-01     Shutdown            Shutdown 10 days ago                                      

node4 (master node)

root@node4:~# docker exec -it balancer.s42tug46h78l0p7z5unlsr0a2.i4p2b70c58rm1yst9s5cjlscc bash

root@balancer:/# nslookup jenkins 127.0.0.11
Server:		127.0.0.11
Address:	127.0.0.11#53

Non-authoritative answer:
Name:	jenkins
Address: 10.0.0.76

root@balancer:/# ping jenkins
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.086/0.087/0.088/0.000 ms

root@balancer:/# echo > /dev/tcp/jenkins/8080 && echo ok || echo notok
ok

node5 (same cluster, also master node)

root@node5:~# docker exec -it balancer.yidccbcki64epay4p4ugq6xm1.tnw1tx627um5hjm2vnt624ino bash

root@balancer:/# nslookup jenkins 127.0.0.11
Server:		127.0.0.11
Address:	127.0.0.11#53

Non-authoritative answer:
Name:	jenkins
Address: 10.0.0.76

root@balancer:/# ping jenkins
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.080/0.089/0.098/0.000 ms

root@balancer:/# echo > /dev/tcp/jenkins/8080 && echo ok || echo notok
bash: connect: Connection timed out
bash: /dev/tcp/jenkins/8080: Connection timed out
notok

@fcrisciani
Copy link
Contributor

Thanks @soar we are aware of this instability and I'm currently working on a patch to stabilize it.
Will let you know when I have a patch and would be great if you will available to test it also on your side and give feedbacks if that solves for you. Let me know.

@soar
Copy link
Author

soar commented Jun 8, 2017

@fcrisciani thank you! I'm ready to test anything, that could help.

@fcrisciani
Copy link
Contributor

Will let you know as I have something ready to be tested

@thaJeztah thaJeztah added area/networking kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. labels Jun 8, 2017
@Spectrik
Copy link

We are experiencing same problem and it actually stops us from deploying docker to production....

@thaJeztah
Copy link
Member

Docker 17.06 will carry changes that should fix this (I think this PR is part of that moby/libnetwork#1796 but @fcrisciani perhaps you can confirm?)

If you have a setup to test, release-candidates for docker 17.06 are available in the "test" channel on download.docker.com; keep in mind they are release candidates, so it's not recommended to run in production yet

@fcrisciani
Copy link
Contributor

Yes correct, that PR fixes several race condition found on the service discovery

@soar
Copy link
Author

soar commented Jun 27, 2017

In russian we have the expression "podgoráet pukán" which means that we need stable version with this bug fixed...

@Crusader4Christ
Copy link

@soar fireass in English
We have the same problem with docker

@Spectrik
Copy link

Spectrik commented Jun 27, 2017

Me and my team also would be very glad for stable swarm, not "stable swarm" version :)

@sanimej
Copy link

sanimej commented Jun 27, 2017

@soar 17.06 GA release should happen pretty soon. But the last RC, RC5 has all the Service Discovery fixes. Can you give it a try and let us know if it helps ?

@BSWANG
Copy link
Contributor

BSWANG commented Aug 22, 2017

@sanimej In docker 17.06.1 GA release, I got the same error.

Aug 22 11:52:31 c57308e8eac5743ef99f7451fdd4a02d7-node6 dockerd[6040]: time="2017-08-22T11:52:31.105208918+08:00" level=warning msg="Neighbor entry already present for IP 172.19.0.17, mac 02:42:ac:13:00:11"
Aug 22 11:52:31 c57308e8eac5743ef99f7451fdd4a02d7-node6 dockerd[6040]: time="2017-08-22T11:52:31.105301694+08:00" level=warning msg="Neighbor entry already present for IP 10.25.69.135, mac 02:42:ac:13:00:11"

Then containers on the error node6, can not access the 172.19.0.17. Because the fdb table in overlay namespace not update to 10.25.69.135.

@taiidani
Copy link

Seeing the same error in 17.06.0-ce.

Sep 16 08:15:25 ip-10-62-129-120 dockerd:  time="2017-09-16T15:15:25.088188954Z" level=warning msg="Neighbor entry already present for IP 10.0.2.9, mac 02:42:0a:00:02:09"
Sep 16 08:15:25 ip-10-62-129-120 dockerd:  time="2017-09-16T15:15:25.088585064Z" level=warning msg="Neighbor entry already present for IP 10.62.129.164, mac 02:42:0a:00:02:09"
Sep 16 08:15:25 ip-10-62-129-120 dockerd:  time="2017-09-16T15:15:25.088618211Z" level=warning msg="Neighbor entry already present for IP 10.0.2.13, mac 02:42:0a:00:02:0d"
Sep 16 08:15:25 ip-10-62-129-120 dockerd:  time="2017-09-16T15:15:25.088632308Z" level=warning msg="Neighbor entry already present for IP 10.62.129.164, mac 02:42:0a:00:02:0d"

The message is being emitted from quite a few if not all of our manager and worker nodes.

@thaJeztah
Copy link
Member

ping @fcrisciani @abhinandanpb ^^

@fcrisciani
Copy link
Contributor

@taiidani do you actually see a connectivity issue? That is simply a warning message that is notifying that the specific mac entry is already already present in the DB so won't be reconfigured again. We can potentially try to rate limit it but should not cause any connectivity issue

@taiidani
Copy link

@fcrisciani I can't guarantee it was related, but I saw the logs as I was troubleshooting one of our manager nodes having intermittent outbound network connectivity issues.

That problem appears to have been solved by rebooting the node, but as my cluster has had sporadic severe connectivity issues between services on overlay networks and on outbound connections I wanted to make sure that if these logs were part of the problem that my occurrence of them was noted.

It sounds like the log entries are unrelated to the instability I've been seeing?

@jgranstrom
Copy link

@taiidani what cloud provider are you using? Are you ever seeing any sporadic connectivity issues outside of docker swarm on the host itself? For example when you simply try a lot of pings or similar between the nodes or to some other external source over a few hours to a day.

@taiidani
Copy link

@jgranstrom We're using AWS (us-west-2) with a Swam setup comprised of 17.06-ce installed on Amazon's vanilla Ubuntu 16.04 AMI.

So far we've seen issues where:

  • Overlay network connections failed at the DNS level. Other services in the network would appear for some containers but not others. In one case this behavior even occurred between two services that had containers on the same node. In all cases so far removing and redeploying the stack seemed to solve the issue (with brief downtime).
  • Outbound connections from a Jenkins runner service (used for one-off crons) failed, say Amazon S3 failed ala fatal error: Could not connect to the endpoint URL: "https://s3-us-west-2.amazonaws.com/... or from Docker Compose ala An HTTP request took too long to complete. Retry with --verbose to obtain debug information.. Rebooting the host node seemed to fix those.

It's been very hard to nail these down, as they happen intermittently and against production systems, meaning we need to implement quick solutions to prevent downtime.

@zigmund
Copy link

zigmund commented Sep 22, 2017

On 17.06.1 and 17.06.2 we still have same issues sometimes. After service update some replicas not responding on some worker nodes.

@fcrisciani
Copy link
Contributor

@zigmund I believe your description is compatible with this issue: moby/libnetwork#1934 for which the PR moby/libnetwork#1935 got merged and will be present in the next release

@yunghoy
Copy link

yunghoy commented Oct 16, 2017

Service Discovery in Namespace and Service Name are totally corrupted. It doesn't have even consistency. Sometimes, Namespace and Service name can be discovered, but other cases with the same docker-compose setting file. It's hard to reproduce these bugs but I am suspicious that it is related to networks. Sometimes, it works if you delete the relevant networks or Exited containers with different settings.

So, my team is now using old way, IP address, again, instead of using service discovery. This is the worst part of docker swarm mode. We hoped we could easily manage our docker files with service discovery. That is why we moved from "docker swarm."

@svscorp
Copy link

svscorp commented Oct 19, 2017

I have got same error upgrading from 17.07 to 17.10 :(

If I have [nginx] and [jenkins] let's say, and I kill jenkins container on one of the nodes (it restarts to another node) - jenkins is not reachable via that NGINX anymore with 502 error.

In the logs it looks like it's trying to access jenkins using old overlay network IP.

@svscorp
Copy link

svscorp commented Oct 19, 2017

@fcrisciani @thaJeztah

Can it be related to the recently introduced 17.10 and moby/libnetwork#1935? Haven't noticed that behavior before.

@fcrisciani
Copy link
Contributor

@svscorp I guess you wanted to refer to the 1935 of libnetwork, but that is actually related to overlay more than the dns. Did you actually verify that the DNS resolution is incorrect inside the container and there is no caching happening?

@zigmund
Copy link

zigmund commented Oct 20, 2017

@svscorp What endpoint mode do you use with jenkins?

As I know, nginx resolves upstream IPs only during start. If you use DNSRR mode then IPs of containers may be changed after stop/kill etc. You should use VIP endpoint mode instead for jenkins since VIP of service stays static during service lifetime.

@svscorp
Copy link

svscorp commented Oct 20, 2017

@zigmund I am in favor of using VIP. Jenkins now is on VIP mode. There are no ports published to ingress network.

But setup is unstable - in some random time I can't use tools. Browser just hangs and looking to logs I can see nginx errors on every HTTP request that connection (to ldap) is timeouted or "client closed keepalive connection".

When I had initial problem, the above behavior was connected with errors "L3 reprogramming failed". On 17.10 I don't see this error message, but it feels the same.

Also, journalctl shows continuously spamming logs with node join event (on 17.07 it was much less messages). That's a separate story (probably).

...
Node join event for 9eb369344416
...

Setup looks like this:
[swarm-manager-1]
[swarm-manager-2]
[swarm-worker-1]
nginx (global)
[swarm-worker-2]
nginx (global)
[swarm-worker-3]
ldap
jenkins

@fcrisciani Thanks, I corrected the link. Indeed I mentioned libnetwork. I haven't tried yet check from container and I don't remember in 17.07 did, let's say ldapsearch work from nginx container (node1) to ldap container (node). Will test and write back.

@fcrisciani
Copy link
Contributor

@svscorp the node join message is not a concern, a part from its frequency, does not indicate an issue. It's only being printed when the nodes exchange the state through TCP sync. I will look into hiding it.
Also another thing to check now in 17.10 is a new log about the networkDb stats that is printed every 5 min. Verify that the queue length is 0 and that the number of entries per network are matching all the nodes where you have running containers.
The log line example is: Oct 17 21:57:36 ip-10-204-197-84 dockerd[18219]: time="2017-10-17T21:57:36.235789959Z" level=info msg="NetworkDB stats - netID:v0jtipqvlyv4nzl6evcfa19gr leaving:false netPeers:27 entries:56 Queue qLen:0 netMsg/s:0"

After that to exclude issues due to caching or application layer, I would validate the return result from inside a container just simply trying to resolve the serviceName and check that the vip matches the docker service inspect result. Further I would check with docker network inspect -v (on a manager) that the containers that are being shown are actually matching what is running

@svscorp
Copy link

svscorp commented Oct 23, 2017

@fcrisciani Thank you very much for helping.

Also another thing to check now in 17.10 is a new log about the networkDb stats that is printed every 5 min.

Confirming this. There are two messages coming in my case as far as I remember. I'l check on qLen tonight.

After that to exclude issues due to caching or application layer, I would validate the return result from inside a container just simply trying to resolve the serviceName and check that the vip matches the docker service inspect result. Further I would check with docker network inspect -v (on a manager) that the containers that are being shown are actually matching what is running
I will check on this cases and return ASAP. I need to have a working and stable setup running, therefore switching between the versions and re-creating setup daily.

In return, @fcrisciani @thaJeztah would you want me to test a specific thing/issue/fix in 17.10 as long as I am able to recreate platform for tuning and troubleshooting purposes, can help.

@fcrisciani
Copy link
Contributor

@svscorp

Confirming this. There are two messages coming in my case as far as I remember. I'l check on qLen tonight.

It will be one per network, consider that the ingress network is created at the beginning when you enter in swarm mode. You can correlate the network id with a simple docker network ls.

I'm currently just curious to see if on a fresh 17.10 you are still able to see inconsistencies in the service discovery, also we need to better understand the use case and add further automated test suite aiming to fix eventual new bugs and strengthen our testing infra to avoid regressions

@svscorp
Copy link

svscorp commented Oct 23, 2017

@fcrisciani meanwhile my "rollback" build to 17.07 was completed. In 15 minutes I've got website not responding.

Debugging and logs:

Show Docker logs
led for 110.10.0.13, no peer entry"
Oct 23 21:04:48 node-4 dockerd: time="2017-10-23T21:04:48.471189216+02:00" level=error msg="Reprogramming on L3 miss failed for 16.10.0.48, no peer entry"
Oct 23 21:04:48 node-4 journal: {:timestamp=>"2017-10-23T19:04:48.535000+0000", :message=>"Attempted to send a bulk request to Elasticsearch configured at '[\"http://elasticsearch/\"]', but Elasticsearch appears to be unreachable or down!", :client_config=>{:hosts=>["http://elasticsearch/"], :ssl=>nil, :transport_options=>{:socket_timeout=>0, :request_timeout=>0, :proxy=>nil, :ssl=>{}}, :transport_class=>Elasticsearch::Transport::Transport::HTTP::Manticore, :logger=>nil, :tracer=>nil, :reload_connections=>false, :retry_on_failure=>false, :reload_on_failure=>false, :randomize_hosts=>false}, :error_message=>"No route to host", :class=>"Manticore::SocketException", :level=>:error}
Oct 23 21:01:42 node-4 dockerd: time="2017-10-23T21:01:42.551170714+02:00" level=error msg="Reprogramming on L3 miss failed for 16.10.0.13, no peer entry"
Show Nginx logs
[info] 44#0: *411 client closed connection while waiting for request
epoll_wait() reported that client prematurely closed connection, so upstream connection is closed too while sending request to upstream
[error] 43#0: *416 http_auth_ldap: Authentication timed out

--
UPDATE 10 min later after writing the debug info

Website started to be responsive suddenly:

Show logs
Oct 23 21:09:11 node-4 dockerd: time="2017-10-23T21:09:11.082808514+02:00" level=info msg="Node join event for node-12-cf5c2040eedc/x.x.x.12"
Oct 23 21:09:21 node-4 dockerd: time="2017-10-23T21:09:21.314841970+02:00" level=info msg="Node join event for node-7-4aa188c5da02/x.x.x.7"
Oct 23 21:09:35 node-4 dockerd: time="2017-10-23T21:09:35.192260130+02:00" level=info msg="NetworkDB stats - Queue net:xcdmx4he7s2nkjjidyrxry8x5 qLen:0 netPeers:12 netMsg/s:0"
Oct 23 21:09:35 node-4 dockerd: time="2017-10-23T21:09:35.192435706+02:00" level=info msg="NetworkDB stats - Queue net:3iwzsa1unvczh2r97q3cxgonq qLen:0 netPeers:12 netMsg/s:0"
Oct 23 21:09:41 node-4 dockerd: time="2017-10-23T21:09:41.085722250+02:00" level=info msg="Node join event for node-2-31e9966cb1a9/x.x.x.2"
Oct 23 21:09:51 node-4 dockerd: time="2017-10-23T21:09:51.317867926+02:00" level=info msg="Node join event for node-7-4aa188c5da02/172.x.x.7"
Oct 23 21:10:01 node-4 systemd: Created slice user-0.slice.
Oct 23 21:10:01 node-4 systemd: Starting user-0.slice.
Oct 23 21:10:01 node-4 systemd: Started Session 603 of user root.
Oct 23 21:10:01 node-4 systemd: Starting Session 603 of user root.
Oct 23 21:10:01 node-4 systemd: Removed slice user-0.slice.
Oct 23 21:10:01 node-4 systemd: Stopping user-0.slice.

While I was writing all the debug info, it fixed itself. But have no idea for how long. Moving circles... It worked on 17.07, now it's not.

Rebuilding on 17.10 and testing further.

Show setup details

RHEL 7.3
Docker CE 17.07, 17.10 (Swarm Mode)
Multiple interfaces. Nodes operate on interface that comply with ports needed to be opened for Swarm (TCP/UDP).
Published services: VIP endpoint mode
Non-published services: DNSRR endpoint mode
Daemon settings:

  • { "bip" : "16.0.0.1/24", "mtu": 1418 }

Network settings:

  • docker-gwbridge { "subnet" : "16.9.0.1/16", "mtu": 1418, "icc": false, "ip_masq": true }
  • ingress [overlay] { "subnet" : "16.8.0.1/16", "mtu": 1418 }
  • my-app-network [overlay] { "subnet" : "16.10.0.1/16", "mtu": 1418 }

@zigmund
Copy link

zigmund commented Oct 24, 2017

@fcrisciani here another problem.

Docker 17.10, 3 hardware worker nodes, 5 vm managers nodes.
Nodes were very loaded (CPU utilization close to 100% for ~5 hours), some tasks were restarted because of unhealth state. And now some services doesn't respond on some nodes.

For example, working node:

root@hw-docker-w5.alahd.kz.prod:/var/run/docker/netns# docker network inspect -v ingress | jq -r '.[0].Services["stack-ms-views_market"]' 
{
  "VIP": "10.255.5.236",
  "Ports": [
    "Target: 8080, Publish: 8185"
  ],
  "LocalLBIndex": 3160,
  "Tasks": [
    {
      "Name": "stack-ms-views_market.1.vo7m84nfhnetbqieonfor6d7k",
      "EndpointID": "345de548190df03de8a9303a5875be165033b6089c026012cd29a1bb7f997b4b",
      "EndpointIP": "10.255.0.66",
      "Info": {
        "Host IP": "10.11.1.5"
      }
    }
  ]
}
root@hw-docker-w5.alahd.kz.prod:/var/run/docker/netns# nsenter --net=ingress_sbox curl 10.255.5.236:8080
{"status":"ok"}

Another node:

root@hw-docker-w6.alahd.kz.prod:/var/run/docker/netns# docker network inspect -v ingress | jq -r '.[0].Services["stack-ms-views_market"]'
{
  "VIP": "10.255.5.236",
  "Ports": [
    "Target: 8080, Publish: 8185"
  ],
  "LocalLBIndex": 2699,
  "Tasks": [
    {
      "Name": "stack-ms-views_market.1.vo7m84nfhnetbqieonfor6d7k",
      "EndpointID": "345de548190df03de8a9303a5875be165033b6089c026012cd29a1bb7f997b4b",
      "EndpointIP": "10.255.0.66",
      "Info": {
        "Host IP": "10.11.1.5"
      }
    }
  ]
}
root@hw-docker-w6.alahd.kz.prod:/var/run/docker/netns# dockensenter --net=ingress_sbox curl 10.255.5.236:8080 
curl: (7) Failed to connect to 10.255.5.236 port 8080: Connection refused

So at the moment this service responds only on 1 node of 3.

At the moment all nodes are not loaded at all, but qLen for ingress network on all nodes include managers are not 0 and continue to raise. For example:

Oct 24 10:08:28 hw-docker-w6 dockerd[1667]: time="2017-10-24T10:08:28.588288969+06:00" level=warning msg="handleTableEvent object &TableEvent{Type:DELETE,LTime:1540431,NodeName:hw-docker-w6-8b2f3bf9f0c1,NetworkID:sudpa9wvueiyk2bw66q1b9s92,TableName:overlay_peer_table,Key:07945e423ef62bf40a2a34a207282eedab53517816bad043eaf309b6a17fe189,Value:[10 14 49 48 46 50 53 53 46 48 46 49 52 47 49 54 18 17 48 50 58 52 50 58 48 97 58 102 102 58 48 48 58 48 101 26 9 49 48 46 49 49 46 49 46 54],ResidualReapTime:0,} has a 0 reapTime, is the cluster running the same docker engine version?"
Oct 24 10:09:47 hw-docker-w6 dockerd[1667]: time="2017-10-24T10:09:47.823351329+06:00" level=warning msg="handleTableEvent object &TableEvent{Type:DELETE,LTime:1606935,NodeName:hw-docker-w6-8b2f3bf9f0c1,NetworkID:sudpa9wvueiyk2bw66q1b9s92,TableName:endpoint_table,Key:e709048df2bf07b972dee144e21fca9dd2a088d679f5f4bb201dd5840df19d4c,Value:[10 38 109 115 45 97 112 112 97 117 116 104 46 50 46 112 54 119 119 109 121 57 111 110 115 105 55 102 118 97 54 99 105 52 103 113 54 110 53 106 18 10 109 115 45 97 112 112 97 117 116 104 26 25 54 113 110 118 100 106 111 118 118 48 101 107 111 56 120 108 97 99 56 120 98 114 56 48 57 34 11 49 48 46 50 53 53 46 48 46 49 55 42 11 49 48 46 50 53 53 46 48 46 49 51 50 6 24 252 63 32 252 63 66 12 51 49 98 101 102 97 101 54 48 98 99 50],ResidualReapTime:0,} has a 0 reapTime, is the cluster running the same docker engine version?"
Oct 24 10:11:09 hw-docker-w6 dockerd[1667]: time="2017-10-24T10:11:09.338321189+06:00" level=info msg="NetworkDB stats hw-docker-w6.alahd.kz.prod(c4fb589ae6f4) - netID:sudpa9wvueiyk2bw66q1b9s92 leaving:false netPeers:14 entries:5639 Queue qLen:31754 netMsg/s:27"
Oct 24 10:16:09 hw-docker-w6 dockerd[1667]: time="2017-10-24T10:16:09.542160185+06:00" level=info msg="NetworkDB stats hw-docker-w6.alahd.kz.prod(c4fb589ae6f4) - netID:sudpa9wvueiyk2bw66q1b9s92 leaving:false netPeers:14 entries:5588 Queue qLen:31758 netMsg/s:27"
Oct 24 10:16:33 hw-docker-w6 dockerd[1667]: time="2017-10-24T10:16:33.425572902+06:00" level=warning msg="handleTableEvent object &TableEvent{Type:DELETE,LTime:1462846,NodeName:hw-docker-w6-8b2f3bf9f0c1,NetworkID:sudpa9wvueiyk2bw66q1b9s92,TableName:endpoint_table,Key:ebc62a76cde8f9f67ee493c0af7d5b6e6896c345c1c49cd71d952bc84c7de5af,Value:[10 38 109 115 45 97 112 112 97 117 116 104 46 50 46 102 117 107 102 50 103 52 113 52 48 51 97 112 120 119 101 100 97 54 112 103 109 49 122 112 18 10 109 115 45 97 112 112 97 117 116 104 26 25 54 113 110 118 100 106 111 118 118 48 101 107 111 56 120 108 97 99 56 120 98 114 56 48 57 34 11 49 48 46 50 53 53 46 48 46 49 55 42 11 49 48 46 50 53 53 46 48 46 49 52 50 6 24 252 63 32 252 63 66 12 49 56 100 52 56 97 100 51 48 102 57 100],ResidualReapTime:0,} has a 0 reapTime, is the cluster running the same docker engine version?"

Engine version is same on all nodes in cluster:

zigmund@bug:~> for i in {1..5}; do ssh docker-m$i.alahd.kz.prod docker info 2>/dev/null | grep "Server Version"; done
Server Version: 17.10.0-ce
Server Version: 17.10.0-ce
Server Version: 17.10.0-ce
Server Version: 17.10.0-ce
Server Version: 17.10.0-ce
zigmund@bug:~> for i in {5..7}; do ssh hw-docker-w$i.alahd.kz.prod docker info 2>/dev/null | grep "Server Version"; done
Server Version: 17.10.0-ce
Server Version: 17.10.0-ce
Server Version: 17.10.0-ce

Also there is strange task state Unable to complete atomic operation, key modified I saw for the first time since I use docker:

zigmund@docker-m1.alahd.kz.prod:~$ docker service ps stack-ms-views_market --no-trunc 
ID                          NAME                          IMAGE                                                           NODE                         DESIRED STATE       CURRENT STATE          ERROR                                                          PORTS
vo7m84nfhnetbqieonfor6d7k   stack-ms-views_market.1       docker-hub-production:5000/ms/views:release-9   hw-docker-w5.alahd.kz.prod   Running             Running 4 hours ago                                                                   
zopf5noz1zognuy6why718sho    \_ stack-ms-views_market.1   docker-hub-production:5000/ms/views:release-9   hw-docker-w6.alahd.kz.prod   Shutdown            Rejected 5 hours ago   "Unable to complete atomic operation, key modified"            
ztzzhm5d4birgplz15o8mnaro    \_ stack-ms-views_market.1   docker-hub-production:5000/ms/views:release-9   hw-docker-w7.alahd.kz.prod   Shutdown            Rejected 7 hours ago   "Unable to complete atomic operation, key modified"            
zjoxpzlcduox6khps5vppk712    \_ stack-ms-views_market.1   docker-hub-production:5000/ms/views:release-9   hw-docker-w7.alahd.kz.prod   Shutdown            Rejected 7 hours ago   "Unable to complete atomic operation, key modified"            
zx9axglcxncb42c7pglfdsk03    \_ stack-ms-views_market.1   docker-hub-production:5000/ms/views:release-9   hw-docker-w5.alahd.kz.prod   Shutdown            Failed 8 hours ago     "task: non-zero exit (137): dockerexec: unhealthy container"

@svscorp
Copy link

svscorp commented Oct 24, 2017

@fcrisciani Hi, I ran into circle of issues and clean everything (totally). Now I have just spanned 17.10 and I set VIP mode for a service. 15 minutes works stable... Will update later on.

Apparently most of my issues (lately) were connected to that fact I placed {bip: ....} in docker daemon.json -> it created interface with that IP and MASK. But then when I was cleaning everything, I noticed, that whatever I do (including interface removal, daemon reload, docker delete and install) - the interface (ifconfig) is still coming up with the old IP, while few days back I've removed {bip....} configuration from docker.

I think it can be another bug, that docker doesn't return default (172.17...) cidr when custom bridge ip was configured in daemon and later removed.

Update. 1h later

I am observing some instability. It shows 'connection timeout' against ldap service in the logs and then suddenly starts serving web, then in a while again hangs with timeout messages. All services are running in VIP mode. Few of them publish port to ingress.

All networks but custom overlay one {"subnet" : "16.10.0.1/16", "mtu": 1418} are default.

@svscorp
Copy link

svscorp commented Oct 24, 2017

@zigmund I have this every time. Usually appears to services deployed in mode: global. Server reboot helps but it's not solid.

@svscorp
Copy link

svscorp commented Oct 24, 2017

@fcrisciani Okay, that's what I have. Combination of Peer operation failed:Unable to find the peerDB,
Unable to complete atomic operation, key modified, epoll_wait() reported that client prematurely closed connection, so upstream connection is closed too while sending request to upstream and 'status: down' for that node in docker node ls command, while it is accessible and docker executable there.

Not sure, where to proceed further from this point.

Show error log
Oct 24 22:23:42 node-5 dockerd[7435]: time="2017-10-24T22:23:42.312716420+02:00" level=error msg="network my_app_network remove failed: No such network: my_app_network" module=node/agent node.id=4go56gp48zv6pkql6kxdjyrtv
Oct 24 22:23:42 node-5 dockerd[7435]: time="2017-10-24T22:23:42.312818402+02:00" level=error msg="remove task failed" error="No such network: my_app_network" module=node/agent node.id=4go56gp48zv6pkql6kxdjyrtv task.id=ae5tcqhljs9vobmzzw075svih
Oct 24 22:23:42 node-5 dockerd[7435]: time="2017-10-24T22:23:42.400193473+02:00" level=info msg="Node join event for 2cc61c91bc6b/x.x.x.5"
Oct 24 22:23:42 node-5 dockerd[7435]: time="2017-10-24T22:23:42.400262990+02:00" level=info msg="Node join event for 618a0eb2b749/x.x.x.7"
Oct 24 22:23:42 node-5 dockerd[7435]: time="2017-10-24T22:23:42.400291770+02:00" level=info msg="Node join event for 53f9cdd72362/x.x.x.9"
Oct 24 22:23:42 node-5 dockerd[7435]: time="2017-10-24T22:23:42.400328593+02:00" level=info msg="Node join event for d81b9c0e9fa3/x.x.x.0"
Oct 24 22:23:42 node-5 dockerd[7435]: time="2017-10-24T22:23:42.428918706+02:00" level=error msg="fatal task error" error="Unable to complete atomic operation, key modified" module=node/agent/taskmanager node.id=4go56gp48zv6pkql6kxdjyrtv service.id=xdsd5cok1ehkewm57p4hukrpv task.id=zpl8itamalx1r69r5dgi102j0
Oct 24 22:23:42 node-5 dockerd[7435]: time="2017-10-24T22:23:42.428983288+02:00" level=warning msg="Peer operation failed:Unable to find the peerDB for nid:m3bnbfjtwtm0ajpvqmhxy7287 op:&{3 m3bnbfjtwtm0ajpvqmhxy7287  [] [] [] [] false false false DeleteNetwork}"
Oct 24 22:23:42 node-5 dockerd[7435]: time="2017-10-24T22:23:42.443128960+02:00" level=error msg="fatal task error" error="Unable to complete atomic operation, key modified" module=node/agent/taskmanager node.id=4go56gp48zv6pkql6kxdjyrtv service.id=3hdz8c69sos988rtr6gkat5vy task.id=xsztpm9rk3lzuqnsvueubicff
Oct 24 22:23:42 node-5 dockerd[7435]: time="2017-10-24T22:23:42.443522349+02:00" level=warning msg="Peer operation failed:Unable to find the peerDB for nid:m3bnbfjtwtm0ajpvqmhxy7287 op:&{3 m3bnbfjtwtm0ajpvqmhxy7287  [] [] [] [] false false false DeleteNetwork}"
Oct 24 22:23:42 node-5 dockerd[7435]: time="2017-10-24T22:23:42.468173110+02:00" level=info msg="Node join event for 315aa4148ed7/x.x.x.5"
Oct 24 22:23:42 node-5 dockerd[7435]: time="2017-10-24T22:23:42.605192262+02:00" level=info msg="Node join event for 1a921a1476e7/x.x.x.4"
Oct 24 22:23:42 node-5 dockerd[7435]: time="2017-10-24T22:23:42.824126376+02:00" level=warning msg="failed to deactivate service binding for container mystack_logstash.4go56gp48zv6pkql6kxdjyrtv.wyl7o0641szx2ccb5dg6lp1zg" error="No such container: mystack_logstash.4go56gp48zv6pkql6kxdjyrtv.wyl7o0641szx2ccb5dg6lp1zg" module=node/agent node.id=4go56gp48zv6pkql6kxdjyrtv
Oct 24 22:23:42 node-5 dockerd[7435]: time="2017-10-24T22:23:42.824312835+02:00" level=warning msg="failed to deactivate service binding for container mystack_sensu-client-system.4go56gp48zv6pkql6kxdjyrtv.tz62rjz82nhb8nmabeuikqha0" error="No such container: mystack_sensu-client-system.4go56gp48zv6pkql6kxdjyrtv.tz62rjz82nhb8nmabeuikqha0" module=node/agent node.id=4go56gp48zv6pkql6kxdjyrtv

@svscorp
Copy link

svscorp commented Oct 24, 2017

an observation, when issue happens, I can see that one of the interfaces is gone. eth0, etch2...

Show additional logs (debug: true)
Oct 25 00:10:53 node-4 dockerd: time="2017-10-25T00:10:53.625312402+02:00" level=debug msg="handleEpTableEvent ADD 6a0a8239d7168e72fdb8e5d71b923c294b55eee262844b82a40a0237781f862a R:{mystack_ldap.1.j7tcyir72w4a8ge5fj1h5ix9j mystack_ldap 95wy6f3qy97vs3bhfyp9co76z 10.255.0.55 10.255.0.15 [&PortConfig{Name:,Protocol:TCP,TargetPort:389,PublishedPort:389,}] [] [0d6b298b6af1]}"
Oct 25 00:10:53 node-4 dockerd: time="2017-10-25T00:10:53.625444378+02:00" level=debug msg="addServiceBinding from handleEpTableEvent START for mystack_ldap 6a0a8239d7168e72fdb8e5d71b923c294b55eee262844b82a40a0237781f862a p:0xc420f58200 nid:oz7wmme7u6zdsp3cbemoh0dou skey:{95wy6f3qy97vs3bhfyp9co76z 389:389/TCP}"
Oct 25 00:10:53 node-4 dockerd: time="2017-10-25T00:10:53.625719059+02:00" level=debug msg="/usr/sbin/iptables, [--wait -t nat -nL DOCKER-INGRESS]"
Oct 25 00:10:53 node-4 dockerd: time="2017-10-25T00:10:53.629186322+02:00" level=info msg="Node join event for 2cc61c91bc6b/x.x.x.5"
Oct 25 00:10:53 node-4 dockerd: time="2017-10-25T00:10:53.630873721+02:00" level=debug msg="/usr/sbin/iptables, [--wait -t filter -nL DOCKER-INGRESS]"
Oct 25 00:10:53 node-4 dockerd: time="2017-10-25T00:10:53.634766542+02:00" level=debug msg="/usr/sbin/iptables, [--wait -t nat -C DOCKER-INGRESS -j RETURN]"
Oct 25 00:10:53 node-4 dockerd: time="2017-10-25T00:10:53.638801288+02:00" level=debug msg="/usr/sbin/iptables, [--wait -t filter -C DOCKER-INGRESS -j RETURN]"
Oct 25 00:10:53 node-4 dockerd: time="2017-10-25T00:10:53.642166632+02:00" level=debug msg="/usr/sbin/iptables, [--wait -t nat -C OUTPUT -m addrtype --dst-type LOCAL -j DOCKER-INGRESS]"
Oct 25 00:10:53 node-4 dockerd: time="2017-10-25T00:10:53.645868055+02:00" level=debug msg="/usr/sbin/iptables, [--wait -t nat -C PREROUTING -m addrtype --dst-type LOCAL -j DOCKER-INGRESS]"
Oct 25 00:10:53 node-4 dockerd: time="2017-10-25T00:10:53.649945734+02:00" level=debug msg="/usr/sbin/iptables, [--wait -t filter -C FORWARD -j DOCKER-INGRESS]"
Oct 25 00:10:53 node-4 dockerd: time="2017-10-25T00:10:53.653577073+02:00" level=debug msg="/usr/sbin/iptables, [--wait -t nat -C POSTROUTING -m addrtype --src-type LOCAL -o docker_gwbridge -j MASQUERADE]"
Oct 25 00:10:53 node-4 dockerd: time="2017-10-25T00:10:53.657276831+02:00" level=debug msg="/usr/sbin/iptables, [--wait -t nat -nL DOCKER-INGRESS]"
Oct 25 00:10:53 node-4 dockerd: time="2017-10-25T00:10:53.660672964+02:00" level=debug msg="/usr/sbin/iptables, [--wait -t nat -I DOCKER-INGRESS -p tcp --dport 389 -j DNAT --to-destination 172.19.0.4:389]"
Oct 25 00:10:53 node-4 dockerd: time="2017-10-25T00:10:53.664352138+02:00" level=debug msg="/usr/sbin/iptables, [--wait -I DOCKER-INGRESS -m state -p tcp --sport 389 --state ESTABLISHED,RELATED -j ACCEPT]"
Oct 25 00:10:53 node-4 dockerd: time="2017-10-25T00:10:53.668713458+02:00" level=debug msg="/usr/sbin/iptables, [--wait -I DOCKER-INGRESS -p tcp --dport 389 -j ACCEPT]"
Oct 25 00:10:53 node-4 dockerd: time="2017-10-25T00:10:53.672832101+02:00" level=debug msg="Creating service for vip 10.255.0.55 fwMark 292 ingressPorts []*libnetwork.PortConfig{&libnetwork.PortConfig{Name: \"\",\nProtocol: 0,\nTargetPort: 0x185,\nPublishedPort: 0x185,\n}} in sbox ingress (ingress)"
Oct 25 00:10:53 node-4 dockerd: time="2017-10-25T00:10:53.712335076+02:00" level=debug msg="memberlist: Stream connection from=x.x.x.65:36498"
Oct 25 00:10:53 node-4 dockerd: time="2017-10-25T00:10:53.717545996+02:00" level=info msg="Node join event for 315aa4148ed7/x.x.x.65"
Oct 25 00:10:53 node-4 dockerd: time="2017-10-25T00:10:53.737119696+02:00" level=debug msg="addEndpointNameResolution 6a0a8239d7168e72fdb8e5d71b923c294b55eee262844b82a40a0237781f862a mystack_ldap add_service:true sAliases:[] tAliases:[0d6b298b6af1]"
Oct 25 00:10:53 node-4 dockerd: time="2017-10-25T00:10:53.737188759+02:00" level=debug msg="addContainerNameResolution 6a0a8239d7168e72fdb8e5d71b923c294b55eee262844b82a40a0237781f862a mystack_ldap.1.j7tcyir72w4a8ge5fj1h5ix9j"
Oct 25 00:10:53 node-4 dockerd: time="2017-10-25T00:10:53.737211195+02:00" level=debug msg="addServiceBinding from handleEpTableEvent END for mystack_ldap 6a0a8239d7168e72fdb8e5d71b923c294b55eee262844b82a40a0237781f862a"
Oct 25 00:10:53 node-4 dockerd: time="2017-10-25T00:10:53.796389798+02:00" level=debug msg="handleEpTableEvent ADD 353eb02adf1e0e52dc442bc7d0c287d2dc8d6e30d8ce65871f7421aac7111238 R:{mystack_ldap.1.j7tcyir72w4a8ge5fj1h5ix9j mystack_ldap 95wy6f3qy97vs3bhfyp9co76z 110.10.1.125 110.10.0.19 [] [ldap] [0d6b298b6af1]}"
Oct 25 00:10:53 node-4 dockerd: time="2017-10-25T00:10:53.796526049+02:00" level=debug msg="addServiceBinding from handleEpTableEvent START for mystack_ldap 353eb02adf1e0e52dc442bc7d0c287d2dc8d6e30d8ce65871f7421aac7111238 p:0xc420126780 nid:m3bnbfjtwtm0ajpvqmhxy7287 skey:{95wy6f3qy97vs3bhfyp9co76z }"
Oct 25 00:10:53 node-4 dockerd: time="2017-10-25T00:10:53.796766264+02:00" level=debug msg="Creating service for vip 110.10.1.125 fwMark 293 ingressPorts []*libnetwork.PortConfig(nil) in sbox 2213eec (e82deb0)"
Oct 25 00:10:53 node-4 dockerd: time="2017-10-25T00:10:53.854571313+02:00" level=debug msg="Creating service for vip 110.10.1.125 fwMark 293 ingressPorts []*libnetwork.PortConfig(nil) in sbox a296fcc (c0c979b)"
Oct 25 00:10:53 node-4 dockerd: time="2017-10-25T00:10:53.916581968+02:00" level=debug msg="Creating service for vip 110.10.1.125 fwMark 293 ingressPorts []*libnetwork.PortConfig(nil) in sbox 906d879 (91a3f60)"
Oct 25 00:10:53 node-4 dockerd: time="2017-10-25T00:10:53.973336496+02:00" level=debug msg="Creating service for vip 110.10.1.125 fwMark 293 ingressPorts []*libnetwork.PortConfig(nil) in sbox 2dc6945 (lgadop-)"
Oct 25 00:10:54 node-4 dockerd: time="2017-10-25T00:10:54.033786427+02:00" level=debug msg="addEndpointNameResolution 353eb02adf1e0e52dc442bc7d0c287d2dc8d6e30d8ce65871f7421aac7111238 mystack_ldap add_service:true sAliases:[ldap] tAliases:[0d6b298b6af1]"
Oct 25 00:10:54 node-4 dockerd: time="2017-10-25T00:10:54.033860249+02:00" level=debug msg="addContainerNameResolution 353eb02adf1e0e52dc442bc7d0c287d2dc8d6e30d8ce65871f7421aac7111238 mystack_ldap.1.j7tcyir72w4a8ge5fj1h5ix9j"
Oct 25 00:10:54 node-4 dockerd: time="2017-10-25T00:10:54.033891198+02:00" level=debug msg="353eb02adf1e0e52dc442bc7d0c287d2dc8d6e30d8ce65871f7421aac7111238 (m3bnbfj).addSvcRecords(mystack_ldap.1.j7tcyir72w4a8ge5fj1h5ix9j, 110.10.0.19, <nil>, true) addServiceBinding sid:353eb02adf1e0e52dc442bc7d0c287d2dc8d6e30d8ce65871f7421aac7111238"
Oct 25 00:10:54 node-4 dockerd: time="2017-10-25T00:10:54.033967326+02:00" level=debug msg="353eb02adf1e0e52dc442bc7d0c287d2dc8d6e30d8ce65871f7421aac7111238 (m3bnbfj).addSvcRecords(0d6b298b6af1, 110.10.0.19, <nil>, true) addServiceBinding sid:353eb02adf1e0e52dc442bc7d0c287d2dc8d6e30d8ce65871f7421aac7111238"
Oct 25 00:10:54 node-4 dockerd: time="2017-10-25T00:10:54.034007759+02:00" level=debug msg="353eb02adf1e0e52dc442bc7d0c287d2dc8d6e30d8ce65871f7421aac7111238 (m3bnbfj).addSvcRecords(tasks.mystack_ldap, 110.10.0.19, <nil>, false) addServiceBinding sid:95wy6f3qy97vs3bhfyp9co76z"
Oct 25 00:10:54 node-4 dockerd: time="2017-10-25T00:10:54.034036149+02:00" level=debug msg="353eb02adf1e0e52dc442bc7d0c287d2dc8d6e30d8ce65871f7421aac7111238 (m3bnbfj).addSvcRecords(tasks.ldap, 110.10.0.19, <nil>, false) addServiceBinding sid:95wy6f3qy97vs3bhfyp9co76z"
Oct 25 00:10:54 node-4 dockerd: time="2017-10-25T00:10:54.034068176+02:00" level=debug msg="353eb02adf1e0e52dc442bc7d0c287d2dc8d6e30d8ce65871f7421aac7111238 (m3bnbfj).addSvcRecords(mystack_ldap, 110.10.1.125, <nil>, false) addServiceBinding sid:95wy6f3qy97vs3bhfyp9co76z"
Oct 25 00:10:54 node-4 dockerd: time="2017-10-25T00:10:54.034095194+02:00" level=debug msg="353eb02adf1e0e52dc442bc7d0c287d2dc8d6e30d8ce65871f7421aac7111238 (m3bnbfj).addSvcRecords(ldap, 110.10.1.125, <nil>, false) addServiceBinding sid:95wy6f3qy97vs3bhfyp9co76z"
Oct 25 00:10:54 node-4 dockerd: time="2017-10-25T00:10:54.034119459+02:00" level=debug msg="addServiceBinding from handleEpTableEvent END for mystack_ldap 353eb02adf1e0e52dc442bc7d0c287d2dc8d6e30d8ce65871f7421aac7111238"

@velislav
Copy link

velislav commented Nov 7, 2017

I have the same problem with 17.10. I am unable to resolve from service1 on node swarm1 any other service in the same docker external network.

ubuntu@swarm0:~$ docker node ls
ID                            HOSTNAME            STATUS              AVAILABILITY        MANAGER STATUS
d4xljclgf7bckt7pm63c6ukln *   swarm0              Ready               Active              Leader
bdl84a615adhqy9k7hhyt8zqw     swarm1              Ready               Active              
44h8hh5x1su3zf566yt6rfcat     swarm2              Ready               Active              
pzeunvp8dymxzcvze5t6nrfn3     swarm3              Ready               Active              
on2xq8j4gwuzluo4nkjifffkv     swarm4              Ready               Active              Reachable
1a2pvtd238mddndgxuirlwmad     swarm5              Ready               Active              Reachable
rq4auyi9h1aayauahwjml86lx     swarm6              Ready               Active              
9xjac5wefpxgym7imvktr62ar     swarm7              Ready               Active  

On every node I have 1 service:

ubuntu@swarm0:~$ docker stack ps try | grep Running
qhol03x3e62t        try_service1.1       nginx:latest        swarm1              Running             Running 15 minutes ago                        
nscs45p1zybt        try_service2.1       nginx:latest        swarm2              Running             Running 2 hours ago                           
c8t1dxtbqta2        try_service7.1       nginx:latest        swarm7              Running             Running 4 hours ago                           
s90kfncrdw4a        try_service6.1       nginx:latest        swarm6              Running             Running 4 hours ago                           
moh4674b3dhf        try_service5.1       nginx:latest        swarm5              Running             Running 4 hours ago                           
gympvxcx6h8l        try_service4.1       nginx:latest        swarm4              Running             Running 4 hours ago                           
q6o00q5pr6uo        try_service3.1       nginx:latest        swarm3              Running             Running 4 hours ago                           
m7fd9v0h9n5i        try_service0.1       nginx:latest        swarm0              Running             Running 4 hours ago 

All of the services are connected via one external network but not all services are registered inside:

ubuntu@swarm0:~$ docker network inspect -v proxy | grep try
                "Name": "try_service0.1.m7fd9v0h9n5ijv5i3jgv3r8hh",
            "try_service0": {
                        "Name": "try_service0.1.m7fd9v0h9n5ijv5i3jgv3r8hh",
            "try_service2": {
                        "Name": "try_service2.1.nscs45p1zybt5f4lg8ah7eu6j",
            "try_service3": {
                        "Name": "try_service3.1.q6o00q5pr6uovo9ey9nac04dv",
            "try_service4": {
                        "Name": "try_service4.1.gympvxcx6h8l6gws9eti1dcc8",
            "try_service5": {
                        "Name": "try_service5.1.moh4674b3dhfe9kzdlzuhyoqf",
            "try_service6": {
                        "Name": "try_service6.1.s90kfncrdw4aa18ewvdm086j2",
            "try_service7": {
                        "Name": "try_service7.1.c8t1dxtbqta2k845oo8q5k9ae",

On the problematic node swarm1 the docker engine logs are full with messages like these:

Nov  7 12:08:24 swarm1 dockerd[1352]: time="2017-11-07T12:08:24.026633722Z" level=warning msg="NetworkDB stats swarm1(20f8e30ededd) - healthscore:7 (connectivity issues)"
Nov  7 12:09:24 swarm1 dockerd[1352]: time="2017-11-07T12:09:24.226791177Z" level=warning msg="NetworkDB stats swarm1(20f8e30ededd) - healthscore:7 (connectivity issues)"
Nov  7 12:09:53 swarm1 dockerd[1352]: time="2017-11-07T12:09:53.626739298Z" level=info msg="NetworkDB stats swarm1(20f8e30ededd) - netID:qq08mhau4o42syygs6ur91niv leaving:false netPeers:1 entries:2 Queue qLen:0 netMsg/s:0"
Nov  7 12:09:53 swarm1 dockerd[1352]: time="2017-11-07T12:09:53.627419202Z" level=info msg="NetworkDB stats swarm1(20f8e30ededd) - netID:fj74h91r1sthsr7qqnehzahb0 leaving:false netPeers:1 entries:4 Queue qLen:0 netMsg/s:0"
Nov  7 12:09:53 swarm1 dockerd[1352]: time="2017-11-07T12:09:53.627765909Z" level=info msg="NetworkDB stats swarm1(20f8e30ededd) - netID:pmcmzfiy3wcqqf8ln45x4x3zu leaving:false netPeers:1 entries:2 Queue qLen:0 netMsg/s:0"
Nov  7 12:10:24 swarm1 dockerd[1352]: time="2017-11-07T12:10:24.426683294Z" level=warning msg="NetworkDB stats swarm1(20f8e30ededd) - healthscore:7 (connectivity issues)"
Nov  7 12:11:24 swarm1 dockerd[1352]: time="2017-11-07T12:11:24.426808011Z" level=warning msg="NetworkDB stats swarm1(20f8e30ededd) - healthscore:7 (connectivity issues)"
Nov  7 12:11:59 swarm1 dhclient[1010]: DHCPREQUEST of 10.4.0.130 on ens3 to 10.4.0.3 port 67 (xid=0x726ec96)
Nov  7 12:11:59 swarm1 dhclient[1010]: DHCPACK of 10.4.0.130 from 10.4.0.3
Nov  7 12:11:59 swarm1 dhclient[1010]: bound to 10.4.0.130 -- renewal in 263 seconds.
Nov  7 12:12:24 swarm1 dockerd[1352]: time="2017-11-07T12:12:24.626729426Z" level=warning msg="NetworkDB stats swarm1(20f8e30ededd) - healthscore:7 (connectivity issues)"
Nov  7 12:13:24 swarm1 dockerd[1352]: time="2017-11-07T12:13:24.826788200Z" level=warning msg="NetworkDB stats swarm1(20f8e30ededd) - healthscore:7 (connectivity issues)"
Nov  7 12:14:25 swarm1 dockerd[1352]: time="2017-11-07T12:14:25.026771522Z" level=warning msg="NetworkDB stats swarm1(20f8e30ededd) - healthscore:7 (connectivity issues)"
Nov  7 12:14:53 swarm1 dockerd[1352]: time="2017-11-07T12:14:53.826740624Z" level=info msg="NetworkDB stats swarm1(20f8e30ededd) - netID:qq08mhau4o42syygs6ur91niv leaving:false netPeers:1 entries:2 Queue qLen:0 netMsg/s:0"
Nov  7 12:14:53 swarm1 dockerd[1352]: time="2017-11-07T12:14:53.826864839Z" level=info msg="NetworkDB stats swarm1(20f8e30ededd) - netID:pmcmzfiy3wcqqf8ln45x4x3zu leaving:false netPeers:1 entries:2 Queue qLen:0 netMsg/s:0"
Nov  7 12:14:53 swarm1 dockerd[1352]: time="2017-11-07T12:14:53.826885409Z" level=info msg="NetworkDB stats swarm1(20f8e30ededd) - netID:fj74h91r1sthsr7qqnehzahb0 leaving:false netPeers:1 entries:4 Queue qLen:0 netMsg/s:0"
Nov  7 12:15:25 swarm1 dockerd[1352]: time="2017-11-07T12:15:25.227334039Z" level=warning msg="NetworkDB stats swarm1(20f8e30ededd) - healthscore:7 (connectivity issues)"
Nov  7 12:16:14 swarm1 systemd[1]: Starting Automatically fetch and run repair assertions...
Nov  7 12:16:14 swarm1 snap-repair[24720]: error: cannot use snap-repair on a classic system
Nov  7 12:16:14 swarm1 systemd[1]: Started Automatically fetch and run repair assertions.
Nov  7 12:16:23 swarm1 dhclient[1010]: DHCPREQUEST of 10.4.0.130 on ens3 to 10.4.0.3 port 67 (xid=0x726ec96)
Nov  7 12:16:23 swarm1 dhclient[1010]: DHCPACK of 10.4.0.130 from 10.4.0.3
Nov  7 12:16:23 swarm1 dhclient[1010]: bound to 10.4.0.130 -- renewal in 230 seconds.
Nov  7 12:16:25 swarm1 dockerd[1352]: time="2017-11-07T12:16:25.426758411Z" level=warning msg="NetworkDB stats swarm1(20f8e30ededd) - healthscore:7 (connectivity issues)"
Nov  7 12:17:01 swarm1 CRON[24769]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Nov  7 12:17:25 swarm1 dockerd[1352]: time="2017-11-07T12:17:25.426836990Z" level=warning msg="NetworkDB stats swarm1(20f8e30ededd) - healthscore:7 (connectivity issues)"
Nov  7 12:18:25 swarm1 dockerd[1352]: time="2017-11-07T12:18:25.626746876Z" level=warning msg="NetworkDB stats swarm1(20f8e30ededd) - healthscore:7 (connectivity issues)"
Nov  7 12:19:25 swarm1 dockerd[1352]: time="2017-11-07T12:19:25.826751473Z" level=warning msg="NetworkDB stats swarm1(20f8e30ededd) - healthscore:7 (connectivity issues)"
Nov  7 12:19:54 swarm1 dockerd[1352]: time="2017-11-07T12:19:54.026762043Z" level=info msg="NetworkDB stats swarm1(20f8e30ededd) - netID:qq08mhau4o42syygs6ur91niv leaving:false netPeers:1 entries:2 Queue qLen:0 netMsg/s:0"
Nov  7 12:19:54 swarm1 dockerd[1352]: time="2017-11-07T12:19:54.026868360Z" level=info msg="NetworkDB stats swarm1(20f8e30ededd) - netID:fj74h91r1sthsr7qqnehzahb0 leaving:false netPeers:1 entries:4 Queue qLen:0 netMsg/s:0"
Nov  7 12:19:54 swarm1 dockerd[1352]: time="2017-11-07T12:19:54.026886243Z" level=info msg="NetworkDB stats swarm1(20f8e30ededd) - netID:pmcmzfiy3wcqqf8ln45x4x3zu leaving:false netPeers:1 entries:2 Queue qLen:0 netMsg/s:0"
Nov  7 12:20:13 swarm1 dhclient[1010]: DHCPREQUEST of 10.4.0.130 on ens3 to 10.4.0.3 port 67 (xid=0x726ec96)
Nov  7 12:20:13 swarm1 dhclient[1010]: DHCPACK of 10.4.0.130 from 10.4.0.3
Nov  7 12:20:13 swarm1 dhclient[1010]: bound to 10.4.0.130 -- renewal in 252 seconds.
Nov  7 12:20:26 swarm1 dockerd[1352]: time="2017-11-07T12:20:26.026781366Z" level=warning msg="NetworkDB stats swarm1(20f8e30ededd) - healthscore:7 (connectivity issues)"
Nov  7 12:21:26 swarm1 dockerd[1352]: time="2017-11-07T12:21:26.227426682Z" level=warning msg="NetworkDB stats swarm1(20f8e30ededd) - healthscore:7 (connectivity issues)"
Nov  7 12:22:26 swarm1 dockerd[1352]: time="2017-11-07T12:22:26.426762006Z" level=warning msg="NetworkDB stats swarm1(20f8e30ededd) - healthscore:7 (connectivity issues)"
Nov  7 12:23:26 swarm1 dockerd[1352]: time="2017-11-07T12:23:26.643349554Z" level=warning msg="NetworkDB stats swarm1(20f8e30ededd) - healthscore:7 (connectivity issues)"

As additional information the nodes from 0 to 3 are over openstack and are behind NAT. They advertise the external IP provided by openstack. The other nodes from 4 to 7 are on vmware and they advertise their own IPs. The connection between all nodes looks fine and the following ports are accesible:

    TCP port 2377 for cluster management communications
    TCP and UDP port 7946 for communication among nodes
    UDP port 4789 for overlay network traffic

@fcrisciani
Copy link
Contributor

@zigmund the log Unable to complete atomic operation, key modified is the cause why the number of entries are constantly going up, the containers start and then fail so they are creating entries and then deleting them. The entries are gc after 30min that they are being deleted. The issue is tracked here: #35310

@fcrisciani
Copy link
Contributor

^ @svscorp same for you looks like

@fcrisciani
Copy link
Contributor

@velislav from the logs is clear that swarm1 is not able to properly communicate. The log healthscore should not appear.

NetworkDB stats swarm1(20f8e30ededd) - healthscore:7 (connectivity issues)

Also the stats are showing that the node is the only one participating on that networks:

Nov  7 12:19:54 swarm1 dockerd[1352]: time="2017-11-07T12:19:54.026762043Z" level=info msg="NetworkDB stats swarm1(20f8e30ededd) - netID:qq08mhau4o42syygs6ur91niv leaving:false netPeers:1 entries:2 Queue qLen:0 netMsg/s:0"
Nov  7 12:19:54 swarm1 dockerd[1352]: time="2017-11-07T12:19:54.026868360Z" level=info msg="NetworkDB stats swarm1(20f8e30ededd) - netID:fj74h91r1sthsr7qqnehzahb0 leaving:false netPeers:1 entries:4 Queue qLen:0 netMsg/s:0"
Nov  7 12:19:54 swarm1 dockerd[1352]: time="2017-11-07T12:19:54.026886243Z" level=info msg="NetworkDB stats swarm1(20f8e30ededd) - netID:pmcmzfiy3wcqqf8ln45x4x3zu leaving:false netPeers:1 entries:2 Queue qLen:0 netMsg/s:0"

the netPeers field shows the number of nodes part of that network.
Try to check again the network connectivity to and from swarm1.

@svscorp
Copy link

svscorp commented Nov 11, 2017

@fcrisciani I am tracking that thread. There is 17.11 RC3 available, not sure is it the time to update to that (someone reported that he can reproduce an issue in another thread).

@thaJeztah
Copy link
Member

With moby/libnetwork#1944, moby/libnetwork#2013, moby/libnetwork#2047 these problems should be resolved.

If you're still having issues on a current version of Docker, please open a new ticket, which makes it easier to triage (and could have a different cause).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking area/swarm kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. version/17.05
Projects
None yet
Development

No branches or pull requests