-
Notifications
You must be signed in to change notification settings - Fork 18.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docker Swarm Mode service discovery is totally unstable #33589
Comments
clusterroot@node5:~# docker service ps balancer | grep node
i4p2b70c58rm balancer.s42tug46h78l0p7z5unlsr0a2 fx/balancer:latest node4 Running Running 7 hours ago *:443->443/tcp,*:80->80/tcp
tnw1tx627um5 balancer.yidccbcki64epay4p4ugq6xm1 fx/balancer:latest node5 Running Running 7 hours ago *:443->443/tcp,*:80->80/tcp
kz6ubc7pgct4 balancer.jpfvnxy14acbjtrmv3n0x78xt fx/balancer:latest node2 Running Running 7 hours ago *:443->443/tcp,*:80->80/tcp
root@node5:~# docker service ps jenkins
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
5q4u91busffe jenkins.1 jenkins:latest dwrk-jenkins-01 Running Running 10 days ago
s4a4w8kch6xv \_ jenkins.1 jenkins:latest dwrk-jenkins-01 Shutdown Shutdown 10 days ago node4 (master node)root@node4:~# docker exec -it balancer.s42tug46h78l0p7z5unlsr0a2.i4p2b70c58rm1yst9s5cjlscc bash
root@balancer:/# nslookup jenkins 127.0.0.11
Server: 127.0.0.11
Address: 127.0.0.11#53
Non-authoritative answer:
Name: jenkins
Address: 10.0.0.76
root@balancer:/# ping jenkins
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.086/0.087/0.088/0.000 ms
root@balancer:/# echo > /dev/tcp/jenkins/8080 && echo ok || echo notok
ok node5 (same cluster, also master node)root@node5:~# docker exec -it balancer.yidccbcki64epay4p4ugq6xm1.tnw1tx627um5hjm2vnt624ino bash
root@balancer:/# nslookup jenkins 127.0.0.11
Server: 127.0.0.11
Address: 127.0.0.11#53
Non-authoritative answer:
Name: jenkins
Address: 10.0.0.76
root@balancer:/# ping jenkins
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.080/0.089/0.098/0.000 ms
root@balancer:/# echo > /dev/tcp/jenkins/8080 && echo ok || echo notok
bash: connect: Connection timed out
bash: /dev/tcp/jenkins/8080: Connection timed out
notok |
Thanks @soar we are aware of this instability and I'm currently working on a patch to stabilize it. |
@fcrisciani thank you! I'm ready to test anything, that could help. |
Will let you know as I have something ready to be tested |
We are experiencing same problem and it actually stops us from deploying docker to production.... |
Docker 17.06 will carry changes that should fix this (I think this PR is part of that moby/libnetwork#1796 but @fcrisciani perhaps you can confirm?) If you have a setup to test, release-candidates for docker 17.06 are available in the "test" channel on download.docker.com; keep in mind they are release candidates, so it's not recommended to run in production yet |
Yes correct, that PR fixes several race condition found on the service discovery |
In russian we have the expression "podgoráet pukán" which means that we need stable version with this bug fixed... |
@soar fireass in English |
Me and my team also would be very glad for stable swarm, not "stable swarm" version :) |
@soar 17.06 GA release should happen pretty soon. But the last RC, RC5 has all the Service Discovery fixes. Can you give it a try and let us know if it helps ? |
@sanimej In docker 17.06.1 GA release, I got the same error.
Then containers on the error |
Seeing the same error in 17.06.0-ce.
The message is being emitted from quite a few if not all of our manager and worker nodes. |
ping @fcrisciani @abhinandanpb ^^ |
@taiidani do you actually see a connectivity issue? That is simply a warning message that is notifying that the specific mac entry is already already present in the DB so won't be reconfigured again. We can potentially try to rate limit it but should not cause any connectivity issue |
@fcrisciani I can't guarantee it was related, but I saw the logs as I was troubleshooting one of our manager nodes having intermittent outbound network connectivity issues. That problem appears to have been solved by rebooting the node, but as my cluster has had sporadic severe connectivity issues between services on overlay networks and on outbound connections I wanted to make sure that if these logs were part of the problem that my occurrence of them was noted. It sounds like the log entries are unrelated to the instability I've been seeing? |
@taiidani what cloud provider are you using? Are you ever seeing any sporadic connectivity issues outside of docker swarm on the host itself? For example when you simply try a lot of pings or similar between the nodes or to some other external source over a few hours to a day. |
@jgranstrom We're using AWS (us-west-2) with a Swam setup comprised of 17.06-ce installed on Amazon's vanilla Ubuntu 16.04 AMI. So far we've seen issues where:
It's been very hard to nail these down, as they happen intermittently and against production systems, meaning we need to implement quick solutions to prevent downtime. |
On 17.06.1 and 17.06.2 we still have same issues sometimes. After service update some replicas not responding on some worker nodes. |
@zigmund I believe your description is compatible with this issue: moby/libnetwork#1934 for which the PR moby/libnetwork#1935 got merged and will be present in the next release |
Service Discovery in Namespace and Service Name are totally corrupted. It doesn't have even consistency. Sometimes, Namespace and Service name can be discovered, but other cases with the same docker-compose setting file. It's hard to reproduce these bugs but I am suspicious that it is related to networks. Sometimes, it works if you delete the relevant networks or Exited containers with different settings. So, my team is now using old way, IP address, again, instead of using service discovery. This is the worst part of docker swarm mode. We hoped we could easily manage our docker files with service discovery. That is why we moved from "docker swarm." |
I have got same error upgrading from 17.07 to 17.10 :( If I have [nginx] and [jenkins] let's say, and I kill jenkins container on one of the nodes (it restarts to another node) - jenkins is not reachable via that NGINX anymore with 502 error. In the logs it looks like it's trying to access jenkins using old overlay network IP. |
Can it be related to the recently introduced 17.10 and moby/libnetwork#1935? Haven't noticed that behavior before. |
@svscorp I guess you wanted to refer to the 1935 of libnetwork, but that is actually related to overlay more than the dns. Did you actually verify that the DNS resolution is incorrect inside the container and there is no caching happening? |
@svscorp What endpoint mode do you use with jenkins? As I know, nginx resolves upstream IPs only during start. If you use DNSRR mode then IPs of containers may be changed after stop/kill etc. You should use VIP endpoint mode instead for jenkins since VIP of service stays static during service lifetime. |
@zigmund I am in favor of using VIP. Jenkins now is on VIP mode. There are no ports published to ingress network. But setup is unstable - in some random time I can't use tools. Browser just hangs and looking to logs I can see nginx errors on every HTTP request that connection (to ldap) is timeouted or "client closed keepalive connection". When I had initial problem, the above behavior was connected with errors "L3 reprogramming failed". On 17.10 I don't see this error message, but it feels the same. Also, journalctl shows continuously spamming logs with node join event (on 17.07 it was much less messages). That's a separate story (probably).
Setup looks like this: @fcrisciani Thanks, I corrected the link. Indeed I mentioned libnetwork. I haven't tried yet check from container and I don't remember in 17.07 did, let's say ldapsearch work from nginx container (node1) to ldap container (node). Will test and write back. |
@svscorp the node join message is not a concern, a part from its frequency, does not indicate an issue. It's only being printed when the nodes exchange the state through TCP sync. I will look into hiding it. After that to exclude issues due to caching or application layer, I would validate the return result from inside a container just simply trying to resolve the serviceName and check that the vip matches the docker service inspect result. Further I would check with docker network inspect -v (on a manager) that the containers that are being shown are actually matching what is running |
@fcrisciani Thank you very much for helping.
Confirming this. There are two messages coming in my case as far as I remember. I'l check on qLen tonight.
In return, @fcrisciani @thaJeztah would you want me to test a specific thing/issue/fix in 17.10 as long as I am able to recreate platform for tuning and troubleshooting purposes, can help. |
It will be one per network, consider that the ingress network is created at the beginning when you enter in swarm mode. You can correlate the network id with a simple docker network ls. I'm currently just curious to see if on a fresh 17.10 you are still able to see inconsistencies in the service discovery, also we need to better understand the use case and add further automated test suite aiming to fix eventual new bugs and strengthen our testing infra to avoid regressions |
@fcrisciani meanwhile my "rollback" build to 17.07 was completed. In 15 minutes I've got website not responding. Debugging and logs: Show Docker logs
Show Nginx logs
-- Website started to be responsive suddenly: Show logs
While I was writing all the debug info, it fixed itself. But have no idea for how long. Moving circles... It worked on 17.07, now it's not. Rebuilding on 17.10 and testing further. Show setup detailsRHEL 7.3
Network settings:
|
@fcrisciani here another problem. Docker 17.10, 3 hardware worker nodes, 5 vm managers nodes. For example, working node:
Another node:
So at the moment this service responds only on 1 node of 3. At the moment all nodes are not loaded at all, but qLen for ingress network on all nodes include managers are not 0 and continue to raise. For example:
Engine version is same on all nodes in cluster:
Also there is strange task state
|
@fcrisciani Hi, I ran into circle of issues and clean everything (totally). Now I have just spanned 17.10 and I set VIP mode for a service. 15 minutes works stable... Will update later on. Apparently most of my issues (lately) were connected to that fact I placed {bip: ....} in docker daemon.json -> it created interface with that IP and MASK. But then when I was cleaning everything, I noticed, that whatever I do (including interface removal, daemon reload, docker delete and install) - the interface (ifconfig) is still coming up with the old IP, while few days back I've removed {bip....} configuration from docker. I think it can be another bug, that docker doesn't return default (172.17...) cidr when custom bridge ip was configured in daemon and later removed. Update. 1h later I am observing some instability. It shows 'connection timeout' against ldap service in the logs and then suddenly starts serving web, then in a while again hangs with timeout messages. All services are running in VIP mode. Few of them publish port to ingress. All networks but custom overlay one |
@zigmund I have this every time. Usually appears to services deployed in |
@fcrisciani Okay, that's what I have. Combination of Not sure, where to proceed further from this point. Show error log
|
an observation, when issue happens, I can see that one of the interfaces is gone. eth0, etch2... Show additional logs (debug: true)
|
I have the same problem with 17.10. I am unable to resolve from service1 on node swarm1 any other service in the same docker external network.
On every node I have 1 service:
All of the services are connected via one external network but not all services are registered inside:
On the problematic node swarm1 the docker engine logs are full with messages like these:
As additional information the nodes from 0 to 3 are over openstack and are behind NAT. They advertise the external IP provided by openstack. The other nodes from 4 to 7 are on vmware and they advertise their own IPs. The connection between all nodes looks fine and the following ports are accesible:
|
@zigmund the log |
^ @svscorp same for you looks like |
@velislav from the logs is clear that swarm1 is not able to properly communicate. The log healthscore should not appear.
Also the stats are showing that the node is the only one participating on that networks:
the netPeers field shows the number of nodes part of that network. |
@fcrisciani I am tracking that thread. There is 17.11 RC3 available, not sure is it the time to update to that (someone reported that he can reproduce an issue in another thread). |
With moby/libnetwork#1944, moby/libnetwork#2013, moby/libnetwork#2047 these problems should be resolved. If you're still having issues on a current version of Docker, please open a new ticket, which makes it easier to triage (and could have a different cause). |
Description
I have about 10 nodes, 4 master nodes, 7 overlay networks and about 70 services. And swarm is totally unstable now: it accidentally stops resolving names of my services:
And this problem can exist for 10-20 minutes! All this time you'll see
502 Bad Gateway
error. And then you'll see messages like this:... and then it will recover itself:
For 1-2 hours. And then - cluster will become broken again.
Steps to reproduce the issue:
Describe the results you received:
At this moment Swarm Mode is unusable for production environments.
Describe the results you expected:
I expected that it will!
Additional information you deem important (e.g. issue happens only occasionally):
Output of
docker version
:Output of
docker info
:The text was updated successfully, but these errors were encountered: