New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docker Swarm on Windows 2019 fails to route to VIP on one network #39339
Comments
@drnybble please share example service commands which you use and also verify if same issue happens with Linux so it is easier track down if this is Windows or Docker issue? |
This is a stack spun up with docker stack deploy comprising 37 services across a three node swarm cluster. It is nothing special but there are a lot of services and the machines are pegged at 100% for quite a while bringing it up. So I suspect a timeout somewhere. Looking for suggestions on debug output or diagnostics I can enable when I try to replicate it (not sure how repeatable it is). |
Here is info how you can get logs from Windows containers https://docs.microsoft.com/en-us/virtualization/windowscontainers/troubleshooting or you can stop docker service and start it from command line on debug mode with command: |
@mkostersitz, @daschott: FYI. |
I'll try to find a generic testcase that can demonstrate it; if there is specific logging that might explain failure of VIP routing let me know. |
@drnybble there is a script https://raw.githubusercontent.com/microsoft/SDN/master/Kubernetes/windows/debug/collectlogs.ps1 which you can use to view a lot of container networking related information. Can you try it out and share the logs when the issue occurs? |
Will do thanks |
Any luck getting a repro or logs? |
Sorry I have parked my Swarm development project for the past few months. I'll be picking it up again soon, if I see it I'll try to get logs. |
@daschott I have a three node cluster. Communication using VIP from Node B to Node C is not working. Node A -> Node C
Node B -> Node C
Node B -> Node A
I have several containers running on each node, it seems that no service from Node B can reach any service on Node C via its VIP, but can reach all of them via container IP. Specially, from container with IP 10.0.0.71 on Node B I can curl container 10.0.0.73 on Node C. Attached are the log files generated by collectlogs.ps1 Now similar to what I tried with the initial issue, I scaled a service to 0 replicas and back to 1 replica; the task started on Node C again and now I can successfully curl via the VIP. These logs were captured after the scale down/up: I tried another variant: I scaled a service to two replicas and then back to one. The one task is still living on Node C and I can now successfully curl via the VIP. Finally I tried Seems that whatever information is used to route to the VIP failed/timed out perhaps on Node B and once you "kick" it by refreshing the service it works. Any further help on logs to collect are welcome. Addendum: noticed that these VMs had been reduced to 2 cores so that may be a contributing factor -- something timing out! Put them back to 8 to see how it goes. |
Getting a similar problem, but it seems to be random. The thing I still don't really understand is whether virtual IP based routing is supposed to be supported for Windows containers or if we're still only supposed to use DNSRR (which works from my limited testing). |
thank you for sending the logs along. I will create a bug internally to track the issue and report back. |
@masaeedu Routing mesh for Windows docker hosts is supported on Windows Server 2019 (and above), but not on Windows Server 2016. That being said, on older release, such as Windows Server 2016, DNS Round-Robin is the only load balancing strategy supported. |
Description
I have a service that lives on two overlay networks.
On network A its one task has IP 10.0.4.26. Its VIP is 10.0.4.6
On network B its one task has IP 10.0.5.18. Its VIP is 10.0.5.4
Now, my cluster is in a state where I can issue an HTTP request successfully to:
Network A: 10.0.4.26
Network B: 10.0.5.18; 10.0.5.4
But if I try the VIP 10.0.4.6 on network A it fails: Unable to connect to the remote server
If I do: docker network inspect A
I see the endpoint correctly enumerated.
I scaled the service to 0 replicas and then back to 1; now the VIP works.
Primarily I am looking for what further diagnostic information or logs I can collect if it happens again.
Additional information you deem important (e.g. issue happens only occasionally):
Output of
docker version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.):
VM running under VMWare
The text was updated successfully, but these errors were encountered: