Docker Swarm on Windows 2019 fails to route to VIP on one network #39339

drnybble · 2019-06-07T18:07:58Z

Description

I have a service that lives on two overlay networks.

On network A its one task has IP 10.0.4.26. Its VIP is 10.0.4.6

On network B its one task has IP 10.0.5.18. Its VIP is 10.0.5.4

Now, my cluster is in a state where I can issue an HTTP request successfully to:

Network A: 10.0.4.26
Network B: 10.0.5.18; 10.0.5.4

But if I try the VIP 10.0.4.6 on network A it fails: Unable to connect to the remote server

If I do: docker network inspect A
I see the endpoint correctly enumerated.

I scaled the service to 0 replicas and then back to 1; now the VIP works.

Primarily I am looking for what further diagnostic information or logs I can collect if it happens again.

Additional information you deem important (e.g. issue happens only occasionally):

Output of docker version:

Client: Docker Engine - Enterprise
 Version:           18.09.6
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        1578dcadd2
 Built:             05/04/2019 02:34:11
 OS/Arch:           windows/amd64
 Experimental:      false

Server: Docker Engine - Enterprise
 Engine:
  Version:          18.09.6
  API version:      1.39 (minimum version 1.24)
  Go version:       go1.10.8
  Git commit:       1578dcadd2
  Built:            05/04/2019 02:32:24
  OS/Arch:          windows/amd64
  Experimental:     false

Output of docker info:

Containers: 18
 Running: 12
 Paused: 0
 Stopped: 6
Images: 26
Server Version: 18.09.6
Storage Driver: windowsfilter
 Windows:
Logging Driver: json-file
Plugins:
 Volume: local
 Network: ics l2bridge l2tunnel nat null overlay transparent
 Log: awslogs etwlogs fluentd gelf json-file local logentries splunk syslog
Swarm: active
 NodeID: q0cvyv7s4ajs59p5krp290j2d
 Is Manager: true
 ClusterID: 2rju9hq3iy4wn8akx59mhb4al
 Managers: 3
 Nodes: 3
 Default Address Pool: 10.0.0.0/8
 SubnetSize: 24
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 10
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 9.24.206.209
 Manager Addresses:
  9.24.206.193:2377
  9.24.206.209:2377
  9.24.206.227:2377
Default Isolation: process
Kernel Version: 10.0 17763 (17763.1.amd64fre.rs5_release.180914-1434)
Operating System: Windows Server 2019 Standard Version 1809 (OS Build 17763.503)
OSType: windows
Architecture: x86_64
CPUs: 4
Total Memory: 16GiB
Name: mcdowels1vm
ID: ZRD4:WS73:2MUZ:MKOI:OW6L:BQR2:IJDF:3M5G:GT5O:CFEY:VTBR:CIXM
Docker Root Dir: C:\ProgramData\docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):
VM running under VMWare

The text was updated successfully, but these errors were encountered:

olljanat · 2019-06-09T15:02:40Z

@drnybble please share example service commands which you use and also verify if same issue happens with Linux so it is easier track down if this is Windows or Docker issue?

drnybble · 2019-06-10T13:06:18Z

This is a stack spun up with docker stack deploy comprising 37 services across a three node swarm cluster. It is nothing special but there are a lot of services and the machines are pegged at 100% for quite a while bringing it up. So I suspect a timeout somewhere. Looking for suggestions on debug output or diagnostics I can enable when I try to replicate it (not sure how repeatable it is).

olljanat · 2019-06-10T13:11:50Z

Here is info how you can get logs from Windows containers https://docs.microsoft.com/en-us/virtualization/windowscontainers/troubleshooting or you can stop docker service and start it from command line on debug mode with command: dockerd.exe -D

pradipd · 2019-06-10T16:56:50Z

@mkostersitz, @daschott: FYI.

drnybble · 2019-06-10T18:06:27Z

I'll try to find a generic testcase that can demonstrate it; if there is specific logging that might explain failure of VIP routing let me know.

daschott · 2019-08-19T20:01:15Z

@drnybble there is a script https://raw.githubusercontent.com/microsoft/SDN/master/Kubernetes/windows/debug/collectlogs.ps1 which you can use to view a lot of container networking related information. Can you try it out and share the logs when the issue occurs?

drnybble · 2019-08-19T20:39:23Z

Will do thanks

daschott · 2019-11-05T18:37:41Z

Any luck getting a repro or logs?

drnybble · 2019-11-05T19:07:51Z

Sorry I have parked my Swarm development project for the past few months. I'll be picking it up again soon, if I see it I'll try to get logs.

drnybble · 2019-11-07T16:29:25Z

@daschott
I am spinning up my application again with latest Docker 19.03.4 and still am seeing VIP problems.

I have a three node cluster. Communication using VIP from Node B to Node C is not working.

Node A -> Node C

able to curl service using VIP

Node B -> Node C

NOT able to curl service using VIP (Timed out)
able to curl service using its container IP

Node B -> Node A

able to curl service using VIP

I have several containers running on each node, it seems that no service from Node B can reach any service on Node C via its VIP, but can reach all of them via container IP.

Specially, from container with IP 10.0.0.71 on Node B I can curl container 10.0.0.73 on Node C.
But I cannot curl using service VIP 10.0.0.32.

Attached are the log files generated by collectlogs.ps1

fxynyvrf.wgc.zip

Now similar to what I tried with the initial issue, I scaled a service to 0 replicas and back to 1 replica; the task started on Node C again and now I can successfully curl via the VIP.

These logs were captured after the scale down/up:
ezy5fh0l.zol.zip

I tried another variant: I scaled a service to two replicas and then back to one. The one task is still living on Node C and I can now successfully curl via the VIP.

Finally I tried docker service update [service] --force. It now also works.

Seems that whatever information is used to route to the VIP failed/timed out perhaps on Node B and once you "kick" it by refreshing the service it works. Any further help on logs to collect are welcome.

Addendum: noticed that these VMs had been reduced to 2 cores so that may be a contributing factor -- something timing out! Put them back to 8 to see how it goes.

masaeedu · 2020-06-10T14:09:42Z

Getting a similar problem, but it seems to be random. The thing I still don't really understand is whether virtual IP based routing is supposed to be supported for Windows containers or if we're still only supposed to use DNSRR (which works from my limited testing).

mkostersitz · 2020-06-18T16:14:00Z

thank you for sending the logs along. I will create a bug internally to track the issue and report back.

daschott · 2020-06-18T16:25:07Z

@masaeedu
Please ensure you have KB4541331 (or higher) installed and using the latest Docker.

Routing mesh for Windows docker hosts is supported on Windows Server 2019 (and above), but not on Windows Server 2016.

That being said, on older release, such as Windows Server 2016, DNS Round-Robin is the only load balancing strategy supported.

GordonTheTurtle added area/networking area/swarm platform/windows labels Jun 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docker Swarm on Windows 2019 fails to route to VIP on one network #39339

Docker Swarm on Windows 2019 fails to route to VIP on one network #39339

drnybble commented Jun 7, 2019

olljanat commented Jun 9, 2019

drnybble commented Jun 10, 2019

olljanat commented Jun 10, 2019

pradipd commented Jun 10, 2019

drnybble commented Jun 10, 2019

daschott commented Aug 19, 2019

drnybble commented Aug 19, 2019

daschott commented Nov 5, 2019

drnybble commented Nov 5, 2019

drnybble commented Nov 7, 2019 •

edited

masaeedu commented Jun 10, 2020

mkostersitz commented Jun 18, 2020

daschott commented Jun 18, 2020

Docker Swarm on Windows 2019 fails to route to VIP on one network #39339

Docker Swarm on Windows 2019 fails to route to VIP on one network #39339

Comments

drnybble commented Jun 7, 2019

olljanat commented Jun 9, 2019

drnybble commented Jun 10, 2019

olljanat commented Jun 10, 2019

pradipd commented Jun 10, 2019

drnybble commented Jun 10, 2019

daschott commented Aug 19, 2019

drnybble commented Aug 19, 2019

daschott commented Nov 5, 2019

drnybble commented Nov 5, 2019

drnybble commented Nov 7, 2019 • edited

masaeedu commented Jun 10, 2020

mkostersitz commented Jun 18, 2020

daschott commented Jun 18, 2020

drnybble commented Nov 7, 2019 •

edited