Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker Swarm on Windows 2019 fails to route to VIP on one network #39339

Open
drnybble opened this issue Jun 7, 2019 · 13 comments
Open

Docker Swarm on Windows 2019 fails to route to VIP on one network #39339

drnybble opened this issue Jun 7, 2019 · 13 comments

Comments

@drnybble
Copy link

drnybble commented Jun 7, 2019

Description

I have a service that lives on two overlay networks.

On network A its one task has IP 10.0.4.26. Its VIP is 10.0.4.6

On network B its one task has IP 10.0.5.18. Its VIP is 10.0.5.4

Now, my cluster is in a state where I can issue an HTTP request successfully to:

Network A: 10.0.4.26
Network B: 10.0.5.18; 10.0.5.4

But if I try the VIP 10.0.4.6 on network A it fails: Unable to connect to the remote server

If I do: docker network inspect A
I see the endpoint correctly enumerated.

I scaled the service to 0 replicas and then back to 1; now the VIP works.

Primarily I am looking for what further diagnostic information or logs I can collect if it happens again.

Additional information you deem important (e.g. issue happens only occasionally):

Output of docker version:

Client: Docker Engine - Enterprise
 Version:           18.09.6
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        1578dcadd2
 Built:             05/04/2019 02:34:11
 OS/Arch:           windows/amd64
 Experimental:      false

Server: Docker Engine - Enterprise
 Engine:
  Version:          18.09.6
  API version:      1.39 (minimum version 1.24)
  Go version:       go1.10.8
  Git commit:       1578dcadd2
  Built:            05/04/2019 02:32:24
  OS/Arch:          windows/amd64
  Experimental:     false

Output of docker info:

Containers: 18
 Running: 12
 Paused: 0
 Stopped: 6
Images: 26
Server Version: 18.09.6
Storage Driver: windowsfilter
 Windows:
Logging Driver: json-file
Plugins:
 Volume: local
 Network: ics l2bridge l2tunnel nat null overlay transparent
 Log: awslogs etwlogs fluentd gelf json-file local logentries splunk syslog
Swarm: active
 NodeID: q0cvyv7s4ajs59p5krp290j2d
 Is Manager: true
 ClusterID: 2rju9hq3iy4wn8akx59mhb4al
 Managers: 3
 Nodes: 3
 Default Address Pool: 10.0.0.0/8
 SubnetSize: 24
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 10
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 9.24.206.209
 Manager Addresses:
  9.24.206.193:2377
  9.24.206.209:2377
  9.24.206.227:2377
Default Isolation: process
Kernel Version: 10.0 17763 (17763.1.amd64fre.rs5_release.180914-1434)
Operating System: Windows Server 2019 Standard Version 1809 (OS Build 17763.503)
OSType: windows
Architecture: x86_64
CPUs: 4
Total Memory: 16GiB
Name: mcdowels1vm
ID: ZRD4:WS73:2MUZ:MKOI:OW6L:BQR2:IJDF:3M5G:GT5O:CFEY:VTBR:CIXM
Docker Root Dir: C:\ProgramData\docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):
VM running under VMWare

@olljanat
Copy link
Contributor

olljanat commented Jun 9, 2019

@drnybble please share example service commands which you use and also verify if same issue happens with Linux so it is easier track down if this is Windows or Docker issue?

@drnybble
Copy link
Author

This is a stack spun up with docker stack deploy comprising 37 services across a three node swarm cluster. It is nothing special but there are a lot of services and the machines are pegged at 100% for quite a while bringing it up. So I suspect a timeout somewhere. Looking for suggestions on debug output or diagnostics I can enable when I try to replicate it (not sure how repeatable it is).

@olljanat
Copy link
Contributor

Here is info how you can get logs from Windows containers https://docs.microsoft.com/en-us/virtualization/windowscontainers/troubleshooting or you can stop docker service and start it from command line on debug mode with command: dockerd.exe -D

@pradipd
Copy link
Contributor

pradipd commented Jun 10, 2019

@mkostersitz, @daschott: FYI.

@drnybble
Copy link
Author

I'll try to find a generic testcase that can demonstrate it; if there is specific logging that might explain failure of VIP routing let me know.

@daschott
Copy link

@drnybble there is a script https://raw.githubusercontent.com/microsoft/SDN/master/Kubernetes/windows/debug/collectlogs.ps1 which you can use to view a lot of container networking related information. Can you try it out and share the logs when the issue occurs?

@drnybble
Copy link
Author

Will do thanks

@daschott
Copy link

daschott commented Nov 5, 2019

Any luck getting a repro or logs?

@drnybble
Copy link
Author

drnybble commented Nov 5, 2019

Sorry I have parked my Swarm development project for the past few months. I'll be picking it up again soon, if I see it I'll try to get logs.

@drnybble
Copy link
Author

drnybble commented Nov 7, 2019

@daschott
I am spinning up my application again with latest Docker 19.03.4 and still am seeing VIP problems.

I have a three node cluster. Communication using VIP from Node B to Node C is not working.

Node A -> Node C

  • able to curl service using VIP

Node B -> Node C

  • NOT able to curl service using VIP (Timed out)
  • able to curl service using its container IP

Node B -> Node A

  • able to curl service using VIP

I have several containers running on each node, it seems that no service from Node B can reach any service on Node C via its VIP, but can reach all of them via container IP.

Specially, from container with IP 10.0.0.71 on Node B I can curl container 10.0.0.73 on Node C.
But I cannot curl using service VIP 10.0.0.32.

Attached are the log files generated by collectlogs.ps1

fxynyvrf.wgc.zip

Now similar to what I tried with the initial issue, I scaled a service to 0 replicas and back to 1 replica; the task started on Node C again and now I can successfully curl via the VIP.

These logs were captured after the scale down/up:
ezy5fh0l.zol.zip

I tried another variant: I scaled a service to two replicas and then back to one. The one task is still living on Node C and I can now successfully curl via the VIP.

Finally I tried docker service update [service] --force. It now also works.

Seems that whatever information is used to route to the VIP failed/timed out perhaps on Node B and once you "kick" it by refreshing the service it works. Any further help on logs to collect are welcome.

Addendum: noticed that these VMs had been reduced to 2 cores so that may be a contributing factor -- something timing out! Put them back to 8 to see how it goes.

@masaeedu
Copy link
Contributor

Getting a similar problem, but it seems to be random. The thing I still don't really understand is whether virtual IP based routing is supposed to be supported for Windows containers or if we're still only supposed to use DNSRR (which works from my limited testing).

@mkostersitz
Copy link

thank you for sending the logs along. I will create a bug internally to track the issue and report back.

@daschott
Copy link

@masaeedu
Please ensure you have KB4541331 (or higher) installed and using the latest Docker.

Routing mesh for Windows docker hosts is supported on Windows Server 2019 (and above), but not on Windows Server 2016.

That being said, on older release, such as Windows Server 2016, DNS Round-Robin is the only load balancing strategy supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants