Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows Server 2019 host in hybrid Swarm cluster breaks routing mesh traffic to Linux services #38484

Closed
pbering opened this Issue Jan 3, 2019 · 16 comments

Comments

Projects
None yet
10 participants
@pbering
Copy link

pbering commented Jan 3, 2019

Description

Swarm routing mesh stops working (timeouts / not responding) if traffic to a Linux service goes through a Windows Server 2019 machine in a hybrid cluster, works as expected with Windows Server 2016 1803.

Steps to reproduce the issue:

  1. Create a Linux and a Windows Server 2019 host on the same subnet:
    1. Ubuntu 18.04 host, update, disable firewall, install Docker.
    2. Windows Server 2019 host, update, disable firewall, install Docker.
  2. On the linux host, initialize new swarm with: sudo docker swarm init
  3. On the windows host, join the swarm.
  4. On the linux host, create service: sudo docker service create --name test0 -p 9000:80 nginx
  5. From another host, continuously do requests to the service on the linux host with for example: while($true) { curl -I http://linux:9000; Get-Date; sleep -Seconds 1 }. Verify that you get HTTP 200.
  6. From another host, then do a single request to the service on the windows host: curl -Isv http://windows:9000

Log files from both hosts (with debugging enabled) can be found here:

Describe the results you received:

You will see that the continuous requests starts to hang and timeout, until a while after the single request also times out. The continuous requests will recover after 20-30 seconds and once again return HTTP 200.

Describe the results you expected:

Expecting that requests through any host in the cluster finishes without any hang/timeouts as it would if the windows host was running Windows Server 2016 1803.

Additional information you deem important (e.g. issue happens only occasionally):

This used to work using Windows Server 2016 1803. I have multiple hybrid Docker Swarm clusters running in production with no issues at all.

Other observations:

  1. Does not matter if linux host is Ubuntu 16, 18 or Debian 9.
  2. Does not matter if linux host is worker and windows Server 2019 host is master.
  3. Doing the same on either 2 linux host or 2 Windows Server 2019 hosts works fine.
  4. Same issue if downgrading the Docker engine on the windows host to 18.03.1-ee-2.
  5. Windows Server 2019 SKU does not matter, reproduced on datacenter (full and core) and standard edition (full).

Linux: Output of docker version:

Client:
 Version:           18.09.0
 API version:       1.39
 Go version:        go1.10.4
 Git commit:        4d60db4
 Built:             Wed Nov  7 00:49:01 2018
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.0
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.4
  Git commit:       4d60db4
  Built:            Wed Nov  7 00:16:44 2018
  OS/Arch:          linux/amd64
  Experimental:     false

Linux: Output of docker info:

Containers: 1
 Running: 1
 Paused: 0
 Stopped: 0
Images: 1
Server Version: 18.09.0
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: active
 NodeID: 6m8os6plt3f23t24ngxegaqgx
 Is Manager: true
 ClusterID: jkw54ivjwqbmm6tmkk4h0p04u
 Managers: 1
 Nodes: 2
 Default Address Pool: 10.0.0.0/8
 SubnetSize: 24
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 10
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 172.17.1.10
 Manager Addresses:
  172.17.1.10:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: c4446665cb9c30056f4998ed953e6d4ff22c7c39
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: fec3683
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.15.0-1036-azure
Operating System: Ubuntu 18.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 3.853GiB
Name: dc2node-nvm0
ID: BYOD:7LOK:KOXR:W4QQ:UPDG:YCNZ:H5EN:RS74:AKRI:F5H4:2NTX:FKWC
Docker Root Dir: /datadisk
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 49
 Goroutines: 174
 System Time: 2019-01-03T12:24:38.998562887Z
 EventsListeners: 1
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 registry.valtech.dk
 127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine

WARNING: No swap limit support

Windows: Output of docker version:

Client:
 Version:           18.09.0
 API version:       1.39
 Go version:        go1.10.3
 Git commit:        33a45cd0a2
 Built:             unknown-buildtime
 OS/Arch:           windows/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.09.0
  API version:      1.39 (minimum version 1.24)
  Go version:       go1.10.3
  Git commit:       33a45cd0a2
  Built:            11/07/2018 00:24:12
  OS/Arch:          windows/amd64
  Experimental:     false

Windows: Output of docker info:

Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 0
Server Version: 18.09.0
Storage Driver: windowsfilter
 Windows:
Logging Driver: json-file
Plugins:
 Volume: local
 Network: ics l2bridge l2tunnel nat null overlay transparent
 Log: awslogs etwlogs fluentd gelf json-file local logentries splunk syslog
Swarm: active
 NodeID: ych43rb08jsey2s7n4p738t12
 Is Manager: false
 Node Address: 172.17.1.20
 Manager Addresses:
  172.17.1.10:2377
Default Isolation: process
Kernel Version: 10.0 17763 (17763.1.amd64fre.rs5_release.180914-1434)
Operating System: Windows Server 2019 Datacenter Version 1809 (OS Build 17763.195)
OSType: windows
Architecture: x86_64
CPUs: 4
Total Memory: 8GiB
Name: dc2node-wvm0
ID: QKH5:ZHTV:NP2P:TTNM:LV22:35MZ:NXPZ:CPQL:FN2L:ZJ2V:S22F:AMRS
Docker Root Dir: F:\docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: -1
 Goroutines: 80
 System Time: 2019-01-03T12:26:18.2309755Z
 EventsListeners: 0
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
Product License: this node is not a swarm manager - check license status on a manager node

Additional environment details (AWS, VirtualBox, physical, etc.):

Reproduced on VM's running on Azure and also on my workstation on VM's running in Hyper-V.

@thaJeztah

This comment has been minimized.

Copy link
Member

thaJeztah commented Jan 4, 2019

ping @jhowardmsft ptal

@jhowardmsft

This comment has been minimized.

Copy link
Contributor

jhowardmsft commented Jan 4, 2019

You need the networking folks. @madhanrm

@olljanat

This comment has been minimized.

Copy link
Contributor

olljanat commented Jan 5, 2019

FYI. As there looks to be many issues which prevents users to Win Srv 2019 on production I created EPIC of it to #38498 Please share Microsoft internal tracking when there is any so I can put them to list and people can refer it when contacting support.

@dineshgovindasamy

This comment has been minimized.

Copy link

dineshgovindasamy commented Jan 24, 2019

@pradipd Is this the same issue, you fixed recently? Can you verify with the patch and comment on this issue?

@pradipd

This comment has been minimized.

Copy link
Contributor

pradipd commented Jan 25, 2019

Investigating

@pradipd

This comment has been minimized.

Copy link
Contributor

pradipd commented Jan 30, 2019

@pbering Thank you for the concise repro steps. We were able to repro and fix the issue. It should be in the next patch. Will also add more tests to catch this.

@thaJeztah

This comment has been minimized.

Copy link
Member

thaJeztah commented Jan 31, 2019

@pradipd Thanks! "next patch" is Windows release? Is there an issue number that people can refer to?

@pbering

This comment has been minimized.

Copy link
Author

pbering commented Feb 4, 2019

@pradipd Thank you! Great news :) ... I too would like to know if "next patch" is a KB or if you mean next release ie. 1903 (or what number the next release will be)

@olljanat

This comment has been minimized.

Copy link
Contributor

olljanat commented Feb 6, 2019

@thaJeztah @pbering Windows Server 2019 is just released LTSC version so bug fixes like these must be released for it part of those fixes which are released on second Tuesday on every month.

Only those who need new features need to update next semi-annual version (1903 I guess). Look: https://docs.microsoft.com/windows-server/get-started-19/servicing-channels-19

@pradipd

This comment has been minimized.

Copy link
Contributor

pradipd commented Feb 6, 2019

Sorry for the late response, I was trying to get the KB number. It will be released as a patch. I haven't found the KB number yet.

@egzonzeneli

This comment has been minimized.

Copy link

egzonzeneli commented Feb 20, 2019

@pradipd is that patch released ? if yes, can you please post the link. If not, then do you have any ETA when will be released

This is the last one but i`m not seeing on improvement/fixes list. It seems we have to wait for CU 03-2019

@pradipd

This comment has been minimized.

Copy link
Contributor

pradipd commented Feb 20, 2019

It has not yet released. My understanding is next week.

@daschott

This comment has been minimized.

Copy link

daschott commented Mar 4, 2019

@pbering Can you please apply kb4482887 and confirm whether the issue is resolved.

@jmhardison

This comment has been minimized.

Copy link

jmhardison commented Mar 5, 2019

@daschott
Figured I'd drop my head in this thread and provide additional testing info for the KB.

I've ran through and reproduced the problem on a small swarm cluster consisting of the following:

  • 1 Linux Swarm Master (Ubuntu 18.04)
  • 1 Windows worker node (Windows 2019)

Both are running Docker Engine 18.09.3.

ID                            HOSTNAME            STATUS              AVAILABILITY        MANAGER STATUS      ENGINE VERSION
s2ukludoxfxk11f8s7eo35hml *   linval1             Ready               Active              Leader              18.09.3
sewyfczbrf9ka9mgr6v4z3idu     winval1             Ready               Active                                  18.09.3

Linux node:

Client:
 Version:           18.09.3
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        774a1f4
 Built:             Thu Feb 28 06:53:11 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.3
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       774a1f4
  Built:            Thu Feb 28 05:59:55 2019
  OS/Arch:          linux/amd64
  Experimental:     false

Windows node:

Client:
 Version:           18.09.3
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        142dfcedca
 Built:             02/28/2019 06:33:17
 OS/Arch:           windows/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.09.3
  API version:      1.39 (minimum version 1.24)
  Go version:       go1.10.8
  Git commit:       142dfcedca
  Built:            02/28/2019 06:31:15
  OS/Arch:          windows/amd64
  Experimental:     false

Test image deployed, used for multi-plat testing:
stefanscherer/whoami

Method of deploy:
docker service create --name whoami --replicas 2 --publish published=80,target=8080 stefanscherer/whoami

Resulting Deployment:

ID                  NAME                IMAGE                         NODE                DESIRED STATE       CURRENT STATE           ERROR               PORTS
4q3u31u4cgs9        whoami.1            stefanscherer/whoami:latest   winval1             Running             Running 5 minutes ago
rjfqfjp2jboq        whoami.2            stefanscherer/whoami:latest   linval1             Running             Running 6 minutes ago

After verifying the issue with routing between the nodes for a service and scaling up/down and landing on linux only or windows only, I've applied KB4482887 as mentioned.

Hotfix(s):                 3 Hotfix(s) Installed.
                           [01]: KB4483452
                           [02]: KB4470788
                           [03]: KB4482887

After reboot of the system the issue does not appear to be resolved. The state after reboot had the container executing on the linux node. However, when scaling, I noticed this error popped briefly before swarm deployed the additional instance on the linux node.
2/2: hnsCall failed in Win32: An adapter was not found. (0x803b0006) Which now appears after applying that hotfix that scaling a service to include deployment onto windows is impacted.

ID                  NAME                IMAGE                         NODE                DESIRED STATE       CURRENT STATE             ERROR                              PORTS
wnonp54ulaj5        whoami.1            stefanscherer/whoami:latest   linval1             Running             Running 36 seconds ago                              
r9dzw9f6zdou         \_ whoami.1        stefanscherer/whoami:latest   winval1             Shutdown            Shutdown 6 minutes ago                              
xlnj4lvw4ma9        whoami.2            stefanscherer/whoami:latest   linval1             Running             Running 11 seconds ago                              
i1wyd6pcpcd5         \_ whoami.2        stefanscherer/whoami:latest   winval1             Shutdown            Rejected 21 seconds ago   "hnsCall failed in Win32: An a…"
k2hsdixlov92         \_ whoami.2        stefanscherer/whoami:latest   winval1             Shutdown            Rejected 26 seconds ago   "hnsCall failed in Win32: An a…"
hns7zt9vvvgm         \_ whoami.2        stefanscherer/whoami:latest   winval1             Shutdown            Rejected 31 seconds ago   "hnsCall failed in Win32: An a…"
xs463z5gi1kw         \_ whoami.2        stefanscherer/whoami:latest   winval1             Shutdown            Rejected 36 seconds ago   "hnsCall failed in Win32: An a…"

I'm going to be rebuilding this cluster fresh and running through the exact steps mentioned above, rather than using an image that is multi-platform (Linux + Windows). Additionally collect more logging for evaluation, but first pass with my test this doesn't appear to resolve the issue for an existing cluster that has the windows node patched. Leaving and rejoining the cluster also did not resolve either the mesh routing, or the scaling now throwing an error when trying to deploy to windows.

I will break out the KB's windows scale issue I'm seeing to a new issue if I can reproduce it with additional logging.

Otherwise, I'll drop any additional information as I get it, but parallel to anything @pbering is doing.

@jmhardison

This comment has been minimized.

Copy link

jmhardison commented Mar 5, 2019

Quick second run, @daschott, using above testing criteria from @pbering.

Results vary, with one avenue of success.

  • Service was deployed, and left running. Then the node was patched. Resulting in no change to routing.
  • With service still deployed, replaced the windows node with a pre-patched node and then added to swarm. No change to routing.
  • With service still deployed, and the patched node added: removed the service and re-created it. Routing now works.

In regards to upgrading without recreating services:

I've tested both with a cluster that is established before installing the KB, and also where the KB is installed prior to installing Docker on the windows node and joining it to the swarm.

Test 1: Swarm exists with linux + windows before kb is installed.
Test 2: Swarm exists on linux, new windows node built with KB installed, afterwards docker is installed and joined to swarm. NGINX is still running on the linux node from prior test and is responding to clients connecting directly to the linux node.

Here are some logs from the attempts:
Test1:

  • Linux Node Pre KB gist
  • Linux Node Post KB gist
  • Windows Node Pre KB gist
  • Windows Node Post KB gist

Test2:

All of the tests I have performed use Azure, with the latest published images for Ubuntu 18.04 and Windows Server (2019 Datacenter Edition).

Removal of service and recreation. The side that works.

I deployed my original test image, noticed the error I was tracking is gone (the adapter error). So I continued to remove the existing services, and redeployed the test0 service:
docker service create --name test0 -p 9000:80 nginx

After this, the service is now responding with HTTP 200's when accessing through linux or windows.
Client -> windows -> linux(container) || Client -> linux(container)

I plan to do some more testing, but it appears if the service exists it won't start working until at least removing the service and recreating it. I will however see in more testing if scaling impacts it, or if there is a resurfacing of the adapter error.

Here is the linux and windows logs after removing and recreating the service for it to start working.

So this 'looks' like it does fix the issue, but I'm trying to find some edges of it and reproduce upgrade scenarios that do not get resolved.

@pbering

This comment has been minimized.

Copy link
Author

pbering commented Mar 22, 2019

@daschott I've just tested on fresh VM's with latest updates as of today and it is working now! Thank you and have a nice weekend :)

@pbering pbering closed this Mar 22, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.