Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Windows Server 2019 host in hybrid Swarm cluster breaks routing mesh traffic to Linux services #38484
Swarm routing mesh stops working (timeouts / not responding) if traffic to a Linux service goes through a Windows Server 2019 machine in a hybrid cluster, works as expected with Windows Server 2016 1803.
Steps to reproduce the issue:
Log files from both hosts (with debugging enabled) can be found here:
Describe the results you received:
You will see that the continuous requests starts to hang and timeout, until a while after the single request also times out. The continuous requests will recover after 20-30 seconds and once again return HTTP 200.
Describe the results you expected:
Expecting that requests through any host in the cluster finishes without any hang/timeouts as it would if the
Additional information you deem important (e.g. issue happens only occasionally):
This used to work using Windows Server 2016 1803. I have multiple hybrid Docker Swarm clusters running in production with no issues at all.
Linux: Output of
Linux: Output of
Windows: Output of
Windows: Output of
Additional environment details (AWS, VirtualBox, physical, etc.):
Reproduced on VM's running on Azure and also on my workstation on VM's running in Hyper-V.
This was referenced
Jan 24, 2019
Only those who need new features need to update next semi-annual version (1903 I guess). Look: https://docs.microsoft.com/windows-server/get-started-19/servicing-channels-19
referenced this issue
Feb 15, 2019
I've ran through and reproduced the problem on a small swarm cluster consisting of the following:
Both are running Docker Engine 18.09.3.
Test image deployed, used for multi-plat testing:
Method of deploy:
After verifying the issue with routing between the nodes for a service and scaling up/down and landing on linux only or windows only, I've applied KB4482887 as mentioned.
After reboot of the system the issue does not appear to be resolved. The state after reboot had the container executing on the linux node. However, when scaling, I noticed this error popped briefly before swarm deployed the additional instance on the linux node.
I'm going to be rebuilding this cluster fresh and running through the exact steps mentioned above, rather than using an image that is multi-platform (Linux + Windows). Additionally collect more logging for evaluation, but first pass with my test this doesn't appear to resolve the issue for an existing cluster that has the windows node patched. Leaving and rejoining the cluster also did not resolve either the mesh routing, or the scaling now throwing an error when trying to deploy to windows.
I will break out the KB's windows scale issue I'm seeing to a new issue if I can reproduce it with additional logging.
Otherwise, I'll drop any additional information as I get it, but parallel to anything @pbering is doing.
Results vary, with one avenue of success.
In regards to upgrading without recreating services:
I've tested both with a cluster that is established before installing the KB, and also where the KB is installed prior to installing Docker on the windows node and joining it to the swarm.
Test 1: Swarm exists with linux + windows before kb is installed.
Here are some logs from the attempts:
All of the tests I have performed use Azure, with the latest published images for Ubuntu 18.04 and Windows Server (2019 Datacenter Edition).
Removal of service and recreation. The side that works.
I deployed my original test image, noticed the error I was tracking is gone (the adapter error). So I continued to remove the existing services, and redeployed the test0 service:
After this, the service is now responding with HTTP 200's when accessing through linux or windows.
I plan to do some more testing, but it appears if the service exists it won't start working until at least removing the service and recreating it. I will however see in more testing if scaling impacts it, or if there is a resurfacing of the adapter error.
Here is the linux and windows logs after removing and recreating the service for it to start working.
So this 'looks' like it does fix the issue, but I'm trying to find some edges of it and reproduce upgrade scenarios that do not get resolved.