-
Notifications
You must be signed in to change notification settings - Fork 18.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Windows Swarm: manager restart breaks overlay networking #42385
Comments
Can you test with 20.10.5 version? It looks to be that Mirantis merged moby/libnetwork#2620 fix to their repo using PR Mirantis/libnetwork#4 from branch named like "20.10-FIELD-3310" and their release notes refers FIELD-3310 https://docs.mirantis.com/containers/v3.1/mcr-rn/20-10-5.html |
I can confirm that I still see this issue with 20.10.21 (installed from https://download.docker.com/win/static/stable/x86_64/) on Windows Server 2019.
|
Description
In a single-manager Windows swarm, restarting the manager node or the Docker service on the manager node breaks the routing mesh. Even after the manager VM and/or the Docker service has finished restarting, swarm services won't be reachable at their published ports on the manager node. They will still be reachable at their published ports on the node their containers are running on.
Steps to reproduce the issue:
drain
to force all tasks to execute on the worker:docker swarm init --availability drain --advertise-addr <ip>
docker swarm join --token <worker token> <ip>:2377
docker service create --name iis --publish 80:80 mcr.microsoft.com/windows/servercore/iis:windowsservercore-ltsc2019
iwr -UseBasicParsing http://<manager or worker ip>
Get-HnsPolicyList
on the manager VM at this point returns 2 policy lists, linked to the IIS container's HNS endpointRestart-Service docker
/Restart-Computer
docker
service is back up, try to connect to port 80 on the manager VM:iwr -UseBasicParsing http://<manager ip>
. The connection will time out.Get-HnsPolicyList
on the manager VM at this point returns nothingdocker service update --force iis
Describe the results you received:
When I try to connect to port 80 on the swarm manager after restarting the Docker service (step 6), I get:
Describe the results you expected:
I expected to get the IIS default page, as I do at step 4:
Additional information you deem important (e.g. issue happens only occasionally):
Given that all HNS policy lists vanish from the manager node at step 6, I suspect this has something to do with (re)creating policy lists when joining a swarm. It also seems that force-updating services can recreate the missing policies, but it's impractical as a workaround as it requires stopping and restarting all swarm services.
I also noticed a variation on this problem, which I've not been able to reliably replicate. Sometimes, stopping the Docker service (step 5) does not clear out all HNS policy lists: a few remain on the manager VM, while the HNS endpoints are cleared out. When this happens, the remaining policies still reference the now-deleted enpoints. In this case, force-updating the service does not solve the issue and the machine needs to be restarted before the leftover policies are cleared out. The swarm service(s) can then be force-updated to fix the problem. I've tried manually removing the leftover policies with
Remove-HNSPolicyList
, but I just getError=The network was not found
.I've been unable to replicate this variation on Azure; it only seems to happen on our on-prem cluster when running our own apps (which are not publically available). From a quick look at open PRs and issues, the failure to remove HNS policies my be fixed by moby/libnetwork#2620.
Output of
docker version
:Both VMs:
Output of
docker info
:Manager VM:
Worker VM:
Additional environment details (AWS, VirtualBox, physical, etc.):
This was tested on 2 Azure VMs, using the "Windows Server 2019 Datacenter Server Core with Containers - Gen 2" image. Before testing, Docker was updated to the most recent version (20.10.4) with
Install-Package -Name docker -ProviderName DockerMsftProvider -Verbose -Update
.The same problem also happens on our on-prem Hyper-V VMs running Windows Server 2019 Standard.
The text was updated successfully, but these errors were encountered: