Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows Swarm: manager restart breaks overlay networking #42385

Open
mr-field opened this issue May 17, 2021 · 2 comments
Open

Windows Swarm: manager restart breaks overlay networking #42385

mr-field opened this issue May 17, 2021 · 2 comments

Comments

@mr-field
Copy link

Description
In a single-manager Windows swarm, restarting the manager node or the Docker service on the manager node breaks the routing mesh. Even after the manager VM and/or the Docker service has finished restarting, swarm services won't be reachable at their published ports on the manager node. They will still be reachable at their published ports on the node their containers are running on.

Steps to reproduce the issue:

  1. Spin up 2 fresh Windows Server VMs. I tested this with 2 Windows Server 2019 VMs on Azure
    • Ensure that connections on the swarm ports are allowed between the 2 VMs
  2. Initialise a new swarm on the manager VM, setting its availability as drain to force all tasks to execute on the worker: docker swarm init --availability drain --advertise-addr <ip>
  3. Join the worker VM to the swarm: docker swarm join --token <worker token> <ip>:2377
  4. On the manager VM, create a basic IIS service: docker service create --name iis --publish 80:80 mcr.microsoft.com/windows/servercore/iis:windowsservercore-ltsc2019
    • Once the service has started, connecting to port 80 on both VMs will show the default IIS page: iwr -UseBasicParsing http://<manager or worker ip>
    • Note that running Get-HnsPolicyList on the manager VM at this point returns 2 policy lists, linked to the IIS container's HNS endpoint
  5. Restart the Docker service on the manager VM or reboot the machine: Restart-Service docker/Restart-Computer
  6. Once the VM and/or the docker service is back up, try to connect to port 80 on the manager VM: iwr -UseBasicParsing http://<manager ip>. The connection will time out.
    • Running the same command against the worker VM will still show the default IIS page
    • The manager node seems otherwise fine, it shows as "Available" in the node list and can control services
    • Note that running Get-HnsPolicyList on the manager VM at this point returns nothing
    • Force-updating the IIS service will repopulate the HNS policy lists and make the IIS service reachable on the manager VM again: docker service update --force iis

Describe the results you received:
When I try to connect to port 80 on the swarm manager after restarting the Docker service (step 6), I get:

$ curl http://<manager ip>/
curl: (7) Failed to connect to <manager ip> port 80: Timed out

Describe the results you expected:
I expected to get the IIS default page, as I do at step 4:

$ curl http://<manager ip>/
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
{snip}
</html>

Additional information you deem important (e.g. issue happens only occasionally):
Given that all HNS policy lists vanish from the manager node at step 6, I suspect this has something to do with (re)creating policy lists when joining a swarm. It also seems that force-updating services can recreate the missing policies, but it's impractical as a workaround as it requires stopping and restarting all swarm services.

I also noticed a variation on this problem, which I've not been able to reliably replicate. Sometimes, stopping the Docker service (step 5) does not clear out all HNS policy lists: a few remain on the manager VM, while the HNS endpoints are cleared out. When this happens, the remaining policies still reference the now-deleted enpoints. In this case, force-updating the service does not solve the issue and the machine needs to be restarted before the leftover policies are cleared out. The swarm service(s) can then be force-updated to fix the problem. I've tried manually removing the leftover policies with Remove-HNSPolicyList, but I just get Error=The network was not found.

I've been unable to replicate this variation on Azure; it only seems to happen on our on-prem cluster when running our own apps (which are not publically available). From a quick look at open PRs and issues, the failure to remove HNS policies my be fixed by moby/libnetwork#2620.

Output of docker version:
Both VMs:

Docker version 20.10.4, build 110e091

Output of docker info:
Manager VM:

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker Application (Docker Inc., v0.8.0)
  cluster: Manage Mirantis Container Cloud clusters (Mirantis Inc., v1.9.0)
  registry: Manage Docker registries (Docker Inc., 0.1.0)

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 3
 Server Version: 20.10.4
 Storage Driver: windowsfilter
  Windows:
 Logging Driver: json-file
 Plugins:
  Volume: local
  Network: ics internal l2bridge l2tunnel nat null overlay private transparent
  Log: awslogs etwlogs fluentd gcplogs gelf json-file local logentries splunk syslog
 Swarm: active
  NodeID: ia9lf0dsa8hza7obdtpit86vc
  Is Manager: true
  ClusterID: ml14gwdkn8v81ouy7dzufhnka
  Managers: 1
  Nodes: 2
  Default Address Pool: 10.0.0.0/8
  SubnetSize: 24
  Data Path Port: 4789
  Orchestration:
   Task History Retention Limit: 5
  Raft:
   Snapshot Interval: 10000
   Number of Old Snapshots to Retain: 0
   Heartbeat Tick: 1
   Election Tick: 10
  Dispatcher:
   Heartbeat Period: 5 seconds
  CA Configuration:
   Expiry Duration: 3 months
   Force Rotate: 0
  Autolock Managers: false
  Root Rotation In Progress: false
  Node Address: 10.0.1.8
  Manager Addresses:
   10.0.1.8:2377
 Default Isolation: process
 Kernel Version: 10.0 17763 (17763.1.amd64fre.rs5_release.180914-1434)
 Operating System: Windows Server 2019 Datacenter Version 1809 (OS Build 17763.1935)
 OSType: windows
 Architecture: x86_64
 CPUs: 1
 Total Memory: 3.499GiB
 Name: manager-core-hn
 ID: VHH2:WGXV:AZ4K:AGGX:HAST:DC3W:MUIG:HYDF:YBIB:6UA6:DCVG:J7HY
 Docker Root Dir: C:\ProgramData\docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Worker VM:

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker Application (Docker Inc., v0.8.0)
  cluster: Manage Mirantis Container Cloud clusters (Mirantis Inc., v1.9.0)
  registry: Manage Docker registries (Docker Inc., 0.1.0)

Server:
 Containers: 3
  Running: 1
  Paused: 0
  Stopped: 2
 Images: 4
 Server Version: 20.10.4
 Storage Driver: windowsfilter
  Windows:
 Logging Driver: json-file
 Plugins:
  Volume: local
  Network: ics internal l2bridge l2tunnel nat null overlay private transparent
  Log: awslogs etwlogs fluentd gcplogs gelf json-file local logentries splunk syslog
 Swarm: active
  NodeID: ntx5oj6frb6391hkklfk2m1z5
  Is Manager: false
  Node Address: 10.0.1.9
  Manager Addresses:
   10.0.1.8:2377
 Default Isolation: process
 Kernel Version: 10.0 17763 (17763.1.amd64fre.rs5_release.180914-1434)
 Operating System: Windows Server 2019 Datacenter Version 1809 (OS Build 17763.1935)
 OSType: windows
 Architecture: x86_64
 CPUs: 1
 Total Memory: 3.499GiB
 Name: worker-core-hns
 ID: VHH2:WGXV:AZ4K:AGGX:HAST:DC3W:MUIG:HYDF:YBIB:6UA6:DCVG:J7HY
 Docker Root Dir: C:\ProgramData\docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):
This was tested on 2 Azure VMs, using the "Windows Server 2019 Datacenter Server Core with Containers - Gen 2" image. Before testing, Docker was updated to the most recent version (20.10.4) with Install-Package -Name docker -ProviderName DockerMsftProvider -Verbose -Update.

The same problem also happens on our on-prem Hyper-V VMs running Windows Server 2019 Standard.

@olljanat
Copy link
Contributor

olljanat commented Jun 2, 2021

Can you test with 20.10.5 version? It looks to be that Mirantis merged moby/libnetwork#2620 fix to their repo using PR Mirantis/libnetwork#4 from branch named like "20.10-FIELD-3310" and their release notes refers FIELD-3310 https://docs.mirantis.com/containers/v3.1/mcr-rn/20-10-5.html

@paulswartz
Copy link

I can confirm that I still see this issue with 20.10.21 (installed from https://download.docker.com/win/static/stable/x86_64/) on Windows Server 2019.

docker version:

Client:
 Version:           20.10.21
 API version:       1.41
 Go version:        go1.18.7
 Git commit:        baeda1f
 Built:             Tue Oct 25 18:08:16 2022
 OS/Arch:           windows/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.21
  API version:      1.41 (minimum version 1.24)
  Go version:       go1.18.7
  Git commit:       3056208
  Built:            Tue Oct 25 18:03:04 2022
  OS/Arch:          windows/amd64
  Experimental:     true

docker info:

Client:
 Context:    default
 Debug Mode: false

Server:
 Containers: 6
  Running: 2
  Paused: 0
  Stopped: 4
 Images: 53
 Server Version: 20.10.21
 Storage Driver: lcow (linux) windowsfilter (windows)
  LCOW:
  Windows:
 Logging Driver: json-file
 Plugins:
  Volume: local
  Network: ics internal l2bridge l2tunnel nat null overlay private transparent
  Log: awslogs etwlogs fluentd gcplogs gelf json-file local logentries splunk syslog
 Swarm: active
  NodeID: 3732x3phlg8fv0qiols2g0z8c
  Is Manager: true
  ClusterID: mc28musfi26quk6lp2t6vci81
  Managers: 1
  Nodes: 2
  Default Address Pool: 10.0.0.0/8
  SubnetSize: 24
  Data Path Port: 4789
  Orchestration:
   Task History Retention Limit: 5
  Raft:
   Snapshot Interval: 10000
   Number of Old Snapshots to Retain: 0
   Heartbeat Tick: 1
   Election Tick: 10
  Dispatcher:
   Heartbeat Period: 5 seconds
  CA Configuration:
   Expiry Duration: 3 months
   Force Rotate: 0
  Autolock Managers: false
  Root Rotation In Progress: false
  Node Address: 10.108.45.24
  Manager Addresses:
   10.108.45.24:2377
 Default Isolation: process
 Kernel Version: 10.0 17763 (17763.1.amd64fre.rs5_release.180914-1434)
 Operating System: Windows Server 2019 Standard Version 1809 (OS Build 17763.3650)
 OSType: windows
 Architecture: x86_64
 CPUs: 2
 Total Memory: 7.999GiB
 Name: HSCTDTST
 ID: OS4L:ECZY:UETB:Q77X:LYEY:T6OY:VDVU:KTP7:7EEE:3QD2:5FIF:JSOU
 Docker Root Dir: C:\ProgramData\docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: true
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
 Product License: Community Engine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants