Dockerd "Removing stale HNS endpoint <ID>" that is not created by itself on windows node of eks cluster #45462

namtrh · 2023-05-04T04:02:42Z

Description

when dockerd pod starts on windows node, it removes the HNS endpoints of existing containers on node, causing them to stop working.

upon restarting existing containers on windows node, it's working again.

Reproduce

a eks cluster which have windows node
run powershell pod on windows node
curl api-server kubernetes.default.svc -> it's working
run dockerd pod (image base on windows-host-process-containers-base-image) on windows node with host process
try curl to kubernetes.default.svc -> network timeout
keep dockerd pod running and restart powershell pod
curl kubernetes.default.svc -> it work normally

Expected behavior

powershell pod can connect to kubernetes.default.svc without need to restart/redeploy.

docker version

Client:
 Version:           23.0.4
 API version:       1.42
 Go version:        go1.19.8
 Git commit:        f480fb1
 Built:             Fri Apr 14 10:31:24 2023
 OS/Arch:           windows/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          23.0.4
  API version:      1.42 (minimum version 1.24)
  Go version:       go1.19.8
  Git commit:       cbce331
  Built:            Fri Apr 14 10:29:12 2023
  OS/Arch:          windows/amd64
  Experimental:     false

docker info

Client:
 Context:    default
 Debug Mode: false

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 0
 Server Version: 23.0.4
 Storage Driver: windowsfilter
  Windows:
 Logging Driver: json-file
 Plugins:
  Volume: local
  Network: ics internal l2bridge l2tunnel nat null overlay private transparent
  Log: awslogs etwlogs fluentd gcplogs gelf json-file local logentries splunk syslog
 Swarm: inactive
 Default Isolation: process
 Kernel Version: 10.0 17763 (17763.1.amd64fre.rs5_release.180914-1434)
 Operating System: Microsoft Windows Server Version 1809 (OS Build 17763.4252)
 OSType: windows
 Architecture: x86_64
 CPUs: 4
 Total Memory: 15.54GiB
 Name: EC2AMAZ-3IU6LTG
 ID: c211b4eb-904e-4da6-a005-90ff5fa0ff48
 Docker Root Dir: C:\ProgramData\docker
 Debug Mode: true
  File Descriptors: -1
  Goroutines: 21
  System Time: 2023-05-04T03:35:29.7733053Z
  EventsListeners: 0
 Registry: https://index.docker.io/v1/
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
 Product License: Community Engine

Additional Info

dockerd-log.txt

I see following from logs of dockerd:

Removing stale HNS endpoint 1F9F7446-BBF1-4D27-AD60-35585A8D05F6

That hns endpoint is not created by dockerd, but dockerd deleted it.

The text was updated successfully, but these errors were encountered:

namtrh · 2023-05-04T04:13:33Z

is it related to #36603

neersighted · 2023-05-08T14:23:54Z

The daemon should only be removing networks that it knows about (so, HNS endpoint IDs that are stored in the libnetwork kv-store); could you elaborate more about what components are in play here? Are you also creating containers with containerd, and the HNS CNI plugins?

corhere · 2023-05-08T20:05:20Z

The user is running an Amazon EKS Kubernetes cluster with Windows nodes, which implies cri-containerd + HNS CNI plugins. They are starting dockerd in a Windows HostProcess pod, which implies that dockerd is being started in the host network namespace. During startup, dockerd is deleting the HNS endpoint for another pod, which AIUI could only happen if the other pod's networking was managed through HNS. Though it would be nice if my assumptions could be verified. @namtrh in addition to what @neersighted asked, could you please also share the dockerd logs without redacting the Network Response logs? The information you removed is extremely relevant for diagnosing this issue.

The code path which removes stale HNS endpoints for a network is only reachable when the libnetwork network is created from a configuration which specifies an HNS ID for the network. Unfortunately, at startup the daemon "adopts" all the HNS networks by creating a libnetwork network for each using a config which specifies the existing network's HNS ID, even when no corresponding libnetwork network exists in the data store.

moby/daemon/daemon_windows.go

Lines 359 to 362 in 9dbdbd4

    
           netOption := map[string]string{ 
        
           	winlibnetwork.NetworkName: v.Name, 
        
           	winlibnetwork.HNSID:       v.Id, 
        
           }

moby/daemon/daemon_windows.go

Lines 391 to 396 in 9dbdbd4

    
           _, err := daemon.netController.NewNetwork(strings.ToLower(v.Type), name, nid, 
        
           	libnetwork.NetworkOptionGeneric(options.Generic{ 
        
           		netlabel.GenericData: netOption, 
        
           	}), 
        
           	libnetwork.NetworkOptionIpam("default", "", v4Conf, v6Conf, nil), 
        
           )

In other words, dockerd incorrectly assumes it owns every HNS network in the network namespace.

corhere · 2023-05-08T20:27:47Z

The daemon deleting HNS endpoints for networks it doesn't own looks to have been known to the Windows Containers folks at Microsoft all the way back in 2018.

Docker removes stale endpoints in nat network only libnetwork#2150

Their "fix" was not great and they decided not to follow through before they abandoned maintaining Windows support in Moby. And it sounds like some misconceptions about container lifecycle got codified in WS2019's version of HNS which makes me very afraid to make any modifications to the Windows network startup code without assistance from Microsoft as HNS is an under-documented black box.

namtrh · 2023-05-09T02:04:37Z

@neersighted, that is exactly what @corhere assumed above.

the fully dockerd log bellow
dockerd-log-full.txt

tuananh · 2023-05-09T02:27:36Z

@corhere is there a workaround for this issue you can think of?

sam-thibault · 2023-08-21T13:45:00Z

I will remove the more-info-needed label, however we are dependent on changes to the Windows networking API to make progress on this issue.

namtrh added kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. status/0-triage labels May 4, 2023

thaJeztah added platform/windows area/networking version/23.0 labels May 4, 2023

corhere added the status/more-info-needed label May 8, 2023

sam-thibault removed the status/more-info-needed label Aug 21, 2023

This was referenced Mar 15, 2024

HNS endpoint is not deleted after windows server patching docker/for-win#13965

Closed

HNS endpoint is not deleted after windows server patching #47576

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dockerd "Removing stale HNS endpoint <ID>" that is not created by itself on windows node of eks cluster #45462

Dockerd "Removing stale HNS endpoint <ID>" that is not created by itself on windows node of eks cluster #45462

namtrh commented May 4, 2023

namtrh commented May 4, 2023

neersighted commented May 8, 2023

corhere commented May 8, 2023

corhere commented May 8, 2023

namtrh commented May 9, 2023

tuananh commented May 9, 2023

sam-thibault commented Aug 21, 2023

Dockerd "Removing stale HNS endpoint <ID>" that is not created by itself on windows node of eks cluster #45462

Dockerd "Removing stale HNS endpoint <ID>" that is not created by itself on windows node of eks cluster #45462

Comments

namtrh commented May 4, 2023

Description

Reproduce

Expected behavior

docker version

docker info

Additional Info

namtrh commented May 4, 2023

neersighted commented May 8, 2023

corhere commented May 8, 2023

corhere commented May 8, 2023

namtrh commented May 9, 2023

tuananh commented May 9, 2023

sam-thibault commented Aug 21, 2023