-
Notifications
You must be signed in to change notification settings - Fork 18.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dockerd "Removing stale HNS endpoint <ID>" that is not created by itself on windows node of eks cluster #45462
Comments
is it related to #36603 |
The daemon should only be removing networks that it knows about (so, HNS endpoint IDs that are stored in the libnetwork kv-store); could you elaborate more about what components are in play here? Are you also creating containers with containerd, and the HNS CNI plugins? |
The user is running an Amazon EKS Kubernetes cluster with Windows nodes, which implies cri-containerd + HNS CNI plugins. They are starting dockerd in a Windows HostProcess pod, which implies that dockerd is being started in the host network namespace. During startup, dockerd is deleting the HNS endpoint for another pod, which AIUI could only happen if the other pod's networking was managed through HNS. Though it would be nice if my assumptions could be verified. @namtrh in addition to what @neersighted asked, could you please also share the dockerd logs without redacting the The code path which removes stale HNS endpoints for a network is only reachable when the libnetwork network is created from a configuration which specifies an HNS ID for the network. Unfortunately, at startup the daemon "adopts" all the HNS networks by creating a libnetwork network for each using a config which specifies the existing network's HNS ID, even when no corresponding libnetwork network exists in the data store. Lines 359 to 362 in 9dbdbd4
Lines 391 to 396 in 9dbdbd4
In other words, dockerd incorrectly assumes it owns every HNS network in the network namespace. |
The daemon deleting HNS endpoints for networks it doesn't own looks to have been known to the Windows Containers folks at Microsoft all the way back in 2018. Their "fix" was not great and they decided not to follow through before they abandoned maintaining Windows support in Moby. And it sounds like some misconceptions about container lifecycle got codified in WS2019's version of HNS which makes me very afraid to make any modifications to the Windows network startup code without assistance from Microsoft as HNS is an under-documented black box. |
@neersighted, that is exactly what @corhere assumed above. the fully dockerd log bellow |
@corhere is there a workaround for this issue you can think of? |
I will remove the more-info-needed label, however we are dependent on changes to the Windows networking API to make progress on this issue. |
Description
when dockerd pod starts on windows node, it removes the HNS endpoints of existing containers on node, causing them to stop working.
upon restarting existing containers on windows node, it's working again.
Reproduce
Expected behavior
powershell pod can connect to kubernetes.default.svc without need to restart/redeploy.
docker version
Client: Version: 23.0.4 API version: 1.42 Go version: go1.19.8 Git commit: f480fb1 Built: Fri Apr 14 10:31:24 2023 OS/Arch: windows/amd64 Context: default Server: Docker Engine - Community Engine: Version: 23.0.4 API version: 1.42 (minimum version 1.24) Go version: go1.19.8 Git commit: cbce331 Built: Fri Apr 14 10:29:12 2023 OS/Arch: windows/amd64 Experimental: false
docker info
Additional Info
dockerd-log.txt
I see following from logs of dockerd:
Removing stale HNS endpoint 1F9F7446-BBF1-4D27-AD60-35585A8D05F6
That hns endpoint is not created by dockerd, but dockerd deleted it.
The text was updated successfully, but these errors were encountered: