Windows: Network connectivity flakes during Kubernetes prestop exec calls (curl webserver) #7332

claudiubelu · 2023-02-09T17:03:54Z

Expected Behavior

In Kubernetes, you can define pods with lifecycle hooks. In particular, you can define PreStop exec lifecycle hooks, which will be executed by kubelet and exec into the container to execute a particular command. Kubernetes has a test suite of E2E tests, including networking tests testing the lifecycle hooks, and the expectation is for them to pass consistently.

Current Behavior

One of Kubernetes' E2E tests, [sig-node] Container Lifecycle Hook when create a pod with lifecycle hook should execute prestop exec hook properly [NodeConformance] [Conformance] flakes several times per week for CI jobs that are testing Windows with Calico CNI. The test spawns a webserver pod and a busybox pod with a prestop exec lifecycle hook, in which it will connect to the webserver pod, and in some cases this curl will fail. An analysis of the issue can be seen here: kubernetes/kubernetes#108711 (comment)

Possible Solution

Processing the HNS endpoint / applying the endpoint rules may result in temporary network interruptions. Ideally, that wouldn't be the case. Alternatively, the endpoint wouldn't be updated before being deleted, or allow it enough time to execute its prestop hooks if any.

Steps to Reproduce (for bugs)

Deploy a Kubernetes cluster, including Windows nodes.
Deploy Calico CNI, including [sig-node] Container Lifecycle Hook when create a pod with lifecycle hook should execute prestop exec hook properly [NodeConformance] [Conformance] and [sig-node] PreStop should call prestop when killing a pod [Conformance]
Run E2E tests repeatedly until it fails.

Test runs can be seen here: https://testgrid.k8s.io/sig-windows-signal . All the capz jobs are using Calico CNI.

Sample test job failure: [1]

The job above will contain the test failure and its stderr. In its stderr we can see:

Expected
    <*errors.errorString | 0xc0015f2590>: {
        s: "failed to match regexp \"GET /echo\\\\?msg=prestop\" in output \"I0202 05:15:16.488707   14144 log.go:198] Started HTTP server on port 8080\\nI0202 05:15:16.981252   14144 log.go:198] Started UDP server on port  8081\\n\"",
    }
to be nil

The error meaning that the webserver did not receive the expected call from busybox pod with the prestop lifecycle hook (which is a curl call).

In the test logs, we can see the Pod events, and it can be seen that the PreStop exec hook failed:

Feb  2 05:16:01.308: INFO: At 2023-02-02 05:15:27 +0000 UTC - event for pod-with-prestop-exec-hook: {kubelet capz-conf-sl5v7} FailedPreStopHook: PreStopHook failed

We can see that the pod was spawned on the node capz-conf-sl5v7 and the pod name is pod-with-prestop-exec-hook. In [1], the artifacts from the job can be accessed. It will contain the kubelet logs, various system logs, logs from various pods in the kube-system and calico-system namespaces. For example, the capz-conf-sl5v7 node logs can be seen here [2], and its calico-node logs can be seen here [3].

In the kubelet logs [4], the following can be seen:

E0202 05:15:27.488428    4684 handlers.go:78] "Exec lifecycle hook for Container in Pod failed" err=<
	command 'sh -c curl http://192.168.28.222:8080/echo?msg=prestop' exited with 7:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
	                                 Dload  Upload   Total   Spent    Left  Speed
	
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to 192.168.28.222 port 8080: Bad access
 > execCommand=[sh -c curl http://192.168.28.222:8080/echo?msg=prestop] containerName="pod-with-prestop-exec-hook" pod="container-lifecycle-hook-5405/pod-with-prestop-exec-hook" message=<
	  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
	                                 Dload  Upload   Total   Spent    Left  Speed
	
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to 192.168.28.222 port 8080: Bad access
 >
E0202 05:15:27.488428    4684 kuberuntime_container.go:598] "PreStop hook failed" err=<
	command 'sh -c curl http://192.168.28.222:8080/echo?msg=prestop' exited with 7:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
	                                 Dload  Upload   Total   Spent    Left  Speed
	
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to 192.168.28.222 port 8080: Bad access
 > pod="container-lifecycle-hook-5405/pod-with-prestop-exec-hook" podUID=ed44495c-a005-4738-a469-2425ff129a60 containerName="pod-with-prestop-exec-hook" containerID="containerd://25272f179f1a3a7a46bc422480cb0f7c5831f52a36c81061c03014c8f6e7e63a"
I0202 05:15:27.488428    4684 kuberuntime_container.go:711] "Killing container with a grace period" pod="container-lifecycle-hook-5405/pod-with-prestop-exec-hook" podUID=ed44495c-a005-4738-a469-2425ff129a60 containerName="pod-with-prestop-exec-hook" containerID="containerd://25272f179f1a3a7a46bc422480cb0f7c5831f52a36c81061c03014c8f6e7e63a" gracePeriod=15

And in [5], we see the endpoint update happening during almost the same timeframe:

2023-02-02 05:15:27.129 [INFO][5900] felix/calc_graph.go 462: Local endpoint updated id=WorkloadEndpoint(node=capz-conf-sl5v7, orchestrator=k8s, workload=container-lifecycle-hook-5405/pod-with-prestop-exec-hook, name=eth0)
2023-02-02 05:15:27.129 [INFO][5900] felix/win_dataplane.go 255: Received *proto.WorkloadEndpointUpdate update from calculation graph msg=id:<orchestrator_id:"k8s" workload_id:"container-lifecycle-hook-5405/pod-with-prestop-exec-hook" endpoint_id:"eth0" > endpoint:<state:"active" name:"cali7f766805d33" profile_ids:"kns.container-lifecycle-hook-5405" profile_ids:"ksa.container-lifecycle-hook-5405.default" ipv4_nets:"192.168.81.21/32" > 
2023-02-02 05:15:27.129 [INFO][5900] felix/endpoint_mgr.go 139: Processing WorkloadEndpointUpdate workloadEndpointId=orchestrator_id:"k8s" workload_id:"container-lifecycle-hook-5405/pod-with-prestop-exec-hook" endpoint_id:"eth0" 
2023-02-02 05:15:27.333 [INFO][5900] felix/endpoint_mgr.go 356: Processing endpoint add/update id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"container-lifecycle-hook-5405/pod-with-prestop-exec-hook", EndpointId:"eth0"}
2023-02-02 05:15:27.334 [INFO][5900] felix/endpoint_mgr.go 446: Applying endpoint rules endpointId="F5D6E054-F8AB-4EF3-BDD2-AC84C7DA83A7" id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"container-lifecycle-hook-5405/pod-with-prestop-exec-hook", EndpointId:"eth0"} inboundPolicyIds=[]string{"profile-kns.container-lifecycle-hook-5405", "profile-ksa.container-lifecycle-hook-5405.default"} outboundPolicyIds=[]string{"profile-kns.container-lifecycle-hook-5405", "profile-ksa.container-lifecycle-hook-5405.default"}

[1] https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-capz-master-containerd-windows/1621001928051789824
[2] https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-capz-master-containerd-windows/1621001928051789824/artifacts/clusters/capz-conf-s0axd4/machines/capz-conf-s0axd4-md-win-7c54cf6856-bh28x/
[3] https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-capz-master-containerd-windows/1621001928051789824/artifacts/clusters/capz-conf-s0axd4/calico-system/calico-node-windows-hwlkc/
[4] https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-capz-master-containerd-windows/1621001928051789824/artifacts/clusters/capz-conf-s0axd4/machines/capz-conf-s0axd4-md-win-7c54cf6856-bh28x/kubelet.log

Context

This issue affects the sig-windows signal in Kubernetes, as seen here: https://testgrid.k8s.io/sig-windows-signal

Your Environment

Calico version: Deployed Calico as Host Process Containers using the image docker.io/sigwindowstools/calico-node:v3.24.5-hostprocess
Orchestrator version (e.g. kubernetes, mesos, rkt): Kubernetes v1.25+
Operating System and version: Windows Server 2019, Windows Server 2022.
Link to your project (optional):

The text was updated successfully, but these errors were encountered:

marosset mentioned this issue Mar 7, 2023

Windows - [sig-node] PreStop should call prestop when killing a pod [Conformance] kubernetes/kubernetes#115136

Closed

coutinhop added kind/bug likelihood/high area/windows labels Mar 21, 2023

jsturtevant mentioned this issue Feb 29, 2024

[Flaking Test] [sig-node] PreStop should call prestop when killing a pod [Conformance] in capz-windows-master kubernetes/kubernetes#123573

Open

pacoxu mentioned this issue Mar 7, 2024

[Flaking] [capz-windows-master] [sig-node] Container Lifecycle Hook when create a pod with lifecycle hook should execute prestop exec hook properly kubernetes/kubernetes#123786

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Windows: Network connectivity flakes during Kubernetes prestop exec calls (curl webserver) #7332

Windows: Network connectivity flakes during Kubernetes prestop exec calls (curl webserver) #7332

claudiubelu commented Feb 9, 2023

Windows: Network connectivity flakes during Kubernetes prestop exec calls (curl webserver) #7332

Windows: Network connectivity flakes during Kubernetes prestop exec calls (curl webserver) #7332

Comments

claudiubelu commented Feb 9, 2023

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment