Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows: Network connectivity flakes during Kubernetes prestop exec calls (curl webserver) #7332

Open
claudiubelu opened this issue Feb 9, 2023 · 0 comments

Comments

@claudiubelu
Copy link

Expected Behavior

In Kubernetes, you can define pods with lifecycle hooks. In particular, you can define PreStop exec lifecycle hooks, which will be executed by kubelet and exec into the container to execute a particular command. Kubernetes has a test suite of E2E tests, including networking tests testing the lifecycle hooks, and the expectation is for them to pass consistently.

Current Behavior

One of Kubernetes' E2E tests, [sig-node] Container Lifecycle Hook when create a pod with lifecycle hook should execute prestop exec hook properly [NodeConformance] [Conformance] flakes several times per week for CI jobs that are testing Windows with Calico CNI. The test spawns a webserver pod and a busybox pod with a prestop exec lifecycle hook, in which it will connect to the webserver pod, and in some cases this curl will fail. An analysis of the issue can be seen here: kubernetes/kubernetes#108711 (comment)

Possible Solution

Processing the HNS endpoint / applying the endpoint rules may result in temporary network interruptions. Ideally, that wouldn't be the case. Alternatively, the endpoint wouldn't be updated before being deleted, or allow it enough time to execute its prestop hooks if any.

Steps to Reproduce (for bugs)

  1. Deploy a Kubernetes cluster, including Windows nodes.
  2. Deploy Calico CNI, including [sig-node] Container Lifecycle Hook when create a pod with lifecycle hook should execute prestop exec hook properly [NodeConformance] [Conformance] and [sig-node] PreStop should call prestop when killing a pod [Conformance]
  3. Run E2E tests repeatedly until it fails.

Test runs can be seen here: https://testgrid.k8s.io/sig-windows-signal . All the capz jobs are using Calico CNI.

Sample test job failure: [1]

The job above will contain the test failure and its stderr. In its stderr we can see:

Expected
    <*errors.errorString | 0xc0015f2590>: {
        s: "failed to match regexp \"GET /echo\\\\?msg=prestop\" in output \"I0202 05:15:16.488707   14144 log.go:198] Started HTTP server on port 8080\\nI0202 05:15:16.981252   14144 log.go:198] Started UDP server on port  8081\\n\"",
    }
to be nil

The error meaning that the webserver did not receive the expected call from busybox pod with the prestop lifecycle hook (which is a curl call).

In the test logs, we can see the Pod events, and it can be seen that the PreStop exec hook failed:

Feb  2 05:16:01.308: INFO: At 2023-02-02 05:15:27 +0000 UTC - event for pod-with-prestop-exec-hook: {kubelet capz-conf-sl5v7} FailedPreStopHook: PreStopHook failed

We can see that the pod was spawned on the node capz-conf-sl5v7 and the pod name is pod-with-prestop-exec-hook. In [1], the artifacts from the job can be accessed. It will contain the kubelet logs, various system logs, logs from various pods in the kube-system and calico-system namespaces. For example, the capz-conf-sl5v7 node logs can be seen here [2], and its calico-node logs can be seen here [3].

In the kubelet logs [4], the following can be seen:

E0202 05:15:27.488428    4684 handlers.go:78] "Exec lifecycle hook for Container in Pod failed" err=<
	command 'sh -c curl http://192.168.28.222:8080/echo?msg=prestop' exited with 7:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
	                                 Dload  Upload   Total   Spent    Left  Speed
	
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to 192.168.28.222 port 8080: Bad access
 > execCommand=[sh -c curl http://192.168.28.222:8080/echo?msg=prestop] containerName="pod-with-prestop-exec-hook" pod="container-lifecycle-hook-5405/pod-with-prestop-exec-hook" message=<
	  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
	                                 Dload  Upload   Total   Spent    Left  Speed
	
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to 192.168.28.222 port 8080: Bad access
 >
E0202 05:15:27.488428    4684 kuberuntime_container.go:598] "PreStop hook failed" err=<
	command 'sh -c curl http://192.168.28.222:8080/echo?msg=prestop' exited with 7:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
	                                 Dload  Upload   Total   Spent    Left  Speed
	
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to 192.168.28.222 port 8080: Bad access
 > pod="container-lifecycle-hook-5405/pod-with-prestop-exec-hook" podUID=ed44495c-a005-4738-a469-2425ff129a60 containerName="pod-with-prestop-exec-hook" containerID="containerd://25272f179f1a3a7a46bc422480cb0f7c5831f52a36c81061c03014c8f6e7e63a"
I0202 05:15:27.488428    4684 kuberuntime_container.go:711] "Killing container with a grace period" pod="container-lifecycle-hook-5405/pod-with-prestop-exec-hook" podUID=ed44495c-a005-4738-a469-2425ff129a60 containerName="pod-with-prestop-exec-hook" containerID="containerd://25272f179f1a3a7a46bc422480cb0f7c5831f52a36c81061c03014c8f6e7e63a" gracePeriod=15

And in [5], we see the endpoint update happening during almost the same timeframe:

2023-02-02 05:15:27.129 [INFO][5900] felix/calc_graph.go 462: Local endpoint updated id=WorkloadEndpoint(node=capz-conf-sl5v7, orchestrator=k8s, workload=container-lifecycle-hook-5405/pod-with-prestop-exec-hook, name=eth0)
2023-02-02 05:15:27.129 [INFO][5900] felix/win_dataplane.go 255: Received *proto.WorkloadEndpointUpdate update from calculation graph msg=id:<orchestrator_id:"k8s" workload_id:"container-lifecycle-hook-5405/pod-with-prestop-exec-hook" endpoint_id:"eth0" > endpoint:<state:"active" name:"cali7f766805d33" profile_ids:"kns.container-lifecycle-hook-5405" profile_ids:"ksa.container-lifecycle-hook-5405.default" ipv4_nets:"192.168.81.21/32" > 
2023-02-02 05:15:27.129 [INFO][5900] felix/endpoint_mgr.go 139: Processing WorkloadEndpointUpdate workloadEndpointId=orchestrator_id:"k8s" workload_id:"container-lifecycle-hook-5405/pod-with-prestop-exec-hook" endpoint_id:"eth0" 
2023-02-02 05:15:27.333 [INFO][5900] felix/endpoint_mgr.go 356: Processing endpoint add/update id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"container-lifecycle-hook-5405/pod-with-prestop-exec-hook", EndpointId:"eth0"}
2023-02-02 05:15:27.334 [INFO][5900] felix/endpoint_mgr.go 446: Applying endpoint rules endpointId="F5D6E054-F8AB-4EF3-BDD2-AC84C7DA83A7" id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"container-lifecycle-hook-5405/pod-with-prestop-exec-hook", EndpointId:"eth0"} inboundPolicyIds=[]string{"profile-kns.container-lifecycle-hook-5405", "profile-ksa.container-lifecycle-hook-5405.default"} outboundPolicyIds=[]string{"profile-kns.container-lifecycle-hook-5405", "profile-ksa.container-lifecycle-hook-5405.default"}

[1] https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-capz-master-containerd-windows/1621001928051789824
[2] https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-capz-master-containerd-windows/1621001928051789824/artifacts/clusters/capz-conf-s0axd4/machines/capz-conf-s0axd4-md-win-7c54cf6856-bh28x/
[3] https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-capz-master-containerd-windows/1621001928051789824/artifacts/clusters/capz-conf-s0axd4/calico-system/calico-node-windows-hwlkc/
[4] https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-capz-master-containerd-windows/1621001928051789824/artifacts/clusters/capz-conf-s0axd4/machines/capz-conf-s0axd4-md-win-7c54cf6856-bh28x/kubelet.log

Context

This issue affects the sig-windows signal in Kubernetes, as seen here: https://testgrid.k8s.io/sig-windows-signal

Your Environment

  • Calico version: Deployed Calico as Host Process Containers using the image docker.io/sigwindowstools/calico-node:v3.24.5-hostprocess
  • Orchestrator version (e.g. kubernetes, mesos, rkt): Kubernetes v1.25+
  • Operating System and version: Windows Server 2019, Windows Server 2022.
  • Link to your project (optional):
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants