You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In Kubernetes, you can define pods with lifecycle hooks. In particular, you can define PreStopexec lifecycle hooks, which will be executed by kubelet and exec into the container to execute a particular command. Kubernetes has a test suite of E2E tests, including networking tests testing the lifecycle hooks, and the expectation is for them to pass consistently.
Current Behavior
One of Kubernetes' E2E tests, [sig-node] Container Lifecycle Hook when create a pod with lifecycle hook should execute prestop exec hook properly [NodeConformance] [Conformance] flakes several times per week for CI jobs that are testing Windows with Calico CNI. The test spawns a webserver pod and a busybox pod with a prestop exec lifecycle hook, in which it will connect to the webserver pod, and in some cases this curl will fail. An analysis of the issue can be seen here: kubernetes/kubernetes#108711 (comment)
Possible Solution
Processing the HNS endpoint / applying the endpoint rules may result in temporary network interruptions. Ideally, that wouldn't be the case. Alternatively, the endpoint wouldn't be updated before being deleted, or allow it enough time to execute its prestop hooks if any.
Steps to Reproduce (for bugs)
Deploy a Kubernetes cluster, including Windows nodes.
Deploy Calico CNI, including [sig-node] Container Lifecycle Hook when create a pod with lifecycle hook should execute prestop exec hook properly [NodeConformance] [Conformance] and [sig-node] PreStop should call prestop when killing a pod [Conformance]
The job above will contain the test failure and its stderr. In its stderr we can see:
Expected
<*errors.errorString | 0xc0015f2590>: {
s: "failed to match regexp \"GET /echo\\\\?msg=prestop\" in output \"I0202 05:15:16.488707 14144 log.go:198] Started HTTP server on port 8080\\nI0202 05:15:16.981252 14144 log.go:198] Started UDP server on port 8081\\n\"",
}
to be nil
The error meaning that the webserver did not receive the expected call from busybox pod with the prestop lifecycle hook (which is a curl call).
In the test logs, we can see the Pod events, and it can be seen that the PreStop exec hook failed:
Feb 2 05:16:01.308: INFO: At 2023-02-02 05:15:27 +0000 UTC - event for pod-with-prestop-exec-hook: {kubelet capz-conf-sl5v7} FailedPreStopHook: PreStopHook failed
We can see that the pod was spawned on the node capz-conf-sl5v7 and the pod name is pod-with-prestop-exec-hook. In [1], the artifacts from the job can be accessed. It will contain the kubelet logs, various system logs, logs from various pods in the kube-system and calico-system namespaces. For example, the capz-conf-sl5v7 node logs can be seen here [2], and its calico-node logs can be seen here [3].
In the kubelet logs [4], the following can be seen:
E0202 05:15:27.488428 4684 handlers.go:78] "Exec lifecycle hook for Container in Pod failed" err=<
command 'sh -c curl http://192.168.28.222:8080/echo?msg=prestop' exited with 7: % Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (7) Failed to connect to 192.168.28.222 port 8080: Bad access
> execCommand=[sh -c curl http://192.168.28.222:8080/echo?msg=prestop] containerName="pod-with-prestop-exec-hook" pod="container-lifecycle-hook-5405/pod-with-prestop-exec-hook" message=<
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (7) Failed to connect to 192.168.28.222 port 8080: Bad access
>
E0202 05:15:27.488428 4684 kuberuntime_container.go:598] "PreStop hook failed" err=<
command 'sh -c curl http://192.168.28.222:8080/echo?msg=prestop' exited with 7: % Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (7) Failed to connect to 192.168.28.222 port 8080: Bad access
> pod="container-lifecycle-hook-5405/pod-with-prestop-exec-hook" podUID=ed44495c-a005-4738-a469-2425ff129a60 containerName="pod-with-prestop-exec-hook" containerID="containerd://25272f179f1a3a7a46bc422480cb0f7c5831f52a36c81061c03014c8f6e7e63a"
I0202 05:15:27.488428 4684 kuberuntime_container.go:711] "Killing container with a grace period" pod="container-lifecycle-hook-5405/pod-with-prestop-exec-hook" podUID=ed44495c-a005-4738-a469-2425ff129a60 containerName="pod-with-prestop-exec-hook" containerID="containerd://25272f179f1a3a7a46bc422480cb0f7c5831f52a36c81061c03014c8f6e7e63a" gracePeriod=15
And in [5], we see the endpoint update happening during almost the same timeframe:
Expected Behavior
In Kubernetes, you can define pods with lifecycle hooks. In particular, you can define
PreStop
exec
lifecycle hooks, which will be executed bykubelet
and exec into the container to execute a particular command. Kubernetes has a test suite of E2E tests, including networking tests testing the lifecycle hooks, and the expectation is for them to pass consistently.Current Behavior
One of Kubernetes' E2E tests,
[sig-node] Container Lifecycle Hook when create a pod with lifecycle hook should execute prestop exec hook properly [NodeConformance] [Conformance]
flakes several times per week for CI jobs that are testing Windows with Calico CNI. The test spawns a webserver pod and a busybox pod with a prestop exec lifecycle hook, in which it will connect to the webserver pod, and in some cases thiscurl
will fail. An analysis of the issue can be seen here: kubernetes/kubernetes#108711 (comment)Possible Solution
Processing the HNS endpoint / applying the endpoint rules may result in temporary network interruptions. Ideally, that wouldn't be the case. Alternatively, the endpoint wouldn't be updated before being deleted, or allow it enough time to execute its prestop hooks if any.
Steps to Reproduce (for bugs)
[sig-node] Container Lifecycle Hook when create a pod with lifecycle hook should execute prestop exec hook properly [NodeConformance] [Conformance]
and[sig-node] PreStop should call prestop when killing a pod [Conformance]
Test runs can be seen here: https://testgrid.k8s.io/sig-windows-signal . All the capz jobs are using Calico CNI.
Sample test job failure: [1]
The job above will contain the test failure and its stderr. In its stderr we can see:
The error meaning that the webserver did not receive the expected call from busybox pod with the prestop lifecycle hook (which is a curl call).
In the test logs, we can see the Pod events, and it can be seen that the PreStop exec hook failed:
We can see that the pod was spawned on the node
capz-conf-sl5v7
and the pod name ispod-with-prestop-exec-hook
. In [1], the artifacts from the job can be accessed. It will contain thekubelet
logs, various system logs, logs from various pods in thekube-system
andcalico-system
namespaces. For example, thecapz-conf-sl5v7
node logs can be seen here [2], and itscalico-node
logs can be seen here [3].In the kubelet logs [4], the following can be seen:
And in [5], we see the endpoint update happening during almost the same timeframe:
[1] https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-capz-master-containerd-windows/1621001928051789824
[2] https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-capz-master-containerd-windows/1621001928051789824/artifacts/clusters/capz-conf-s0axd4/machines/capz-conf-s0axd4-md-win-7c54cf6856-bh28x/
[3] https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-capz-master-containerd-windows/1621001928051789824/artifacts/clusters/capz-conf-s0axd4/calico-system/calico-node-windows-hwlkc/
[4] https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-capz-master-containerd-windows/1621001928051789824/artifacts/clusters/capz-conf-s0axd4/machines/capz-conf-s0axd4-md-win-7c54cf6856-bh28x/kubelet.log
Context
This issue affects the sig-windows signal in Kubernetes, as seen here: https://testgrid.k8s.io/sig-windows-signal
Your Environment
docker.io/sigwindowstools/calico-node:v3.24.5-hostprocess
The text was updated successfully, but these errors were encountered: