New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failing tests: [sig-storage] CSI Volumes [Driver: csi-hostpath] * #102452
Comments
@marseel: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/sig storage |
I looked at one test case:
I picked that one because we do have the corresponding pod logs: Apparently not all artifacts were uploaded, the corresponding Creating the volume and a pod which uses it goes okay. But then the driver log ends at We don't have pod events because of this if check: kubernetes/test/e2e/storage/utils/pod.go Lines 77 to 82 in a972589
The logic behind not logging pod events during a CI run made sense at the time of writing that code because CSI drivers ran in the test namespace. This then was changed later to use a separate driver namespace, so now nothing records the events for the CSI driver pods. A first step to debug this further therefore will be to collect that information. It should tell us why the CSI driver disappears. I suspect pod eviction due to an overloaded cluster. Why that happens now and didn't before is open. Do we have information about load in the cluster? Looking at the driver logs, I see a lot of calls from health monitoring. For some reason they fail ( The health monitor pods themselves run into request throttling: That seems odd. Is it expected that they create enough requests that they need to be throttled? /cc @xing-yang |
/cc |
Thanks for helping @pohly
Can we change it? These tests are run every day, so if we change it we should have new run tomorrow.
What is worth to mention is probably that cluster has 5k nodes. |
Yes: #102526 |
As seen in kubernetes#102452, we currently don't have pod events for the CSI driver pods because of the different namespace and would need them to determine whether the driver gets evicted. Previously, only changes of the pods where logged. Perhaps even more interesting are events in the namespace.
Let's look at another instance, this time the simpler
According to that, the health monitor agents again have problems with client-side request throttling. This time, the controller looses leadership election because of it. @xing-yang any idea why it gets throttled? What is the controller doing with events? What gets throttled is:
Is the controller really watching all nodes, pods, PVCs, PVs and events in all namespaces in the entire cluster? I can't imagine how that can scale. My theory that the driver pods get evicted seems to be false, though. There's no indication of that in https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1400422473312243712/artifacts/_sig-storage_CSI_Volumes/_Driver_csi-hostpath_/_Testpattern_Dynamic_PV_default_fs_volumes/should_store_data/pod-event.log That logging stops at 13:13:52.977, which matches the time when the test itself starts to clean up after the failure. So the driver itself was running. It just never got a We don't have kubelet logs for this failure in this job, do we? I suspect we don't even know which node to look at. FWIW, the
|
They are not needed for any of the tests and may be causing too much overhead (see kubernetes#102452 (comment)). We already disabled them earlier and then re-enabled them again because it wasn't clear how much overhead they were causing. A recent change in how the sidecars get deployed (kubernetes#102282) seems to have made the situation worse again. There's no logical explanation for that yet, though. (cherry picked from commit 0c2cee5676e64976f9e767f40c4c4750a8eeb11f)
The last log messages in the hostpath driver log is:
I wonder whether we might be missing more recent log output because kubelet started to become unhealthy on the node. |
One thing that I just noticed is that csi-external-health-monitor-agent was removed in v0.3.0 of external-health-monitor. We'll see because it gets removed together with the controller in #102591, which is about to be merged and should be in the next job run. |
Besides that, #102282 also updated the csi-external-health-monitor-controller from v0.2.0 to v0.3.0. But a diff between those versions for "cmd pkg/controller" doesn't show much changes, so that shouldn't make a difference. |
The health monitor agent was removed in v0.3.0 and the feature was redesigned exactly because of the concern about the watches on all the objects. So if we were deploying the v0.2.0 agent that could indeed explain the scalability issue |
But it isn't a concern for the controller? The link that I gave above was for that, not the agent.
The agent was also getting deployed before my PR, so that doesn't explain a potentially higher load: https://github.com/pohly/kubernetes/blob/ebd02341c9805086ffc0af1422ed6133b86142f1/test/e2e/testing-manifests/storage-csi/hostpath/hostpath/csi-hostpath-plugin.yaml#L39-L54 |
One other observation: the PR updated the hostpath driver from a version without health check support (1.4.0) to a version with support (1.7.2) - kubernetes-csi/csi-driver-host-path#210 So whatever code exists in kubelet and the sidecars for this wasn't getting (stress) tested before the PR. |
Disabling the health check sidecars didn't make it into today's job run. Does the health check support in kubelet have a feature gate? Is it active in this job? |
The kubelet side is feature gated, but there's nothing feature gating the sidecar. |
So once we get rid of the sidecar, this whole feature should be unused. We just have to wait another day. |
Yes, if the controller has to watch all those objects, that is also a scalability concern. cc @xing-yang |
I'm wondering whether we'll still have kubelet code enabled that didn't run before the driver update. For example: kubernetes/pkg/kubelet/server/stats/volume_stat_calculator.go Lines 98 to 102 in 7ed2ed1
Same with kubernetes/pkg/volume/csi/csi_client.go Lines 590 to 620 in 7ed2ed1
|
The GetMetrics call does not invoke CSI ListVolumes: kubernetes/pkg/volume/csi/csi_metrics.go Line 53 in 7ed2ed1
It is true that we will be calling |
The problem is gone in https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1401147272468631552 after removing healthcheck controller and agent. We now should bring back the controller. If that then still works, we'll know that it was the redundant agent which caused the problem. We won't need to fix anything because the agent is obsolete and wasn't meant to run. |
It was disabled together with the agent to avoid test failures in gce-master-scale-correctness (kubernetes#102452). That solved the problem, but we still need to check whether the controller alone works.
See #102627 |
Yes, the external-health-monitor-controller does watch all PVCs, Pods, and Nodes. This is for the Node Watcher component which is optional. If a node goes down, the controller will report an event on the PVC saying that Pods (with names) consuming PVC is running on a failed node. The Node Watcher component is disabled by default. So if it is disabled, I think we should not be watching Pods and Nodes. Currently they are always watched. |
We had two good runs (06-06, 06-07) with both sidecars disabled. Today's run (06-08) was done with just the health-check-controller, and in that run the hostpath tests and several others failed again: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1402234386643947520 The "node not ready" events are also back: I'll disable the controller again. |
|
The last 2 runs have succeeded after disabling the health monitor sidecars. We'll investigate scalability of the health monitoring feature separately. /close |
@msau42: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Which jobs are failing:
gce-master-scale-correctness
Which test(s) are failing:
Around ~50 tests with prefix "Kubernetes e2e suite: [sig-storage] CSI Volumes [Driver: csi-hostpath] "
Example tests:
Since when has it been failing:
05-27-2021
Testgrid link:
https://testgrid.k8s.io/sig-release-master-informing#gce-master-scale-correctness
Reason for failure:
I've checked logs for pod "pod-bd3fa2c8-2279-446b-98e5-a268aa308126" in test "Kubernetes e2e suite: [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (default fs)(allowExpansion)] volume-expand should resize volume when PVC is edited while pod is using it".
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1398972946927587328
kubelet logs:
systemd logs:
Looks like reason why it fails is
Volume has not been added to the list of VolumesInUse in the node's volume status for volume \"pvc-9096038e-4455-4c1c-a596-5b10a03ffb47\"
Anything else we need to know:
The text was updated successfully, but these errors were encountered: