Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing tests: [sig-storage] CSI Volumes [Driver: csi-hostpath] * #102452

Closed
marseel opened this issue May 31, 2021 · 27 comments
Closed

Failing tests: [sig-storage] CSI Volumes [Driver: csi-hostpath] * #102452

marseel opened this issue May 31, 2021 · 27 comments
Assignees
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@marseel
Copy link
Member

marseel commented May 31, 2021

Which jobs are failing:

gce-master-scale-correctness

Which test(s) are failing:

Around ~50 tests with prefix "Kubernetes e2e suite: [sig-storage] CSI Volumes [Driver: csi-hostpath] "
Example tests:

Kubernetes e2e suite: [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Generic Ephemeral-volume (default fs) (late-binding)] ephemeral should create read-only inline ephemeral volume
Kubernetes e2e suite: [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (default fs)] subPath should support readOnly directory specified in the volumeMount
Kubernetes e2e suite: [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (default fs)(allowExpansion)] volume-expand should resize volume when PVC is edited while pod is using it 
Kubernetes e2e suite: [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (default fs)] subPath should support restarting containers using file as subpath [Slow][LinuxOnly] 
Kubernetes e2e suite: [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (default fs)] subPath should fail if subpath directory is outside the volume [Slow][LinuxOnly]
...

Since when has it been failing:

05-27-2021

Testgrid link:

https://testgrid.k8s.io/sig-release-master-informing#gce-master-scale-correctness

Reason for failure:

I've checked logs for pod "pod-bd3fa2c8-2279-446b-98e5-a268aa308126" in test "Kubernetes e2e suite: [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (default fs)(allowExpansion)] volume-expand should resize volume when PVC is edited while pod is using it".
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1398972946927587328

kubelet logs:

./kubelet.log:523:I0530 12:24:09.838798    2113 kubelet.go:1938] "SyncLoop ADD" source="api" pods=[volume-expand-235/pod-bd3fa2c8-2279-446b-98e5-a268aa308126]
./kubelet.log:527:I0530 12:24:10.052174    2113 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"pvc-9096038e-4455-4c1c-a596-5b10a03ffb47\" (UniqueName: \"kubernetes.io/csi/csi-hostpath-volume-expand-235^eeb4e7d3-c141-11eb-9364-4a26c59411e2\") pod \"pod-bd3fa2c8-2279-446b-98e5-a268aa308126\" (UID: \"49cbe128-3778-4623-b37a-f3980e709f5b\") "
./kubelet.log:528:E0530 12:24:10.052298    2113 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/csi-hostpath-volume-expand-235^eeb4e7d3-c141-11eb-9364-4a26c59411e2 podName: nodeName:}" failed. No retries permitted until 2021-05-30 12:24:10.55116941 +0000 UTC m=+858.587378827 (durationBeforeRetry 500ms). Error: "Volume has not been added to the list of VolumesInUse in the node's volume status for volume \"pvc-9096038e-4455-4c1c-a596-5b10a03ffb47\" (UniqueName: \"kubernetes.io/csi/csi-hostpath-volume-expand-235^eeb4e7d3-c141-11eb-9364-4a26c59411e2\") pod \"pod-bd3fa2c8-2279-446b-98e5-a268aa308126\" (UID: \"49cbe128-3778-4623-b37a-f3980e709f5b\") "
./kubelet.log:529:I0530 12:24:10.052358    2113 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-api-access-fxcvv\" (UniqueName: \"kubernetes.io/projected/49cbe128-3778-4623-b37a-f3980e709f5b-kube-api-access-fxcvv\") pod \"pod-bd3fa2c8-2279-446b-98e5-a268aa308126\" (UID: \"49cbe128-3778-4623-b37a-f3980e709f5b\") "
./kubelet.log:530:I0530 12:24:10.153707    2113 reconciler.go:269] "operationExecutor.MountVolume started for volume \"kube-api-access-fxcvv\" (UniqueName: \"kubernetes.io/projected/49cbe128-3778-4623-b37a-f3980e709f5b-kube-api-access-fxcvv\") pod \"pod-bd3fa2c8-2279-446b-98e5-a268aa308126\" (UID: \"49cbe128-3778-4623-b37a-f3980e709f5b\") "
./kubelet.log:531:I0530 12:24:10.198699    2113 operation_generator.go:698] MountVolume.SetUp succeeded for volume "kube-api-access-fxcvv" (UniqueName: "kubernetes.io/projected/49cbe128-3778-4623-b37a-f3980e709f5b-kube-api-access-fxcvv") pod "pod-bd3fa2c8-2279-446b-98e5-a268aa308126" (UID: "49cbe128-3778-4623-b37a-f3980e709f5b")
./kubelet.log:532:I0530 12:24:10.556368    2113 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"pvc-9096038e-4455-4c1c-a596-5b10a03ffb47\" (UniqueName: \"kubernetes.io/csi/csi-hostpath-volume-expand-235^eeb4e7d3-c141-11eb-9364-4a26c59411e2\") pod \"pod-bd3fa2c8-2279-446b-98e5-a268aa308126\" (UID: \"49cbe128-3778-4623-b37a-f3980e709f5b\") "
./kubelet.log:533:E0530 12:24:10.556764    2113 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/csi-hostpath-volume-expand-235^eeb4e7d3-c141-11eb-9364-4a26c59411e2 podName: nodeName:}" failed. No retries permitted until 2021-05-30 12:24:11.556689246 +0000 UTC m=+859.592898644 (durationBeforeRetry 1s). Error: "Volume has not been added to the list of VolumesInUse in the node's volume status for volume \"pvc-9096038e-4455-4c1c-a596-5b10a03ffb47\" (UniqueName: \"kubernetes.io/csi/csi-hostpath-volume-expand-235^eeb4e7d3-c141-11eb-9364-4a26c59411e2\") pod \"pod-bd3fa2c8-2279-446b-98e5-a268aa308126\" (UID: \"49cbe128-3778-4623-b37a-f3980e709f5b\") "
./kubelet.log:534:I0530 12:24:11.574846    2113 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"pvc-9096038e-4455-4c1c-a596-5b10a03ffb47\" (UniqueName: \"kubernetes.io/csi/csi-hostpath-volume-expand-235^eeb4e7d3-c141-11eb-9364-4a26c59411e2\") pod \"pod-bd3fa2c8-2279-446b-98e5-a268aa308126\" (UID: \"49cbe128-3778-4623-b37a-f3980e709f5b\") "
./kubelet.log:535:E0530 12:24:11.575005    2113 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/csi-hostpath-volume-expand-235^eeb4e7d3-c141-11eb-9364-4a26c59411e2 podName: nodeName:}" failed. No retries permitted until 2021-05-30 12:24:13.574938703 +0000 UTC m=+861.611148093 (durationBeforeRetry 2s). Error: "Volume has not been added to the list of VolumesInUse in the node's volume status for volume \"pvc-9096038e-4455-4c1c-a596-5b10a03ffb47\" (UniqueName: \"kubernetes.io/csi/csi-hostpath-volume-expand-235^eeb4e7d3-c141-11eb-9364-4a26c59411e2\") pod \"pod-bd3fa2c8-2279-446b-98e5-a268aa308126\" (UID: \"49cbe128-3778-4623-b37a-f3980e709f5b\") "
./kubelet.log:536:I0530 12:24:13.612979    2113 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"pvc-9096038e-4455-4c1c-a596-5b10a03ffb47\" (UniqueName: \"kubernetes.io/csi/csi-hostpath-volume-expand-235^eeb4e7d3-c141-11eb-9364-4a26c59411e2\") pod \"pod-bd3fa2c8-2279-446b-98e5-a268aa308126\" (UID: \"49cbe128-3778-4623-b37a-f3980e709f5b\") "
./kubelet.log:537:E0530 12:24:13.651348    2113 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/csi-hostpath-volume-expand-235^eeb4e7d3-c141-11eb-9364-4a26c59411e2 podName: nodeName:}" failed. No retries permitted until 2021-05-30 12:24:17.651267361 +0000 UTC m=+865.687476745 (durationBeforeRetry 4s). Error: "Volume has not been added to the list of VolumesInUse in the node's volume status for volume \"pvc-9096038e-4455-4c1c-a596-5b10a03ffb47\" (UniqueName: \"kubernetes.io/csi/csi-hostpath-volume-expand-235^eeb4e7d3-c141-11eb-9364-4a26c59411e2\") pod \"pod-bd3fa2c8-2279-446b-98e5-a268aa308126\" (UID: \"49cbe128-3778-4623-b37a-f3980e709f5b\") "
./kubelet.log:539:I0530 12:24:17.977866    2113 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"pvc-9096038e-4455-4c1c-a596-5b10a03ffb47\" (UniqueName: \"kubernetes.io/csi/csi-hostpath-volume-expand-235^eeb4e7d3-c141-11eb-9364-4a26c59411e2\") pod \"pod-bd3fa2c8-2279-446b-98e5-a268aa308126\" (UID: \"49cbe128-3778-4623-b37a-f3980e709f5b\") "
./kubelet.log:540:E0530 12:24:17.978407    2113 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/csi-hostpath-volume-expand-235^eeb4e7d3-c141-11eb-9364-4a26c59411e2 podName: nodeName:}" failed. No retries permitted until 2021-05-30 12:24:25.978353055 +0000 UTC m=+874.014562445 (durationBeforeRetry 8s). Error: "Volume has not been added to the list of VolumesInUse in the node's volume status for volume \"pvc-9096038e-4455-4c1c-a596-5b10a03ffb47\" (UniqueName: \"kubernetes.io/csi/csi-hostpath-volume-expand-235^eeb4e7d3-c141-11eb-9364-4a26c59411e2\") pod \"pod-bd3fa2c8-2279-446b-98e5-a268aa308126\" (UID: \"49cbe128-3778-4623-b37a-f3980e709f5b\") "
./kubelet.log:637:I0530 13:38:17.842036    2113 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"pvc-9096038e-4455-4c1c-a596-5b10a03ffb47\" (UniqueName: \"kubernetes.io/csi/csi-hostpath-volume-expand-235^eeb4e7d3-c141-11eb-9364-4a26c59411e2\") pod \"pod-bd3fa2c8-2279-446b-98e5-a268aa308126\" (UID: \"49cbe128-3778-4623-b37a-f3980e709f5b\") "
./kubelet.log:640:E0530 13:38:21.236943    2113 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/csi-hostpath-volume-expand-235^eeb4e7d3-c141-11eb-9364-4a26c59411e2 podName: nodeName:}" failed. No retries permitted until 2021-05-30 13:38:34.739730301 +0000 UTC m=+5322.775939691 (durationBeforeRetry 16s). Error: "Volume has not been added to the list of VolumesInUse in the node's volume status for volume \"pvc-9096038e-4455-4c1c-a596-5b10a03ffb47\" (UniqueName: \"kubernetes.io/csi/csi-hostpath-volume-expand-235^eeb4e7d3-c141-11eb-9364-4a26c59411e2\") pod \"pod-bd3fa2c8-2279-446b-98e5-a268aa308126\" (UID: \"49cbe128-3778-4623-b37a-f3980e709f5b\") "
./kubelet.log:667:E0530 13:38:22.336502    2113 kubelet.go:1707] "Unable to attach or mount volumes for pod; skipping pod" err="unmounted volumes=[volume1], unattached volumes=[volume1 kube-api-access-fxcvv]: timed out waiting for the condition" pod="volume-expand-235/pod-bd3fa2c8-2279-446b-98e5-a268aa308126"
./kubelet.log:668:E0530 13:38:22.337195    2113 pod_workers.go:190] "Error syncing pod, skipping" err="unmounted volumes=[volume1], unattached volumes=[volume1 kube-api-access-fxcvv]: timed out waiting for the condition" pod="volume-expand-235/pod-bd3fa2c8-2279-446b-98e5-a268aa308126" podUID=49cbe128-3778-4623-b37a-f3980e709f5b
./kubelet.log:1091:I0530 13:38:41.638765    6839 kubelet.go:1938] "SyncLoop ADD" source="api" pods=[kube-system/metadata-proxy-v0.1-wxc2j kube-system/kube-proxy-gce-scale-cluster-minion-group-1-clw2 volume-expand-235-7191/csi-hostpathplugin-0 volume-expand-235/pod-bd3fa2c8-2279-446b-98e5-a268aa308126]
./kubelet.log:1319:I0530 13:38:50.594330    6839 kubelet.go:1954] "SyncLoop DELETE" source="api" pods=[volume-expand-235/pod-bd3fa2c8-2279-446b-98e5-a268aa308126]
./kubelet.log:1320:I0530 13:38:50.596477    6839 kubelet.go:1948] "SyncLoop REMOVE" source="api" pods=[volume-expand-235/pod-bd3fa2c8-2279-446b-98e5-a268aa308126]
./kubelet.log:1322:I0530 13:38:50.597705    6839 kubelet.go:2157] "Failed to delete pod" pod="volume-expand-235/pod-bd3fa2c8-2279-446b-98e5-a268aa308126" err="pod not found"

systemd logs:

./systemd.log:2452:May 30 12:24:09.839101 gce-scale-cluster-minion-group-1-clw2 kubelet[2113]: I0530 12:24:09.838798    2113 kubelet.go:1938] "SyncLoop ADD" source="api" pods=[volume-expand-235/pod-bd3fa2c8-2279-446b-98e5-a268aa308126]
./systemd.log:2456:May 30 12:24:10.052465 gce-scale-cluster-minion-group-1-clw2 kubelet[2113]: I0530 12:24:10.052174    2113 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"pvc-9096038e-4455-4c1c-a596-5b10a03ffb47\" (UniqueName: \"kubernetes.io/csi/csi-hostpath-volume-expand-235^eeb4e7d3-c141-11eb-9364-4a26c59411e2\") pod \"pod-bd3fa2c8-2279-446b-98e5-a268aa308126\" (UID: \"49cbe128-3778-4623-b37a-f3980e709f5b\") "
./systemd.log:2457:May 30 12:24:10.052465 gce-scale-cluster-minion-group-1-clw2 kubelet[2113]: E0530 12:24:10.052298    2113 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/csi-hostpath-volume-expand-235^eeb4e7d3-c141-11eb-9364-4a26c59411e2 podName: nodeName:}" failed. No retries permitted until 2021-05-30 12:24:10.55116941 +0000 UTC m=+858.587378827 (durationBeforeRetry 500ms). Error: "Volume has not been added to the list of VolumesInUse in the node's volume status for volume \"pvc-9096038e-4455-4c1c-a596-5b10a03ffb47\" (UniqueName: \"kubernetes.io/csi/csi-hostpath-volume-expand-235^eeb4e7d3-c141-11eb-9364-4a26c59411e2\") pod \"pod-bd3fa2c8-2279-446b-98e5-a268aa308126\" (UID: \"49cbe128-3778-4623-b37a-f3980e709f5b\") "
./systemd.log:2458:May 30 12:24:10.052465 gce-scale-cluster-minion-group-1-clw2 kubelet[2113]: I0530 12:24:10.052358    2113 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-api-access-fxcvv\" (UniqueName: \"kubernetes.io/projected/49cbe128-3778-4623-b37a-f3980e709f5b-kube-api-access-fxcvv\") pod \"pod-bd3fa2c8-2279-446b-98e5-a268aa308126\" (UID: \"49cbe128-3778-4623-b37a-f3980e709f5b\") "
./systemd.log:2459:May 30 12:24:10.154040 gce-scale-cluster-minion-group-1-clw2 kubelet[2113]: I0530 12:24:10.153707    2113 reconciler.go:269] "operationExecutor.MountVolume started for volume \"kube-api-access-fxcvv\" (UniqueName: \"kubernetes.io/projected/49cbe128-3778-4623-b37a-f3980e709f5b-kube-api-access-fxcvv\") pod \"pod-bd3fa2c8-2279-446b-98e5-a268aa308126\" (UID: \"49cbe128-3778-4623-b37a-f3980e709f5b\") "
./systemd.log:2460:May 30 12:24:10.198744 gce-scale-cluster-minion-group-1-clw2 kubelet[2113]: I0530 12:24:10.198699    2113 operation_generator.go:698] MountVolume.SetUp succeeded for volume "kube-api-access-fxcvv" (UniqueName: "kubernetes.io/projected/49cbe128-3778-4623-b37a-f3980e709f5b-kube-api-access-fxcvv") pod "pod-bd3fa2c8-2279-446b-98e5-a268aa308126" (UID: "49cbe128-3778-4623-b37a-f3980e709f5b")
./systemd.log:2461:May 30 12:24:10.556483 gce-scale-cluster-minion-group-1-clw2 kubelet[2113]: I0530 12:24:10.556368    2113 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"pvc-9096038e-4455-4c1c-a596-5b10a03ffb47\" (UniqueName: \"kubernetes.io/csi/csi-hostpath-volume-expand-235^eeb4e7d3-c141-11eb-9364-4a26c59411e2\") pod \"pod-bd3fa2c8-2279-446b-98e5-a268aa308126\" (UID: \"49cbe128-3778-4623-b37a-f3980e709f5b\") "
./systemd.log:2462:May 30 12:24:10.556796 gce-scale-cluster-minion-group-1-clw2 kubelet[2113]: E0530 12:24:10.556764    2113 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/csi-hostpath-volume-expand-235^eeb4e7d3-c141-11eb-9364-4a26c59411e2 podName: nodeName:}" failed. No retries permitted until 2021-05-30 12:24:11.556689246 +0000 UTC m=+859.592898644 (durationBeforeRetry 1s). Error: "Volume has not been added to the list of VolumesInUse in the node's volume status for volume \"pvc-9096038e-4455-4c1c-a596-5b10a03ffb47\" (UniqueName: \"kubernetes.io/csi/csi-hostpath-volume-expand-235^eeb4e7d3-c141-11eb-9364-4a26c59411e2\") pod \"pod-bd3fa2c8-2279-446b-98e5-a268aa308126\" (UID: \"49cbe128-3778-4623-b37a-f3980e709f5b\") "
./systemd.log:2463:May 30 12:24:11.574902 gce-scale-cluster-minion-group-1-clw2 kubelet[2113]: I0530 12:24:11.574846    2113 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"pvc-9096038e-4455-4c1c-a596-5b10a03ffb47\" (UniqueName: \"kubernetes.io/csi/csi-hostpath-volume-expand-235^eeb4e7d3-c141-11eb-9364-4a26c59411e2\") pod \"pod-bd3fa2c8-2279-446b-98e5-a268aa308126\" (UID: \"49cbe128-3778-4623-b37a-f3980e709f5b\") "
./systemd.log:2464:May 30 12:24:11.575558 gce-scale-cluster-minion-group-1-clw2 kubelet[2113]: E0530 12:24:11.575005    2113 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/csi-hostpath-volume-expand-235^eeb4e7d3-c141-11eb-9364-4a26c59411e2 podName: nodeName:}" failed. No retries permitted until 2021-05-30 12:24:13.574938703 +0000 UTC m=+861.611148093 (durationBeforeRetry 2s). Error: "Volume has not been added to the list of VolumesInUse in the node's volume status for volume \"pvc-9096038e-4455-4c1c-a596-5b10a03ffb47\" (UniqueName: \"kubernetes.io/csi/csi-hostpath-volume-expand-235^eeb4e7d3-c141-11eb-9364-4a26c59411e2\") pod \"pod-bd3fa2c8-2279-446b-98e5-a268aa308126\" (UID: \"49cbe128-3778-4623-b37a-f3980e709f5b\") "
./systemd.log:2474:May 30 12:24:13.613019 gce-scale-cluster-minion-group-1-clw2 kubelet[2113]: I0530 12:24:13.612979    2113 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"pvc-9096038e-4455-4c1c-a596-5b10a03ffb47\" (UniqueName: \"kubernetes.io/csi/csi-hostpath-volume-expand-235^eeb4e7d3-c141-11eb-9364-4a26c59411e2\") pod \"pod-bd3fa2c8-2279-446b-98e5-a268aa308126\" (UID: \"49cbe128-3778-4623-b37a-f3980e709f5b\") "
./systemd.log:2475:May 30 12:24:13.662561 gce-scale-cluster-minion-group-1-clw2 kubelet[2113]: E0530 12:24:13.651348    2113 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/csi-hostpath-volume-expand-235^eeb4e7d3-c141-11eb-9364-4a26c59411e2 podName: nodeName:}" failed. No retries permitted until 2021-05-30 12:24:17.651267361 +0000 UTC m=+865.687476745 (durationBeforeRetry 4s). Error: "Volume has not been added to the list of VolumesInUse in the node's volume status for volume \"pvc-9096038e-4455-4c1c-a596-5b10a03ffb47\" (UniqueName: \"kubernetes.io/csi/csi-hostpath-volume-expand-235^eeb4e7d3-c141-11eb-9364-4a26c59411e2\") pod \"pod-bd3fa2c8-2279-446b-98e5-a268aa308126\" (UID: \"49cbe128-3778-4623-b37a-f3980e709f5b\") "
./systemd.log:2477:May 30 12:24:17.977899 gce-scale-cluster-minion-group-1-clw2 kubelet[2113]: I0530 12:24:17.977866    2113 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"pvc-9096038e-4455-4c1c-a596-5b10a03ffb47\" (UniqueName: \"kubernetes.io/csi/csi-hostpath-volume-expand-235^eeb4e7d3-c141-11eb-9364-4a26c59411e2\") pod \"pod-bd3fa2c8-2279-446b-98e5-a268aa308126\" (UID: \"49cbe128-3778-4623-b37a-f3980e709f5b\") "
./systemd.log:2478:May 30 12:24:18.059339 gce-scale-cluster-minion-group-1-clw2 kubelet[2113]: E0530 12:24:17.978407    2113 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/csi-hostpath-volume-expand-235^eeb4e7d3-c141-11eb-9364-4a26c59411e2 podName: nodeName:}" failed. No retries permitted until 2021-05-30 12:24:25.978353055 +0000 UTC m=+874.014562445 (durationBeforeRetry 8s). Error: "Volume has not been added to the list of VolumesInUse in the node's volume status for volume \"pvc-9096038e-4455-4c1c-a596-5b10a03ffb47\" (UniqueName: \"kubernetes.io/csi/csi-hostpath-volume-expand-235^eeb4e7d3-c141-11eb-9364-4a26c59411e2\") pod \"pod-bd3fa2c8-2279-446b-98e5-a268aa308126\" (UID: \"49cbe128-3778-4623-b37a-f3980e709f5b\") "
./systemd.log:2777:May 30 13:38:20.689877 gce-scale-cluster-minion-group-1-clw2 kubelet[2113]: I0530 13:38:17.842036    2113 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"pvc-9096038e-4455-4c1c-a596-5b10a03ffb47\" (UniqueName: \"kubernetes.io/csi/csi-hostpath-volume-expand-235^eeb4e7d3-c141-11eb-9364-4a26c59411e2\") pod \"pod-bd3fa2c8-2279-446b-98e5-a268aa308126\" (UID: \"49cbe128-3778-4623-b37a-f3980e709f5b\") "
./systemd.log:2780:May 30 13:38:21.236983 gce-scale-cluster-minion-group-1-clw2 kubelet[2113]: E0530 13:38:21.236943    2113 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/csi-hostpath-volume-expand-235^eeb4e7d3-c141-11eb-9364-4a26c59411e2 podName: nodeName:}" failed. No retries permitted until 2021-05-30 13:38:34.739730301 +0000 UTC m=+5322.775939691 (durationBeforeRetry 16s). Error: "Volume has not been added to the list of VolumesInUse in the node's volume status for volume \"pvc-9096038e-4455-4c1c-a596-5b10a03ffb47\" (UniqueName: \"kubernetes.io/csi/csi-hostpath-volume-expand-235^eeb4e7d3-c141-11eb-9364-4a26c59411e2\") pod \"pod-bd3fa2c8-2279-446b-98e5-a268aa308126\" (UID: \"49cbe128-3778-4623-b37a-f3980e709f5b\") "
./systemd.log:2807:May 30 13:38:22.337063 gce-scale-cluster-minion-group-1-clw2 kubelet[2113]: E0530 13:38:22.336502    2113 kubelet.go:1707] "Unable to attach or mount volumes for pod; skipping pod" err="unmounted volumes=[volume1], unattached volumes=[volume1 kube-api-access-fxcvv]: timed out waiting for the condition" pod="volume-expand-235/pod-bd3fa2c8-2279-446b-98e5-a268aa308126"
./systemd.log:2808:May 30 13:38:22.337235 gce-scale-cluster-minion-group-1-clw2 kubelet[2113]: E0530 13:38:22.337195    2113 pod_workers.go:190] "Error syncing pod, skipping" err="unmounted volumes=[volume1], unattached volumes=[volume1 kube-api-access-fxcvv]: timed out waiting for the condition" pod="volume-expand-235/pod-bd3fa2c8-2279-446b-98e5-a268aa308126" podUID=49cbe128-3778-4623-b37a-f3980e709f5b
./systemd.log:3341:May 30 13:38:41.638822 gce-scale-cluster-minion-group-1-clw2 kubelet[6839]: I0530 13:38:41.638765    6839 kubelet.go:1938] "SyncLoop ADD" source="api" pods=[kube-system/metadata-proxy-v0.1-wxc2j kube-system/kube-proxy-gce-scale-cluster-minion-group-1-clw2 volume-expand-235-7191/csi-hostpathplugin-0 volume-expand-235/pod-bd3fa2c8-2279-446b-98e5-a268aa308126]
./systemd.log:3696:May 30 13:38:50.594374 gce-scale-cluster-minion-group-1-clw2 kubelet[6839]: I0530 13:38:50.594330    6839 kubelet.go:1954] "SyncLoop DELETE" source="api" pods=[volume-expand-235/pod-bd3fa2c8-2279-446b-98e5-a268aa308126]
./systemd.log:3697:May 30 13:38:50.596510 gce-scale-cluster-minion-group-1-clw2 kubelet[6839]: I0530 13:38:50.596477    6839 kubelet.go:1948] "SyncLoop REMOVE" source="api" pods=[volume-expand-235/pod-bd3fa2c8-2279-446b-98e5-a268aa308126]
./systemd.log:3699:May 30 13:38:50.597755 gce-scale-cluster-minion-group-1-clw2 kubelet[6839]: I0530 13:38:50.597705    6839 kubelet.go:2157] "Failed to delete pod" pod="volume-expand-235/pod-bd3fa2c8-2279-446b-98e5-a268aa308126" err="pod not found"

Looks like reason why it fails is Volume has not been added to the list of VolumesInUse in the node's volume status for volume \"pvc-9096038e-4455-4c1c-a596-5b10a03ffb47\"

Anything else we need to know:

@marseel marseel added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label May 31, 2021
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 31, 2021
@k8s-ci-robot
Copy link
Contributor

@marseel: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@marseel
Copy link
Member Author

marseel commented May 31, 2021

/sig storage
/assign @msau42

@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 31, 2021
@marseel marseel changed the title Failing test: [sig-storage] CSI Volumes [Driver: csi-hostpath] * Failing tests: [sig-storage] CSI Volumes [Driver: csi-hostpath] * May 31, 2021
@msau42
Copy link
Member

msau42 commented Jun 1, 2021

This may coincide with #102282

/assign @pohly
to help take a look

@pohly
Copy link
Contributor

pohly commented Jun 2, 2021

I looked at one test case:

Kubernetes e2e suite: [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (block volmode)] volumes should store data

I picked that one because we do have the corresponding pod logs:
https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1397885771649650688/artifacts/_sig-storage_CSI_Volumes/_Driver_csi-hostpath_/_Testpattern_Dynamic_PV_block_volmode_volumes/should_store_data/

Apparently not all artifacts were uploaded, the corresponding default fs test case has no such logs.

Creating the volume and a pod which uses it goes okay. But then the driver log ends at 13:04:01.592264, which is much too early. The test kept running until at least 13:18:59.677. The sidecars end at the same time as the driver, without a "lost connection" error for the driver. This looks like the pod was evicted and not like a driver failure.

We don't have pod events because of this if check:

// pod events are something that the framework already collects itself
// after a failed test. Logging them live is only useful for interactive
// debugging, not when we collect reports.
if framework.TestContext.ReportDir == "" {
podlogs.WatchPods(ctx, cs, ns, ginkgo.GinkgoWriter)
}

The logic behind not logging pod events during a CI run made sense at the time of writing that code because CSI drivers ran in the test namespace. This then was changed later to use a separate driver namespace, so now nothing records the events for the CSI driver pods.

A first step to debug this further therefore will be to collect that information. It should tell us why the CSI driver disappears. I suspect pod eviction due to an overloaded cluster.

Why that happens now and didn't before is open. Do we have information about load in the cluster?

Looking at the driver logs, I see a lot of calls from health monitoring. For some reason they fail (Could not get file information from /var/lib/kubelet/pods/0f6e7e47-4e98-4ca6-9bf0-be946370386c/volumes/kubernetes.io~csi/pvc-12817128-4e7e-420c-ab42-131514d05a85/mount, https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1397885771649650688/artifacts/_sig-storage_CSI_Volumes/_Driver_csi-hostpath_/_Testpattern_Dynamic_PV_block_volmode_volumes/should_store_data/csi-hostpathplugin-0-hostpath.log).

The health monitor pods themselves run into request throttling:
https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1397885771649650688/artifacts/_sig-storage_CSI_Volumes/_Driver_csi-hostpath_/_Testpattern_Dynamic_PV_block_volmode_volumes/should_store_data/csi-hostpathplugin-0-csi-external-health-monitor-agent.log
https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1397885771649650688/artifacts/_sig-storage_CSI_Volumes/_Driver_csi-hostpath_/_Testpattern_Dynamic_PV_block_volmode_volumes/should_store_data/csi-hostpathplugin-0-csi-external-health-monitor-controller.log

That seems odd. Is it expected that they create enough requests that they need to be throttled?

/cc @xing-yang

@aojea
Copy link
Member

aojea commented Jun 2, 2021

/cc

@marseel
Copy link
Member Author

marseel commented Jun 2, 2021

Thanks for helping @pohly

The logic behind not logging pod events during a CI run made sense at the time of writing that code because CSI drivers ran in the test namespace. This then was changed later to use a separate driver namespace, so now nothing records the events for the CSI driver pods.

A first step to debug this further therefore will be to collect that information. It should tell us why the CSI driver disappears. I suspect pod eviction due to an overloaded cluster.

Can we change it? These tests are run every day, so if we change it we should have new run tomorrow.

Why that happens now and didn't before is open. Do we have information about load in the cluster?

What is worth to mention is probably that cluster has 5k nodes.

@pohly
Copy link
Contributor

pohly commented Jun 2, 2021

A first step to debug this further therefore will be to collect that information. It should tell us why the CSI driver disappears. I suspect pod eviction due to an overloaded cluster.

Can we change it? These tests are run every day, so if we change it we should have new run tomorrow.

Yes: #102526

pohly added a commit to pohly/kubernetes that referenced this issue Jun 2, 2021
As seen in kubernetes#102452, we
currently don't have pod events for the CSI driver pods because of the
different namespace and would need them to determine whether the
driver gets evicted.

Previously, only changes of the pods where logged. Perhaps even more
interesting are events in the namespace.
@pohly
Copy link
Contributor

pohly commented Jun 4, 2021

Let's look at another instance, this time the simpler CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (default fs)] volumes should store data
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1400422473312243712

�[1mSTEP�[0m: Deleting pod hostpath-injector in namespace volume-6146
Jun  3 12:53:20.654: INFO: Waiting for pod hostpath-injector to disappear
...
Jun  3 13:03:45.446: INFO: Pod hostpath-client still exists
Jun  3 13:03:45.476: FAIL: Failed to create client pod: timed out waiting for the condition

Driver logs: https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1400422473312243712/artifacts/_sig-storage_CSI_Volumes/_Driver_csi-hostpath_/_Testpattern_Dynamic_PV_default_fs_volumes/should_store_data/

According to that, the health monitor agents again have problems with client-side request throttling. This time, the controller looses leadership election because of it. @xing-yang any idea why it gets throttled?

What is the controller doing with events? What gets throttled is:

I0603 12:51:15.083745       1 request.go:600] Waited for 83.705872ms due to client-side throttling, not priority and fairness, request: GET:https://10.0.0.1:443/api/v1/events?continue=....

This looks scary:
https://github.com/kubernetes-csi/external-health-monitor/blob/3b4d28ec4317769a85319401d045f7caa807fa64/cmd/csi-external-health-monitor-controller/main.go#L210-L211

Is the controller really watching all nodes, pods, PVCs, PVs and events in all namespaces in the entire cluster? I can't imagine how that can scale.

My theory that the driver pods get evicted seems to be false, though. There's no indication of that in https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1400422473312243712/artifacts/_sig-storage_CSI_Volumes/_Driver_csi-hostpath_/_Testpattern_Dynamic_PV_default_fs_volumes/should_store_data/pod-event.log

That logging stops at 13:13:52.977, which matches the time when the test itself starts to clean up after the failure. So the driver itself was running. It just never got a NodeUnpublishVolume call from kubelet, which looks like #101911

We don't have kubelet logs for this failure in this job, do we? I suspect we don't even know which node to look at.

FWIW, the pod-event.log has one interesting event regarding the node:

12:54:38.041 event: v1/Pod csi-hostpathplugin-0: node-controller Warning: Node is not ready (2021-06-03 12:52:17 +0000 UTC - 2021-06-03 12:54:38 +0000 UTC)

pohly added a commit to pohly/kubernetes that referenced this issue Jun 4, 2021
They are not needed for any of the tests and may be causing too much
overhead (see
kubernetes#102452 (comment)).

We already disabled them earlier and then re-enabled them again
because it wasn't clear how much overhead they were causing. A recent
change in how the sidecars get
deployed (kubernetes#102282) seems
to have made the situation worse again. There's no logical explanation
for that yet, though.

(cherry picked from commit 0c2cee5676e64976f9e767f40c4c4750a8eeb11f)
@pohly
Copy link
Contributor

pohly commented Jun 4, 2021

The last log messages in the hostpath driver log is:

I0603 12:53:49.055110       1 server.go:159] gRPCCall: {"Method":"/csi.v1.Identity/Probe","Request":{},"Response":{},"Error":"","FullError":null}

I wonder whether we might be missing more recent log output because kubelet started to become unhealthy on the node.

@pohly
Copy link
Contributor

pohly commented Jun 4, 2021

One thing that I just noticed is that csi-external-health-monitor-agent was removed in v0.3.0 of external-health-monitor.
csi-driver-host-path and thus Kubernetes still deploy csi-external-health-monitor-agent v0.2.0, which is obsolete, right? Could it cause harm?

We'll see because it gets removed together with the controller in #102591, which is about to be merged and should be in the next job run.

@pohly
Copy link
Contributor

pohly commented Jun 4, 2021

Besides that, #102282 also updated the csi-external-health-monitor-controller from v0.2.0 to v0.3.0. But a diff between those versions for "cmd pkg/controller" doesn't show much changes, so that shouldn't make a difference.

@msau42
Copy link
Member

msau42 commented Jun 4, 2021

The health monitor agent was removed in v0.3.0 and the feature was redesigned exactly because of the concern about the watches on all the objects.

So if we were deploying the v0.2.0 agent that could indeed explain the scalability issue

@pohly
Copy link
Contributor

pohly commented Jun 4, 2021

The health monitor agent was removed in v0.3.0 and the feature was redesigned exactly because of the concern about the watches on all the objects.

But it isn't a concern for the controller? The link that I gave above was for that, not the agent.

So if we were deploying the v0.2.0 agent that could indeed explain the scalability issue

The agent was also getting deployed before my PR, so that doesn't explain a potentially higher load: https://github.com/pohly/kubernetes/blob/ebd02341c9805086ffc0af1422ed6133b86142f1/test/e2e/testing-manifests/storage-csi/hostpath/hostpath/csi-hostpath-plugin.yaml#L39-L54

@pohly
Copy link
Contributor

pohly commented Jun 4, 2021

One other observation: the PR updated the hostpath driver from a version without health check support (1.4.0) to a version with support (1.7.2) - kubernetes-csi/csi-driver-host-path#210

So whatever code exists in kubelet and the sidecars for this wasn't getting (stress) tested before the PR.

@pohly
Copy link
Contributor

pohly commented Jun 4, 2021

Disabling the health check sidecars didn't make it into today's job run.

Does the health check support in kubelet have a feature gate? Is it active in this job?

@msau42
Copy link
Member

msau42 commented Jun 4, 2021

The kubelet side is feature gated, but there's nothing feature gating the sidecar.

@pohly
Copy link
Contributor

pohly commented Jun 4, 2021

So once we get rid of the sidecar, this whole feature should be unused. We just have to wait another day.

@msau42
Copy link
Member

msau42 commented Jun 4, 2021

But it isn't a concern for the controller? The link that I gave above was for that, not the agent.

Yes, if the controller has to watch all those objects, that is also a scalability concern. cc @xing-yang

@pohly
Copy link
Contributor

pohly commented Jun 4, 2021

I'm wondering whether we'll still have kubelet code enabled that didn't run before the driver update. For example:

volumes, found := s.statsProvider.ListVolumesForPod(s.pod.UID)
blockVolumes, bvFound := s.statsProvider.ListBlockVolumesForPod(s.pod.UID)
if !found && !bvFound {
return
}

ListVolumes was not implemented in the hostpath driver before the recent update.

Same with

func (c *csiDriverClient) NodeGetVolumeStats(ctx context.Context, volID string, targetPath string) (*volume.Metrics, error) {
klog.V(4).Info(log("calling NodeGetVolumeStats rpc: [volid=%s, target_path=%s", volID, targetPath))
if volID == "" {
return nil, errors.New("missing volume id")
}
if targetPath == "" {
return nil, errors.New("missing target path")
}
if c.nodeV1ClientCreator == nil {
return nil, errors.New("nodeV1ClientCreate is nil")
}
nodeClient, closer, err := c.nodeV1ClientCreator(c.addr, c.metricsManager)
if err != nil {
return nil, err
}
defer closer.Close()
req := &csipbv1.NodeGetVolumeStatsRequest{
VolumeId: volID,
VolumePath: targetPath,
}
resp, err := nodeClient.NodeGetVolumeStats(ctx, req)
if err != nil {
return nil, err
}
usages := resp.GetUsage()
if usages == nil {
return nil, fmt.Errorf("failed to get usage from response. usage is nil")
}

NodeGetVolumeStats also was not implemented.

@msau42
Copy link
Member

msau42 commented Jun 4, 2021

The GetMetrics call does not invoke CSI ListVolumes:

func (mc *metricsCsi) GetMetrics() (*volume.Metrics, error) {

It is true that we will be calling NodeGetVolumeStats with the updated hostpath driver now.

@pohly
Copy link
Contributor

pohly commented Jun 5, 2021

The problem is gone in https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1401147272468631552 after removing healthcheck controller and agent.

We now should bring back the controller. If that then still works, we'll know that it was the redundant agent which caused the problem. We won't need to fix anything because the agent is obsolete and wasn't meant to run.

pohly added a commit to pohly/kubernetes that referenced this issue Jun 5, 2021
It was disabled together with the agent to avoid test failures in
gce-master-scale-correctness (kubernetes#102452). That
solved the problem, but we still need to check whether the controller
alone works.
@pohly
Copy link
Contributor

pohly commented Jun 5, 2021

We now should bring back the controller.

See #102627

@xing-yang
Copy link
Contributor

Yes, if the controller has to watch all those objects, that is also a scalability concern. cc @xing-yang

Yes, the external-health-monitor-controller does watch all PVCs, Pods, and Nodes. This is for the Node Watcher component which is optional. If a node goes down, the controller will report an event on the PVC saying that Pods (with names) consuming PVC is running on a failed node.

The Node Watcher component is disabled by default. So if it is disabled, I think we should not be watching Pods and Nodes. Currently they are always watched.

@pohly
Copy link
Contributor

pohly commented Jun 8, 2021

We had two good runs (06-06, 06-07) with both sidecars disabled. Today's run (06-08) was done with just the health-check-controller, and in that run the hostpath tests and several others failed again: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1402234386643947520

The "node not ready" events are also back:
https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1402234386643947520/artifacts/_sig-storage_CSI_Volumes/_Driver_csi-hostpath_/_Testpattern_Dynamic_PV_default_fs_volumes/should_store_data/pod-event.log

I'll disable the controller again.

@pohly
Copy link
Contributor

pohly commented Jun 8, 2021

I'll disable the controller again.

#102700

@msau42
Copy link
Member

msau42 commented Jun 14, 2021

The last 2 runs have succeeded after disabling the health monitor sidecars. We'll investigate scalability of the health monitoring feature separately.

/close

@k8s-ci-robot
Copy link
Contributor

@msau42: Closing this issue.

In response to this:

The last 2 runs have succeeded after disabling the health monitor sidecars. We'll investigate scalability of the health monitoring feature separately.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

No branches or pull requests

6 participants