Persistent local volume e2e fail to clean up and break soak suites #68570

MaciekPytel · 2018-09-12T18:08:57Z

soak-gci-gce suites are blocked for days, because failing storage tests seem to fail to cleanup. After a failed test a pod pd-client is left in terminating state. This in turn causes test namespace to be stuck in terminating state, which causes a check in BeforeSuit to fail.

Currently 1.10 test cluster is in this state, so it could be used for debugging.
Project: k8s-jkns-gci-gce-soak-1-7
Zone: us-central1-f
Master VM: bootstrap-e2e-master

As far as I can tell this is connected with "[sig-storage] Volumes PD should be mountable with ext4" failing in https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-soak-gci-gce-stable2/648.

This is a release blocking suite for 1.10, can we clean up the offending pod to unblock patch release?

Impacted suites:
https://k8s-testgrid.appspot.com/sig-release-1.10-blocking#soak-gci-gce-1.10
https://k8s-testgrid.appspot.com/sig-release-1.11-all#soak-gci-gce-1.11

/sig storage
/priority important-soon
/kind bug

The text was updated successfully, but these errors were encountered:

MaciekPytel · 2018-09-12T18:09:15Z

cc: @msau42

msau42 · 2018-09-12T18:23:03Z

cc @davidz627 who is looking at PD leaks on e2e master tests, although this failure is on previous releases
To clarify, the suspect is PD tests, not local PV tests.

msau42 · 2018-09-12T20:20:58Z

I think there are a few problems in the "PD should be mountable" test:

It is force detaching the PD at the end of the test. It could still be mounted. Force detaching the volume while it is still mounted could potentially cause mount/unmount operations to hang.
It is not waiting for the Pod to be fully deleted before cleaning up the PD. It's only waiting for pod termination, which indicates containers are deleted, however volume unmounting could still be in progress.

davidz627 · 2018-09-12T20:43:10Z

/assign
looking into it

msau42 · 2018-09-12T20:51:29Z

https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-soak-gci-gce-stable2/648/artifacts/bootstrap-e2e-minion-group-p43z/kubelet.log

I0912 03:43:28.665240    1380 reconciler.go:278] operationExecutor.UnmountDevice started for volume "pd-volume" (UniqueName: "kubernetes.io/gce-pd/bootstrap-e2e-f135dc47-b63d-11e8-b87e-0a580a3c52c2") on node "bootstrap-e2e-minion-group-p43z"
E0912 03:43:28.666209    1380 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/gce-pd/bootstrap-e2e-f135dc47-b63d-11e8-b87e-0a580a3c52c2\"" failed. No retries permitted until 2018-09-12 03:43:29.16617148 +0000 UTC m=+25576.736735964 (durationBeforeRetry 500ms). Error: "GetDeviceMountRefs check failed for volume \"pd-volume\" (UniqueName: \"kubernetes.io/gce-pd/bootstrap-e2e-f135dc47-b63d-11e8-b87e-0a580a3c52c2\") on node \"bootstrap-e2e-minion-group-p43z\" : The device mount path \"/var/lib/kubelet/plugins/kubernetes.io/gce-pd/mounts/bootstrap-e2e-f135dc47-b63d-11e8-b87e-0a580a3c52c2\" is still mounted by other references [/home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/plugins/kubernetes.io/gce-pd/mounts/bootstrap-e2e-f135dc47-b63d-11e8-b87e-0a580a3c52c2 /home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/plugins/kubernetes.io/gce-pd/mounts/bootstrap-e2e-f135dc47-b63d-11e8-b87e-0a580a3c52c2 /var/lib/kubelet/pods/fe8fb1f7-b63d-11e8-8d68-42010a800002/volumes/kubernetes.io~gce-pd/pd-volume-0 /var/lib/kubelet/pods/fe8fb1f7-b63d-11e8-8d68-42010a800002/volumes/kubernetes.io~gce-pd/pd-volume-0 /home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/pods/fe8fb1f7-b63d-11e8-8d68-42010a800002/volumes/kubernetes.io~gce-pd/pd-volume-0 /home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/pods/fe8fb1f7-b63d-11e8-8d68-42010a800002/volumes/kubernetes.io~gce-pd/pd-volume-0]"

davidz627 · 2018-09-24T19:08:07Z

I can't find the 1.10 soak tests anymore but it looks like these tests are consistently passing in the 1.11 soak test so I'm going to mark this as closed
/close

k8s-ci-robot · 2018-09-24T19:08:08Z

@davidz627: Closing this issue.

In response to this:

I can't find the 1.10 soak tests anymore but it looks like these tests are consistently passing in the 1.11 soak test so I'm going to mark this as closed
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

msau42 · 2018-09-24T19:12:14Z

1.10 soak moved to non-blocking:
https://k8s-testgrid.appspot.com/sig-release-1.10-all#soak-gci-gce-1.10

msau42 · 2018-09-24T19:14:08Z

It still looks to be failing on 1.10, but I would double check if the test version used on 1.10 includes your commit. There was an issue where the release build tags were not getting updated.

davidz627 · 2018-09-25T17:54:35Z

/reopen

This is still an issue.

I think @msau42 and I have zeroed in on the problem:
InnerVolumeSpec name comes from the attached volume object in Actual State of the World cache. This object might (not confirmed) not get updated when you have multiple pods using the same PD on the same node. This happens when the first pod is removed but the volume hasn't finished Detaching (removed from ASW) before the second pod is scheduled to the same node and reuses the attached volume object in the ASW cache. Future pods could have a different InnerVolumeSpec name but it will use the inner spec name of the first pod because we get the InnerVolumeSpec from the AttachedVolume object in ASW.

This is a problem because we use the InnerVolumeSpec to generate the paths used to tear down volumes. We are seeing failed volume teardown which causes the pod to get stuck indefinitely which causes namespace deletion to fail and the soak tests to fail forever.

k8s-ci-robot · 2018-09-25T17:54:36Z

@davidz627: Reopening this issue.

In response to this:

/reopen

This is still an issue.

I think @msau42 and I have zeroed in on the problem:
InnerVolumeSpec name comes from the attached volume object in Actual State of the World cache. This object might (not confirmed) not get updated when you have multiple pods using the same PD on the same node. This happens when the first pod is removed but the volume hasn't finished Detaching (removed from ASW) before the second pod is scheduled to the same node and reuses the attached volume object in the ASW cache. Future pods could have a different InnerVolumeSpec name but it will use the inner spec name of the first pod because we get the InnerVolumeSpec from the AttachedVolume object in ASW.

This is a problem because we use the InnerVolumeSpec to generate the paths used to tear down volumes. We are seeing failed volume teardown which causes the pod to get stuck indefinitely which causes namespace deletion to fail and the soak tests to fail forever.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jingxu97 · 2018-09-25T20:46:03Z

this is the same issue as #61248
PR #61549 fixes the issue
Cherry pick the PR for release-1.10 #69050

davidz627 · 2018-11-09T01:56:56Z

/close

k8s-ci-robot · 2018-11-09T01:56:57Z

@davidz627: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. kind/bug Categorizes issue or PR as related to a bug. labels Sep 12, 2018

k8s-ci-robot assigned davidz627 Sep 12, 2018

davidz627 mentioned this issue Sep 12, 2018

Fixed GCE PD tests to cleanup pods and disk properly after testing #68581

Merged

k8s-ci-robot closed this as completed Sep 24, 2018

k8s-ci-robot reopened this Sep 25, 2018

jingxu97 mentioned this issue Sep 25, 2018

Add volume spec to mountedPod in actual state of world #69050

Merged

k8s-ci-robot closed this as completed Nov 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persistent local volume e2e fail to clean up and break soak suites #68570

Persistent local volume e2e fail to clean up and break soak suites #68570

MaciekPytel commented Sep 12, 2018

MaciekPytel commented Sep 12, 2018

msau42 commented Sep 12, 2018

msau42 commented Sep 12, 2018

davidz627 commented Sep 12, 2018

msau42 commented Sep 12, 2018

davidz627 commented Sep 24, 2018

k8s-ci-robot commented Sep 24, 2018

msau42 commented Sep 24, 2018

msau42 commented Sep 24, 2018

davidz627 commented Sep 25, 2018

k8s-ci-robot commented Sep 25, 2018

jingxu97 commented Sep 25, 2018

davidz627 commented Nov 9, 2018

k8s-ci-robot commented Nov 9, 2018

Persistent local volume e2e fail to clean up and break soak suites #68570

Persistent local volume e2e fail to clean up and break soak suites #68570

Comments

MaciekPytel commented Sep 12, 2018

MaciekPytel commented Sep 12, 2018

msau42 commented Sep 12, 2018

msau42 commented Sep 12, 2018

davidz627 commented Sep 12, 2018

msau42 commented Sep 12, 2018

davidz627 commented Sep 24, 2018

k8s-ci-robot commented Sep 24, 2018

msau42 commented Sep 24, 2018

msau42 commented Sep 24, 2018

davidz627 commented Sep 25, 2018

k8s-ci-robot commented Sep 25, 2018

jingxu97 commented Sep 25, 2018

davidz627 commented Nov 9, 2018

k8s-ci-robot commented Nov 9, 2018