Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persistent local volume e2e fail to clean up and break soak suites #68570

Closed
MaciekPytel opened this issue Sep 12, 2018 · 14 comments
Closed

Persistent local volume e2e fail to clean up and break soak suites #68570

MaciekPytel opened this issue Sep 12, 2018 · 14 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@MaciekPytel
Copy link
Contributor

soak-gci-gce suites are blocked for days, because failing storage tests seem to fail to cleanup. After a failed test a pod pd-client is left in terminating state. This in turn causes test namespace to be stuck in terminating state, which causes a check in BeforeSuit to fail.

Currently 1.10 test cluster is in this state, so it could be used for debugging.
Project: k8s-jkns-gci-gce-soak-1-7
Zone: us-central1-f
Master VM: bootstrap-e2e-master

As far as I can tell this is connected with "[sig-storage] Volumes PD should be mountable with ext4" failing in https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-soak-gci-gce-stable2/648.

This is a release blocking suite for 1.10, can we clean up the offending pod to unblock patch release?

Impacted suites:
https://k8s-testgrid.appspot.com/sig-release-1.10-blocking#soak-gci-gce-1.10
https://k8s-testgrid.appspot.com/sig-release-1.11-all#soak-gci-gce-1.11

/sig storage
/priority important-soon
/kind bug

@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. kind/bug Categorizes issue or PR as related to a bug. labels Sep 12, 2018
@MaciekPytel
Copy link
Contributor Author

cc: @msau42

@msau42
Copy link
Member

msau42 commented Sep 12, 2018

cc @davidz627 who is looking at PD leaks on e2e master tests, although this failure is on previous releases
To clarify, the suspect is PD tests, not local PV tests.

@msau42
Copy link
Member

msau42 commented Sep 12, 2018

I think there are a few problems in the "PD should be mountable" test:

  • It is force detaching the PD at the end of the test. It could still be mounted. Force detaching the volume while it is still mounted could potentially cause mount/unmount operations to hang.
  • It is not waiting for the Pod to be fully deleted before cleaning up the PD. It's only waiting for pod termination, which indicates containers are deleted, however volume unmounting could still be in progress.

@davidz627
Copy link
Contributor

/assign
looking into it

@msau42
Copy link
Member

msau42 commented Sep 12, 2018

https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-soak-gci-gce-stable2/648/artifacts/bootstrap-e2e-minion-group-p43z/kubelet.log

I0912 03:43:28.665240    1380 reconciler.go:278] operationExecutor.UnmountDevice started for volume "pd-volume" (UniqueName: "kubernetes.io/gce-pd/bootstrap-e2e-f135dc47-b63d-11e8-b87e-0a580a3c52c2") on node "bootstrap-e2e-minion-group-p43z"
E0912 03:43:28.666209    1380 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/gce-pd/bootstrap-e2e-f135dc47-b63d-11e8-b87e-0a580a3c52c2\"" failed. No retries permitted until 2018-09-12 03:43:29.16617148 +0000 UTC m=+25576.736735964 (durationBeforeRetry 500ms). Error: "GetDeviceMountRefs check failed for volume \"pd-volume\" (UniqueName: \"kubernetes.io/gce-pd/bootstrap-e2e-f135dc47-b63d-11e8-b87e-0a580a3c52c2\") on node \"bootstrap-e2e-minion-group-p43z\" : The device mount path \"/var/lib/kubelet/plugins/kubernetes.io/gce-pd/mounts/bootstrap-e2e-f135dc47-b63d-11e8-b87e-0a580a3c52c2\" is still mounted by other references [/home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/plugins/kubernetes.io/gce-pd/mounts/bootstrap-e2e-f135dc47-b63d-11e8-b87e-0a580a3c52c2 /home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/plugins/kubernetes.io/gce-pd/mounts/bootstrap-e2e-f135dc47-b63d-11e8-b87e-0a580a3c52c2 /var/lib/kubelet/pods/fe8fb1f7-b63d-11e8-8d68-42010a800002/volumes/kubernetes.io~gce-pd/pd-volume-0 /var/lib/kubelet/pods/fe8fb1f7-b63d-11e8-8d68-42010a800002/volumes/kubernetes.io~gce-pd/pd-volume-0 /home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/pods/fe8fb1f7-b63d-11e8-8d68-42010a800002/volumes/kubernetes.io~gce-pd/pd-volume-0 /home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/pods/fe8fb1f7-b63d-11e8-8d68-42010a800002/volumes/kubernetes.io~gce-pd/pd-volume-0]"

@davidz627
Copy link
Contributor

I can't find the 1.10 soak tests anymore but it looks like these tests are consistently passing in the 1.11 soak test so I'm going to mark this as closed
/close

@k8s-ci-robot
Copy link
Contributor

@davidz627: Closing this issue.

In response to this:

I can't find the 1.10 soak tests anymore but it looks like these tests are consistently passing in the 1.11 soak test so I'm going to mark this as closed
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@msau42
Copy link
Member

msau42 commented Sep 24, 2018

1.10 soak moved to non-blocking:
https://k8s-testgrid.appspot.com/sig-release-1.10-all#soak-gci-gce-1.10

@msau42
Copy link
Member

msau42 commented Sep 24, 2018

It still looks to be failing on 1.10, but I would double check if the test version used on 1.10 includes your commit. There was an issue where the release build tags were not getting updated.

@davidz627
Copy link
Contributor

/reopen

This is still an issue.

I think @msau42 and I have zeroed in on the problem:
InnerVolumeSpec name comes from the attached volume object in Actual State of the World cache. This object might (not confirmed) not get updated when you have multiple pods using the same PD on the same node. This happens when the first pod is removed but the volume hasn't finished Detaching (removed from ASW) before the second pod is scheduled to the same node and reuses the attached volume object in the ASW cache. Future pods could have a different InnerVolumeSpec name but it will use the inner spec name of the first pod because we get the InnerVolumeSpec from the AttachedVolume object in ASW.

This is a problem because we use the InnerVolumeSpec to generate the paths used to tear down volumes. We are seeing failed volume teardown which causes the pod to get stuck indefinitely which causes namespace deletion to fail and the soak tests to fail forever.

@k8s-ci-robot
Copy link
Contributor

@davidz627: Reopening this issue.

In response to this:

/reopen

This is still an issue.

I think @msau42 and I have zeroed in on the problem:
InnerVolumeSpec name comes from the attached volume object in Actual State of the World cache. This object might (not confirmed) not get updated when you have multiple pods using the same PD on the same node. This happens when the first pod is removed but the volume hasn't finished Detaching (removed from ASW) before the second pod is scheduled to the same node and reuses the attached volume object in the ASW cache. Future pods could have a different InnerVolumeSpec name but it will use the inner spec name of the first pod because we get the InnerVolumeSpec from the AttachedVolume object in ASW.

This is a problem because we use the InnerVolumeSpec to generate the paths used to tear down volumes. We are seeing failed volume teardown which causes the pod to get stuck indefinitely which causes namespace deletion to fail and the soak tests to fail forever.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jingxu97
Copy link
Contributor

this is the same issue as #61248
PR #61549 fixes the issue
Cherry pick the PR for release-1.10 #69050

@davidz627
Copy link
Contributor

/close

@k8s-ci-robot
Copy link
Contributor

@davidz627: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

No branches or pull requests

5 participants