New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue with race condition during pod deletion #61071

Merged
merged 1 commit into from Mar 16, 2018

Conversation

@jingxu97
Contributor

jingxu97 commented Mar 12, 2018

This PR fixes two issues

  1. When desired_state_populator removes podvolume state, it should check
    whether the actual state already has the volume before deleting it to
    make sure actual state has a chance to add the volume into the state
  2. When checking podVolume still exists, it not only checks the actual
    state, but also the volume disk directory because actual state might not
    reflect the real world when kubelet starts.

fixes issue #60645

@jingxu97

This comment has been minimized.

Contributor

jingxu97 commented Mar 12, 2018

@@ -245,6 +245,10 @@ func (dswp *desiredStateOfWorldPopulator) findAndRemoveDeletedPods() {
continue
}
if !dswp.actualStateOfWorld.VolumeExists(volumeToMount.VolumeName) && podExists {

This comment has been minimized.

@gnufied

gnufied Mar 13, 2018

Member

What happens when we quickly add a pod and then delete it. The code will wait for volume be attached before it can be mounted and hence it will not be present in ASW. But the entry will not be removed from DSW even when pod is deleted because of this check right?

This comment has been minimized.

@jingxu97

jingxu97 Mar 13, 2018

Contributor

because we have an extra check on podExists here, so the entry should be removed. When pod is added and quickly deleted before reconciler sets up mount for it, kubelet should be able to delete it without problem, and podManager will not have this pod anymore, so podExists will become false.

if len(volumePaths) > 0 {
glog.Errorf("pod %q found, but volume paths are still present on disk", podUID)
return true
}

This comment has been minimized.

@gnufied

gnufied Mar 13, 2018

Member

While this check mostly looks okay, I am wondering if this will prevent deletion of pods that somehow have leftover directories on the node. To be clear - stopping pod deletion for mounted volumes is fine, but I have seen several cases where pods just leave unmounted pod directory structure in-place because of node restart or crash. Will we prevent users from deleting such pods?

cc @jsafrane

This comment has been minimized.

@jingxu97

jingxu97 Mar 13, 2018

Contributor

In that case, since desired state still has the pod at first, reconciler will try to mount the volume and update the actual state. After the volume exists in actual state, since pod is in terminated state, desired_state_populator will delete the pod from the desired state. Then reconciler should try to tear down the volume which remove all the directories.
Unless something really bad happened, and reconciler fail to mount volume all the time, pod will remains in the desired state. In case, user should use --force to delete the pod.

This comment has been minimized.

@jingxu97

jingxu97 Mar 13, 2018

Contributor

After checking the code, I think it is better to check whether volumes are still mounted instead of just volume paths exist here. So only if volumes are still mounted on the disk, pod should not be deleted. If all mounts are gone, pod could be deleted. The downside of this is that empty directories migh not be cleaned up.

@dims

This comment has been minimized.

Member

dims commented Mar 13, 2018

/milestone v1.10

@k8s-ci-robot k8s-ci-robot added this to the v1.10 milestone Mar 13, 2018

@k8s-ci-robot k8s-ci-robot added size/M and removed size/S labels Mar 13, 2018

@jingxu97

This comment has been minimized.

Contributor

jingxu97 commented Mar 14, 2018

This PR is a fix for issue #60846 which is currently blocks release 1.10. Please keep a close look at this PR. Thanks!

@jingxu97

This comment has been minimized.

Contributor

jingxu97 commented Mar 14, 2018

/test pull-kubernetes-e2e-gce

@@ -274,6 +274,24 @@ func (kl *Kubelet) getPodVolumePathListFromDisk(podUID types.UID) ([]string, err
return volumes, nil
}
func (kl *Kubelet) getMountedVolumePathListFromDisk(podUID types.UID) ([]string, error) {

This comment has been minimized.

@jsafrane

jsafrane Mar 14, 2018

Member

this function is not called

This comment has been minimized.

@gnufied

gnufied Mar 14, 2018

Member

Yeah, not sure why this function was introduced and then got commented out in code below.

This comment has been minimized.

@jingxu97

jingxu97 Mar 15, 2018

Contributor

sorry for the confusion. It is being called. I was testing at that time.

@gnufied

This comment has been minimized.

Member

gnufied commented Mar 14, 2018

@jingxu97 I think this is causing e2e test failures which is directly related to this PR:

test/e2e/framework/framework.go:142
Mar 14 06:52:28.511: Couldn't delete ns: "e2e-tests-volume-sd4d8": namespace e2e-tests-volume-sd4d8 was not deleted with limit: timed out waiting for the condition, pods remaining: 1 (&errors.errorString{s:"namespace e2e-tests-volume-sd4d8 was not deleted with limit: timed out waiting for the condition, pods remaining: 1"})
test/e2e/framework/framework.go:318

Looking into kubelet logs we can see:

I0314 06:31:37.153173    1567 kubelet_volumes.go:67] pod "3e78734c-2751-11e8-84c9-42010a280002" found, but volumes are still mounted on disk [/var/lib/kubelet/pods/3e78734c-2751-11e8-84c9-42010a280002/volumes/kubernetes.io~gce-pd/pd-volume-0]
I0314 06:31:37.153201    1567 kubelet_pods.go:947] Pod "pd-client_e2e-tests-volume-ftpr2(3e78734c-2751-11e8-84c9-42010a280002)" is terminated, but some volumes have not been cleaned up

I am not sure why @jingxu97 did not use mounted volume check.

@jdumars

This comment has been minimized.

Member

jdumars commented Mar 14, 2018

/release-note-none

@jdumars

This comment has been minimized.

Member

jdumars commented Mar 14, 2018

/sig node

@jdumars

This comment has been minimized.

Member

jdumars commented Mar 14, 2018

/test pull-kubernetes-e2e-gce

@liggitt

This comment has been minimized.

Member

liggitt commented Mar 14, 2018

fixes issue #60846

that is a PR. does this address #60645?

@k8s-merge-robot

This comment has been minimized.

Contributor

k8s-merge-robot commented Mar 15, 2018

[MILESTONENOTIFIER] Milestone Pull Request: Up-to-date for process

@gnufied @jingxu97 @yujuhong

Pull Request Labels
  • sig/node: Pull Request will be escalated to these SIGs if needed.
  • priority/critical-urgent: Never automatically move pull request out of a release milestone; continually escalate to contributor and SIG through all available channels.
  • kind/bug: Fixes a bug discovered during the current release.
Help
@jingxu97

This comment has been minimized.

Contributor

jingxu97 commented Mar 15, 2018

/test

}
glog.V(4).Infof("Volume paths exist for pod %s: %v", podUID, volumePaths)
for _, volumePath := range volumePaths {
isNotMount, err := kl.mounter.IsLikelyNotMountPoint(volumePath)

This comment has been minimized.

@jsafrane

jsafrane Mar 15, 2018

Member

I do not like check for mount points here. Flex volume may not use mount() at all, it may populate the directory in another way - e.g. Secret-like volume. So it may be "used" and TearDown needs to be called while the directory is not in /proc/mounts.

IMO, the presence of the directory is enough to call TearDown() of particular volume plugin and the directory should be removed only when it succeeds.

Of course, if we can't reconstruct the volume, unmount + rmdir is the best we can do, but as a last resort.

This comment has been minimized.

@jdumars

jdumars Mar 15, 2018

Member

@jsafrane given the lateness of this work in 1.10 and our pending release, can you advise a solution, or clarify the risk of implementing this way? In short, is this concern a blocker for approval?

This comment has been minimized.

@gnufied

gnufied Mar 15, 2018

Member

I am not sure if this is a blocker, if what flexvolume did was populated a directory (not a mount) and it is in "use", then surely a running pod will mean volume gets added to ASW and Teardown will be called when pod is deleted.

The worst that can happen is - when a pod using flexvolume (and assuming flexvolume did not create mount points but just populated plain directories) is deleted while kubelet was down and the race condition in question prevented volume from being added to ASW then pod may be deleted without calling Teardown on left over directories.

Lets review what happens in 1.9 for flexvolume that does not create mount points. Such volumes will not be processed by reconstruction process - https://github.com/kubernetes/kubernetes/blob/release-1.9/pkg/kubelet/volumemanager/reconciler/reconciler.go#L490 and hence in similar fashion as this PR will leave directories on which Teardown isn't called.

This comment has been minimized.

@jsafrane

jsafrane Mar 15, 2018

Member

@jdumars I agree with @gnufied, my comment is not blocking 1.10. Flex volume that does not mount + kubelet restart + forced pod deletion need to meet at the right time to leave an orphan directory. We can fix it in 1.10.1 or 1.11.

This comment has been minimized.

@jingxu97

jingxu97 Mar 15, 2018

Contributor

The purpose of this checking is to block pod deleted from API server if there are still some directories and mounts left behind. In normal case, it does not interfere with reconciler tear down process. By normal case, I mean the following workflow

  1. Pod is created and added in desired state, and reconciler mounts volumes and add to actual state.
  2. User tries to delete pod, so pod deletion timestamp is set
  3. Kubelet kills containers
  4. Desired state populator removes the pod from desired state (it checks if pod is in terminating state, containers are killed and actual state has the volume already)
  5. Reconciler issues tear down volume process.

In case of pod is deleted during kubelet restart, the following will happen.

  1. Pod (in terminating state) is added back to desired state first.
  2. Reconstruction will not proceed if volume in desired state
  3. Reconciler tries to set up volume and add it back the actual state. In the mean time, kubelet tries to delete the pod. With the PR, it will fail to delete if directories exist and mounted. So reconciler have a chance to make the volume into the actual state.
  4. Desired state populator can now remove volume since actual state has the volume. So reconciler will tear down the volume. After that kubelet can delete pod.

In the case of directories are not mounted, race condition might happen like the following, kubelet deletes the pod so desired state populator removes the pod so reconciler will not issue set up volume and add it to the actual state. But it should be a rare case. Before 1.10, we have quite some volume plugins cannot reconstruct volume correctly so they are not cleaned up. With the new design, we give more time for reconciler to reconstruct the state from desired state by delaying the deletion of pod from desired state. With the new design and this PR, we are still not give complete guarantee on correctly volume clean up, but overall, it should improve the situation, I think.

This comment has been minimized.

@jsafrane

jsafrane Mar 15, 2018

Member

yes, the overall situation is better than before this PR.

@yujuhong

Mostly nits. Please file an issue for the follow-up work as we discussed.

How is this tested?

if err != nil {
return volumePaths, err
}
glog.V(4).Infof("Volume paths exist for pod %s: %v", podUID, volumePaths)

This comment has been minimized.

@yujuhong

yujuhong Mar 15, 2018

Contributor

How often does this function run? Will the V(4) log be too noisy?

This comment has been minimized.

@jingxu97

jingxu97 Mar 15, 2018

Contributor

I remove this log, that is mainly for debugging

for _, volumePath := range volumePaths {
isNotMount, err := kl.mounter.IsLikelyNotMountPoint(volumePath)
if err != nil {
glog.Errorf("Cannot check whether volume path %s for pod %sis mounted or not: %v", volumePath, podUID, err)

This comment has been minimized.

@yujuhong

yujuhong Mar 15, 2018

Contributor

nit: missing a space s/%sis/%s is
Also change volume path printing to using %q

This comment has been minimized.

@yujuhong

yujuhong Mar 15, 2018

Contributor

Add continue to make sure you skip the next if check

This comment has been minimized.

@jingxu97

jingxu97 Mar 15, 2018

Contributor

fixed

return volumePaths, err
}
glog.V(4).Infof("Volume paths exist for pod %s: %v", podUID, volumePaths)
for _, volumePath := range volumePaths {

This comment has been minimized.

@yujuhong

yujuhong Mar 15, 2018

Contributor

Please add a short comment explaining what the value returned of IsLikelyNotMountPoint mean here, and why it's safe to skip when running into errors

This comment has been minimized.

@jingxu97

jingxu97 Mar 15, 2018

Contributor

added

glog.Errorf("pod %q found, but error %v occurred during checking mounted volumes from disk", podUID, err)
}
if len(volumePaths) > 0 {
glog.Infof("pod %q found, but volumes are still mounted on disk %v", podUID, volumePaths)

This comment has been minimized.

@yujuhong

yujuhong Mar 15, 2018

Contributor

Lower the verbosity of this log line.

This comment has been minimized.

@jingxu97

jingxu97 Mar 15, 2018

Contributor

fixed

@@ -58,6 +58,15 @@ func (kl *Kubelet) podVolumesExist(podUID types.UID) bool {
return true
}
volumePaths, err := kl.getMountedVolumePathListFromDisk(podUID)

This comment has been minimized.

@yujuhong

yujuhong Mar 15, 2018

Contributor

Add a comment describing that this is a best-effort check, and link to the github issue for the follow-up work.

This comment has been minimized.

@jingxu97

jingxu97 Mar 15, 2018

Contributor

fixed

@jingxu97

This comment has been minimized.

Contributor

jingxu97 commented Mar 15, 2018

@yujuhong The code path modified by this PR is tested by some volume e2e tests including PD volume should be mountable for ..., and also some disruptive PD tests.

Comments are address. PTAL. Thanks!

return volumePaths, err
}
for _, volumePath := range volumePaths {
// Check whether volume path is mounted or not. If one path has error to check, continue to the next check.

This comment has been minimized.

@yujuhong

yujuhong Mar 15, 2018

Contributor

I think the comment should be about why it's safe to ignore errors and continue. The opposite is to fail and return error immediately.

This comment has been minimized.

@jingxu97

jingxu97 Mar 15, 2018

Contributor

This function is called by podVolumesExist() which will return true/false to indicate whether pod can be deleted or not. In case of checking error, I am thinking maybe it is better to return error here, and then podVolumeExist() should return true since we are not sure whether paths exist and are mounted. Then the behavior would be if we fail to check volume paths, pods cannot be deleted.

This comment has been minimized.

@yujuhong

yujuhong Mar 15, 2018

Contributor

Changing this sgtm. Please make the change and either add comment for podVolumesExist, or let it return both bool and error instead.

This comment has been minimized.

@jingxu97

jingxu97 Mar 15, 2018

Contributor

changed the code as discussed. PTAL

@yujuhong

This comment has been minimized.

Contributor

yujuhong commented Mar 15, 2018

@yujuhong The code path modified by this PR is tested by some volume e2e tests including PD volume should be mountable for ..., and also some disruptive PD tests.

Are those tests failing now? If not, how do we verify the PR?

@jingxu97

This comment has been minimized.

Contributor

jingxu97 commented Mar 15, 2018

@yujuhong The existing tests are not failing, but our code path does not check the real directories on disk. Now after this code is added to check the disk directories, we might hit errors if anything left behind. I did notice something commented in #61071 (comment), still investigating.

@jingxu97

This comment has been minimized.

Contributor

jingxu97 commented Mar 15, 2018

/test pull-kubernetes-e2e-gce

@gnufied

This comment has been minimized.

Member

gnufied commented Mar 15, 2018

@jingxu97 @yujuhong https://github.com/kubernetes/kubernetes/pull/61181/files e2e test fails 100% without the fixes needed for #60645 . It is sadly a timing thing - the busybox pod with default command "while do; sleep 1; done" is not dying fast enough for pod to miss ASW. It appears that nginx image that I was using does die and reproduces the bug.

Fix issue with race condition during pod deletion
This PR fixes two issues
1. When desired_state_populator removes podvolume state, it should check
whether the actual state already has the volume before deleting it to
make sure actual state has a chance to add the volume into the state
2. When checking podVolume still exists, it not only checks the actual
state, but also the volume disk directory because actual state might not
reflect the real world when kubelet starts.
@yujuhong

This comment has been minimized.

Contributor

yujuhong commented Mar 15, 2018

@gnufied can you confirm that the test now passes with this PR?

@yujuhong

This comment has been minimized.

Contributor

yujuhong commented Mar 15, 2018

This fix adds an extra call to check the directories on the disk, but it's only used in 1) a periodic cleanup routine, and 2) when a pod's deletion timestamp is set, and the status manager is checking whether the resources are reclaimed (hence safe to delete).

/approve

lgtm pending the verification of the fix #61071 (comment)

@gnufied

This comment has been minimized.

Member

gnufied commented Mar 15, 2018

@yujuhong I ran the e2e tests before and after merging this PR and jan's PR (#60888).

Before these 2 PRs:

[sig-storage] GenericPersistentVolume[Disruptive]
/home/shared/go_projects/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/storage/utils/framework.go:22
  When kubelet restarts
  /home/shared/go_projects/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/storage/generic_persistent_volume-disruptive.go:57
    Should test that a volume mounted to a pod that is deleted while the kubelet is down unmounts when the kubelet returns. [It]
    /home/shared/go_projects/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/storage/generic_persistent_volume-disruptive.go:71

    Expected grep stdout to be empty (i.e. no mount found).
    Expected
        <string>: /dev/sdb on /var/lib/kubelet/pods/ac7fc3dc-289f-11e8-8279-42010af00002/volumes/kubernetes.io~gce-pd/pvc-a9e2ed7a-289f-11e8-8279-42010af00002 type ext4 (rw,relatime,data=ordered)
        /dev/sdb on /var/lib/kubelet/pods/ac7fc3dc-289f-11e8-8279-42010af00002/volumes/kubernetes.io~gce-pd/pvc-a9e2ed7a-289f-11e8-8279-42010af00002 type ext4 (rw,relatime,data=ordered)
        /dev/sdb on /home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/pods/ac7fc3dc-289f-11e8-8279-42010af00002/volumes/kubernetes.io~gce-pd/pvc-a9e2ed7a-289f-11e8-8279-42010af00002 type ext4 (rw,relatime,data=ordered)
        /dev/sdb on /home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/pods/ac7fc3dc-289f-11e8-8279-42010af00002/volumes/kubernetes.io~gce-pd/pvc-a9e2ed7a-289f-11e8-8279-42010af00002 type ext4 (rw,relatime,data=ordered)
        
    to be empty

    /home/shared/go_projects/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/storage/utils/utils.go:200
------------------------------
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS

After:

• [SLOW TEST:109.773 seconds]
[sig-storage] GenericPersistentVolume[Disruptive]
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/storage/utils/framework.go:22
  When kubelet restarts
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/storage/generic_persistent_volume-disruptive.go:57
    Should test that a volume mounted to a pod that is deleted while the kubelet is down unmounts when the kubelet returns.
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/storage/generic_persistent_volume-disruptive.go:71
------------------------------
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSMar 15 19:10:48.807: INFO: Running AfterSuite actions on all node
Mar 15 19:10:48.807: INFO: Running AfterSuite actions on node 1

Ran 3 of 823 Specs in 390.657 seconds
SUCCESS! -- 3 Passed | 0 Failed | 0 Pending | 820 Skipped PASS

I am running the e2e test in a loop just to be sure but I think we should be good.

@yujuhong

This comment has been minimized.

Contributor

yujuhong commented Mar 15, 2018

@gnufied thanks!

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Mar 15, 2018

@k8s-ci-robot

This comment has been minimized.

Contributor

k8s-ci-robot commented Mar 15, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gnufied, jingxu97, yujuhong

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jingxu97

This comment has been minimized.

Contributor

jingxu97 commented Mar 15, 2018

A few issues opened are related to this PR

  1. #61229 Check volumes actually exist in PodVolumesExist before deleting the pod might not work well for some plugins such flexvolume.
  • Need to address in a later release.
  1. #61249 Move volume related functions to volume manager
  • Small code cleanup in a separate PR
  1. #61248 Attached Volume data structure issue in volume manager's actual state
  • an existing issue, but needs some attention and should to be fixed soon
  1. #61133: kubelet_getters getPodVolumePathListFromDisk() does not check block
  • should fix it for block volumes
@k8s-merge-robot

This comment has been minimized.

Contributor

k8s-merge-robot commented Mar 16, 2018

Automatic merge from submit-queue (batch tested with PRs 61203, 61071). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-merge-robot k8s-merge-robot merged commit 9fe565a into kubernetes:master Mar 16, 2018

14 checks passed

Submit Queue Queued to run github e2e tests a second time.
Details
cla/linuxfoundation jingxu97 authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gke Skipped
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce Job succeeded.
Details
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
@jbguerraz

This comment has been minimized.

jbguerraz commented Mar 16, 2018

Any plan to backport it to 1.9 ? :)

@saad-ali

This comment has been minimized.

Member

saad-ali commented Mar 16, 2018

AFAIK this PR is fixing issues in introduced in PR #58177 which doesn't exist in 1.9. So it doesn't not make sense to cherry pick to 1.9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment