Make volume manager more robust across restarts #27653

saad-ali · 2016-06-18T01:18:34Z

When kubelet is restarted, volume manager loses its internal state. It rebuilds this state and calls attach/mount on volumes that may already be attached/mounted.

This issue is to track two items:

Verify that if mount is called on a volume that is already mounted that it does not fail.
If a pod was deleted while kubelet was unavailable, the associated volume will not be unmounted (since Kubelet would not know about it after restart). So volume_manager should find a way to recover the pre-attached volumes and populate actual state of the world on startup so that they get cleaned up appropriately.

yujuhong · 2016-06-18T01:33:03Z

/cc @kubernetes/sig-node @dchen1107

Item (2) will lead to irremovable pod directories (see #27651 for an example), and these directories will accumulate over time. kubelet'd also try to clean up every 2s and fail, wasting cycles and spamming the log.
@saad-ali, is it possible to add a temporary fix for v1.3?

saad-ali · 2016-06-19T21:50:33Z

Item (2) will lead to irremovable pod directories

Only if kubelet is restarted and a pod with volumes mounted is deleted during that period.

I agree it must be fixed. But I don't think it's a v1.3 blocker. I think it can wait for v1.3.1

@matchstick for triage.

yujuhong · 2016-06-20T18:09:42Z

Only if kubelet is restarted and a pod with volumes mounted is deleted during that period.

Almost every pod has a default secret volume, so this applies to most pods. The difference between this issue and many other issues is that it's not transient and haunts kubelet forever even after restarts, and would only get worse as time goes by.

I agree it must be fixed. But I don't think it's a v1.3 blocker. I think it can wait for v1.3.1

Could this be documented as known issues in the release note if the decision is to table this for v1.3.1? Thanks!

yujuhong · 2016-06-20T19:05:23Z

/cc @freehan, this is why your kubelet log is full of error messages :)

saad-ali · 2016-06-20T19:25:58Z

Spoke with @matchstick. We're going to try to get it in for v1.3. If not, we'll punt it to post-1.3

dchen1107 · 2016-06-20T22:03:07Z

Talked to both @yujuhong and @saad-ali We should fix this for 1.3 release otherwise, Kubelet will be quickly useless due to out-of-disk here. I marked this as P1.

yifan-gu · 2016-06-20T23:36:27Z

FWIW, this is very easy to reproduce by using local-up-cluster.sh. Basically after local-up, create a pod, nuke the cluster, local-up again, and then we will see messages like kubelet.go:2018] Failed to remove orphaned pod "03a5ea59-371b-11e6-98cf-28d244b00276" dir; err: remove /var/lib/kubelet/pods/03a5ea59-371b-11e6-98cf-28d244b00276/volumes/kubernetes.io~secret/default-token-385u2: device or resource busy.

saad-ali · 2016-06-22T02:58:19Z

@jingxu97 is working on implementing a clean up routine on startup

yujuhong · 2016-06-23T19:52:41Z

@jingxu97

A very simple script to reproduce this issue in an e2e cluster:

#!/bin/bash

NODES=( $(gcloud compute instances list $ZONE | awk '{print $1}' | grep e2e-test- | grep minion) )
KUBECTL="./cluster/kubectl.sh"

RC="
apiVersion: v1
kind: ReplicationController
metadata:
  name: foo 
  labels:
    k8s-app: foo 
spec:
  replicas: 90
  selector:
    k8s-app: foo 
  template:
    metadata:
      labels:
        k8s-app: foo 
    spec:
      containers:
      - image: gcr.io/google_containers/pause:2.0
        name: bar
"
# Create N pods.
echo "${RC}" | ${KUBECTL} create -f -

# Sleep a while and assume the pods will be running by then.
sleep 60

# Frocefully delete the pods.
${KUBECTL} delete rc --grace-period=0 foo

# Killing kubelet on each node.
# NOTE: It will be more effective to reproduce this issue if killing with kubelet
# overlaps with the RC deletion command.
for NODE in "${NODES[@]}"; do
    echo "Killing kubelet on node ${NODE}"
    gcloud compute ssh ${NODE} --command "sudo pkill kubelet"
done

# Give kubelet time to start up.
sleep 120

# Grep the error messages from the logs.
for NODE in "${NODES[@]}"; do
    echo "Checking error messages on node ${NODE}"
    gcloud compute ssh ${NODE} --command "grep \"Failed to remove orphaned pod\" /var/log/kubelet.log"
done

saad-ali · 2016-06-28T02:23:42Z

The design for this should also handle VolumesInUse() gracefully--i.e. if a pod continues to live after kubelet is restarted, it should have no gap in being reported in the Node's VolumesInUse status.

thockin · 2016-06-28T02:25:56Z

status on this?

saad-ali · 2016-06-28T02:27:37Z

Punted to post 1.3. Hopefully part of 1.3.x release

matchstick · 2016-06-28T16:50:25Z

OK we need to make sure we don't lose track of this one, but it is not blocking 1.3 release. Raising to P0 and removing from 1.3 milestone.

goltermann · 2016-07-27T14:19:08Z

This is an issue marked v1.3 and kind/documentation. Can you get it fixed up this week?

Automatic merge from submit-queue Add volume reconstruct/cleanup logic in kubelet volume manager Currently kubelet volume management works on the concept of desired and actual world of states. The volume manager periodically compares the two worlds and perform volume mount/unmount and/or attach/detach operations. When kubelet restarts, the cache of those two worlds are gone. Although desired world can be recovered through apiserver, actual world can not be recovered which may cause some volumes cannot be cleaned up if their information is deleted by apiserver. This change adds the reconstruction of the actual world by reading the pod directories from disk. The reconstructed volume information is added to both desired world and actual world if it cannot be found in either world. The rest logic would be as same as before, desired world populator may clean up the volume entry if it is no longer in apiserver, and then volume manager should invoke unmount to clean it up. Fixes #27653

With bug kubernetes#27653, kubelet could remove mounted volumes and delete user data. The bug itself is fixed, however our trust in kubelet is significantly lower. Let's add an extra version of RemoveAll that does not cross mount boundary (rm -rf --one-file-system). It calls lstat(path) three times for each removed directory - once in RemoveAllOneFilesystem and twice in IsLikelyNotMountPoint, however this way it's platform independent and the directory that is being removed by kubelet should be almost empty.

Automatic merge from submit-queue (batch tested with PRs 41139, 41186, 38882, 37698, 42034) Make kubelet never delete files on mounted filesystems With bug #27653, kubelet could remove mounted volumes and delete user data. The bug itself is fixed, however our trust in kubelet is significantly lower. Let's add an extra version of RemoveAll that does not cross mount boundary (rm -rf --one-file-system). It calls lstat(path) three times for each removed directory - once in RemoveAllOneFilesystem and twice in IsLikelyNotMountPoint, however this way it's platform independent and the directory that is being removed by kubelet should be almost empty.

Automatic merge from submit-queue Add volume reconstruct/cleanup logic in kubelet volume manager Currently kubelet volume management works on the concept of desired and actual world of states. The volume manager periodically compares the two worlds and perform volume mount/unmount and/or attach/detach operations. When kubelet restarts, the cache of those two worlds are gone. Although desired world can be recovered through apiserver, actual world can not be recovered which may cause some volumes cannot be cleaned up if their information is deleted by apiserver. This change adds the reconstruction of the actual world by reading the pod directories from disk. The reconstructed volume information is added to both desired world and actual world if it cannot be found in either world. The rest logic would be as same as before, desired world populator may clean up the volume entry if it is no longer in apiserver, and then volume manager should invoke unmount to clean it up. Fixes kubernetes/kubernetes#27653

saad-ali added sig/storage Categorizes an issue or PR as relevant to SIG Storage. team/cluster labels Jun 18, 2016

saad-ali added this to the next-candidate milestone Jun 18, 2016

saad-ali mentioned this issue Jun 18, 2016

Failed to remove orphaned pod because volumes weren't unmounted #27651

Closed

saad-ali modified the milestones: v1.3, next-candidate Jun 20, 2016

saad-ali self-assigned this Jun 20, 2016

dchen1107 added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jun 20, 2016

saad-ali assigned jingxu97 Jun 22, 2016

goltermann modified the milestones: next-candidate, v1.3 Jun 24, 2016

goltermann added the kind/documentation Categorizes issue or PR as related to documentation. label Jun 24, 2016

saad-ali mentioned this issue Jun 27, 2016

Kubelet should mark VolumeInUse before checking if it is Attached #28095

Merged

matchstick added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Jun 28, 2016

derekparker mentioned this issue Jun 29, 2016

Cluster update script kubernetes-retired/bootkube#67

Closed

childsb added this to the v1.3 milestone Jul 13, 2016

Random-Liu mentioned this issue Jul 18, 2016

Kubelet: Cleanup returns early when there is busy orphaned pod directory. #29078

Closed

jingxu97 mentioned this issue Aug 6, 2016

Add volume reconstruct/cleanup logic in kubelet volume manager #27970

Merged

saad-ali added the cherrypick-candidate label Aug 6, 2016

k8s-github-robot closed this as completed in #27970 Aug 15, 2016

javiercr mentioned this issue Sep 2, 2016

Kubelet restarted randomly #28154

Closed

jsafrane mentioned this issue Nov 30, 2016

Make kubelet never delete files on mounted filesystems #37698

Merged

jsafrane mentioned this issue Feb 21, 2017

persistent storage volumes data is deleted when node becomes unresponsive openshift/origin#13003

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make volume manager more robust across restarts #27653

Make volume manager more robust across restarts #27653

saad-ali commented Jun 18, 2016

yujuhong commented Jun 18, 2016

saad-ali commented Jun 19, 2016

yujuhong commented Jun 20, 2016 •

edited

Loading

yujuhong commented Jun 20, 2016

saad-ali commented Jun 20, 2016

dchen1107 commented Jun 20, 2016

yifan-gu commented Jun 20, 2016

saad-ali commented Jun 22, 2016

yujuhong commented Jun 23, 2016

saad-ali commented Jun 28, 2016

thockin commented Jun 28, 2016

saad-ali commented Jun 28, 2016

matchstick commented Jun 28, 2016

goltermann commented Jul 27, 2016

Make volume manager more robust across restarts #27653

Make volume manager more robust across restarts #27653

Comments

saad-ali commented Jun 18, 2016

yujuhong commented Jun 18, 2016

saad-ali commented Jun 19, 2016

yujuhong commented Jun 20, 2016 • edited Loading

yujuhong commented Jun 20, 2016

saad-ali commented Jun 20, 2016

dchen1107 commented Jun 20, 2016

yifan-gu commented Jun 20, 2016

saad-ali commented Jun 22, 2016

yujuhong commented Jun 23, 2016

saad-ali commented Jun 28, 2016

thockin commented Jun 28, 2016

saad-ali commented Jun 28, 2016

matchstick commented Jun 28, 2016

goltermann commented Jul 27, 2016

yujuhong commented Jun 20, 2016 •

edited

Loading