Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make volume manager more robust across restarts #27653

Closed
saad-ali opened this issue Jun 18, 2016 · 15 comments
Closed

Make volume manager more robust across restarts #27653

saad-ali opened this issue Jun 18, 2016 · 15 comments
Assignees
Labels
kind/documentation Categorizes issue or PR as related to documentation. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Milestone

Comments

@saad-ali
Copy link
Member

When kubelet is restarted, volume manager loses its internal state. It rebuilds this state and calls attach/mount on volumes that may already be attached/mounted.

This issue is to track two items:

  1. Verify that if mount is called on a volume that is already mounted that it does not fail.
  2. If a pod was deleted while kubelet was unavailable, the associated volume will not be unmounted (since Kubelet would not know about it after restart). So volume_manager should find a way to recover the pre-attached volumes and populate actual state of the world on startup so that they get cleaned up appropriately.
@saad-ali saad-ali added sig/storage Categorizes an issue or PR as relevant to SIG Storage. team/cluster labels Jun 18, 2016
@saad-ali saad-ali added this to the next-candidate milestone Jun 18, 2016
@yujuhong
Copy link
Contributor

/cc @kubernetes/sig-node @dchen1107

Item (2) will lead to irremovable pod directories (see #27651 for an example), and these directories will accumulate over time. kubelet'd also try to clean up every 2s and fail, wasting cycles and spamming the log.
@saad-ali, is it possible to add a temporary fix for v1.3?

@saad-ali
Copy link
Member Author

Item (2) will lead to irremovable pod directories

Only if kubelet is restarted and a pod with volumes mounted is deleted during that period.

I agree it must be fixed. But I don't think it's a v1.3 blocker. I think it can wait for v1.3.1

@matchstick for triage.

@yujuhong
Copy link
Contributor

yujuhong commented Jun 20, 2016

Only if kubelet is restarted and a pod with volumes mounted is deleted during that period.

Almost every pod has a default secret volume, so this applies to most pods. The difference between this issue and many other issues is that it's not transient and haunts kubelet forever even after restarts, and would only get worse as time goes by.

I agree it must be fixed. But I don't think it's a v1.3 blocker. I think it can wait for v1.3.1

Could this be documented as known issues in the release note if the decision is to table this for v1.3.1? Thanks!

@yujuhong
Copy link
Contributor

/cc @freehan, this is why your kubelet log is full of error messages :)

@saad-ali saad-ali modified the milestones: v1.3, next-candidate Jun 20, 2016
@saad-ali
Copy link
Member Author

Spoke with @matchstick. We're going to try to get it in for v1.3. If not, we'll punt it to post-1.3

@saad-ali saad-ali self-assigned this Jun 20, 2016
@dchen1107 dchen1107 added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jun 20, 2016
@dchen1107
Copy link
Member

Talked to both @yujuhong and @saad-ali We should fix this for 1.3 release otherwise, Kubelet will be quickly useless due to out-of-disk here. I marked this as P1.

@yifan-gu
Copy link
Contributor

FWIW, this is very easy to reproduce by using local-up-cluster.sh. Basically after local-up, create a pod, nuke the cluster, local-up again, and then we will see messages like kubelet.go:2018] Failed to remove orphaned pod "03a5ea59-371b-11e6-98cf-28d244b00276" dir; err: remove /var/lib/kubelet/pods/03a5ea59-371b-11e6-98cf-28d244b00276/volumes/kubernetes.io~secret/default-token-385u2: device or resource busy.

@saad-ali
Copy link
Member Author

@jingxu97 is working on implementing a clean up routine on startup

@yujuhong
Copy link
Contributor

@jingxu97

A very simple script to reproduce this issue in an e2e cluster:

#!/bin/bash

NODES=( $(gcloud compute instances list $ZONE | awk '{print $1}' | grep e2e-test- | grep minion) )
KUBECTL="./cluster/kubectl.sh"

RC="
apiVersion: v1
kind: ReplicationController
metadata:
  name: foo 
  labels:
    k8s-app: foo 
spec:
  replicas: 90
  selector:
    k8s-app: foo 
  template:
    metadata:
      labels:
        k8s-app: foo 
    spec:
      containers:
      - image: gcr.io/google_containers/pause:2.0
        name: bar
"
# Create N pods.
echo "${RC}" | ${KUBECTL} create -f -

# Sleep a while and assume the pods will be running by then.
sleep 60

# Frocefully delete the pods.
${KUBECTL} delete rc --grace-period=0 foo

# Killing kubelet on each node.
# NOTE: It will be more effective to reproduce this issue if killing with kubelet
# overlaps with the RC deletion command.
for NODE in "${NODES[@]}"; do
    echo "Killing kubelet on node ${NODE}"
    gcloud compute ssh ${NODE} --command "sudo pkill kubelet"
done

# Give kubelet time to start up.
sleep 120

# Grep the error messages from the logs.
for NODE in "${NODES[@]}"; do
    echo "Checking error messages on node ${NODE}"
    gcloud compute ssh ${NODE} --command "grep \"Failed to remove orphaned pod\" /var/log/kubelet.log"
done

@goltermann goltermann modified the milestones: next-candidate, v1.3 Jun 24, 2016
@goltermann goltermann added the kind/documentation Categorizes issue or PR as related to documentation. label Jun 24, 2016
@saad-ali
Copy link
Member Author

The design for this should also handle VolumesInUse() gracefully--i.e. if a pod continues to live after kubelet is restarted, it should have no gap in being reported in the Node's VolumesInUse status.

@thockin
Copy link
Member

thockin commented Jun 28, 2016

status on this?

@saad-ali
Copy link
Member Author

Punted to post 1.3. Hopefully part of 1.3.x release

@matchstick matchstick added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Jun 28, 2016
@matchstick
Copy link
Contributor

OK we need to make sure we don't lose track of this one, but it is not blocking 1.3 release. Raising to P0 and removing from 1.3 milestone.

@goltermann
Copy link
Contributor

This is an issue marked v1.3 and kind/documentation. Can you get it fixed up this week?

k8s-github-robot pushed a commit that referenced this issue Aug 15, 2016
Automatic merge from submit-queue

Add volume reconstruct/cleanup logic in kubelet volume manager

Currently kubelet volume management works on the concept of desired
and actual world of states. The volume manager periodically compares the
two worlds and perform volume mount/unmount and/or attach/detach
operations. When kubelet restarts, the cache of those two worlds are
gone. Although desired world can be recovered through apiserver, actual
world can not be recovered which may cause some volumes cannot be cleaned
up if their information is deleted by apiserver. This change adds the
reconstruction of the actual world by reading the pod directories from
disk. The reconstructed volume information is added to both desired
world and actual world if it cannot be found in either world. The rest
logic would be as same as before, desired world populator may clean up
the volume entry if it is no longer in apiserver, and then volume
manager should invoke unmount to clean it up.

Fixes #27653
jsafrane added a commit to jsafrane/kubernetes that referenced this issue Nov 30, 2016
With bug kubernetes#27653, kubelet could remove mounted volumes and delete user data.
The bug itself is fixed, however our trust in kubelet is significantly lower.
Let's add an extra version of RemoveAll that does not cross mount boundary
(rm -rf --one-file-system).

It calls lstat(path) three times for each removed directory - once in
RemoveAllOneFilesystem and twice in IsLikelyNotMountPoint, however this way
it's platform independent and the directory that is being removed by kubelet
should be almost empty.
jsafrane added a commit to jsafrane/kubernetes that referenced this issue Dec 1, 2016
With bug kubernetes#27653, kubelet could remove mounted volumes and delete user data.
The bug itself is fixed, however our trust in kubelet is significantly lower.
Let's add an extra version of RemoveAll that does not cross mount boundary
(rm -rf --one-file-system).

It calls lstat(path) three times for each removed directory - once in
RemoveAllOneFilesystem and twice in IsLikelyNotMountPoint, however this way
it's platform independent and the directory that is being removed by kubelet
should be almost empty.
jsafrane added a commit to jsafrane/kubernetes that referenced this issue Feb 1, 2017
With bug kubernetes#27653, kubelet could remove mounted volumes and delete user data.
The bug itself is fixed, however our trust in kubelet is significantly lower.
Let's add an extra version of RemoveAll that does not cross mount boundary
(rm -rf --one-file-system).

It calls lstat(path) three times for each removed directory - once in
RemoveAllOneFilesystem and twice in IsLikelyNotMountPoint, however this way
it's platform independent and the directory that is being removed by kubelet
should be almost empty.
jsafrane added a commit to jsafrane/kubernetes that referenced this issue Feb 28, 2017
With bug kubernetes#27653, kubelet could remove mounted volumes and delete user data.
The bug itself is fixed, however our trust in kubelet is significantly lower.
Let's add an extra version of RemoveAll that does not cross mount boundary
(rm -rf --one-file-system).

It calls lstat(path) three times for each removed directory - once in
RemoveAllOneFilesystem and twice in IsLikelyNotMountPoint, however this way
it's platform independent and the directory that is being removed by kubelet
should be almost empty.
k8s-github-robot pushed a commit that referenced this issue Mar 24, 2017
Automatic merge from submit-queue (batch tested with PRs 41139, 41186, 38882, 37698, 42034)

Make kubelet never delete files on mounted filesystems

With bug #27653, kubelet could remove mounted volumes and delete user data.
The bug itself is fixed, however our trust in kubelet is significantly lower.
Let's add an extra version of RemoveAll that does not cross mount boundary
(rm -rf --one-file-system).

It calls lstat(path) three times for each removed directory - once in
RemoveAllOneFilesystem and twice in IsLikelyNotMountPoint, however this way
it's platform independent and the directory that is being removed by kubelet
should be almost empty.
dims pushed a commit to dims/openstack-cloud-controller-manager that referenced this issue Sep 6, 2017
Automatic merge from submit-queue

Add volume reconstruct/cleanup logic in kubelet volume manager

Currently kubelet volume management works on the concept of desired
and actual world of states. The volume manager periodically compares the
two worlds and perform volume mount/unmount and/or attach/detach
operations. When kubelet restarts, the cache of those two worlds are
gone. Although desired world can be recovered through apiserver, actual
world can not be recovered which may cause some volumes cannot be cleaned
up if their information is deleted by apiserver. This change adds the
reconstruction of the actual world by reading the pod directories from
disk. The reconstructed volume information is added to both desired
world and actual world if it cannot be found in either world. The rest
logic would be as same as before, desired world populator may clean up
the volume entry if it is no longer in apiserver, and then volume
manager should invoke unmount to clean it up.

Fixes kubernetes/kubernetes#27653
dims pushed a commit to dims/openstack-cloud-controller-manager that referenced this issue Jan 13, 2018
Automatic merge from submit-queue

Add volume reconstruct/cleanup logic in kubelet volume manager

Currently kubelet volume management works on the concept of desired
and actual world of states. The volume manager periodically compares the
two worlds and perform volume mount/unmount and/or attach/detach
operations. When kubelet restarts, the cache of those two worlds are
gone. Although desired world can be recovered through apiserver, actual
world can not be recovered which may cause some volumes cannot be cleaned
up if their information is deleted by apiserver. This change adds the
reconstruction of the actual world by reading the pod directories from
disk. The reconstructed volume information is added to both desired
world and actual world if it cannot be found in either world. The rest
logic would be as same as before, desired world populator may clean up
the volume entry if it is no longer in apiserver, and then volume
manager should invoke unmount to clean it up.

Fixes kubernetes/kubernetes#27653
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/documentation Categorizes issue or PR as related to documentation. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

No branches or pull requests

9 participants