Volume status out of sync when kubelet restarts #33203

jingxu97 · 2016-09-21T20:15:29Z

Problem

Currently at node side, node status is updated periodically by kubelet. For volumes, node updater will try to get a list of volumes in use (from current desired state and actual state) and update the list. The master will need to use this information to determine whether it is safe to detach. This information might be out of sync and cause the following problems:

P1. When kubelet restarts, all the information about the volumes will be gone from actual/desired states and it takes some time to recover those information. When node status is updated, the mounted volumes list might be empty after kubelet restarts. If during this time, pod is deleted, master will try to detach the volumes (pass safe to detach test since node status showing they are not mounted) while they are still mounted. In turn, reconstruct procedure fails to reconstruct the volume due to some reason such as global mount path is gone see PR #33207.

P2. The node status is out of date which might cause race condition when master and node decide to perform detach and mount operations. The following is the sequence of the events which will result in detaching a volume while it is mounted.

Example

a. The initial state is volume is attached and mounted, both master and node update node status showing that volume is attached (reportedAsAttached) and mounted (volumeInUse)
b. On node, the volume is unmounted (UnmountDevice finishes), node status is updated by the kubelet showing that volume is no longer mounted
c. On node, a mount operation is triggered (verify volume is attached can passed because master did not change the status so far)
d. On master, a detach operation is triggered (verify volume is not mounted can pass because of step b)

Proposed solution 1

Instead of using the periodic updates by kubelet, status updates are triggered when attach/detach mount/unmount operations happen. Node status can be updated and read by both master and node. Master updates which volume is reportAsAttached to the node, and node updates which volume is currently mounted (or going to be mounted). Considering the four operations, attach/detach/mount/unmount, detach and mount operation need to be carefully issued because detaching while volume is mounted might cause file system corruption and data loss. While attach and unmount are normally safe operations that can be performed no matter what the current state is.

Detach operation:

Check if it is safe to detach (according to node status whether the volume is still mounted -- this information might out of date). If this is safe, move to step 2. Otherwise, return
Mark volume as detached (remove from reportedAsAttached list), update node status
Verify node status from API server again. If the volume is not mounted, got to step 4. Otherwise, add the node back to reportedAsAttached list and update the node status. Abort detach operation
Issue Detach operation.

Mount operation:

Check if volume is attached according to node status. If it is reportedAsAttached, move to step 2. Otherwise return
Mark volume as mounted (add to VolumeInUse list to the node status), update node status.
Verify device path exist or not. If it is, go to step 4. Otherwise, remove volume from VolumeInUse list), update node status. Abort mount operation.
Issue Mount operation

Attach/Unmount operation:

Issue operation.
After operation is finished successfully, update node status indicating volume is attached or volume is not longer mounted

Proposed solution 2

Kubelet already have the implementation to periodically update node status to API server (it includes the updates for network, machine info and volumes). So a separate on-demand updater required by proposal 1 might interfere with each other without proper locking. It is also beneficial to use periodic updates because it can pack individual updates into a single update. Also the state recover process (implemented by PR #27970) help kubelet volume controller to sync up the states with the true world after kubelet restarts. So after this sync state process has performed at least one time, the cached information should reflect the true state instead of just being empty. After this point, it is safe to let the kubelet update the node status with the cached information. Instead of triggering node status update before mount operation, this proposal uses existing kubelet updater which periodically retrieves the mounted volume list in cached actual state and updates the node status. The following changes are required

Kubelet reconciler sync states process:
This syncStates process starts to peridocally scan existing directories and recover states if needed after kubelet restarts and sources are ready. After the process performed the first sync up after kubelet starts, it sets a flag indicating states are recovered already.

Kubelet updater:
Check the flag set by the kubelet reconciler and update the VolumeInUse list to node status only after this flag is set to true. It makes sure that the list reflects the true mounted volumes even when kubelet starts and emptied the cache because it is being recovered already.

Mount operation:
Since we cannot update on-demand in this approach, mount operation has to wait until the volume is updated to the node status already before it is getting triggered. The ReportAsInUse in the above step 4 is used for this purpose to make sure node status is updated before issuing mount operation. When this condition is guaranteed, the steps listed from mount operation in proposal 1 is basically the same.

Actual_State_of_world
Each time a volume is unmounted globally, add this volume to the list UnmountVolumesSinceLastUpdate. When this list is retrieved by the kubelet, the content is cleaned.

Verify the approach

Problem 1: Kubelet restarts
The periodic updates approach needs to retrieve the current cache information stored in actual/desired state of worlds. When kubelet restarts, the cache is empty and cause the problem. The proposed approach 1 is no longer use periodic updates so empty cache when kubelet restarts will not affect the node status. For mount operation, since the status is updated before triggering the operation, if kubelet restarts during this period, node status might show the volume is mounted (although the truth is it is not yet). This situation might cause delay if master needs to detach, but it is safe. The proposed approach 2 can also solve the problem because VolumeInUse is no longer completely replaced by the empty cache. Instead, we only update the list after the cache is recovered.

Problem 2: Race condition caused by out-of-date status
The key in the proposal is to mark and update status before trigger the detach/mount operations. No matter what the sequence of the events, detaching while volume is mounted will not happen. This can be proved by contradiction. If detaching while volume is mounted happened, before triggering those operations, master and node must already mark node status showing that volume is detach and volume is not mounted. It is not possible to pass the verification by both master and node before issuing the operations. Also it won’t cause deadlock because if verification in step 3. If the verification fails, the node status will be updated back to before and master and node will try again.

Example
Use the above mentioned event sequence as an example, event d will fail to verify because node status showing that the volume is being mounted already. So detach will not be issued.

a. The initial state is volume is attached and mounted, both master and node update node status showing that volume is attached (reportedAsAttached) and mounted (volumeInUse)
b. On node, the volume is unmounted (UnmountDevice finishes), node status is updated by the kubelet showing that volume is no longer mounted
c. On node, verify volume is attached can passed because master did not change the status so far. Before issuing mount operation, node status is updated showing the volume is mounted (in VolumeInUse list)
d. On master, update node status showing the volume is not attached. Before issuing detach, verify volume is not mounted will FAIL because of event c. Update node status back showing volume is attached, and abort the detach operation

Comparison between Proposal 1 and 2

Proposal 1
Pros: logic is clean and very similar to master
Cons: More updates (each volume mount will trigger one) to the API server will be incurred compared to Proposal 2. More work needs to be done in order to synchronization kubelet period updater and on-demand updater.
Proposal 2
Pros: Pack multiple updates into one. Kubelet updater is already implemented
Cons: Logic is a little more complicated and needs more careful design. Mount operations might wait up to update period (7 seconds) to start.

Conclusion

Both Proposals have their pros and cons in different aspects. Because PR #28095 already implemented part of proposal 2, we decide to move forward with proposal 2 considering the amount of code changes.

rootfs · 2016-09-22T15:24:51Z

@kubernetes/sig-storage

saad-ali · 2016-09-22T22:03:02Z

Thanks for the detailed write up @jingxu97. We spoke offline. To summerize, P2 is already handled (see PR #28095). P1 is an issue we need to address: The correct way to handle it is to make sure that kubelet does not remove elements from NodeStatus.VolumesInUse unless explicitly required to do so for a successful UnmountDevice operation. There are a couple of different ways to do this with various pros and cons. @jingxu97 will write these up proposals and select which design to implement.

When kubelet restarts, all the information about the volumes will be gone from actual/desired states. When update node status with mounted volumes, the volume list might be empty although there are still volumes are mounted and in turn causing master to detach those volumes since they are not in the mounted volumes list. This fix is to make sure only update mounted volumes list after reconciler starts sync states process. This sync state process will scan the existing volume directories and reconstruct actual states if they are missing. This PR also fixes the problem during orphaned pods' directories. In case of the pod directory is unmounted but has not yet deleted (e.g., interrupted with kubelet restarts), clean up routine will delete the directory so that the pod directoriy could be cleaned up (it is safe to delete directory since it is no longer mounted) Details are in proposal PR kubernetes#33203

When kubelet restarts, all the information about the volumes will be gone from actual/desired states. When update node status with mounted volumes, the volume list might be empty although there are still volumes are mounted and in turn causing master to detach those volumes since they are not in the mounted volumes list. This fix is to make sure only update mounted volumes list after reconciler starts sync states process. This sync state process will scan the existing volume directories and reconstruct actual states if they are missing. This PR also fixes the problem during orphaned pods' directories. In case of the pod directory is unmounted but has not yet deleted (e.g., interrupted with kubelet restarts), clean up routine will delete the directory so that the pod directoriy could be cleaned up (it is safe to delete directory since it is no longer mounted) The third issue this PR fixes is that during reconstruct volume in actual state, mounter could not be nil since it is required for creating container.VolumeMap. If it is nil, it might cause nil pointer exception in kubelet. Details are in proposal PR kubernetes#33203

Automatic merge from submit-queue Fix volume states out of sync problem after kubelet restarts When kubelet restarts, all the information about the volumes will be gone from actual/desired states. When update node status with mounted volumes, the volume list might be empty although there are still volumes are mounted and in turn causing master to detach those volumes since they are not in the mounted volumes list. This fix is to make sure only update mounted volumes list after reconciler starts sync states process. This sync state process will scan the existing volume directories and reconstruct actual states if they are missing. This PR also fixes the problem during orphaned pods' directories. In case of the pod directory is unmounted but has not yet deleted (e.g., interrupted with kubelet restarts), clean up routine will delete the directory so that the pod directoriy could be cleaned up (it is safe to delete directory since it is no longer mounted) The third issue this PR fixes is that during reconstruct volume in actual state, mounter could not be nil since it is required for creating container.VolumeMap. If it is nil, it might cause nil pointer exception in kubelet. Detailed design proposal is #33203

When kubelet restarts, all the information about the volumes will be gone from actual/desired states. When update node status with mounted volumes, the volume list might be empty although there are still volumes are mounted and in turn causing master to detach those volumes since they are not in the mounted volumes list. This fix is to make sure only update mounted volumes list after reconciler starts sync states process. This sync state process will scan the existing volume directories and reconstruct actual states if they are missing. This PR also fixes the problem during orphaned pods' directories. In case of the pod directory is unmounted but has not yet deleted (e.g., interrupted with kubelet restarts), clean up routine will delete the directory so that the pod directoriy could be cleaned up (it is safe to delete directory since it is no longer mounted) The third issue this PR fixes is that during reconstruct volume in actual state, mounter could not be nil since it is required for creating container.VolumeMap. If it is nil, it might cause nil pointer exception in kubelet. Details are in proposal PR kubernetes#33203

dims · 2016-11-16T14:25:27Z

This needs to be triaged as a release-blocker or not for 1.5 @jingxu97 @saad-ali

jingxu97 · 2016-11-17T01:41:51Z

#33616 already fixed this issue. Close it.

When kubelet restarts, all the information about the volumes will be gone from actual/desired states. When update node status with mounted volumes, the volume list might be empty although there are still volumes are mounted and in turn causing master to detach those volumes since they are not in the mounted volumes list. This fix is to make sure only update mounted volumes list after reconciler starts sync states process. This sync state process will scan the existing volume directories and reconstruct actual states if they are missing. This PR also fixes the problem during orphaned pods' directories. In case of the pod directory is unmounted but has not yet deleted (e.g., interrupted with kubelet restarts), clean up routine will delete the directory so that the pod directoriy could be cleaned up (it is safe to delete directory since it is no longer mounted) The third issue this PR fixes is that during reconstruct volume in actual state, mounter could not be nil since it is required for creating container.VolumeMap. If it is nil, it might cause nil pointer exception in kubelet. Details are in proposal PR kubernetes#33203

jingxu97 added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Sep 21, 2016

k8s-github-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. labels Sep 21, 2016

jingxu97 self-assigned this Sep 21, 2016

jingxu97 mentioned this issue Sep 21, 2016

GCE persited disk sometimes gets mounted read-only in a single container #32929

Closed

jingxu97 changed the title ~~Volume status might not be updated correctly after kubelet restart~~ Volume status out of sync problem Sep 22, 2016

jingxu97 mentioned this issue Sep 22, 2016

Kubelet volume manager fails to reconstruct volume information if global mount path is gone #33207

Closed

saad-ali self-assigned this Sep 22, 2016

saad-ali added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Sep 26, 2016

saad-ali added this to the v1.5 milestone Sep 26, 2016

jingxu97 changed the title ~~Volume status out of sync problem~~ Volume status out of sync when kubelet restarts Sep 29, 2016

jingxu97 mentioned this issue Sep 29, 2016

Fix volume states out of sync problem after kubelet restarts #33616

Merged

jingxu97 mentioned this issue Oct 24, 2016

don't reread world state after starting reconstruction #35320

Closed

jingxu97 closed this as completed Nov 17, 2016

This was referenced Nov 23, 2016

kubernetes attempts to unmount a wrong vSphere volume and stops making any progress after that #37332

Closed

kubernetes attempts to unmount a wrong vSphere volume and stops making any progress after that #37022

Closed

jsravn mentioned this issue Dec 13, 2016

Orphaned pods fail to get cleaned up #38498

Closed

jsravn mentioned this issue Jan 13, 2017

AWS EBS volumes fail to mount after a kubelet restart - 1.4.5 #36589

Closed

jingxu97 mentioned this issue Sep 19, 2017

Volume reconstruction after kubelet restarts might not work for some plugins. #52683

Closed

andyzhangx mentioned this issue Jan 20, 2018

upgrade k8s process is broke Azure/acs-engine#2022

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Volume status out of sync when kubelet restarts #33203

Volume status out of sync when kubelet restarts #33203

jingxu97 commented Sep 21, 2016 •

edited

rootfs commented Sep 22, 2016

saad-ali commented Sep 22, 2016

dims commented Nov 16, 2016

jingxu97 commented Nov 17, 2016

Volume status out of sync when kubelet restarts #33203

Volume status out of sync when kubelet restarts #33203

Comments

jingxu97 commented Sep 21, 2016 • edited

Problem

Proposed solution 1

Proposed solution 2

Verify the approach

Comparison between Proposal 1 and 2

Conclusion

rootfs commented Sep 22, 2016

saad-ali commented Sep 22, 2016

dims commented Nov 16, 2016

jingxu97 commented Nov 17, 2016

jingxu97 commented Sep 21, 2016 •

edited