New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Volume status out of sync when kubelet restarts #33203
Comments
@kubernetes/sig-storage |
Thanks for the detailed write up @jingxu97. We spoke offline. To summerize, P2 is already handled (see PR #28095). P1 is an issue we need to address: The correct way to handle it is to make sure that kubelet does not remove elements from |
When kubelet restarts, all the information about the volumes will be gone from actual/desired states. When update node status with mounted volumes, the volume list might be empty although there are still volumes are mounted and in turn causing master to detach those volumes since they are not in the mounted volumes list. This fix is to make sure only update mounted volumes list after reconciler starts sync states process. This sync state process will scan the existing volume directories and reconstruct actual states if they are missing. This PR also fixes the problem during orphaned pods' directories. In case of the pod directory is unmounted but has not yet deleted (e.g., interrupted with kubelet restarts), clean up routine will delete the directory so that the pod directoriy could be cleaned up (it is safe to delete directory since it is no longer mounted) Details are in proposal PR kubernetes#33203
When kubelet restarts, all the information about the volumes will be gone from actual/desired states. When update node status with mounted volumes, the volume list might be empty although there are still volumes are mounted and in turn causing master to detach those volumes since they are not in the mounted volumes list. This fix is to make sure only update mounted volumes list after reconciler starts sync states process. This sync state process will scan the existing volume directories and reconstruct actual states if they are missing. This PR also fixes the problem during orphaned pods' directories. In case of the pod directory is unmounted but has not yet deleted (e.g., interrupted with kubelet restarts), clean up routine will delete the directory so that the pod directoriy could be cleaned up (it is safe to delete directory since it is no longer mounted) Details are in proposal PR kubernetes#33203
When kubelet restarts, all the information about the volumes will be gone from actual/desired states. When update node status with mounted volumes, the volume list might be empty although there are still volumes are mounted and in turn causing master to detach those volumes since they are not in the mounted volumes list. This fix is to make sure only update mounted volumes list after reconciler starts sync states process. This sync state process will scan the existing volume directories and reconstruct actual states if they are missing. This PR also fixes the problem during orphaned pods' directories. In case of the pod directory is unmounted but has not yet deleted (e.g., interrupted with kubelet restarts), clean up routine will delete the directory so that the pod directoriy could be cleaned up (it is safe to delete directory since it is no longer mounted) The third issue this PR fixes is that during reconstruct volume in actual state, mounter could not be nil since it is required for creating container.VolumeMap. If it is nil, it might cause nil pointer exception in kubelet. Details are in proposal PR kubernetes#33203
When kubelet restarts, all the information about the volumes will be gone from actual/desired states. When update node status with mounted volumes, the volume list might be empty although there are still volumes are mounted and in turn causing master to detach those volumes since they are not in the mounted volumes list. This fix is to make sure only update mounted volumes list after reconciler starts sync states process. This sync state process will scan the existing volume directories and reconstruct actual states if they are missing. This PR also fixes the problem during orphaned pods' directories. In case of the pod directory is unmounted but has not yet deleted (e.g., interrupted with kubelet restarts), clean up routine will delete the directory so that the pod directoriy could be cleaned up (it is safe to delete directory since it is no longer mounted) The third issue this PR fixes is that during reconstruct volume in actual state, mounter could not be nil since it is required for creating container.VolumeMap. If it is nil, it might cause nil pointer exception in kubelet. Details are in proposal PR kubernetes#33203
Automatic merge from submit-queue Fix volume states out of sync problem after kubelet restarts When kubelet restarts, all the information about the volumes will be gone from actual/desired states. When update node status with mounted volumes, the volume list might be empty although there are still volumes are mounted and in turn causing master to detach those volumes since they are not in the mounted volumes list. This fix is to make sure only update mounted volumes list after reconciler starts sync states process. This sync state process will scan the existing volume directories and reconstruct actual states if they are missing. This PR also fixes the problem during orphaned pods' directories. In case of the pod directory is unmounted but has not yet deleted (e.g., interrupted with kubelet restarts), clean up routine will delete the directory so that the pod directoriy could be cleaned up (it is safe to delete directory since it is no longer mounted) The third issue this PR fixes is that during reconstruct volume in actual state, mounter could not be nil since it is required for creating container.VolumeMap. If it is nil, it might cause nil pointer exception in kubelet. Detailed design proposal is #33203
When kubelet restarts, all the information about the volumes will be gone from actual/desired states. When update node status with mounted volumes, the volume list might be empty although there are still volumes are mounted and in turn causing master to detach those volumes since they are not in the mounted volumes list. This fix is to make sure only update mounted volumes list after reconciler starts sync states process. This sync state process will scan the existing volume directories and reconstruct actual states if they are missing. This PR also fixes the problem during orphaned pods' directories. In case of the pod directory is unmounted but has not yet deleted (e.g., interrupted with kubelet restarts), clean up routine will delete the directory so that the pod directoriy could be cleaned up (it is safe to delete directory since it is no longer mounted) The third issue this PR fixes is that during reconstruct volume in actual state, mounter could not be nil since it is required for creating container.VolumeMap. If it is nil, it might cause nil pointer exception in kubelet. Details are in proposal PR kubernetes#33203
#33616 already fixed this issue. Close it. |
When kubelet restarts, all the information about the volumes will be gone from actual/desired states. When update node status with mounted volumes, the volume list might be empty although there are still volumes are mounted and in turn causing master to detach those volumes since they are not in the mounted volumes list. This fix is to make sure only update mounted volumes list after reconciler starts sync states process. This sync state process will scan the existing volume directories and reconstruct actual states if they are missing. This PR also fixes the problem during orphaned pods' directories. In case of the pod directory is unmounted but has not yet deleted (e.g., interrupted with kubelet restarts), clean up routine will delete the directory so that the pod directoriy could be cleaned up (it is safe to delete directory since it is no longer mounted) The third issue this PR fixes is that during reconstruct volume in actual state, mounter could not be nil since it is required for creating container.VolumeMap. If it is nil, it might cause nil pointer exception in kubelet. Details are in proposal PR kubernetes#33203
Problem
Currently at node side, node status is updated periodically by kubelet. For volumes, node updater will try to get a list of volumes in use (from current desired state and actual state) and update the list. The master will need to use this information to determine whether it is safe to detach. This information might be out of sync and cause the following problems:
P1. When kubelet restarts, all the information about the volumes will be gone from actual/desired states and it takes some time to recover those information. When node status is updated, the mounted volumes list might be empty after kubelet restarts. If during this time, pod is deleted, master will try to detach the volumes (pass safe to detach test since node status showing they are not mounted) while they are still mounted. In turn, reconstruct procedure fails to reconstruct the volume due to some reason such as global mount path is gone see PR #33207.
P2. The node status is out of date which might cause race condition when master and node decide to perform detach and mount operations. The following is the sequence of the events which will result in detaching a volume while it is mounted.
Example
Proposed solution 1
Instead of using the periodic updates by kubelet, status updates are triggered when attach/detach mount/unmount operations happen. Node status can be updated and read by both master and node. Master updates which volume is reportAsAttached to the node, and node updates which volume is currently mounted (or going to be mounted). Considering the four operations, attach/detach/mount/unmount, detach and mount operation need to be carefully issued because detaching while volume is mounted might cause file system corruption and data loss. While attach and unmount are normally safe operations that can be performed no matter what the current state is.
Detach operation:
Mount operation:
Attach/Unmount operation:
Proposed solution 2
Kubelet already have the implementation to periodically update node status to API server (it includes the updates for network, machine info and volumes). So a separate on-demand updater required by proposal 1 might interfere with each other without proper locking. It is also beneficial to use periodic updates because it can pack individual updates into a single update. Also the state recover process (implemented by PR #27970) help kubelet volume controller to sync up the states with the true world after kubelet restarts. So after this sync state process has performed at least one time, the cached information should reflect the true state instead of just being empty. After this point, it is safe to let the kubelet update the node status with the cached information. Instead of triggering node status update before mount operation, this proposal uses existing kubelet updater which periodically retrieves the mounted volume list in cached actual state and updates the node status. The following changes are required
Kubelet reconciler sync states process:
This syncStates process starts to peridocally scan existing directories and recover states if needed after kubelet restarts and sources are ready. After the process performed the first sync up after kubelet starts, it sets a flag indicating states are recovered already.
Kubelet updater:
Check the flag set by the kubelet reconciler and update the VolumeInUse list to node status only after this flag is set to true. It makes sure that the list reflects the true mounted volumes even when kubelet starts and emptied the cache because it is being recovered already.
Mount operation:
Since we cannot update on-demand in this approach, mount operation has to wait until the volume is updated to the node status already before it is getting triggered. The ReportAsInUse in the above step 4 is used for this purpose to make sure node status is updated before issuing mount operation. When this condition is guaranteed, the steps listed from mount operation in proposal 1 is basically the same.
Actual_State_of_world
Each time a volume is unmounted globally, add this volume to the list UnmountVolumesSinceLastUpdate. When this list is retrieved by the kubelet, the content is cleaned.
Verify the approach
Problem 1: Kubelet restarts
The periodic updates approach needs to retrieve the current cache information stored in actual/desired state of worlds. When kubelet restarts, the cache is empty and cause the problem. The proposed approach 1 is no longer use periodic updates so empty cache when kubelet restarts will not affect the node status. For mount operation, since the status is updated before triggering the operation, if kubelet restarts during this period, node status might show the volume is mounted (although the truth is it is not yet). This situation might cause delay if master needs to detach, but it is safe. The proposed approach 2 can also solve the problem because VolumeInUse is no longer completely replaced by the empty cache. Instead, we only update the list after the cache is recovered.
Problem 2: Race condition caused by out-of-date status
The key in the proposal is to mark and update status before trigger the detach/mount operations. No matter what the sequence of the events, detaching while volume is mounted will not happen. This can be proved by contradiction. If detaching while volume is mounted happened, before triggering those operations, master and node must already mark node status showing that volume is detach and volume is not mounted. It is not possible to pass the verification by both master and node before issuing the operations. Also it won’t cause deadlock because if verification in step 3. If the verification fails, the node status will be updated back to before and master and node will try again.
Example
Use the above mentioned event sequence as an example, event d will fail to verify because node status showing that the volume is being mounted already. So detach will not be issued.
Comparison between Proposal 1 and 2
Proposal 1
Pros: logic is clean and very similar to master
Cons: More updates (each volume mount will trigger one) to the API server will be incurred compared to Proposal 2. More work needs to be done in order to synchronization kubelet period updater and on-demand updater.
Proposal 2
Pros: Pack multiple updates into one. Kubelet updater is already implemented
Cons: Logic is a little more complicated and needs more careful design. Mount operations might wait up to update period (7 seconds) to start.
Conclusion
Both Proposals have their pros and cons in different aspects. Because PR #28095 already implemented part of proposal 2, we decide to move forward with proposal 2 considering the amount of code changes.
The text was updated successfully, but these errors were encountered: