Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Volume status out of sync when kubelet restarts #33203

Closed
jingxu97 opened this issue Sep 21, 2016 · 4 comments
Closed

Volume status out of sync when kubelet restarts #33203

jingxu97 opened this issue Sep 21, 2016 · 4 comments
Assignees
Labels
area/kubelet priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Milestone

Comments

@jingxu97
Copy link
Contributor

jingxu97 commented Sep 21, 2016

Problem

Currently at node side, node status is updated periodically by kubelet. For volumes, node updater will try to get a list of volumes in use (from current desired state and actual state) and update the list. The master will need to use this information to determine whether it is safe to detach. This information might be out of sync and cause the following problems:

P1. When kubelet restarts, all the information about the volumes will be gone from actual/desired states and it takes some time to recover those information. When node status is updated, the mounted volumes list might be empty after kubelet restarts. If during this time, pod is deleted, master will try to detach the volumes (pass safe to detach test since node status showing they are not mounted) while they are still mounted. In turn, reconstruct procedure fails to reconstruct the volume due to some reason such as global mount path is gone see PR #33207.

P2. The node status is out of date which might cause race condition when master and node decide to perform detach and mount operations. The following is the sequence of the events which will result in detaching a volume while it is mounted.

Example

  • a. The initial state is volume is attached and mounted, both master and node update node status showing that volume is attached (reportedAsAttached) and mounted (volumeInUse)
  • b. On node, the volume is unmounted (UnmountDevice finishes), node status is updated by the kubelet showing that volume is no longer mounted
  • c. On node, a mount operation is triggered (verify volume is attached can passed because master did not change the status so far)
  • d. On master, a detach operation is triggered (verify volume is not mounted can pass because of step b)

Proposed solution 1

Instead of using the periodic updates by kubelet, status updates are triggered when attach/detach mount/unmount operations happen. Node status can be updated and read by both master and node. Master updates which volume is reportAsAttached to the node, and node updates which volume is currently mounted (or going to be mounted). Considering the four operations, attach/detach/mount/unmount, detach and mount operation need to be carefully issued because detaching while volume is mounted might cause file system corruption and data loss. While attach and unmount are normally safe operations that can be performed no matter what the current state is.

Detach operation:

  1. Check if it is safe to detach (according to node status whether the volume is still mounted -- this information might out of date). If this is safe, move to step 2. Otherwise, return
  2. Mark volume as detached (remove from reportedAsAttached list), update node status
  3. Verify node status from API server again. If the volume is not mounted, got to step 4. Otherwise, add the node back to reportedAsAttached list and update the node status. Abort detach operation
  4. Issue Detach operation.

Mount operation:

  1. Check if volume is attached according to node status. If it is reportedAsAttached, move to step 2. Otherwise return
  2. Mark volume as mounted (add to VolumeInUse list to the node status), update node status.
  3. Verify device path exist or not. If it is, go to step 4. Otherwise, remove volume from VolumeInUse list), update node status. Abort mount operation.
  4. Issue Mount operation

Attach/Unmount operation:

  1. Issue operation.
  2. After operation is finished successfully, update node status indicating volume is attached or volume is not longer mounted

Proposed solution 2

Kubelet already have the implementation to periodically update node status to API server (it includes the updates for network, machine info and volumes). So a separate on-demand updater required by proposal 1 might interfere with each other without proper locking. It is also beneficial to use periodic updates because it can pack individual updates into a single update. Also the state recover process (implemented by PR #27970) help kubelet volume controller to sync up the states with the true world after kubelet restarts. So after this sync state process has performed at least one time, the cached information should reflect the true state instead of just being empty. After this point, it is safe to let the kubelet update the node status with the cached information. Instead of triggering node status update before mount operation, this proposal uses existing kubelet updater which periodically retrieves the mounted volume list in cached actual state and updates the node status. The following changes are required

Kubelet reconciler sync states process:
This syncStates process starts to peridocally scan existing directories and recover states if needed after kubelet restarts and sources are ready. After the process performed the first sync up after kubelet starts, it sets a flag indicating states are recovered already.

Kubelet updater:
Check the flag set by the kubelet reconciler and update the VolumeInUse list to node status only after this flag is set to true. It makes sure that the list reflects the true mounted volumes even when kubelet starts and emptied the cache because it is being recovered already.

Mount operation:
Since we cannot update on-demand in this approach, mount operation has to wait until the volume is updated to the node status already before it is getting triggered. The ReportAsInUse in the above step 4 is used for this purpose to make sure node status is updated before issuing mount operation. When this condition is guaranteed, the steps listed from mount operation in proposal 1 is basically the same.

Actual_State_of_world
Each time a volume is unmounted globally, add this volume to the list UnmountVolumesSinceLastUpdate. When this list is retrieved by the kubelet, the content is cleaned.

Verify the approach

Problem 1: Kubelet restarts
The periodic updates approach needs to retrieve the current cache information stored in actual/desired state of worlds. When kubelet restarts, the cache is empty and cause the problem. The proposed approach 1 is no longer use periodic updates so empty cache when kubelet restarts will not affect the node status. For mount operation, since the status is updated before triggering the operation, if kubelet restarts during this period, node status might show the volume is mounted (although the truth is it is not yet). This situation might cause delay if master needs to detach, but it is safe. The proposed approach 2 can also solve the problem because VolumeInUse is no longer completely replaced by the empty cache. Instead, we only update the list after the cache is recovered.

Problem 2: Race condition caused by out-of-date status
The key in the proposal is to mark and update status before trigger the detach/mount operations. No matter what the sequence of the events, detaching while volume is mounted will not happen. This can be proved by contradiction. If detaching while volume is mounted happened, before triggering those operations, master and node must already mark node status showing that volume is detach and volume is not mounted. It is not possible to pass the verification by both master and node before issuing the operations. Also it won’t cause deadlock because if verification in step 3. If the verification fails, the node status will be updated back to before and master and node will try again.

Example
Use the above mentioned event sequence as an example, event d will fail to verify because node status showing that the volume is being mounted already. So detach will not be issued.

  • a. The initial state is volume is attached and mounted, both master and node update node status showing that volume is attached (reportedAsAttached) and mounted (volumeInUse)
  • b. On node, the volume is unmounted (UnmountDevice finishes), node status is updated by the kubelet showing that volume is no longer mounted
  • c. On node, verify volume is attached can passed because master did not change the status so far. Before issuing mount operation, node status is updated showing the volume is mounted (in VolumeInUse list)
  • d. On master, update node status showing the volume is not attached. Before issuing detach, verify volume is not mounted will FAIL because of event c. Update node status back showing volume is attached, and abort the detach operation

Comparison between Proposal 1 and 2

Proposal 1
Pros: logic is clean and very similar to master
Cons: More updates (each volume mount will trigger one) to the API server will be incurred compared to Proposal 2. More work needs to be done in order to synchronization kubelet period updater and on-demand updater.
Proposal 2
Pros: Pack multiple updates into one. Kubelet updater is already implemented
Cons: Logic is a little more complicated and needs more careful design. Mount operations might wait up to update period (7 seconds) to start.

Conclusion

Both Proposals have their pros and cons in different aspects. Because PR #28095 already implemented part of proposal 2, we decide to move forward with proposal 2 considering the amount of code changes.

@jingxu97 jingxu97 added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Sep 21, 2016
@k8s-github-robot k8s-github-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. labels Sep 21, 2016
@jingxu97 jingxu97 self-assigned this Sep 21, 2016
@rootfs
Copy link
Contributor

rootfs commented Sep 22, 2016

@kubernetes/sig-storage

@jingxu97 jingxu97 changed the title Volume status might not be updated correctly after kubelet restart Volume status out of sync problem Sep 22, 2016
@saad-ali saad-ali self-assigned this Sep 22, 2016
@saad-ali
Copy link
Member

Thanks for the detailed write up @jingxu97. We spoke offline. To summerize, P2 is already handled (see PR #28095). P1 is an issue we need to address: The correct way to handle it is to make sure that kubelet does not remove elements from NodeStatus.VolumesInUse unless explicitly required to do so for a successful UnmountDevice operation. There are a couple of different ways to do this with various pros and cons. @jingxu97 will write these up proposals and select which design to implement.

@saad-ali saad-ali added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Sep 26, 2016
@saad-ali saad-ali added this to the v1.5 milestone Sep 26, 2016
@jingxu97 jingxu97 changed the title Volume status out of sync problem Volume status out of sync when kubelet restarts Sep 29, 2016
jingxu97 added a commit to jingxu97/kubernetes that referenced this issue Sep 29, 2016
When kubelet restarts, all the information about the volumes will be
gone from actual/desired states. When update node status with mounted
volumes, the volume list might be empty although there are still volumes
are mounted and in turn causing master to detach those volumes since
they are not in the mounted volumes list. This fix is to make sure only
update mounted volumes list after reconciler starts sync states process.
This sync state process will scan the existing volume directories and
reconstruct actual states if they are missing.

This PR also fixes the problem during orphaned pods' directories. In
case of the pod directory is unmounted but has not yet deleted (e.g.,
interrupted with kubelet restarts), clean up routine will delete the
directory so that the pod directoriy could be cleaned up (it is safe to
delete directory since it is no longer mounted)

Details are in proposal PR kubernetes#33203
jingxu97 added a commit to jingxu97/kubernetes that referenced this issue Oct 4, 2016
When kubelet restarts, all the information about the volumes will be
gone from actual/desired states. When update node status with mounted
volumes, the volume list might be empty although there are still volumes
are mounted and in turn causing master to detach those volumes since
they are not in the mounted volumes list. This fix is to make sure only
update mounted volumes list after reconciler starts sync states process.
This sync state process will scan the existing volume directories and
reconstruct actual states if they are missing.

This PR also fixes the problem during orphaned pods' directories. In
case of the pod directory is unmounted but has not yet deleted (e.g.,
interrupted with kubelet restarts), clean up routine will delete the
directory so that the pod directoriy could be cleaned up (it is safe to
delete directory since it is no longer mounted)

Details are in proposal PR kubernetes#33203
jingxu97 added a commit to jingxu97/kubernetes that referenced this issue Oct 6, 2016
When kubelet restarts, all the information about the volumes will be
gone from actual/desired states. When update node status with mounted
volumes, the volume list might be empty although there are still volumes
are mounted and in turn causing master to detach those volumes since
they are not in the mounted volumes list. This fix is to make sure only
update mounted volumes list after reconciler starts sync states process.
This sync state process will scan the existing volume directories and
reconstruct actual states if they are missing.

This PR also fixes the problem during orphaned pods' directories. In
case of the pod directory is unmounted but has not yet deleted (e.g.,
interrupted with kubelet restarts), clean up routine will delete the
directory so that the pod directoriy could be cleaned up (it is safe to
delete directory since it is no longer mounted)

The third issue this PR fixes is that during reconstruct volume in
actual state, mounter could not be nil since it is required for creating
container.VolumeMap. If it is nil, it might cause nil pointer exception
in kubelet.

Details are in proposal PR kubernetes#33203
jingxu97 added a commit to jingxu97/kubernetes that referenced this issue Oct 25, 2016
When kubelet restarts, all the information about the volumes will be
gone from actual/desired states. When update node status with mounted
volumes, the volume list might be empty although there are still volumes
are mounted and in turn causing master to detach those volumes since
they are not in the mounted volumes list. This fix is to make sure only
update mounted volumes list after reconciler starts sync states process.
This sync state process will scan the existing volume directories and
reconstruct actual states if they are missing.

This PR also fixes the problem during orphaned pods' directories. In
case of the pod directory is unmounted but has not yet deleted (e.g.,
interrupted with kubelet restarts), clean up routine will delete the
directory so that the pod directoriy could be cleaned up (it is safe to
delete directory since it is no longer mounted)

The third issue this PR fixes is that during reconstruct volume in
actual state, mounter could not be nil since it is required for creating
container.VolumeMap. If it is nil, it might cause nil pointer exception
in kubelet.

Details are in proposal PR kubernetes#33203
k8s-github-robot pushed a commit that referenced this issue Oct 25, 2016
Automatic merge from submit-queue

Fix volume states out of sync problem after kubelet restarts

When kubelet restarts, all the information about the volumes will be
gone from actual/desired states. When update node status with mounted
volumes, the volume list might be empty although there are still volumes
are mounted and in turn causing master to detach those volumes since
they are not in the mounted volumes list. This fix is to make sure only
update mounted volumes list after reconciler starts sync states process.
This sync state process will scan the existing volume directories and
reconstruct actual states if they are missing.

This PR also fixes the problem during orphaned pods' directories. In
case of the pod directory is unmounted but has not yet deleted (e.g.,
interrupted with kubelet restarts), clean up routine will delete the
directory so that the pod directoriy could be cleaned up (it is safe to
delete directory since it is no longer mounted)

The third issue this PR fixes is that during reconstruct volume in
actual state, mounter could not be nil since it is required for creating
container.VolumeMap. If it is nil, it might cause nil pointer exception
in kubelet.
Detailed design proposal is #33203
jingxu97 added a commit to jingxu97/kubernetes that referenced this issue Oct 28, 2016
When kubelet restarts, all the information about the volumes will be
gone from actual/desired states. When update node status with mounted
volumes, the volume list might be empty although there are still volumes
are mounted and in turn causing master to detach those volumes since
they are not in the mounted volumes list. This fix is to make sure only
update mounted volumes list after reconciler starts sync states process.
This sync state process will scan the existing volume directories and
reconstruct actual states if they are missing.

This PR also fixes the problem during orphaned pods' directories. In
case of the pod directory is unmounted but has not yet deleted (e.g.,
interrupted with kubelet restarts), clean up routine will delete the
directory so that the pod directoriy could be cleaned up (it is safe to
delete directory since it is no longer mounted)

The third issue this PR fixes is that during reconstruct volume in
actual state, mounter could not be nil since it is required for creating
container.VolumeMap. If it is nil, it might cause nil pointer exception
in kubelet.

Details are in proposal PR kubernetes#33203
@dims
Copy link
Member

dims commented Nov 16, 2016

This needs to be triaged as a release-blocker or not for 1.5 @jingxu97 @saad-ali

@jingxu97
Copy link
Contributor Author

#33616 already fixed this issue. Close it.

shyamjvs pushed a commit to shyamjvs/kubernetes that referenced this issue Dec 1, 2016
When kubelet restarts, all the information about the volumes will be
gone from actual/desired states. When update node status with mounted
volumes, the volume list might be empty although there are still volumes
are mounted and in turn causing master to detach those volumes since
they are not in the mounted volumes list. This fix is to make sure only
update mounted volumes list after reconciler starts sync states process.
This sync state process will scan the existing volume directories and
reconstruct actual states if they are missing.

This PR also fixes the problem during orphaned pods' directories. In
case of the pod directory is unmounted but has not yet deleted (e.g.,
interrupted with kubelet restarts), clean up routine will delete the
directory so that the pod directoriy could be cleaned up (it is safe to
delete directory since it is no longer mounted)

The third issue this PR fixes is that during reconstruct volume in
actual state, mounter could not be nil since it is required for creating
container.VolumeMap. If it is nil, it might cause nil pointer exception
in kubelet.

Details are in proposal PR kubernetes#33203
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

No branches or pull requests

5 participants