New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cinder volume is detached from running pod. #19602

Closed
jsafrane opened this Issue Jan 13, 2016 · 10 comments

Comments

Projects
None yet
4 participants
@jsafrane
Member

jsafrane commented Jan 13, 2016

There is a race between detaching a volume when an old pod is being destroyed and attaching the same volume to a new pod on the same node:

  1. pod A happily uses a volume
  2. pod A is deleted (and kubelet starts to clean up the stuff)
  3. pod B starts, with the same volume (and kubelet starts to set it up...)

At this point, the volume is still attached to the node, kubelet is really slow when destroying it...

  1. kubelet attaches the volume to the node, because pod B will need it. This does not do anything, because the volume is already attached there.
  2. kubelet mounts the volume to pod B directory (which is OK)
  3. kubelet unmounts the volume from pod A directory (again, OK)
  4. kubelet detaches the volume from the node, because pod A is being destroyed. This detaches the volume, while it's still mounted into pod B directory
  5. kubelet starts pod B containers -> problems.

There should be some usage counter, so kubelet just decreases "reference count'" at step 7 instead of detaching the volume. Similar approach is already present in AWS code.

@kubernetes/rh-storage

@markturansky markturansky self-assigned this Jan 13, 2016

@jsafrane jsafrane changed the title from Kubelet unmounts/detaches used Cinder volume. to Kubelet unmounts/detaches Cinder volume that is being used Jan 13, 2016

@jsafrane jsafrane changed the title from Kubelet unmounts/detaches Cinder volume that is being used to Cinder volume is detached from running pod. Jan 13, 2016

@jsafrane

This comment has been minimized.

Show comment
Hide comment
@jsafrane

jsafrane Jan 14, 2016

Member

I was wrong, cinderVolumePlugint.SetUpAt() and TearDownAt() have some checks to prevent detaching of an used volume. The race is in the cinder volume plugin.

Member

jsafrane commented Jan 14, 2016

I was wrong, cinderVolumePlugint.SetUpAt() and TearDownAt() have some checks to prevent detaching of an used volume. The race is in the cinder volume plugin.

@jsafrane

This comment has been minimized.

Show comment
Hide comment
@jsafrane

jsafrane Jan 14, 2016

Member

These two threads run in parallel:

 dying pod.TearDownAt("myVolume")          starting pod.SetUpAt("myVolume")
   |                                       |
   |                                       AttachDisk("myVolume")
   refs, err := mount.GetMountRefs()       |
   Unmount("myDir")                        |
   if refs == 1 {                          |
   |  |                                    Mount("myVolume", "myDir")
   |  |                                    |
   |  DetachDisk("myVolume")               |
   |                                       start containers - OOPS! The volume is detached!
   |
   finish the pod cleanup
Member

jsafrane commented Jan 14, 2016

These two threads run in parallel:

 dying pod.TearDownAt("myVolume")          starting pod.SetUpAt("myVolume")
   |                                       |
   |                                       AttachDisk("myVolume")
   refs, err := mount.GetMountRefs()       |
   Unmount("myDir")                        |
   if refs == 1 {                          |
   |  |                                    Mount("myVolume", "myDir")
   |  |                                    |
   |  DetachDisk("myVolume")               |
   |                                       start containers - OOPS! The volume is detached!
   |
   finish the pod cleanup
@jsafrane

This comment has been minimized.

Show comment
Hide comment
@jsafrane

jsafrane Jan 14, 2016

Member

I see very similar code in AWS EBS SetupAt/TearDownAt implementation, but I am not able to reproduce the bug there. Maybe because EBS is much faster in attaching volumes than Cinder?

Member

jsafrane commented Jan 14, 2016

I see very similar code in AWS EBS SetupAt/TearDownAt implementation, but I am not able to reproduce the bug there. Maybe because EBS is much faster in attaching volumes than Cinder?

@jsafrane

This comment has been minimized.

Show comment
Hide comment
@jsafrane

jsafrane Jan 14, 2016

Member

Actually, most of the volume plugins probably have the same race.

@rootfs, you wrote many volume plugins, I am (randomly) looking at Ceph RBD, which has the same code flow as Cinder - can you confirm that the race is theoretically there? It's hard to reproduce though.

Member

jsafrane commented Jan 14, 2016

Actually, most of the volume plugins probably have the same race.

@rootfs, you wrote many volume plugins, I am (randomly) looking at Ceph RBD, which has the same code flow as Cinder - can you confirm that the race is theoretically there? It's hard to reproduce though.

@markturansky

This comment has been minimized.

Show comment
Hide comment
@markturansky

markturansky Jan 14, 2016

Member

The fencing provided by #15524 should be able to fix this race, but we also noted during discussion in Raleigh that we'll still need attach logic in Kubelet for a while. This race would still exist for those not using the attach/detach controller.

@saad-ali

Member

markturansky commented Jan 14, 2016

The fencing provided by #15524 should be able to fix this race, but we also noted during discussion in Raleigh that we'll still need attach logic in Kubelet for a while. This race would still exist for those not using the attach/detach controller.

@saad-ali

@jsafrane

This comment has been minimized.

Show comment
Hide comment
@jsafrane

jsafrane Jan 14, 2016

Member

I forgot to mention that the race is between TearDownAt() seeing GetMountRefs == 1, i.e. the dying pod is the last one using the volume, before SetupAt() mounts the volume (which would increase GetMountRefs to 2, but it's too late...)

Member

jsafrane commented Jan 14, 2016

I forgot to mention that the race is between TearDownAt() seeing GetMountRefs == 1, i.e. the dying pod is the last one using the volume, before SetupAt() mounts the volume (which would increase GetMountRefs to 2, but it's too late...)

@saad-ali

This comment has been minimized.

Show comment
Hide comment
@saad-ali

saad-ali Jan 15, 2016

Member

This race exists in most of the plugins because the the attach and detach code paths are triggered from two separate async loops: the pod creation loop triggers attach, and the kubelet main sync loop triggers detach (orphaned pods cleanup). Both loops are running on separate threads and unaware of what the other is doing.

I worked around this for GCE PDs by introducing locking between attach/detach; see attachDetachMutex in https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/gce_pd/gce_util.go#L61. You could reuse that pattern here.

Member

saad-ali commented Jan 15, 2016

This race exists in most of the plugins because the the attach and detach code paths are triggered from two separate async loops: the pod creation loop triggers attach, and the kubelet main sync loop triggers detach (orphaned pods cleanup). Both loops are running on separate threads and unaware of what the other is doing.

I worked around this for GCE PDs by introducing locking between attach/detach; see attachDetachMutex in https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/gce_pd/gce_util.go#L61. You could reuse that pattern here.

@jsafrane

This comment has been minimized.

Show comment
Hide comment
@jsafrane

jsafrane Jan 15, 2016

Member

While GCE's AttachAndMountDisk is safe, it's TearDownAt is not. It checks how many times is a volume mounted (calls GetMountRefs), while the same volume can be already attached and not mounted yet by SetUpAt running in a separate thread.

This pattern is in all volume plugins. I'm going to add a mutex before SetUp/TearDown calls.

Member

jsafrane commented Jan 15, 2016

While GCE's AttachAndMountDisk is safe, it's TearDownAt is not. It checks how many times is a volume mounted (calls GetMountRefs), while the same volume can be already attached and not mounted yet by SetUpAt running in a separate thread.

This pattern is in all volume plugins. I'm going to add a mutex before SetUp/TearDown calls.

@jsafrane

This comment has been minimized.

Show comment
Hide comment
@jsafrane

jsafrane Jan 15, 2016

Member

I'm going to add a mutex before SetUp/TearDown calls.

That's not possible. Only the volume plugin knows the volume ID (i.e. AWS EBS VolumeID or GCE PD Name). We would have the same bug if two Kubernetes Volumes (with different Volume.Name) lead to the same AWS VolumeID (or GCE PD).

We can either make synchronize all TearDown/SetUp calls (i.e. they will run in sequence) or fix all volume plugins to synchronize TearDown/SetUp on the same volume (i.e. they will run mostly in parallel, unless they act on the same volume).

Member

jsafrane commented Jan 15, 2016

I'm going to add a mutex before SetUp/TearDown calls.

That's not possible. Only the volume plugin knows the volume ID (i.e. AWS EBS VolumeID or GCE PD Name). We would have the same bug if two Kubernetes Volumes (with different Volume.Name) lead to the same AWS VolumeID (or GCE PD).

We can either make synchronize all TearDown/SetUp calls (i.e. they will run in sequence) or fix all volume plugins to synchronize TearDown/SetUp on the same volume (i.e. they will run mostly in parallel, unless they act on the same volume).

@jsafrane

This comment has been minimized.

Show comment
Hide comment
@jsafrane

jsafrane Jan 15, 2016

Member

I posted a PR to fix this bug, but it's Cinder only. Similar changes are needed in most of the volume plugins.

Maybe we could fix the plugins when refactoring them for the attach/detach controller.
@saad-ali @markturansky: opinions?

Member

jsafrane commented Jan 15, 2016

I posted a PR to fix this bug, but it's Cinder only. Similar changes are needed in most of the volume plugins.

Maybe we could fix the plugins when refactoring them for the attach/detach controller.
@saad-ali @markturansky: opinions?

openshift-publish-robot pushed a commit to openshift/kubernetes that referenced this issue May 3, 2018

Merge pull request #19602 from smarterclayton/decorate
UPSTREAM: 63349: Decorate function not called on Create

Origin-commit: 93f12e580ece30e0055f030dcd6d239116ff5cb4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment