Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UPSTREAM: 40417: Always detach volumes in operator executor #13251

Merged
merged 1 commit into from
Mar 16, 2017

Conversation

gnufied
Copy link
Member

@gnufied gnufied commented Mar 6, 2017

This fixes a important race condition in mounting a volume and if pods gets deleted while mount was in progress.

BZ- https://bugzilla.redhat.com/show_bug.cgi?id=1432949

Instead of marking a volume as detached immediately in Kubelet's
reconciler, delegate the marking asynchronously to the operator
executor. This is necessary to prevent race conditions with other
operations mutating the same volume state.

An example of one such problem:

pod is created, volume is added to desired state of the world
reconciler process starts
reconciler starts MountVolume, which is kicked off asynchronously via
operation_executor.go
MountVolume mounts the volume, but hasn't yet marked it as mounted
pod is deleted, volume is removed from desired state of the world
reconciler reaches detach volume section, detects volume is no longer in desired state of world,
removes it from volumes in use
MountVolume tries to mark mount, throws an error because
volume is no longer in actual state of world list. After this, kubelet isn't aware of the mount
so doesn't try to unmount again.
controller-manager tries to detach the volume, this fails because it
is still mounted to the OS.
EBS gets stuck indefinitely in busy state trying to detach.
@gnufied gnufied added this to the 1.5.0 milestone Mar 6, 2017
@gnufied gnufied requested review from ajsmith and jsafrane and removed request for ajsmith March 6, 2017 16:32
@gnufied
Copy link
Member Author

gnufied commented Mar 6, 2017

cc @jsafrane @childsb

@gnufied
Copy link
Member Author

gnufied commented Mar 6, 2017

[test]

@jsafrane
Copy link
Contributor

jsafrane commented Mar 7, 2017

LGTM for the code, however... shouldn't there be a link to BZ?

@pweil-
Copy link
Contributor

pweil- commented Mar 7, 2017

@childsb or @eparis please sign off on this from a risk perspective

@gnufied I see where this was targeted for 1.5 upstream picks but didn't see the actual 1.5 pick. Since the PR was from an external contributor we may want to follow up on that.

@gnufied
Copy link
Member Author

gnufied commented Mar 8, 2017

@pweil- okay I will follow up on original PR for cherry pick.

I think for now - lets hold off on merging this. I will drop the milestone target.

@gnufied gnufied modified the milestone: 1.5.0 Mar 8, 2017
@gnufied
Copy link
Member Author

gnufied commented Mar 14, 2017

I am going to push for this to get merged. Upstream PR for cherry picking to 1.5 was opened - kubernetes/kubernetes#41555

@eparis
Copy link
Member

eparis commented Mar 15, 2017

[test] but like jan said, where's the bZ?

@openshift-bot
Copy link
Contributor

Evaluated for origin test up to e34c005

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/239/) (Base Commit: 4bdf7ce)

@childsb
Copy link
Contributor

childsb commented Mar 16, 2017

[merge] test flaked on #9309

@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to e34c005

@openshift-bot
Copy link
Contributor

openshift-bot commented Mar 16, 2017

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_origin/101/) (Base Commit: 0eb20f4) (Image: devenv-rhel7_6072)

@openshift-bot openshift-bot merged commit a98a33d into openshift:release-1.5 Mar 16, 2017
@eparis
Copy link
Member

eparis commented Mar 16, 2017

@childsb @gnufied where is the bz?

@gnufied
Copy link
Member Author

gnufied commented Mar 16, 2017

The openshift bug https://bugzilla.redhat.com/show_bug.cgi?id=1429318

Also upstream bug - kubernetes/kubernetes#32881

Having said that, the bug reported in bugzilla doesn't have reproducible steps. This is bit of a problem because - we don't know why BZ 1429318 happened. We do know, we have an existing race condition in volume detach/umount code path so we are fixing that here. But we don't know for sure, if that will fix BZ 1429318. I will keep BZ open for now, until we observe if it is really fixed.

@eparis
Copy link
Member

eparis commented Mar 16, 2017

I need a BZ, in MODIFIED, which explains what we changed and why. For QA to try to pound on what changes you made at the last minute.

@gnufied
Copy link
Member Author

gnufied commented Mar 16, 2017

@eparis updated the issue description with appropriate BZ.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants