Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix race condition issue when detaching azure disk #60183

Merged
merged 1 commit into from Feb 23, 2018

Conversation

@andyzhangx
Copy link
Member

andyzhangx commented Feb 22, 2018

What this PR does / why we need it:
add lock before detaching azure disk, without this PR, there would be lots of Multi-Attach error when scheduling one pod from one node to another.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #60101

Special notes for your reviewer:
@feiskyer @djsly @khenidak
Since we are using getLunMutex.LockKey(instanceid) for both AttachDisk and DetachDisk, there would be only one VM.update operation at a time for both AttachDisk and DetachDisk.

Release note:

fix race condition issue when detaching azure disk

/assign @feiskyer
Could you also mark as v1.10 milestone @feiskyer thanks.
/sig azure

@andyzhangx

This comment has been minimized.

Copy link
Member Author

andyzhangx commented Feb 22, 2018

@khenidak I think lots of Multi-Attach error of azure disk was due to this bug, it's really bad we do not lock before detaching azure disk...

@andyzhangx andyzhangx force-pushed the andyzhangx:addlock-detach-azuredisk branch from 630269a to f3324a6 Feb 22, 2018

@djsly

This comment has been minimized.

Copy link
Contributor

djsly commented Feb 22, 2018

Looking good on my side.
I will apply the patch on my local 1.7 source and test it out tomorrow. We have another issue with PV and I want to confirm it is the same source.

@andyzhangx

This comment has been minimized.

Copy link
Member Author

andyzhangx commented Feb 22, 2018

@djsly just replace the controller manager, thanks.

@feiskyer
Copy link
Member

feiskyer left a comment

LGTM with the changes.

Let's wait a while for @djsly's validation.

@@ -268,7 +268,7 @@ func (d *azureDiskDetacher) Detach(diskURI string, nodeName types.NodeName) erro
return fmt.Errorf("invalid disk to detach: %q", diskURI)
}

_, err := d.cloud.InstanceID(context.TODO(), nodeName)
instanceid, err := d.cloud.InstanceID(context.TODO(), nodeName)

This comment has been minimized.

@brendandburns

brendandburns Feb 22, 2018

Contributor

context.TODO() seems like a mistake here? is there a context we can attach to?

This comment has been minimized.

@andyzhangx

andyzhangx Feb 22, 2018

Author Member

@brendanburns it's related to #59287, there is some future work item for context.TODO()

@djsly

This comment has been minimized.

Copy link
Contributor

djsly commented Feb 22, 2018

@andyzhangx here are the logs for my latest test after I applied the patch on the 1.7 branch.

https://gist.github.com/djsly/d04edbd3239373f345ca27616ddd5132

Things are looking good!

Except the Multi-Attach error. This seems to be caused by the reconciler. It last ~2m30sec until the disks are unmounted and the first detach kicks in.

Could you confirm that those warnings do not involve the ARM api ?

Multi-Attach error for volume "pvc-91d3b0a6-11b9-11e8-82b7-000d3a018ac3" (UniqueName: "kubernetes.io/azure-disk//subscriptions/<sub_id>/resourceGroups/<rg>/providers/Microsoft.Compute/disks/<rg>-dynamic-pvc-91d3b0a6-11b9-11e8-82b7-000d3a018ac3") from node "kn-edge-2" Volume is already exclusively attached to one node and can't be attached to another

@djsly

This comment has been minimized.

Copy link
Contributor

djsly commented Feb 22, 2018

Also, could we have this PR cherry picked all the way to 1.7 ? thanks!

@brendandburns

This comment has been minimized.

Copy link
Contributor

brendandburns commented Feb 22, 2018

/lgtm
/approve

@jdumars can you shepard this into 1.7.x and 1.8.x?

Thanks!

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Feb 22, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx, brendandburns

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-cherrypick-bot

This comment has been minimized.

Copy link

k8s-cherrypick-bot commented Feb 22, 2018

Removing label cherrypick-candidate because no release milestone was set. This is an invalid state and thus this PR is not being considered for cherry-pick to any release branch. Please add an appropriate release milestone and then re-add the label.

@andyzhangx

This comment has been minimized.

Copy link
Member Author

andyzhangx commented Feb 23, 2018

@djsly I checked the code again, current Multi-Attach error from reconciler is expected, it does not come from ARM api, it's a error message generated by reconciler.
When one pod is rescheduled from node#1 to node#2, DisksAreAttached func in azure_controllerCommon.go will still return that the original disk is still in node#1, after around one minute, disk was removed from node#1, and then the Multi-Attach error from reconciler disappears.

I will cherry pick this PR to v1.7-1.9.
@djsly Thanks again for your reporting and verification, finally we resolved this Multi-Attach error issue.

@djsly

This comment has been minimized.

Copy link
Contributor

djsly commented Feb 23, 2018

My pleasure and thanks for the quick PR.

@andyzhangx

This comment has been minimized.

Copy link
Member Author

andyzhangx commented Feb 23, 2018

@k8s-github-robot

This comment has been minimized.

Copy link
Contributor

k8s-github-robot commented Feb 23, 2018

Automatic merge from submit-queue (batch tested with PRs 60208, 60084, 60183, 59713, 60096). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 8f9e8c0 into kubernetes:master Feb 23, 2018

13 checks passed

Submit Queue Queued to run github e2e tests a second time.
Details
cla/linuxfoundation andyzhangx authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gke Skipped
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce Job succeeded.
Details
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-unit Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
@jdumars

This comment has been minimized.

Copy link
Member

jdumars commented Feb 23, 2018

@mehdy @jpbetz @wojtek-t this is an important cherry pick

@k8s-cherrypick-bot

This comment has been minimized.

Copy link

k8s-cherrypick-bot commented Feb 23, 2018

Removing label cherrypick-candidate because no release milestone was set. This is an invalid state and thus this PR is not being considered for cherry-pick to any release branch. Please add an appropriate release milestone and then re-add the label.

@jdumars jdumars modified the milestones: v1.10, v1.9 Feb 23, 2018

k8s-github-robot pushed a commit that referenced this pull request Feb 28, 2018

Kubernetes Submit Queue
Merge pull request #60272 from andyzhangx/automated-cherry-pick-of-#6…
…0183-upstream-release-1.8

Automatic merge from submit-queue.

Automated cherry pick of #60183

Cherry pick of #60183 on release-1.8.

#60183: add lock before detaching azure disk

k8s-github-robot pushed a commit that referenced this pull request Mar 1, 2018

Kubernetes Submit Queue
Merge pull request #60271 from andyzhangx/automated-cherry-pick-of-#6…
…0183-upstream-release-1.7

Automatic merge from submit-queue.

Automated cherry pick of #60183

Cherry pick of #60183 on release-1.7.

#60183: add lock before detaching azure disk

@oivindoh oivindoh referenced this pull request Mar 13, 2018

Closed

Is production ready? #2422

k8s-github-robot pushed a commit that referenced this pull request Mar 16, 2018

Kubernetes Submit Queue
Merge pull request #60273 from andyzhangx/automated-cherry-pick-of-#6…
…0183-upstream-release-1.9

Automatic merge from submit-queue.

Automated cherry pick of #60183: add lock before detaching azure disk

Cherry pick of #60183 on release-1.9.

#60183: add lock before detaching azure disk
@k8s-cherrypick-bot

This comment has been minimized.

Copy link

k8s-cherrypick-bot commented Mar 16, 2018

Commit found in the "release-1.9" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.