Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix race condition issue when detaching azure disk #60183

Merged

Conversation

@andyzhangx
Copy link
Member

@andyzhangx andyzhangx commented Feb 22, 2018

What this PR does / why we need it:
add lock before detaching azure disk, without this PR, there would be lots of Multi-Attach error when scheduling one pod from one node to another.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #60101

Special notes for your reviewer:
@feiskyer @djsly @khenidak
Since we are using getLunMutex.LockKey(instanceid) for both AttachDisk and DetachDisk, there would be only one VM.update operation at a time for both AttachDisk and DetachDisk.

Release note:

fix race condition issue when detaching azure disk

/assign @feiskyer
Could you also mark as v1.10 milestone @feiskyer thanks.
/sig azure

@andyzhangx
Copy link
Member Author

@andyzhangx andyzhangx commented Feb 22, 2018

@khenidak I think lots of Multi-Attach error of azure disk was due to this bug, it's really bad we do not lock before detaching azure disk...

@andyzhangx andyzhangx force-pushed the andyzhangx:addlock-detach-azuredisk branch from 630269a to f3324a6 Feb 22, 2018
@djsly
Copy link
Contributor

@djsly djsly commented Feb 22, 2018

Looking good on my side.
I will apply the patch on my local 1.7 source and test it out tomorrow. We have another issue with PV and I want to confirm it is the same source.

@andyzhangx
Copy link
Member Author

@andyzhangx andyzhangx commented Feb 22, 2018

@djsly just replace the controller manager, thanks.

Copy link
Member

@feiskyer feiskyer left a comment

LGTM with the changes.

Let's wait a while for @djsly's validation.

@@ -268,7 +268,7 @@ func (d *azureDiskDetacher) Detach(diskURI string, nodeName types.NodeName) erro
return fmt.Errorf("invalid disk to detach: %q", diskURI)
}

_, err := d.cloud.InstanceID(context.TODO(), nodeName)
instanceid, err := d.cloud.InstanceID(context.TODO(), nodeName)

This comment has been minimized.

@brendandburns

brendandburns Feb 22, 2018
Contributor

context.TODO() seems like a mistake here? is there a context we can attach to?

This comment has been minimized.

@andyzhangx

andyzhangx Feb 22, 2018
Author Member

@brendanburns it's related to #59287, there is some future work item for context.TODO()

@djsly
Copy link
Contributor

@djsly djsly commented Feb 22, 2018

@andyzhangx here are the logs for my latest test after I applied the patch on the 1.7 branch.

https://gist.github.com/djsly/d04edbd3239373f345ca27616ddd5132

Things are looking good!

Except the Multi-Attach error. This seems to be caused by the reconciler. It last ~2m30sec until the disks are unmounted and the first detach kicks in.

Could you confirm that those warnings do not involve the ARM api ?

Multi-Attach error for volume "pvc-91d3b0a6-11b9-11e8-82b7-000d3a018ac3" (UniqueName: "kubernetes.io/azure-disk//subscriptions/<sub_id>/resourceGroups/<rg>/providers/Microsoft.Compute/disks/<rg>-dynamic-pvc-91d3b0a6-11b9-11e8-82b7-000d3a018ac3") from node "kn-edge-2" Volume is already exclusively attached to one node and can't be attached to another

@djsly
Copy link
Contributor

@djsly djsly commented Feb 22, 2018

Also, could we have this PR cherry picked all the way to 1.7 ? thanks!

@brendandburns
Copy link
Contributor

@brendandburns brendandburns commented Feb 22, 2018

/lgtm
/approve

@jdumars can you shepard this into 1.7.x and 1.8.x?

Thanks!

@k8s-ci-robot
Copy link
Contributor

@k8s-ci-robot k8s-ci-robot commented Feb 22, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx, brendandburns

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-cherrypick-bot
Copy link

@k8s-cherrypick-bot k8s-cherrypick-bot commented Feb 22, 2018

Removing label cherrypick-candidate because no release milestone was set. This is an invalid state and thus this PR is not being considered for cherry-pick to any release branch. Please add an appropriate release milestone and then re-add the label.

@andyzhangx
Copy link
Member Author

@andyzhangx andyzhangx commented Feb 23, 2018

@djsly I checked the code again, current Multi-Attach error from reconciler is expected, it does not come from ARM api, it's a error message generated by reconciler.
When one pod is rescheduled from node#1 to node#2, DisksAreAttached func in azure_controllerCommon.go will still return that the original disk is still in node#1, after around one minute, disk was removed from node#1, and then the Multi-Attach error from reconciler disappears.

I will cherry pick this PR to v1.7-1.9.
@djsly Thanks again for your reporting and verification, finally we resolved this Multi-Attach error issue.

@djsly
Copy link
Contributor

@djsly djsly commented Feb 23, 2018

My pleasure and thanks for the quick PR.

@andyzhangx
Copy link
Member Author

@andyzhangx andyzhangx commented Feb 23, 2018

@k8s-github-robot
Copy link
Contributor

@k8s-github-robot k8s-github-robot commented Feb 23, 2018

Automatic merge from submit-queue (batch tested with PRs 60208, 60084, 60183, 59713, 60096). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 8f9e8c0 into kubernetes:master Feb 23, 2018
13 checks passed
13 checks passed
@k8s-github-robot
Submit Queue Queued to run github e2e tests a second time.
Details
@thelinuxfoundation
cla/linuxfoundation andyzhangx authorized
Details
@k8s-ci-robot
pull-kubernetes-bazel-build Job succeeded.
Details
@k8s-ci-robot
pull-kubernetes-bazel-test Job succeeded.
Details
@k8s-ci-robot
pull-kubernetes-cross Skipped
@k8s-ci-robot
pull-kubernetes-e2e-gce Job succeeded.
Details
@k8s-ci-robot
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
@k8s-ci-robot
pull-kubernetes-e2e-gke Skipped
@k8s-ci-robot
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
@k8s-ci-robot
pull-kubernetes-kubemark-e2e-gce Job succeeded.
Details
@k8s-ci-robot
pull-kubernetes-node-e2e Job succeeded.
Details
@k8s-ci-robot
pull-kubernetes-unit Job succeeded.
Details
@k8s-ci-robot
pull-kubernetes-verify Job succeeded.
Details
@jdumars
Copy link
Member

@jdumars jdumars commented Feb 23, 2018

@mehdy @jpbetz @wojtek-t this is an important cherry pick

@k8s-cherrypick-bot
Copy link

@k8s-cherrypick-bot k8s-cherrypick-bot commented Feb 23, 2018

Removing label cherrypick-candidate because no release milestone was set. This is an invalid state and thus this PR is not being considered for cherry-pick to any release branch. Please add an appropriate release milestone and then re-add the label.

@jdumars jdumars modified the milestones: v1.10, v1.9 Feb 23, 2018
k8s-github-robot pushed a commit that referenced this pull request Feb 28, 2018
…0183-upstream-release-1.8

Automatic merge from submit-queue.

Automated cherry pick of #60183

Cherry pick of #60183 on release-1.8.

#60183: add lock before detaching azure disk
k8s-github-robot pushed a commit that referenced this pull request Mar 1, 2018
…0183-upstream-release-1.7

Automatic merge from submit-queue.

Automated cherry pick of #60183

Cherry pick of #60183 on release-1.7.

#60183: add lock before detaching azure disk
k8s-github-robot pushed a commit that referenced this pull request Mar 16, 2018
…0183-upstream-release-1.9

Automatic merge from submit-queue.

Automated cherry pick of #60183: add lock before detaching azure disk

Cherry pick of #60183 on release-1.9.

#60183: add lock before detaching azure disk
@k8s-cherrypick-bot
Copy link

@k8s-cherrypick-bot k8s-cherrypick-bot commented Mar 16, 2018

Commit found in the "release-1.9" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment