Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix detach azure disk back off issue which has too big lock in failure retry condition #76573

Merged
merged 2 commits into from Apr 19, 2019

Conversation

@andyzhangx
Copy link
Member

commented Apr 15, 2019

What type of PR is this?
/kind bug

What this PR does / why we need it:
In some error condition when detach azure disk failed, azure cloud provider will retry 6 times at most with exponential backoff, it will hold the data disk list for about 3 minutes with a node level lock, and in that time period, if customer update data disk list manually (e.g. need manual operationto attach/detach another disk since there is attach/detach error, ) , the data disk list will be obselete(dirty data), then weird VM status happens, e.g. attach a non-existing disk, we should split those retry operations, every retry should get a fresh data disk list in the beginning.

if as.CloudProviderBackoff && shouldRetryHTTPRequest(resp, err) {
klog.V(2).Infof("azureDisk - update(%s) backing off: vm(%s) detach disk(%s, %s), err: %v", nodeResourceGroup, vmName, diskName, diskURI, err)
retryErr := as.CreateOrUpdateVMWithRetry(nodeResourceGroup, vmName, newVM)
if retryErr != nil {

This PR has two commits:

  1. rename function name from DetachDiskByName to DetachDisk
  2. refine detach azure disk retry operation, make every detach azure disk operation in a standalone function, originally it's by as.CreateOrUpdateVMWithRetry(nodeResourceGroup, vmName, newVM) which may lead to obsolete data disk list

This PR don't change the logic of attach disk since there is no retry in azure cloud provide for attach disk, k8s attach-detach controller will do the attach volume retry.

BTW, I have tested this PR on both vmss and vmas k8s cluster by a stress disk attach/detach test, it works well for a bunch of times.

Which issue(s) this PR fixes:

Fixes #76502

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

fix detach azure disk back off issue which has too big lock in failure retry condition

/kind bug
/assign @feiskyer
/priority important-soon
/sig azure

cc @khenidak @brendandburns

move disk lock process to azure cloud provider
fix comments

fix import keymux check error

add unit test for attach/detach disk funcs

@andyzhangx andyzhangx force-pushed the andyzhangx:disk-backoff-refactor branch from 3772cd8 to 6c70ca6 Apr 16, 2019

@andyzhangx

This comment has been minimized.

Copy link
Member Author

commented Apr 16, 2019

/test pull-kubernetes-integration

@feiskyer
Copy link
Member

left a comment

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm label Apr 16, 2019

@feiskyer

This comment has been minimized.

Copy link
Member

commented Apr 16, 2019

@andrewsykim Could you help to approve the cloud-provider changes?

/assign @andrewsykim

@andrewsykim

This comment has been minimized.

Copy link
Member

commented Apr 16, 2019

Can we add a bit more details on the release note please?

@andyzhangx andyzhangx changed the title fix detach azure disk back off issue fix detach azure disk back off issue which has too big lock in failure retry condition Apr 17, 2019

@andyzhangx

This comment has been minimized.

Copy link
Member Author

commented Apr 17, 2019

/test pull-kubernetes-e2e-gce-csi-serial

@andyzhangx

This comment has been minimized.

Copy link
Member Author

commented Apr 17, 2019

@andrewsykim thanks. I have changed the release note to:

fix detach azure disk back off issue which has too big lock in failure retry condition

Let me know if you have any question.

@feiskyer feiskyer added this to In progress in SIG Azure via automation Apr 17, 2019

@andyzhangx

This comment has been minimized.

Copy link
Member Author

commented Apr 19, 2019

@andrewsykim PTAL, thanks.

@andrewsykim

This comment has been minimized.

Copy link
Member

commented Apr 19, 2019

/approve
/lgtm

for pkg/cloudprovider/providers/.import-restrictions changes

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Apr 19, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andrewsykim, andyzhangx, feiskyer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 64a0441 into kubernetes:master Apr 19, 2019

20 checks passed

cla/linuxfoundation andyzhangx authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-storage-slow Job succeeded.
Details
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details

SIG Azure automation moved this from In progress to Done Apr 19, 2019

k8s-ci-robot added a commit that referenced this pull request Apr 28, 2019

Merge pull request #76981 from andyzhangx/automated-cherry-pick-of-#7…
…6573-upstream-release-1.12

Automated cherry pick of #76573: refactor detach azure disk retry operation

k8s-ci-robot added a commit that referenced this pull request Apr 30, 2019

Merge pull request #76887 from andyzhangx/automated-cherry-pick-of-#7…
…6573-upstream-release-1.13

Automated cherry pick of #76573: refactor detach azure disk retry operation

k8s-ci-robot added a commit that referenced this pull request May 1, 2019

Merge pull request #76886 from andyzhangx/automated-cherry-pick-of-#7…
…6573-upstream-release-1.14

Automated cherry pick of #76573: refactor detach azure disk retry operation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.