Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add retry for detach azure disk #74398

Merged
merged 1 commit into from Feb 26, 2019

Conversation

@andyzhangx
Copy link
Member

andyzhangx commented Feb 22, 2019

What type of PR is this?
/kind bug

What this PR does / why we need it:
Current azure cloud provider would fail to detach azure disk when there is server side error, need to add retry mechanism for detach disk operation, while for attach disk operation, it's not necessary since k8s pv-controller will try if first try failed.

Which issue(s) this PR fixes:

Fixes #74396

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

fix issue: fail to detach azure disk when there is server side error

/kind bug
/assign @feiskyer
/priority important-soon
/sig azure

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Feb 22, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

add retry for detach azure disk
add more logging info in detach disk

@andyzhangx andyzhangx force-pushed the andyzhangx:detach-azuredisk-retry branch from df2edd2 to 8c53db0 Feb 22, 2019

@feiskyer
Copy link
Member

feiskyer left a comment

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Feb 22, 2019

@andyzhangx

This comment has been minimized.

Copy link
Member Author

andyzhangx commented Feb 22, 2019

BTW, if move 8 disks from one node to another in parellel, there will be always one disk not detched(only one), the error is like following:

I0222 08:38:22.503087       1 azure_controller_vmss.go:140] azureDisk - update(andy-vmss1124) backing off: vm(k8s-agentpool-24194600-vmss000001) detach disk(, /subscriptions/xxx/resourceGroups/andy-vmss1124/providers/Microsoft.Compute/disks/andy-vmss1124-dynamic-pvc-66815035-35ac-11e9-921c-000d3a002125), err: compute.VirtualMachineScaleSetVMsClient#Update: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code="AttachDiskWhileBeingDetached" Message="Cannot attach data disk '763c0d05-4ae5-4699-8ce8-9c8ced5283da' to VM 'k8s-agentpool-24194600-vmss_1' because the disk is currently being detached or the last detach operation failed. Please wait until the disk is completely detached and then try again or delete/detach the disk explicitly again."

The detach action finally failed, it's more like attach a disk to node#1 would cause detach that disk from node#2 to fail which is not reasonable, I will contact with azure disk RP to check with this issue.

@feiskyer

This comment has been minimized.

Copy link
Member

feiskyer commented Feb 22, 2019

Then let's hold a while for root causes.

/hold

@feiskyer

This comment has been minimized.

Copy link
Member

feiskyer commented Feb 22, 2019

/test pull-kubernetes-e2e-aks-engine-azure

@andyzhangx

This comment has been minimized.

Copy link
Member Author

andyzhangx commented Feb 22, 2019

the funny thing is Cannot attach data disk '763c0d05-4ae5-4699-8ce8-9c8ced5283da' to VM 'k8s-agentpool-24194600-vmss_1', while I never have data disk '763c0d05-4ae5-4699-8ce8-9c8ced5283da' in my resource group, and I could repro that easily in two vmss k8s clusters.

@andyzhangx

This comment has been minimized.

Copy link
Member Author

andyzhangx commented Feb 25, 2019

@feiskyer The issue I could repro is related to VMSS, while I still insist on adding this retry logic only for detach disk (not necessary for attach disk since k8s controller will retry if failed) in case there is any potential issue, thus azure cloud provider could have more chance to retry, although in this case, it does not work perfectly. what's your opinion?

@feiskyer

This comment has been minimized.

Copy link
Member

feiskyer commented Feb 25, 2019

The issue I could repro is related to VMSS, while I still insist on adding this retry logic only for detach disk (not necessary for attach disk since k8s controller will retry if failed) in case there is any potential issue, thus azure cloud provider could have more chance to retry, although in this case, it does not work perfectly. what's your opinion?

Agreed. Let's wait a while for VMSS responses, in case there're still other potential issues.

@andyzhangx

This comment has been minimized.

Copy link
Member Author

andyzhangx commented Feb 26, 2019

the failed error is most likely due to slow disk attach/detach on VMSS, let's merge this PR first since it would mitigate the issue a little
/hold cancel

@k8s-ci-robot k8s-ci-robot merged commit d500740 into kubernetes:master Feb 26, 2019

17 checks passed

cla/linuxfoundation andyzhangx authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-aks-engine-azure Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-godeps Skipped
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped
tide In merge pool.
Details

k8s-ci-robot added a commit that referenced this pull request Feb 27, 2019

Merge pull request #74579 from andyzhangx/automated-cherry-pick-of-#7…
…4398-upstream-release-1.13

Automated cherry pick of #74398: add retry for detach azure disk

k8s-ci-robot added a commit that referenced this pull request Mar 5, 2019

Merge pull request #74593 from andyzhangx/automated-cherry-pick-of-#7…
…4398-upstream-release-1.11

Automated cherry pick of #74398: add retry for detach azure disk

@feiskyer feiskyer added this to Done in Cloud Provider Azure Mar 5, 2019

k8s-ci-robot added a commit that referenced this pull request Mar 7, 2019

Merge pull request #74581 from andyzhangx/automated-cherry-pick-of-#7…
…4398-upstream-release-1.12

Automated cherry pick of #74398: add retry for detach azure disk
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.