fix detach azure disk issue due to dirty cache #71495

andyzhangx · 2018-11-28T06:42:52Z

What type of PR is this?
/kind bug

What this PR does / why we need it:
fix detch azure disk issue by cleaning up vm cache right after every update vm operation in disk attach/detach

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #71453

Special notes for your reviewer:
When detaching lots of disks in one node, detach failure may happen, in the before, we use backoff mechanism to retries 6 times which is not good solution; now we switch to use k8s controller to detach disk volume, and it finally depends on DisksAreAttached func to decide whether one disk is attached or not:

kubernetes/pkg/volume/azure_dd/attacher.go

Line 135 in 5dfe48b

attachedResult, err := diskController.DisksAreAttached(volumeIDList, nodeName)

Unfortunately DisksAreAttached func is invoked right after DetachDiskByName failure which already make the vm cache dirty, this PR makes sure vm cache will be cleaned after every update vm operation in disk attach/detach operation. And this PR will also solve strange issue, e.g. detach disk failure due to the last attach disk failure.

The invocation chain for using vmCache in DisksAreAttached func:

DisksAreAttached --> getNodeDataDisks--> GetDataDisks --> getVirtualMachine --> vmCache.Get

I1128 05:04:58.206841       1 attacher.go:145] azureDisk - VolumesAreAttached: check volume "andy-vmss11010-dynamic-pvc-884809b5-f2ca-11e8-a757-000d3a01450e" (specName: "pvc-884809b5-f2ca-11e8-a757-000d3a01450e") is no longer attached
I1128 05:04:58.206945       1 operation_generator.go:193] VerifyVolumesAreAttached determined volume "kubernetes.io/azure-disk//subscriptions/.../resourceGroups/andy-vmss11010/providers/Microsoft.Compute/disks/andy-vmss11010-dynamic-pvc-049158b4-f2ba-11e8-ade9-000d3a01450e" (spec.Name: "pvc-049158b4-f2ba-11e8-ade9-000d3a01450e") is no longer attached to node "k8s-agentpool-39687990-vmss000000", therefore it was marked as detached.

Release note:

fix detach azure disk issue

/sig azure
/assign @feiskyer
cc @khenidak @brendandburns
@antoineco

k8s-ci-robot · 2018-11-28T06:43:29Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/cloudprovider/providers/azure/OWNERS~~ [andyzhangx]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

andyzhangx · 2018-11-28T06:48:27Z

/priority important-soon

feiskyer · 2018-11-28T06:50:16Z

/milestone v1.13
/priority critical-urgent

feiskyer · 2018-11-28T07:01:24Z

Added to v1.13 since this is a critical bug fix.

feiskyer

/lgtm

AishSundar · 2018-11-28T07:04:34Z

@andyzhangx @feiskyer is this an issue introduced recently in 1.13 that we are trying to address with this fix? Are there CI tests to verify the goodness of the fix? We are very late in the 1.13 release cycle and I would like to avoid any code churn unless it is a critical urgent issue introduced in 1.13. Can this wait until 1.13.1?

/hold

andyzhangx · 2018-11-28T07:12:32Z

Hi @AishSundar, Yes, we are going to fix the issue right in v1.13.0, and the code change only remains in azure directory. It's important for this fix fit into v1.13.0, thanks.
Update:
About the test, I have done a few e2e tests on this, while since it's not a common scenario, we have not an automation test for this, it's manual now.

andyzhangx · 2018-11-28T07:16:48Z

/test pull-kubernetes-integration

andyzhangx · 2018-11-28T09:26:26Z

/hold cancel
@AishSundar just let me know if you have any concern of this PR, thanks.

…1495-upstream-release-1.12 Automated cherry pick of #71495: fix detch azure disk issue by clean vm cache

…1495-upstream-release-1.10 Automated cherry pick of #71495: fix detch azure disk issue by clean vm cache

andyzhangx · 2019-04-25T05:30:46Z

pkg/cloudprovider/providers/azure/azure_controller_vmss.go


+	// Invalidate the cache right after updating
+	key := buildVmssCacheKey(nodeResourceGroup, ss.makeVmssVMName(ssName, instanceID))
+	defer ss.vmssVMCache.Delete(key)


here is the PR for this dirty cache issue, originally whether succeed or not, this PR clean the cache anyway.
I can move this clean cache operation defer ss.vmssVMCache.Delete(key) to conditions when only error happens, would that solve your issue @khenidak ?I can build one hot fix image if you wan to try this fix.
cc@feiskyer

fix detch azure disk issue by clean vm cache

cd29302

k8s-ci-robot assigned feiskyer Nov 28, 2018

k8s-ci-robot requested review from brendandburns and feiskyer November 28, 2018 06:43

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 28, 2018

k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 28, 2018

k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Nov 28, 2018

k8s-ci-robot added this to the v1.13 milestone Nov 28, 2018

feiskyer reviewed Nov 28, 2018

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 28, 2018

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 28, 2018

This was referenced Nov 28, 2018

Automated cherry pick of #71495: fix detch azure disk issue by clean vm cache #71496

Merged

Automated cherry pick of #71495: fix detch azure disk issue by clean vm cache #71497

Merged

andyzhangx changed the title ~~fix detach azure disk issue~~ fix detach azure disk issue due to dirty cache Nov 28, 2018

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 28, 2018

k8s-ci-robot merged commit b75a346 into kubernetes:master Nov 28, 2018

andyzhangx mentioned this pull request Nov 29, 2018

Sometimes exponentialBackOffOnError leads to longer mount times Azure/AKS#719

Closed

k8s-ci-robot added a commit that referenced this pull request Nov 29, 2018

Merge pull request #71497 from andyzhangx/automated-cherry-pick-of-#7…

401f2c0

…1495-upstream-release-1.12 Automated cherry pick of #71495: fix detch azure disk issue by clean vm cache

k8s-ci-robot added a commit that referenced this pull request Dec 10, 2018

Merge pull request #71496 from andyzhangx/automated-cherry-pick-of-#7…

5756f0f

…1495-upstream-release-1.10 Automated cherry pick of #71495: fix detch azure disk issue by clean vm cache

andyzhangx mentioned this pull request Dec 18, 2018

Multi Attach Error Azure/AKS#477

Closed

andyzhangx mentioned this pull request Feb 18, 2019

Improve Attach Detach Disk Performance kubernetes-sigs/cloud-provider-azure#78

Closed

andyzhangx commented Apr 25, 2019

View reviewed changes

kwoodson mentioned this pull request Jun 10, 2019

Verify Azure Disk fix in 3.11 openshift/openshift-azure#1702

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix detach azure disk issue due to dirty cache #71495

fix detach azure disk issue due to dirty cache #71495

andyzhangx commented Nov 28, 2018 •

edited

k8s-ci-robot commented Nov 28, 2018

andyzhangx commented Nov 28, 2018

feiskyer commented Nov 28, 2018

feiskyer commented Nov 28, 2018

feiskyer left a comment

AishSundar commented Nov 28, 2018

andyzhangx commented Nov 28, 2018 •

edited

andyzhangx commented Nov 28, 2018

andyzhangx commented Nov 28, 2018

andyzhangx Apr 25, 2019

fix detach azure disk issue due to dirty cache #71495

fix detach azure disk issue due to dirty cache #71495

Conversation

andyzhangx commented Nov 28, 2018 • edited

k8s-ci-robot commented Nov 28, 2018

andyzhangx commented Nov 28, 2018

feiskyer commented Nov 28, 2018

feiskyer commented Nov 28, 2018

feiskyer left a comment

Choose a reason for hiding this comment

AishSundar commented Nov 28, 2018

andyzhangx commented Nov 28, 2018 • edited

andyzhangx commented Nov 28, 2018

andyzhangx commented Nov 28, 2018

andyzhangx Apr 25, 2019

Choose a reason for hiding this comment

andyzhangx commented Nov 28, 2018 •

edited

andyzhangx commented Nov 28, 2018 •

edited