Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix detach azure disk issue due to dirty cache #71495

Merged
merged 1 commit into from
Nov 28, 2018

Conversation

andyzhangx
Copy link
Member

@andyzhangx andyzhangx commented Nov 28, 2018

What type of PR is this?
/kind bug

What this PR does / why we need it:
fix detch azure disk issue by cleaning up vm cache right after every update vm operation in disk attach/detach

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #71453

Special notes for your reviewer:
When detaching lots of disks in one node, detach failure may happen, in the before, we use backoff mechanism to retries 6 times which is not good solution; now we switch to use k8s controller to detach disk volume, and it finally depends on DisksAreAttached func to decide whether one disk is attached or not:

attachedResult, err := diskController.DisksAreAttached(volumeIDList, nodeName)

Unfortunately DisksAreAttached func is invoked right after DetachDiskByName failure which already make the vm cache dirty, this PR makes sure vm cache will be cleaned after every update vm operation in disk attach/detach operation. And this PR will also solve strange issue, e.g. detach disk failure due to the last attach disk failure.

The invocation chain for using vmCache in DisksAreAttached func:

DisksAreAttached --> getNodeDataDisks--> GetDataDisks --> getVirtualMachine --> vmCache.Get
I1128 05:04:58.206841       1 attacher.go:145] azureDisk - VolumesAreAttached: check volume "andy-vmss11010-dynamic-pvc-884809b5-f2ca-11e8-a757-000d3a01450e" (specName: "pvc-884809b5-f2ca-11e8-a757-000d3a01450e") is no longer attached
I1128 05:04:58.206945       1 operation_generator.go:193] VerifyVolumesAreAttached determined volume "kubernetes.io/azure-disk//subscriptions/.../resourceGroups/andy-vmss11010/providers/Microsoft.Compute/disks/andy-vmss11010-dynamic-pvc-049158b4-f2ba-11e8-ade9-000d3a01450e" (spec.Name: "pvc-049158b4-f2ba-11e8-ade9-000d3a01450e") is no longer attached to node "k8s-agentpool-39687990-vmss000000", therefore it was marked as detached.

Release note:

fix detach azure disk issue

/sig azure
/assign @feiskyer
cc @khenidak @brendandburns
@antoineco

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. sig/azure cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. labels Nov 28, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 28, 2018
@andyzhangx
Copy link
Member Author

/priority important-soon

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 28, 2018
@feiskyer
Copy link
Member

/milestone v1.13
/priority critical-urgent

@k8s-ci-robot k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Nov 28, 2018
@k8s-ci-robot k8s-ci-robot added this to the v1.13 milestone Nov 28, 2018
@feiskyer
Copy link
Member

Added to v1.13 since this is a critical bug fix.

Copy link
Member

@feiskyer feiskyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 28, 2018
@AishSundar
Copy link
Contributor

@andyzhangx @feiskyer is this an issue introduced recently in 1.13 that we are trying to address with this fix? Are there CI tests to verify the goodness of the fix? We are very late in the 1.13 release cycle and I would like to avoid any code churn unless it is a critical urgent issue introduced in 1.13. Can this wait until 1.13.1?

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 28, 2018
@andyzhangx
Copy link
Member Author

andyzhangx commented Nov 28, 2018

Hi @AishSundar, Yes, we are going to fix the issue right in v1.13.0, and the code change only remains in azure directory. It's important for this fix fit into v1.13.0, thanks.
Update:
About the test, I have done a few e2e tests on this, while since it's not a common scenario, we have not an automation test for this, it's manual now.

@andyzhangx
Copy link
Member Author

/test pull-kubernetes-integration

@andyzhangx andyzhangx changed the title fix detach azure disk issue fix detach azure disk issue due to dirty cache Nov 28, 2018
@andyzhangx
Copy link
Member Author

/hold cancel
@AishSundar just let me know if you have any concern of this PR, thanks.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 28, 2018
@k8s-ci-robot k8s-ci-robot merged commit b75a346 into kubernetes:master Nov 28, 2018
k8s-ci-robot added a commit that referenced this pull request Nov 29, 2018
…1495-upstream-release-1.12

Automated cherry pick of #71495: fix detch azure disk issue by clean vm cache
k8s-ci-robot added a commit that referenced this pull request Dec 10, 2018
…1495-upstream-release-1.10

Automated cherry pick of #71495: fix detch azure disk issue by clean vm cache

// Invalidate the cache right after updating
key := buildVmssCacheKey(nodeResourceGroup, ss.makeVmssVMName(ssName, instanceID))
defer ss.vmssVMCache.Delete(key)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is the PR for this dirty cache issue, originally whether succeed or not, this PR clean the cache anyway.
I can move this clean cache operation defer ss.vmssVMCache.Delete(key) to conditions when only error happens, would that solve your issue @khenidak ?I can build one hot fix image if you wan to try this fix.
cc@feiskyer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Azure Disks occasionally mounted in a way leading to I/O errors
4 participants