Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

azure: remove disk locks per vm during attach/detach #85115

Merged
merged 2 commits into from Nov 14, 2019

Conversation

aramase
Copy link
Member

@aramase aramase commented Nov 11, 2019

What type of PR is this?

/kind bug

What this PR does / why we need it:

  • Changes the per VM lock logic for attach/detach operations.

Currently the per VM lock logic uses number of CPUs by default to lock which throttles number of concurrent attach/detach requests. With this updated locking logic, disk attach/detach for different nodes will be run concurrently. Disk attach/detach operations for the same node will be done sequentially.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

azure: update disk lock logic per vm during attach/detach to allow concurrent updates for different nodes.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NONE

Test result -

Testing on a cluster with 1VMSS - 100 nodes

With the following changes -
Time taken for 10 disks (5 pods on same node + 5 pods on different nodes) - 9min
Time taken to scale from 10 disks to 60 disks - 6m36s
Time taken to detach and delete all 60 disks - 14m28s

Before the changes -
Time taken for 10 disks (5 pods on same node + 5 pods on different nodes) - 16min
Time taken to scale from 10 disks to 60 disks - 49m58s

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 11, 2019
@aramase aramase changed the title azure: remove disk locks per vm during attach/detach [WIP] azure: remove disk locks per vm during attach/detach Nov 11, 2019
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 11, 2019
@aramase
Copy link
Member Author

aramase commented Nov 11, 2019

/area provider/azure
/hold

(for testing repercussions of removing lock and ensuring updates to same VM is not a problem + soak tests)

/cc @khenidak

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. area/provider/azure Issues or PRs related to azure provider labels Nov 11, 2019
@k8s-ci-robot k8s-ci-robot added area/cloudprovider sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 11, 2019
@aramase
Copy link
Member Author

aramase commented Nov 11, 2019

/assign @andyzhangx

@andyzhangx
Copy link
Member

@aramase disk attach/detach lock cannot be removed now, is there a way to replace it with another better lock? This lock uses number of CPUs by default, we may change the default max thread num size.

@aramase
Copy link
Member Author

aramase commented Nov 12, 2019

@andyzhangx I'm going to test the behavior of removing the lock. If there are any issues, then yes we can investigate using the existing lock with different config or a different locking implementation.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 12, 2019
@aramase
Copy link
Member Author

aramase commented Nov 12, 2019

/test pull-kubernetes-e2e-gce

1 similar comment
@aramase
Copy link
Member Author

aramase commented Nov 13, 2019

/test pull-kubernetes-e2e-gce

@@ -58,17 +57,16 @@ var defaultBackOff = kwait.Backoff{
Jitter: 0.0,
}

// acquire lock to attach/detach disk in one node
var diskOpMutex = keymutex.NewHashed(0)

This comment was marked as outdated.

This comment was marked as off-topic.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 13, 2019
@aramase
Copy link
Member Author

aramase commented Nov 13, 2019

/test pull-kubernetes-node-e2e-containerd

@aramase
Copy link
Member Author

aramase commented Nov 14, 2019

/test pull-kubernetes-kubemark-e2e-gce-big

@aramase
Copy link
Member Author

aramase commented Nov 14, 2019

Test result -

Testing on a cluster with 1VMSS - 100 nodes

With the following changes -
Time taken for 10 disks (5 pods on same node + 5 pods on different nodes) - 9min
Time taken to scale from 10 disks to 60 disks - 6m36s
Time taken to detach and delete all 60 disks - 14m28s

Before the changes -
Time taken for 10 disks (5 pods on same node + 5 pods on different nodes) - 16min
Time taken to scale from 10 disks to 60 disks - 49m58s

The tests are done on a cluster with 2 vcpus. This means the length of mutexes is equal to the CPU num (2). In case of AKS, the master has 8vcpus from what @andyzhangx told. This would mean with the old keymutex, there could be 8 concurrent updates (attach/detach) in parallel provided there are no hash collisions. With the changes in this PR, all vm updates happen in parallel which reduces disk attach/detach time for numerous disks.

@andyzhangx PTAL!

/hold
(adding a hold because want to get a approval from @khenidak too)

@aramase aramase changed the title [WIP] azure: remove disk locks per vm during attach/detach azure: remove disk locks per vm during attach/detach Nov 14, 2019
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 14, 2019
Copy link
Member

@andyzhangx andyzhangx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 14, 2019
@andyzhangx
Copy link
Member

/hold cancel
since it's quite tight

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx, aramase

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Nov 14, 2019
@andyzhangx
Copy link
Member

/priority important-soon
/sig cloud-provider
/area provider/azure

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 14, 2019
@k8s-ci-robot k8s-ci-robot merged commit 5dd641e into kubernetes:master Nov 14, 2019
@k8s-ci-robot k8s-ci-robot added this to the v1.17 milestone Nov 14, 2019
@aramase aramase deleted the azure-disk-lock branch November 14, 2019 07:43
@craiglpeters craiglpeters added this to Done in Provider Azure Nov 14, 2019
@andyzhangx
Copy link
Member

@aramase could you also cherry-pick this PR to old releases? thanks.

@aramase
Copy link
Member Author

aramase commented Nov 27, 2019

@andyzhangx will do!

k8s-ci-robot added a commit that referenced this pull request Dec 3, 2019
…5-upstream-release-1.16

Automated cherry pick of #85115: remove disk locks per vm
k8s-ci-robot added a commit that referenced this pull request Dec 5, 2019
…5-upstream-release-1.15

Automated cherry pick of #85115: remove disk locks per vm
k8s-ci-robot added a commit that referenced this pull request Dec 5, 2019
…5-upstream-release-1.14

Automated cherry pick of #85115: remove disk locks per vm
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cloudprovider area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Development

Successfully merging this pull request may close these issues.

None yet

3 participants