azure: remove disk locks per vm during attach/detach #85115

aramase · 2019-11-11T23:49:12Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

Changes the per VM lock logic for attach/detach operations.

Currently the per VM lock logic uses number of CPUs by default to lock which throttles number of concurrent attach/detach requests. With this updated locking logic, disk attach/detach for different nodes will be run concurrently. Disk attach/detach operations for the same node will be done sequentially.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

azure: update disk lock logic per vm during attach/detach to allow concurrent updates for different nodes.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NONE

Test result -

Testing on a cluster with 1VMSS - 100 nodes

With the following changes -
Time taken for 10 disks (5 pods on same node + 5 pods on different nodes) - 9min
Time taken to scale from 10 disks to 60 disks - 6m36s
Time taken to detach and delete all 60 disks - 14m28s

Before the changes -
Time taken for 10 disks (5 pods on same node + 5 pods on different nodes) - 16min
Time taken to scale from 10 disks to 60 disks - 49m58s

aramase · 2019-11-11T23:50:21Z

/area provider/azure
/hold

(for testing repercussions of removing lock and ensuring updates to same VM is not a problem + soak tests)

/cc @khenidak

aramase · 2019-11-11T23:57:10Z

/assign @andyzhangx

andyzhangx · 2019-11-12T00:03:17Z

@aramase disk attach/detach lock cannot be removed now, is there a way to replace it with another better lock? This lock uses number of CPUs by default, we may change the default max thread num size.

aramase · 2019-11-12T00:43:05Z

@andyzhangx I'm going to test the behavior of removing the lock. If there are any issues, then yes we can investigate using the existing lock with different config or a different locking implementation.

aramase · 2019-11-12T23:32:38Z

/test pull-kubernetes-e2e-gce

aramase · 2019-11-13T01:11:33Z

/test pull-kubernetes-e2e-gce

staging/src/k8s.io/legacy-cloud-providers/azure/azure_controller_common.go

@@ -58,17 +57,16 @@ var defaultBackOff = kwait.Backoff{
 	Jitter:   0.0,
 }

-// acquire lock to attach/detach disk in one node
-var diskOpMutex = keymutex.NewHashed(0)


staging/src/k8s.io/legacy-cloud-providers/azure/azure_utils.go

aramase · 2019-11-13T19:17:24Z

/test pull-kubernetes-node-e2e-containerd

aramase · 2019-11-14T02:32:29Z

/test pull-kubernetes-kubemark-e2e-gce-big

aramase · 2019-11-14T03:36:21Z

Test result -

Testing on a cluster with 1VMSS - 100 nodes

With the following changes -
Time taken for 10 disks (5 pods on same node + 5 pods on different nodes) - 9min
Time taken to scale from 10 disks to 60 disks - 6m36s
Time taken to detach and delete all 60 disks - 14m28s

Before the changes -
Time taken for 10 disks (5 pods on same node + 5 pods on different nodes) - 16min
Time taken to scale from 10 disks to 60 disks - 49m58s

The tests are done on a cluster with 2 vcpus. This means the length of mutexes is equal to the CPU num (2). In case of AKS, the master has 8vcpus from what @andyzhangx told. This would mean with the old keymutex, there could be 8 concurrent updates (attach/detach) in parallel provided there are no hash collisions. With the changes in this PR, all vm updates happen in parallel which reduces disk attach/detach time for numerous disks.

@andyzhangx PTAL!

/hold
(adding a hold because want to get a approval from @khenidak too)

andyzhangx

/lgtm
/approve

andyzhangx · 2019-11-14T05:12:23Z

/hold cancel
since it's quite tight

k8s-ci-robot · 2019-11-14T05:12:23Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx, aramase

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~staging/src/k8s.io/legacy-cloud-providers/azure/OWNERS~~ [andyzhangx]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

andyzhangx · 2019-11-14T05:13:38Z

/priority important-soon
/sig cloud-provider
/area provider/azure

andyzhangx · 2019-11-27T14:18:53Z

@aramase could you also cherry-pick this PR to old releases? thanks.

aramase · 2019-11-27T20:08:02Z

@andyzhangx will do!

…5-upstream-release-1.16 Automated cherry pick of #85115: remove disk locks per vm

…5-upstream-release-1.15 Automated cherry pick of #85115: remove disk locks per vm

…5-upstream-release-1.14 Automated cherry pick of #85115: remove disk locks per vm

aramase changed the title ~~azure: remove disk locks per vm during attach/detach~~ [WIP] azure: remove disk locks per vm during attach/detach Nov 11, 2019

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 11, 2019

k8s-ci-robot requested a review from khenidak November 11, 2019 23:50

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. area/provider/azure Issues or PRs related to azure provider labels Nov 11, 2019

k8s-ci-robot requested a review from brendandburns November 11, 2019 23:50

k8s-ci-robot added area/cloudprovider sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 11, 2019

k8s-ci-robot assigned andyzhangx Nov 11, 2019

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 12, 2019

aramase force-pushed the azure-disk-lock branch from 4e665ee to 2109246 Compare November 12, 2019 22:37

andyzhangx reviewed Nov 13, 2019

View reviewed changes

staging/src/k8s.io/legacy-cloud-providers/azure/azure_utils.go Show resolved Hide resolved

staging/src/k8s.io/legacy-cloud-providers/azure/azure_utils.go Show resolved Hide resolved

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 13, 2019

aramase force-pushed the azure-disk-lock branch from 381b0a1 to 57df625 Compare November 14, 2019 01:08

aramase changed the title ~~[WIP] azure: remove disk locks per vm during attach/detach~~ azure: remove disk locks per vm during attach/detach Nov 14, 2019

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 14, 2019

andyzhangx approved these changes Nov 14, 2019

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 14, 2019

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Nov 14, 2019

k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 14, 2019

k8s-ci-robot merged commit 5dd641e into kubernetes:master Nov 14, 2019

k8s-ci-robot added this to the v1.17 milestone Nov 14, 2019

aramase deleted the azure-disk-lock branch November 14, 2019 07:43

craiglpeters added this to Done in Provider Azure Nov 14, 2019

This was referenced Nov 27, 2019

Automated cherry pick of #85115: remove disk locks per vm #85697

Merged

Automated cherry pick of #85115: remove disk locks per vm #85830

Merged

Automated cherry pick of #85115: remove disk locks per vm #85832

Merged

k8s-ci-robot added a commit that referenced this pull request Dec 3, 2019

Merge pull request #85697 from aramase/automated-cherry-pick-of-#8511…

e0a060d

…5-upstream-release-1.16 Automated cherry pick of #85115: remove disk locks per vm

k8s-ci-robot added a commit that referenced this pull request Dec 5, 2019

Merge pull request #85830 from aramase/automated-cherry-pick-of-#8511…

6be7495

…5-upstream-release-1.15 Automated cherry pick of #85115: remove disk locks per vm

k8s-ci-robot added a commit that referenced this pull request Dec 5, 2019

Merge pull request #85832 from aramase/automated-cherry-pick-of-#8511…

3c40a36

…5-upstream-release-1.14 Automated cherry pick of #85115: remove disk locks per vm

hoegaarden mentioned this pull request Dec 5, 2019

Compiling the kubelet in providerless mode is broken for release-1.14 & release-1.15 #85941

Closed

This was referenced Dec 5, 2019

Fix build when providerless tags is provided #85943

Merged

Fix build when providerless tags is provided #85946

Merged

tshafeev mentioned this pull request Feb 19, 2020

AKS ARM/VMSS Throttling/429 errors Azure/AKS#1413

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

azure: remove disk locks per vm during attach/detach #85115

azure: remove disk locks per vm during attach/detach #85115

aramase commented Nov 11, 2019 •

edited

aramase commented Nov 11, 2019 •

edited

aramase commented Nov 11, 2019

andyzhangx commented Nov 12, 2019

aramase commented Nov 12, 2019

aramase commented Nov 12, 2019

aramase commented Nov 13, 2019

This comment was marked as outdated.

This comment was marked as off-topic.

aramase commented Nov 13, 2019

aramase commented Nov 14, 2019

aramase commented Nov 14, 2019

andyzhangx left a comment

andyzhangx commented Nov 14, 2019

k8s-ci-robot commented Nov 14, 2019

andyzhangx commented Nov 14, 2019

andyzhangx commented Nov 27, 2019

aramase commented Nov 27, 2019

azure: remove disk locks per vm during attach/detach #85115

azure: remove disk locks per vm during attach/detach #85115

Conversation

aramase commented Nov 11, 2019 • edited

aramase commented Nov 11, 2019 • edited

aramase commented Nov 11, 2019

andyzhangx commented Nov 12, 2019

aramase commented Nov 12, 2019

aramase commented Nov 12, 2019

aramase commented Nov 13, 2019

This comment was marked as outdated.

This comment was marked as off-topic.

aramase commented Nov 13, 2019

aramase commented Nov 14, 2019

aramase commented Nov 14, 2019

andyzhangx left a comment

Choose a reason for hiding this comment

andyzhangx commented Nov 14, 2019

k8s-ci-robot commented Nov 14, 2019

andyzhangx commented Nov 14, 2019

andyzhangx commented Nov 27, 2019

aramase commented Nov 27, 2019

aramase commented Nov 11, 2019 •

edited

aramase commented Nov 11, 2019 •

edited