fix: avoid recreate vmss cache in race condition #2589

andyzhangx · 2022-10-21T10:53:15Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

fix: avoid recreate vmss cache in race condition
In race condition, there could be duplicated getVMSSVMCache calls, and it would end up with creating multiple vmss cache objects, and finally cause VMSS list storm, related logs:

I1018 03:18:45.242671       1 azure_vmss.go:204] Couldn't find VMSS VM with nodeName aks-agentpool-34156162-vmss00000j, refreshing the cache(vmss: aks-agentpool-34156162-vmss, rg: aksdiskscaleai2f68d-nodegroup)
I1018 03:18:45.242671       1 azure_vmss.go:204] Couldn't find VMSS VM with nodeName aks-agentpool-34156162-vmss00000j, refreshing the cache(vmss: aks-agentpool-34156162-vmss, rg: aksdiskscaleai2f68d-nodegroup)

I1018 03:18:47.272599       1 azure_vmss.go:204] Couldn't find VMSS VM with nodeName aks-agentpool-34156162-vmss00000x, refreshing the cache(vmss: aks-agentpool-34156162-vmss, rg: aksdiskscaleai2f68d-nodegroup)
I1018 03:18:47.272599       1 azure_vmss.go:204] Couldn't find VMSS VM with nodeName aks-agentpool-34156162-vmss00000x, refreshing the cache(vmss: aks-agentpool-34156162-vmss, rg: aksdiskscaleai2f68d-nodegroup)

I1018 03:18:51.474445       1 azure_vmss.go:204] Couldn't find VMSS VM with nodeName aks-agentpool-34156162-vmss00000l, refreshing the cache(vmss: aks-agentpool-34156162-vmss, rg: aksdiskscaleai2f68d-nodegroup)
I1018 03:18:51.474445       1 azure_vmss.go:204] Couldn't find VMSS VM with nodeName aks-agentpool-34156162-vmss00000l, refreshing the cache(vmss: aks-agentpool-34156162-vmss, rg: aksdiskscaleai2f68d-nodegroup)

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

fix: avoid recreate vmss cache in race condition

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

fix: avoid recreate vmss cache in race condition

netlify · 2022-10-21T10:53:23Z

✅ Deploy Preview for kubernetes-sigs-cloud-provide-azure canceled.

Name	Link
🔨 Latest commit	`fd6696b`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-cloud-provide-azure/deploys/6354a7caba23d9000970699a

k8s-ci-robot · 2022-10-21T10:53:27Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [andyzhangx]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

edreed · 2022-10-21T18:11:11Z

pkg/provider/azure_vmss_cache.go

+
+	// lock and try find cacheKey from cache again, refresh cache if still not found
+	lockKey := cacheKey + "/search"
+	ss.lockMap.LockEntry(lockKey)


Have you considered using singleflight to suppress duplicate calls rather than adding a new lock?

@edreed using singleflight requires refactor of vmss cache, adding a new lock would be quite small and safe change, let's roll out this fix first

It shouldn't require a refactor for use in this context, e.g.

v, err, _ := ss.group.Do(cacheKey, func () (interface{}, error) { cache, err := ss.newVMSSVirtualMachinesCache(resourceGroup, vmssName, cacheKey) if err != nil { return nil, err } ss.vmssVMCache.Store(cacheKey, cache) return cache, nil }); if err != nil { return "", nil, err } cache := v.(*azcache.TimedCache) return cacheKey, cache, nil

where ss.group is a singleflight.Group.

thanks, using singleflight.Group now in the PR, I think this PR could fix the vmss list storm since it does not have lock in the before, lots of new vmss cache are created simultaneously which cause the vmss list storm.

coveralls · 2022-10-23T02:09:23Z

Coverage decreased (-0.006%) to 79.855% when pulling fd6696b on andyzhangx:avoid-recreate-vmss-cache into c60daf9 on kubernetes-sigs:master.

MartinForReal · 2022-10-24T02:03:40Z

/test pull-cloud-provider-azure-e2e-ccm-capz

MartinForReal · 2022-10-24T12:39:35Z

/lgtm

MartinForReal · 2022-10-24T12:39:56Z

/cherrypick release-1.25

k8s-infra-cherrypick-robot · 2022-10-24T12:39:58Z

@MartinForReal: once the present PR merges, I will cherry-pick it on top of release-1.25 in a new PR and assign it to you.

In response to this:

/cherrypick release-1.25

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

MartinForReal · 2022-10-24T12:40:03Z

/cherrypick release-1.24

k8s-infra-cherrypick-robot · 2022-10-24T12:40:04Z

@MartinForReal: once the present PR merges, I will cherry-pick it on top of release-1.24 in a new PR and assign it to you.

In response to this:

/cherrypick release-1.24

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

MartinForReal · 2022-10-24T12:40:08Z

/cherrypick release-1.23

k8s-infra-cherrypick-robot · 2022-10-24T12:40:09Z

@MartinForReal: once the present PR merges, I will cherry-pick it on top of release-1.23 in a new PR and assign it to you.

In response to this:

/cherrypick release-1.23

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

MartinForReal · 2022-10-24T12:40:14Z

/cherrypick release-1.1

k8s-infra-cherrypick-robot · 2022-10-24T12:40:15Z

@MartinForReal: once the present PR merges, I will cherry-pick it on top of release-1.1 in a new PR and assign it to you.

In response to this:

/cherrypick release-1.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-infra-cherrypick-robot · 2022-10-24T12:41:16Z

@MartinForReal: #2589 failed to apply on top of branch "release-1.25":

Applying: fix: avoid recreate vmss cache in race condition
Applying: feat: use singleflight.Group instead of lock to fix race condition
Using index info to reconstruct a base tree...
M	go.mod
M	pkg/provider/azure_vmss.go
M	pkg/provider/azure_vmss_cache.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/provider/azure_vmss_cache.go
Auto-merging pkg/provider/azure_vmss.go
Auto-merging go.mod
Applying: doc: fix spelling error
Using index info to reconstruct a base tree...
M	pkg/provider/azure_controller_common.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/provider/azure_controller_common.go
CONFLICT (content): Merge conflict in pkg/provider/azure_controller_common.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0003 doc: fix spelling error
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherrypick release-1.25

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-infra-cherrypick-robot · 2022-10-24T12:41:52Z

@MartinForReal: #2589 failed to apply on top of branch "release-1.24":

Applying: fix: avoid recreate vmss cache in race condition
Applying: feat: use singleflight.Group instead of lock to fix race condition
Using index info to reconstruct a base tree...
M	go.mod
M	pkg/provider/azure_vmss.go
M	pkg/provider/azure_vmss_cache.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/provider/azure_vmss_cache.go
Auto-merging pkg/provider/azure_vmss.go
Auto-merging go.mod
CONFLICT (content): Merge conflict in go.mod
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0002 feat: use singleflight.Group instead of lock to fix race condition
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherrypick release-1.24

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-infra-cherrypick-robot · 2022-10-24T12:42:30Z

@MartinForReal: #2589 failed to apply on top of branch "release-1.23":

Applying: fix: avoid recreate vmss cache in race condition
Applying: feat: use singleflight.Group instead of lock to fix race condition
Using index info to reconstruct a base tree...
M	go.mod
M	pkg/provider/azure_vmss.go
M	pkg/provider/azure_vmss_cache.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/provider/azure_vmss_cache.go
Auto-merging pkg/provider/azure_vmss.go
Auto-merging go.mod
Applying: doc: fix spelling error
Using index info to reconstruct a base tree...
M	pkg/provider/azure_controller_common.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/provider/azure_controller_common.go
CONFLICT (content): Merge conflict in pkg/provider/azure_controller_common.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0003 doc: fix spelling error
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherrypick release-1.23

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-infra-cherrypick-robot · 2022-10-24T12:43:07Z

@MartinForReal: #2589 failed to apply on top of branch "release-1.1":

Applying: fix: avoid recreate vmss cache in race condition
Applying: feat: use singleflight.Group instead of lock to fix race condition
Using index info to reconstruct a base tree...
M	go.mod
M	pkg/provider/azure_vmss.go
M	pkg/provider/azure_vmss_cache.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/provider/azure_vmss_cache.go
Auto-merging pkg/provider/azure_vmss.go
CONFLICT (content): Merge conflict in pkg/provider/azure_vmss.go
Auto-merging go.mod
CONFLICT (content): Merge conflict in go.mod
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0002 feat: use singleflight.Group instead of lock to fix race condition
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherrypick release-1.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 21, 2022

k8s-ci-robot requested review from feiskyer and MartinForReal October 21, 2022 10:53

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 21, 2022

edreed reviewed Oct 21, 2022

View reviewed changes

andyzhangx added 2 commits October 23, 2022 02:03

fix: avoid recreate vmss cache in race condition

d7c8712

feat: use singleflight.Group instead of lock to fix race condition

cda9e11

andyzhangx force-pushed the avoid-recreate-vmss-cache branch from 44d94d0 to cda9e11 Compare October 23, 2022 02:03

doc: fix spelling error

fd6696b

k8s-ci-robot assigned MartinForReal Oct 24, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 24, 2022

k8s-ci-robot merged commit 635406f into kubernetes-sigs:master Oct 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: avoid recreate vmss cache in race condition #2589

fix: avoid recreate vmss cache in race condition #2589

andyzhangx commented Oct 21, 2022 •

edited

netlify bot commented Oct 21, 2022 •

edited

k8s-ci-robot commented Oct 21, 2022

edreed Oct 21, 2022

andyzhangx Oct 21, 2022

edreed Oct 22, 2022

andyzhangx Oct 24, 2022

coveralls commented Oct 23, 2022 •

edited

MartinForReal commented Oct 24, 2022

MartinForReal commented Oct 24, 2022

MartinForReal commented Oct 24, 2022

k8s-infra-cherrypick-robot commented Oct 24, 2022

MartinForReal commented Oct 24, 2022

k8s-infra-cherrypick-robot commented Oct 24, 2022

MartinForReal commented Oct 24, 2022

k8s-infra-cherrypick-robot commented Oct 24, 2022

MartinForReal commented Oct 24, 2022

k8s-infra-cherrypick-robot commented Oct 24, 2022

k8s-infra-cherrypick-robot commented Oct 24, 2022

k8s-infra-cherrypick-robot commented Oct 24, 2022

k8s-infra-cherrypick-robot commented Oct 24, 2022

k8s-infra-cherrypick-robot commented Oct 24, 2022

fix: avoid recreate vmss cache in race condition #2589

fix: avoid recreate vmss cache in race condition #2589

Conversation

andyzhangx commented Oct 21, 2022 • edited

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

netlify bot commented Oct 21, 2022 • edited

✅ Deploy Preview for kubernetes-sigs-cloud-provide-azure canceled.

k8s-ci-robot commented Oct 21, 2022

edreed Oct 21, 2022

Choose a reason for hiding this comment

andyzhangx Oct 21, 2022

Choose a reason for hiding this comment

edreed Oct 22, 2022

Choose a reason for hiding this comment

andyzhangx Oct 24, 2022

Choose a reason for hiding this comment

coveralls commented Oct 23, 2022 • edited

MartinForReal commented Oct 24, 2022

/test pull-cloud-provider-azure-e2e-ccm-capz

MartinForReal commented Oct 24, 2022

MartinForReal commented Oct 24, 2022

k8s-infra-cherrypick-robot commented Oct 24, 2022

MartinForReal commented Oct 24, 2022

k8s-infra-cherrypick-robot commented Oct 24, 2022

MartinForReal commented Oct 24, 2022

k8s-infra-cherrypick-robot commented Oct 24, 2022

MartinForReal commented Oct 24, 2022

k8s-infra-cherrypick-robot commented Oct 24, 2022

k8s-infra-cherrypick-robot commented Oct 24, 2022

k8s-infra-cherrypick-robot commented Oct 24, 2022

k8s-infra-cherrypick-robot commented Oct 24, 2022

k8s-infra-cherrypick-robot commented Oct 24, 2022

andyzhangx commented Oct 21, 2022 •

edited

netlify bot commented Oct 21, 2022 •

edited

coveralls commented Oct 23, 2022 •

edited