Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: avoid recreate vmss cache in race condition #2589

Merged

Conversation

andyzhangx
Copy link
Member

@andyzhangx andyzhangx commented Oct 21, 2022

What type of PR is this?

/kind bug

What this PR does / why we need it:

fix: avoid recreate vmss cache in race condition
In race condition, there could be duplicated getVMSSVMCache calls, and it would end up with creating multiple vmss cache objects, and finally cause VMSS list storm, related logs:

I1018 03:18:45.242671       1 azure_vmss.go:204] Couldn't find VMSS VM with nodeName aks-agentpool-34156162-vmss00000j, refreshing the cache(vmss: aks-agentpool-34156162-vmss, rg: aksdiskscaleai2f68d-nodegroup)
I1018 03:18:45.242671       1 azure_vmss.go:204] Couldn't find VMSS VM with nodeName aks-agentpool-34156162-vmss00000j, refreshing the cache(vmss: aks-agentpool-34156162-vmss, rg: aksdiskscaleai2f68d-nodegroup)

I1018 03:18:47.272599       1 azure_vmss.go:204] Couldn't find VMSS VM with nodeName aks-agentpool-34156162-vmss00000x, refreshing the cache(vmss: aks-agentpool-34156162-vmss, rg: aksdiskscaleai2f68d-nodegroup)
I1018 03:18:47.272599       1 azure_vmss.go:204] Couldn't find VMSS VM with nodeName aks-agentpool-34156162-vmss00000x, refreshing the cache(vmss: aks-agentpool-34156162-vmss, rg: aksdiskscaleai2f68d-nodegroup)

I1018 03:18:51.474445       1 azure_vmss.go:204] Couldn't find VMSS VM with nodeName aks-agentpool-34156162-vmss00000l, refreshing the cache(vmss: aks-agentpool-34156162-vmss, rg: aksdiskscaleai2f68d-nodegroup)
I1018 03:18:51.474445       1 azure_vmss.go:204] Couldn't find VMSS VM with nodeName aks-agentpool-34156162-vmss00000l, refreshing the cache(vmss: aks-agentpool-34156162-vmss, rg: aksdiskscaleai2f68d-nodegroup)

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

fix: avoid recreate vmss cache in race condition

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

fix: avoid recreate vmss cache in race condition

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 21, 2022
@netlify
Copy link

netlify bot commented Oct 21, 2022

Deploy Preview for kubernetes-sigs-cloud-provide-azure canceled.

Name Link
🔨 Latest commit fd6696b
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-cloud-provide-azure/deploys/6354a7caba23d9000970699a

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 21, 2022

// lock and try find cacheKey from cache again, refresh cache if still not found
lockKey := cacheKey + "/search"
ss.lockMap.LockEntry(lockKey)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered using singleflight to suppress duplicate calls rather than adding a new lock?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edreed using singleflight requires refactor of vmss cache, adding a new lock would be quite small and safe change, let's roll out this fix first

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't require a refactor for use in this context, e.g.

v, err, _ := ss.group.Do(cacheKey, func () (interface{}, error) {
	cache, err := ss.newVMSSVirtualMachinesCache(resourceGroup, vmssName, cacheKey)
	if err != nil {
		return nil, err
	}
	ss.vmssVMCache.Store(cacheKey, cache)
	return cache, nil
});
if err != nil {
	return "", nil, err
}
cache := v.(*azcache.TimedCache)
return cacheKey, cache, nil

where ss.group is a singleflight.Group.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, using singleflight.Group now in the PR, I think this PR could fix the vmss list storm since it does not have lock in the before, lots of new vmss cache are created simultaneously which cause the vmss list storm.

@coveralls
Copy link

coveralls commented Oct 23, 2022

Coverage Status

Coverage decreased (-0.006%) to 79.855% when pulling fd6696b on andyzhangx:avoid-recreate-vmss-cache into c60daf9 on kubernetes-sigs:master.

@MartinForReal
Copy link
Contributor

/test pull-cloud-provider-azure-e2e-ccm-capz


@MartinForReal
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 24, 2022
@MartinForReal
Copy link
Contributor

/cherrypick release-1.25

@k8s-infra-cherrypick-robot

@MartinForReal: once the present PR merges, I will cherry-pick it on top of release-1.25 in a new PR and assign it to you.

In response to this:

/cherrypick release-1.25

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@MartinForReal
Copy link
Contributor

/cherrypick release-1.24

@k8s-infra-cherrypick-robot

@MartinForReal: once the present PR merges, I will cherry-pick it on top of release-1.24 in a new PR and assign it to you.

In response to this:

/cherrypick release-1.24

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@MartinForReal
Copy link
Contributor

/cherrypick release-1.23

@k8s-infra-cherrypick-robot

@MartinForReal: once the present PR merges, I will cherry-pick it on top of release-1.23 in a new PR and assign it to you.

In response to this:

/cherrypick release-1.23

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@MartinForReal
Copy link
Contributor

/cherrypick release-1.1

@k8s-infra-cherrypick-robot

@MartinForReal: once the present PR merges, I will cherry-pick it on top of release-1.1 in a new PR and assign it to you.

In response to this:

/cherrypick release-1.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot merged commit 635406f into kubernetes-sigs:master Oct 24, 2022
@k8s-infra-cherrypick-robot

@MartinForReal: #2589 failed to apply on top of branch "release-1.25":

Applying: fix: avoid recreate vmss cache in race condition
Applying: feat: use singleflight.Group instead of lock to fix race condition
Using index info to reconstruct a base tree...
M	go.mod
M	pkg/provider/azure_vmss.go
M	pkg/provider/azure_vmss_cache.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/provider/azure_vmss_cache.go
Auto-merging pkg/provider/azure_vmss.go
Auto-merging go.mod
Applying: doc: fix spelling error
Using index info to reconstruct a base tree...
M	pkg/provider/azure_controller_common.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/provider/azure_controller_common.go
CONFLICT (content): Merge conflict in pkg/provider/azure_controller_common.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0003 doc: fix spelling error
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherrypick release-1.25

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-infra-cherrypick-robot

@MartinForReal: #2589 failed to apply on top of branch "release-1.24":

Applying: fix: avoid recreate vmss cache in race condition
Applying: feat: use singleflight.Group instead of lock to fix race condition
Using index info to reconstruct a base tree...
M	go.mod
M	pkg/provider/azure_vmss.go
M	pkg/provider/azure_vmss_cache.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/provider/azure_vmss_cache.go
Auto-merging pkg/provider/azure_vmss.go
Auto-merging go.mod
CONFLICT (content): Merge conflict in go.mod
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0002 feat: use singleflight.Group instead of lock to fix race condition
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherrypick release-1.24

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-infra-cherrypick-robot

@MartinForReal: #2589 failed to apply on top of branch "release-1.23":

Applying: fix: avoid recreate vmss cache in race condition
Applying: feat: use singleflight.Group instead of lock to fix race condition
Using index info to reconstruct a base tree...
M	go.mod
M	pkg/provider/azure_vmss.go
M	pkg/provider/azure_vmss_cache.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/provider/azure_vmss_cache.go
Auto-merging pkg/provider/azure_vmss.go
Auto-merging go.mod
Applying: doc: fix spelling error
Using index info to reconstruct a base tree...
M	pkg/provider/azure_controller_common.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/provider/azure_controller_common.go
CONFLICT (content): Merge conflict in pkg/provider/azure_controller_common.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0003 doc: fix spelling error
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherrypick release-1.23

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-infra-cherrypick-robot

@MartinForReal: #2589 failed to apply on top of branch "release-1.1":

Applying: fix: avoid recreate vmss cache in race condition
Applying: feat: use singleflight.Group instead of lock to fix race condition
Using index info to reconstruct a base tree...
M	go.mod
M	pkg/provider/azure_vmss.go
M	pkg/provider/azure_vmss_cache.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/provider/azure_vmss_cache.go
Auto-merging pkg/provider/azure_vmss.go
CONFLICT (content): Merge conflict in pkg/provider/azure_vmss.go
Auto-merging go.mod
CONFLICT (content): Merge conflict in go.mod
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0002 feat: use singleflight.Group instead of lock to fix race condition
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherrypick release-1.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants