-
Notifications
You must be signed in to change notification settings - Fork 39.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cherry pick #537 from cloud provider azure: Refresh VM cache when node is not found #100110
Cherry pick #537 from cloud provider azure: Refresh VM cache when node is not found #100110
Conversation
/lgtm |
/triage accepted |
@nilo19: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/priority important-soon |
@nilo19 @feiskyer are the PR tests healthy? I see a bunch of failures that seem unrelated to the PR and the job history has a lot of failures too https://prow.k8s.io/job-history/gs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-azure-file https://prow.k8s.io/job-history/gs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-aks-engine-azure |
/retest |
test failures related to #99909 |
/retest @nilo19 anything else I can do to get this merged? |
@andrewsykim could you help to approve this change? it is a bug fix. |
Opened kubernetes-sigs/azurefile-csi-driver#600. I can take a look |
kubernetes/test-infra#21459 should fix the azure file failure. |
/retest now that kubernetes/test-infra#21459 has merged |
/retest @chewong are the tests expected to pass now? |
I still don't see kubekins-e2e test image getting updated for almost 2 weeks. Let me follow up on slack. |
So the problem is that kubekins-e2e build has been failing for a while (https://testgrid.k8s.io/sig-testing-images#kubekins-e2e). I will take a look once I am free. |
/test pull-kubernetes-e2e-aks-engine-conformance |
Opened kubernetes/test-infra#21543 to fix the kubekins-e2e problem |
/retest |
The following three failed jobs were renamed: |
@CecileRobertMichon could you do a rebase and retrigger tests again? |
b37379d
to
8850c8c
Compare
/test pull-kubernetes-e2e-aks-engine-conformance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
ping @andrewsykim @cheftako for approval
@CecileRobertMichon: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
Thanks @CecileRobertMichon
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andrewsykim, CecileRobertMichon, nilo19 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
return availabilitySetNodes.Has(nodeName), nil | ||
cachedNodes := cached.(availabilitySetEntry).nodeNames | ||
// if the node is not in the cache, assume the node has joined after the last cache refresh and attempt to refresh the cache. | ||
if !cachedNodes.Has(nodeName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unit tests for this specific case would be useful I think
// if the node is not in the cache, assume the node has joined after the last cache refresh and attempt to refresh the cache. | ||
if !cachedNodes.Has(nodeName) { | ||
klog.V(2).Infof("Node %s has joined the cluster since the last VM cache refresh, refreshing the cache", nodeName) | ||
cached, err = ss.availabilitySetNodesCache.Get(availabilitySetNodesKey, azcache.CacheReadTypeForceRefresh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this mean we refresh the whole node cache for every new VM in the cluster? Is that expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tldr: yes. It's a trade-off we can't get both accuracy and performance but this minimizes the damage by adding a node cache.
See discussion in kubernetes-sigs/cloud-provider-azure#537 (comment)
What type of PR is this?
/kind bug
What this PR does / why we need it: kubernetes-sigs/cloud-provider-azure#537
This fixes a bug affecting clusters with virtual machines when vmType is set to "vmss". What happens is the control manager comes online and queries for azure machines power status. Since the machines are not available yet they are not in the cache. When the request comes in for the load balancer the cache is queried and reports that the node does not exist as a VMAS and attempts to run the VMSS code hence the following error message: failed: not a vmss instance. The same error also occurs when trying to expose a load balancer service. When it is found the cache it goes down the correct code path.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
/sig cloud-provider
/area provider/azure
/assign @feiskyer