Cherry pick #537 from cloud provider azure: Refresh VM cache when node is not found #100110

CecileRobertMichon · 2021-03-11T03:21:07Z

What type of PR is this?

/kind bug

What this PR does / why we need it: kubernetes-sigs/cloud-provider-azure#537

This fixes a bug affecting clusters with virtual machines when vmType is set to "vmss". What happens is the control manager comes online and queries for azure machines power status. Since the machines are not available yet they are not in the cache. When the request comes in for the load balancer the cache is queried and reports that the node does not exist as a VMAS and attempts to run the VMSS code hence the following error message: failed: not a vmss instance. The same error also occurs when trying to expose a load balancer service. When it is found the cache it goes down the correct code path.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fix availability set cache in vmss cache

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

/sig cloud-provider
/area provider/azure
/assign @feiskyer

nilo19 · 2021-03-11T03:23:40Z

/lgtm
/approve

nilo19 · 2021-03-11T03:24:09Z

/triage accepted

k8s-ci-robot · 2021-03-11T03:24:11Z

@nilo19: The label(s) priority/import-soon cannot be applied, because the repository doesn't have them.

In response to this:

/priority import-soon
/triage accepted

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

nilo19 · 2021-03-11T03:25:02Z

/priority important-soon

CecileRobertMichon · 2021-03-11T20:03:20Z

@nilo19 @feiskyer are the PR tests healthy? I see a bunch of failures that seem unrelated to the PR and the job history has a lot of failures too https://prow.k8s.io/job-history/gs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-azure-file https://prow.k8s.io/job-history/gs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-aks-engine-azure

nilo19 · 2021-03-12T01:07:39Z

/retest

nilo19 · 2021-03-12T02:32:52Z

test failures related to #99909

CecileRobertMichon · 2021-03-16T18:02:09Z

/retest

@nilo19 anything else I can do to get this merged?

feiskyer · 2021-03-17T04:59:09Z

@andrewsykim could you help to approve this change? it is a bug fix.

CecileRobertMichon · 2021-03-19T17:52:25Z

@feiskyer @nilo19 @chewong do you know why azure-file tests are failing?

chewong · 2021-03-19T18:00:41Z

Opened kubernetes-sigs/azurefile-csi-driver#600. I can take a look

chewong · 2021-03-19T20:44:23Z

kubernetes/test-infra#21459 should fix the azure file failure.
/test pull-kubernetes-e2e-aks-engine-conformance
/test pull-kubernetes-e2e-aks-engine-azure-windows
/test pull-kubernetes-e2e-aks-engine-windows-containerd

CecileRobertMichon · 2021-03-22T19:38:47Z

/retest

now that kubernetes/test-infra#21459 has merged

CecileRobertMichon · 2021-03-24T17:02:18Z

/retest

@chewong are the tests expected to pass now?

chewong · 2021-03-24T17:16:03Z

I still don't see kubekins-e2e test image getting updated for almost 2 weeks. Let me follow up on slack.

chewong · 2021-03-24T22:26:56Z

So the problem is that kubekins-e2e build has been failing for a while (https://testgrid.k8s.io/sig-testing-images#kubekins-e2e). I will take a look once I am free.

CecileRobertMichon · 2021-03-24T22:59:58Z

/test pull-kubernetes-e2e-aks-engine-conformance

chewong · 2021-03-26T17:42:54Z

Opened kubernetes/test-infra#21543 to fix the kubekins-e2e problem

feiskyer · 2021-04-01T03:00:29Z

/retest
ping @andrewsykim @cheftako for approval

chewong · 2021-04-01T05:06:10Z

The following three failed jobs were renamed:
pull-kubernetes-e2e-aks-engine-azure -> pull-kubernetes-e2e-aks-engine-conformance
pull-kubernetes-e2e-azure-file -> pull-kubernetes-e2e-aks-engine-azure-file
pull-kubernetes-e2e-azure-file-windows -> pull-kubernetes-e2e-aks-engine-azure-file-windows-dockershim

feiskyer · 2021-04-06T02:35:26Z

@CecileRobertMichon could you do a rebase and retrigger tests again?

CecileRobertMichon · 2021-04-06T02:57:32Z

/test pull-kubernetes-e2e-aks-engine-conformance
/test pull-kubernetes-e2e-aks-engine-azure-windows
/test pull-kubernetes-e2e-aks-engine-windows-containerd

feiskyer

/lgtm
ping @andrewsykim @cheftako for approval

k8s-ci-robot · 2021-04-06T03:37:14Z

@CecileRobertMichon: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
pull-kubernetes-e2e-aks-engine-azure	b37379da0987472258f2d65dfdd90402228151d9	link	`/test pull-kubernetes-e2e-aks-engine-azure`
pull-kubernetes-e2e-azure-file	b37379da0987472258f2d65dfdd90402228151d9	link	`/test pull-kubernetes-e2e-azure-file`
pull-kubernetes-e2e-azure-file-windows	b37379da0987472258f2d65dfdd90402228151d9	link	`/test pull-kubernetes-e2e-azure-file-windows`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

CecileRobertMichon · 2021-04-07T16:43:29Z

/retest

andrewsykim

/approve

Thanks @CecileRobertMichon

k8s-ci-robot · 2021-04-07T17:47:16Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andrewsykim, CecileRobertMichon, nilo19

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~staging/src/k8s.io/legacy-cloud-providers/OWNERS~~ [andrewsykim]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

andrewsykim · 2021-04-07T17:59:18Z

staging/src/k8s.io/legacy-cloud-providers/azure/azure_vmss_cache.go

-	return availabilitySetNodes.Has(nodeName), nil
+	cachedNodes := cached.(availabilitySetEntry).nodeNames
+	// if the node is not in the cache, assume the node has joined after the last cache refresh and attempt to refresh the cache.
+	if !cachedNodes.Has(nodeName) {


Unit tests for this specific case would be useful I think

andrewsykim · 2021-04-07T18:16:13Z

staging/src/k8s.io/legacy-cloud-providers/azure/azure_vmss_cache.go

+	// if the node is not in the cache, assume the node has joined after the last cache refresh and attempt to refresh the cache.
+	if !cachedNodes.Has(nodeName) {
+		klog.V(2).Infof("Node %s has joined the cluster since the last VM cache refresh, refreshing the cache", nodeName)
+		cached, err = ss.availabilitySetNodesCache.Get(availabilitySetNodesKey, azcache.CacheReadTypeForceRefresh)


Would this mean we refresh the whole node cache for every new VM in the cluster? Is that expected?

tldr: yes. It's a trade-off we can't get both accuracy and performance but this minimizes the damage by adding a node cache.

See discussion in kubernetes-sigs/cloud-provider-azure#537 (comment)

…35-upstream-release-1.21 Cherry pick of #100110: Cherry pick #537 from cloud provider azure: Refresh VM cache when node is not found and #102935: fix: cleanup outdated routes

k8s-ci-robot assigned feiskyer Mar 11, 2021

k8s-ci-robot requested review from andyzhangx and nilo19 March 11, 2021 03:22

k8s-ci-robot assigned nilo19 Mar 11, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 11, 2021

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 11, 2021

k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 11, 2021

fix: Refresh VM cache when node is not found

8850c8c

CecileRobertMichon force-pushed the azure-vm-cache branch from b37379d to 8850c8c Compare April 6, 2021 02:57

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 6, 2021

feiskyer reviewed Apr 6, 2021

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 6, 2021

andrewsykim reviewed Apr 7, 2021

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 7, 2021

andrewsykim reviewed Apr 7, 2021

View reviewed changes

k8s-ci-robot merged commit 8300553 into kubernetes:master Apr 9, 2021

k8s-ci-robot added this to the v1.22 milestone Apr 9, 2021

CecileRobertMichon mentioned this pull request Apr 14, 2021

Add load balancer test to conformance test suite kubernetes-sigs/cluster-api-provider-azure#1171

Closed

3 tasks

CecileRobertMichon mentioned this pull request Jun 10, 2021

When running a workload with a single control plane node the load balancers take 15 mins to provision kubernetes-sigs/cluster-api-provider-azure#857

Closed

nilo19 mentioned this pull request Jul 12, 2021

Cherry pick of #100110: Cherry pick #537 from cloud provider azure: Refresh VM cache when node is not found and #102935: fix: cleanup outdated routes #102983

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cherry pick #537 from cloud provider azure: Refresh VM cache when node is not found #100110

Cherry pick #537 from cloud provider azure: Refresh VM cache when node is not found #100110

CecileRobertMichon commented Mar 11, 2021

nilo19 commented Mar 11, 2021

nilo19 commented Mar 11, 2021 •

edited

Loading

k8s-ci-robot commented Mar 11, 2021

nilo19 commented Mar 11, 2021

CecileRobertMichon commented Mar 11, 2021

nilo19 commented Mar 12, 2021

nilo19 commented Mar 12, 2021

CecileRobertMichon commented Mar 16, 2021

feiskyer commented Mar 17, 2021

CecileRobertMichon commented Mar 19, 2021

chewong commented Mar 19, 2021

chewong commented Mar 19, 2021

CecileRobertMichon commented Mar 22, 2021

CecileRobertMichon commented Mar 24, 2021

chewong commented Mar 24, 2021

chewong commented Mar 24, 2021

CecileRobertMichon commented Mar 24, 2021

chewong commented Mar 26, 2021

feiskyer commented Apr 1, 2021

chewong commented Apr 1, 2021 •

edited

Loading

feiskyer commented Apr 6, 2021

CecileRobertMichon commented Apr 6, 2021

feiskyer left a comment

k8s-ci-robot commented Apr 6, 2021 •

edited

Loading

CecileRobertMichon commented Apr 7, 2021

andrewsykim left a comment

k8s-ci-robot commented Apr 7, 2021

andrewsykim Apr 7, 2021

andrewsykim Apr 7, 2021

CecileRobertMichon Apr 14, 2021

Cherry pick #537 from cloud provider azure: Refresh VM cache when node is not found #100110

Cherry pick #537 from cloud provider azure: Refresh VM cache when node is not found #100110

Conversation

CecileRobertMichon commented Mar 11, 2021

What type of PR is this?

What this PR does / why we need it: kubernetes-sigs/cloud-provider-azure#537

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

nilo19 commented Mar 11, 2021

nilo19 commented Mar 11, 2021 • edited Loading

k8s-ci-robot commented Mar 11, 2021

nilo19 commented Mar 11, 2021

CecileRobertMichon commented Mar 11, 2021

nilo19 commented Mar 12, 2021

nilo19 commented Mar 12, 2021

CecileRobertMichon commented Mar 16, 2021

feiskyer commented Mar 17, 2021

CecileRobertMichon commented Mar 19, 2021

chewong commented Mar 19, 2021

chewong commented Mar 19, 2021

CecileRobertMichon commented Mar 22, 2021

CecileRobertMichon commented Mar 24, 2021

chewong commented Mar 24, 2021

chewong commented Mar 24, 2021

CecileRobertMichon commented Mar 24, 2021

chewong commented Mar 26, 2021

feiskyer commented Apr 1, 2021

chewong commented Apr 1, 2021 • edited Loading

feiskyer commented Apr 6, 2021

CecileRobertMichon commented Apr 6, 2021

feiskyer left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Apr 6, 2021 • edited Loading

CecileRobertMichon commented Apr 7, 2021

andrewsykim left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Apr 7, 2021

andrewsykim Apr 7, 2021

Choose a reason for hiding this comment

andrewsykim Apr 7, 2021

Choose a reason for hiding this comment

CecileRobertMichon Apr 14, 2021

Choose a reason for hiding this comment

nilo19 commented Mar 11, 2021 •

edited

Loading

chewong commented Apr 1, 2021 •

edited

Loading

k8s-ci-robot commented Apr 6, 2021 •

edited

Loading