If nodes are are not in a ready state when the controller manager boots, the cache is stale causing 15 min delay of LB deployment #363

jsturtevant · 2020-08-03T16:35:07Z

What happened:
When adding functionality to the e2e tests in kubernetes-sigs/cluster-api-provider-azure#857 for CAPZ, I found the e2e tests would fail if run against a single control plane node cluster. This would occasionally happen when running multi-node control plane clusters as well. This was initially reported with some of the message in #338

What happens is the control manager comes online and queries for azure machines power status. Since the machines are not available yet they are not in the cache. When the request comes in for the load balancer the cache is queried and reports that the node does not exist as a VMAS and attempts to run the VMSS code hence the following error message. When it is found the in the cache it goes down the correct code path.

E0802 23:30:06.928574       1 azure_vmss.go:1116] EnsureHostInPool(default/ingress-nginx-ilb): backendPoolID(/subscriptions/b9d9436a-0c07-4fe8-b779-2c1030bd7997/resourceGroups/capz-e2e-72fll1/providers/Microsoft.Network/loadBalancers/capz-e2e-72fll1-internal/backendAddressPools/capz-e2e-72fll1) - failed to ensure host in pool: "not a vmss instance"

I was able to confirm the cache was missing the node by adding some custom logs to the controller-manager. The initial request to the cache doesn't have the nodes:

I0802 23:27:42.217347       1 azure_vmss.go:309] Can not extract scale set name from providerID (azure:///subscriptions/b9d9436a-0c07-4fe8-b779-2c1030bd7997/resourceGroups/capz-e2e-72fll1/providers/Microsoft.Compute/virtualMachines/capz-e2e-72fll1-control-plane-wjqp5), assuming it is mananaged by availability set: not a vmss instance
I0802 23:27:42.217487       1 azure_vmss.go:192] JJS: powerstatus node capz-e2e-72fll1-control-plane-wjqp5
I0802 23:27:42.217638       1 azure_vmss_cache.go:238] JJS: enter cache calls for nodes
I0802 23:27:42.217725       1 azure_vmss_cache.go:240] JJS: listing nodes for rg: capz-e2e-72fll1
I0802 23:27:42.651485       1 azure_backoff.go:100] VirtualMachinesClient.List(capz-e2e-72fll1) success
I0802 23:27:42.651641       1 azure_vmss_cache.go:246] JJS: nodes found: %!s(int=1)
I0802 23:27:42.651750       1 azure_vmss_cache.go:249] JJS: adding node to local cache {{%!s(*http.Response=<nil>)} %!s(*compute.Plan=<nil>) %!s(*compute.VirtualMachineProperties=&{0xc001fd3eb0 0xc002022960 <nil> 0xc001f13ea0 0xc0020280a0 <nil> <nil> <nil> <nil>   <nil> <nil> 0xc001fd3fd0 <nil> <nil> 0xc001fd3ea0}) %!s(*[]compute.VirtualMachineExtension=<nil>) %!s(*compute.VirtualMachineIdentity=<nil>) %!s(*[]string=&[1]) %!s(*string=0xc001fd2a60) %!s(*string=0xc001fd2a50) %!s(*string=0xc001fd2a70) %!s(*string=0xc001fd2a80) map[Name:%!s(*string=0xc001fd2a90) creationTimestamp:%!s(*string=0xc001fd2ab0) jobName:%!s(*string=0xc001fd2ad0) kubernetes.io_cluster_capz-e2e-72fll1-control-plane-wjqp5:%!s(*string=0xc001fd2af0) sigs.k8s.io_cluster-api-provider-azure_cluster_capz-e2e-72fll1:%!s(*string=0xc001fd2b10) sigs.k8s.io_cluster-api-provider-azure_role:%!s(*string=0xc001fd2b30)]}
I0802 23:27:42.651962       1 azure_vmss_cache.go:278] JJS: VMAS Nodes: map[capz-e2e-72fll1-control-plane-wjqp5:{}]
I0802 23:27:42.652194       1 azure_vmss_cache.go:280] JJS: asking for node capz-e2e-72fll1-control-plane-wjqp5
I0802 23:27:42.698224       1 azure_instances.go:232] InstanceShutdownByProviderID gets power status "running" for node "capz-e2e-72fll1-control-plane-wjqp5"

You can see request for powerstatus here:

I0802 23:28:32.709471       1 azure_vmss.go:192] JJS: powerstatus node capz-e2e-72fll1-md-0-q2st9
I0802 23:28:32.709626       1 azure_vmss_cache.go:278] JJS: VMAS Nodes: map[capz-e2e-72fll1-control-plane-wjqp5:{}]
I0802 23:28:32.709652       1 azure_vmss_cache.go:280] JJS: asking for node capz-e2e-72fll1-md-0-q2st9
I0802 23:28:32.709836       1 azure_vmss.go:309] Can not extract scale set name from providerID (azure:///subscriptions/b9d9436a-0c07-4fe8-b779-2c1030bd7997/resourceGroups/capz-e2e-72fll1/providers/Microsoft.Compute/virtualMachines/capz-e2e-72fll1-md-0-q2st9), assuming it is mananaged by availability set: not a vmss instance

And request for EnsureHostsInPool doesn't have the VMAS nodes in the cache:

I0802 23:30:06.928474       1 azure_vmss.go:1077] JJS: EnsureHostsInPool node capz-e2e-72fll1-md-0-q2st9
I0802 23:30:06.928488       1 azure_vmss_cache.go:278] JJS: VMAS Nodes: map[capz-e2e-72fll1-control-plane-wjqp5:{}]
I0802 23:30:06.928518       1 azure_vmss_cache.go:280] JJS: asking for node capz-e2e-72fll1-md-0-q2st9
I0802 23:30:06.928529       1 azure_vmss.go:1097] JJS: EnsureHostsInPool node false managedby VMAS
I0802 23:30:06.928542       1 azure_vmss.go:1113] JJS: EnsureHostsInPool entered EnsureHostInPool started%!(EXTRA bool=false)
I0802 23:30:06.928549       1 azure_vmss.go:832] ensuring node "capz-e2e-72fll1-md-0-q2st9" of scaleset "" in LB backendpool "/subscriptions/b9d9436a-0c07-4fe8-b779-2c1030bd7997/resourceGroups/capz-e2e-72fll1/providers/Microsoft.Network/loadBalancers/capz-e2e-72fll1-internal/backendAddressPools/capz-e2e-72fll1"
E0802 23:30:06.928574       1 azure_vmss.go:1116] EnsureHostInPool(default/ingress-nginx-ilb): backendPoolID(/subscriptions/b9d9436a-0c07-4fe8-b779-2c1030bd7997/resourceGroups/capz-e2e-72fll1/providers/Microsoft.Network/loadBalancers/capz-e2e-72fll1-internal/backendAddressPools/capz-e2e-72fll1) - failed to ensure host in pool: "not a vmss instance"

What you expected to happen:

How to reproduce it:
Follow the instructions at kubernetes-sigs/cluster-api-provider-azure#857 or create a single node control plane cluster then after the control node is up, add a worker node.

Anything else we need to know?:
In multi control plane node clusters, leader election occurs in the controller-manager and causes the cache to reload. This is why it only happens occasionally in that scenario.

I believe this could happen if a node is added at a later time and was expected to be in the backend pool but wasn't and so could be important to fix.

Environment:

Kubernetes version (use kubectl version): 1.18.6
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools: CAPZ
Others:

The text was updated successfully, but these errors were encountered:

alexeldeib · 2020-08-03T17:20:28Z

Possibly related to cache read mode? https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/legacy-cloud-providers/azure/azure_vmss.go#L1196 possibly should be CacheReadTypeUnsafe or CacheReadTypeForceRefresh? Still digging a bit.

VMSS does seem to force refresh if the instances aren't found: https://github.com/kubernetes/kubernetes/blob/ec560b9737537be8c688776461bc700e8ddedb9d/staging/src/k8s.io/legacy-cloud-providers/azure/azure_vmss.go#L276-L295

aramase · 2020-08-03T17:47:37Z

Possibly related to cache read mode? https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/legacy-cloud-providers/azure/azure_vmss.go#L1196 possibly should be CacheReadTypeUnsafe or CacheReadTypeForceRefresh? Still digging a bit.

CacheReadTypeDefault reads from the cache if TTL not expired. In case of a cache miss, it'll fetch the contents and populate the cache. For VMSS, if VM instance is not found in the cache, then the cache is force refreshed but I think for AS it seems to be slightly different. We should probably align the logic, so it's same for VMSSVM instance or AS.
CacheReadTypeUnsafe means even if the cache entry is expired it's still safe to read the contents as they aren't expected to change.
CacheReadTypeForceRefresh force refreshes the cache even if the entry in cache is still valid.

jsturtevant · 2020-08-03T19:31:35Z

For VMSS, if VM instance is not found in the cache, then the cache is force refreshed but I think for AS it seems to be slightly different.

For AS I believe it is because the nodes are stored in the cache as a set. The set is cached and so the request for a single instance doesn't count as a cache miss and doesn't refresh:

https://github.com/kubernetes/kubernetes/blob/ec560b9737537be8c688776461bc700e8ddedb9d/staging/src/k8s.io/legacy-cloud-providers/azure/azure_vmss_cache.go#L307

CecileRobertMichon · 2020-09-21T21:13:50Z

@aramase @jsturtevant @alexeldeib do we have an idea of what is needed to fix this?

fejta-bot · 2020-12-20T21:43:01Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2021-01-19T22:27:40Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

alexeldeib · 2021-01-19T23:11:06Z

/remove-lifecycle rotten

pretty sure this still exists...need to refresh my memory for a fix

CecileRobertMichon · 2021-02-23T23:29:27Z

/assign

looking into this

CecileRobertMichon · 2021-03-04T01:01:53Z

So this fixes it: fcb1df0

the issue with that solution is that it will cause a force refresh of the VM cache on every run of isNodeManagedByAvailabilitySet unnecessarily when the node is indeed VMSS (and so it really should return false).

The previous reasoning for calling it with CacheReadTypeUnsafe was probably that only control plane nodes are expected to be VMAS in a "vmss" cluster (eg. AKS Engine vmss cluster with vmas master nodes), and those wouldn't change since scale up / down of control plane nodes isn't supported, and new control plane nodes joining as part of upgrade should trigger a cache refresh as leader election occurs.

Now in CAPZ we have both VM and VMSS worker nodes joining and leaving the cluster with scale ups/downs so this means when we check if a node is VM or VMSS we need to have the latest information. A better solution might be to trigger a cache refresh only when the list of nodes in the cluster has changed since last time the cache was refreshed (need to look into how feasible that is though). @aramase @alexeldeib @jsturtevant wdyt?

CecileRobertMichon · 2021-03-04T02:35:47Z

Here's an attempt at only doing a refresh of the VM cache if the node is new to the cluster: master...CecileRobertMichon:fix-vm-cache

This should reduce a lot of the extra refreshes when the node is a known vmss node, and still refresh when the node is new to the cluster and we don't know if it's vm or vmss yet. Haven't tested it works yet but it's the general idea.

jsturtevant mentioned this issue Aug 3, 2020

When running a workload with a single control plane node the load balancers take 15 mins to provision kubernetes-sigs/cluster-api-provider-azure#857

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 20, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 19, 2021

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 19, 2021

k8s-ci-robot assigned CecileRobertMichon Feb 23, 2021

CecileRobertMichon mentioned this issue Mar 5, 2021

Fix availability set cache in vmss cache #537

Merged

k8s-ci-robot closed this as completed in #537 Mar 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If nodes are are not in a ready state when the controller manager boots, the cache is stale causing 15 min delay of LB deployment #363

If nodes are are not in a ready state when the controller manager boots, the cache is stale causing 15 min delay of LB deployment #363

jsturtevant commented Aug 3, 2020

alexeldeib commented Aug 3, 2020 •

edited

aramase commented Aug 3, 2020

jsturtevant commented Aug 3, 2020 •

edited

CecileRobertMichon commented Sep 21, 2020

fejta-bot commented Dec 20, 2020

fejta-bot commented Jan 19, 2021

alexeldeib commented Jan 19, 2021

CecileRobertMichon commented Feb 23, 2021

CecileRobertMichon commented Mar 4, 2021 •

edited

CecileRobertMichon commented Mar 4, 2021

If nodes are are not in a ready state when the controller manager boots, the cache is stale causing 15 min delay of LB deployment #363

If nodes are are not in a ready state when the controller manager boots, the cache is stale causing 15 min delay of LB deployment #363

Comments

jsturtevant commented Aug 3, 2020

alexeldeib commented Aug 3, 2020 • edited

aramase commented Aug 3, 2020

jsturtevant commented Aug 3, 2020 • edited

CecileRobertMichon commented Sep 21, 2020

fejta-bot commented Dec 20, 2020

fejta-bot commented Jan 19, 2021

alexeldeib commented Jan 19, 2021

CecileRobertMichon commented Feb 23, 2021

CecileRobertMichon commented Mar 4, 2021 • edited

CecileRobertMichon commented Mar 4, 2021

alexeldeib commented Aug 3, 2020 •

edited

jsturtevant commented Aug 3, 2020 •

edited

CecileRobertMichon commented Mar 4, 2021 •

edited