Azure LB Availability Set assumptions too restrictive #97375

wking · 2020-12-18T01:14:51Z

What happened: #96111 introduced new dependency on Availability Sets which is too restrictive for some existing installations: https://github.com/kubernetes/kubernetes/pull/96111/files#diff-0414c3aba906b2c0cdb2f09da32bd45c6bf1df71cbb2fc55950743c99a4a5fe4R1071

What you expected to happen: Azure cloud provider to successfully manage LBs, like it had been doing before. Instead, it died with:

azure_loadbalancer.go:193] reconcileLoadBalancer(openshift-ingress/router-default) failed: cannot get the availability set ID from the virtual machine with node name ci-op-bpk9gm56-2dc90-lw6kr-master-1

How to reproduce it (as minimally and precisely as possible): Fails every time in OpenShift CI. Can you point us to documentation around the infrastructure options that the Azure cloud provider expects for LB management? https://kubernetes-sigs.github.io/cloud-provider-azure/topics/loadbalancer/#load-balancer-selection-modes doesn't say anything about NIC names, and it doesn't have specifics around what things should look like without Availability Sets, or whether "without Availability Sets" is not supported (when previous iterations of the cloud provider had no problem with our lack of Availability Sets).

Documenting cloud-provider assumptions about host hardware would have also helped avoid #97352; CC @nilo19

/sig cloud-provider
/area provider/azure

The text was updated successfully, but these errors were encountered:

nilo19 · 2020-12-18T01:15:41Z

/triage accepted

nilo19 · 2020-12-18T01:15:48Z

/assign

wking · 2020-12-18T02:27:45Z

Notes on OpenShift's Azure infrastructure in ARM templates here, in case that helps clarify the infra we use today. Those are OpenShift 4.6 docs, and 4.6 is based on Kubernetes 1.19, and things worked fine. The trouble we've been having when we took the same Azure-infra approach and tried to use it with Kubernetes 1.20, after openshift#471.

feiskyer · 2020-12-18T02:48:00Z

@wking could you share an example of resourceID (please replace your subscription/resourceName with fake ones)?

wking · 2020-12-18T22:09:29Z

Example CI run here from the NIC-renaming openshift/installer#4490. Installer logs here include:

time="2020-12-17T21:57:59Z" level=debug msg="module.master.azurerm_network_interface.master[1]: Creation complete after 34s [id=/subscriptions/d38f1e38-4bed-438e-b227-833f997adf6a/resourceGroups/ci-op-67x6pb5c-2dc90-n68qc-rg/providers/Microsoft.Network/networkInterfaces/ci-op-67x6pb5c-2dc90-n68qc-master-nic-1]"

I'm hoping that the subscription name is not sensitive, because we've been dumping them in public logs for a long time now 🤞. Should be IDs for any other resources you're interested in in those logs as well.

nilo19 · 2020-12-19T01:33:12Z

Looks like this is about the master node. Are you using standard LB or a basic one? For standard LB, the master node would be excluded from the LB by default.

wking · 2020-12-22T16:18:47Z

Are you using standard LB or a basic one?

Standard (and here).

For standard LB, the master node would be excluded from the LB by default.

Is this documented somewhere? We've been using standard LBs for years, with no problems until the bump to the Kube 1.20 Azure cloud-provider.

alvaroaleman · 2020-12-22T16:24:13Z

For standard LB, the master node would be excluded from the LB by default.

Are you referring to the built-in logic of Kube to never target master nodes with a service type LoadBalancer?

wking · 2020-12-22T16:44:54Z

Are you referring to the built-in logic of Kube to never target master nodes with a service type LoadBalancer?

That used to be a problem (#65618), but my understanding is that it has since been fixed.

wking · 2020-12-22T20:58:01Z

Can we re-open this? kubernetes-sigs/cloud-provider-azure#443 was not sufficient to get the cloud-provider happy on our infra, which again, contains no Availability Sets. There are some more notes about our infrastructure here (with discussion in rhbz#1794839), in case that helps.

wking · 2020-12-22T21:22:40Z

@staebler, who has a better grasp on this than me, also just opened #97467. Sounds like fixing that broader issue might obsolete this ticket, because "the cloud-provider expects Availability Sets which OpenShift infra doesn't have in order to figure out which nodes to remove" doesn't matter if the cloud-provider isn't removing any OpenShift nodes in the first place.

CecileRobertMichon · 2021-02-08T18:30:24Z

@feiskyer @nilo19 can you please confirm the standalone VM scenario is supported? It's a critical use case for CAPZ (our control plane VMs in zone-enabled regions are not placed in availability sets)

wking added the kind/bug Categorizes issue or PR as related to a bug. label Dec 18, 2020

k8s-ci-robot added sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. area/provider/azure Issues or PRs related to azure provider needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 18, 2020

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 18, 2020

k8s-ci-robot assigned nilo19 Dec 18, 2020

wking changed the title ~~Azure LB Availabillity Set assumptions too restrictive~~ Azure LB Availability Set assumptions too restrictive Dec 18, 2020

nilo19 mentioned this issue Dec 21, 2020

Use network.Interface.VirtualMachine.ID to get the VM kubernetes-sigs/cloud-provider-azure#443

Merged

k8s-ci-robot closed this as completed in kubernetes-sigs/cloud-provider-azure#443 Dec 22, 2020

nilo19 mentioned this issue Dec 22, 2020

Use network.Interface.VirtualMachine.ID to get the binded VM #97455

Closed

feiskyer reopened this Dec 23, 2020

This was referenced Dec 31, 2020

Cherry pick 443 and 448 from cloud provider azure #97635

Merged

Cherry pick 443 and 448 from cloud provider azure to 1.20 #97639

Merged

k8s-ci-robot closed this as completed in #97635 Jan 20, 2021

CecileRobertMichon mentioned this issue Feb 22, 2021

Add assumption docs for provisioning tools kubernetes-sigs/cloud-provider-azure#514

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Azure LB Availability Set assumptions too restrictive #97375

Azure LB Availability Set assumptions too restrictive #97375

wking commented Dec 18, 2020

nilo19 commented Dec 18, 2020

nilo19 commented Dec 18, 2020

wking commented Dec 18, 2020

feiskyer commented Dec 18, 2020

wking commented Dec 18, 2020

nilo19 commented Dec 19, 2020 •

edited

wking commented Dec 22, 2020

alvaroaleman commented Dec 22, 2020

wking commented Dec 22, 2020

wking commented Dec 22, 2020

wking commented Dec 22, 2020

CecileRobertMichon commented Feb 8, 2021

Azure LB Availability Set assumptions too restrictive #97375

Azure LB Availability Set assumptions too restrictive #97375

Comments

wking commented Dec 18, 2020

nilo19 commented Dec 18, 2020

nilo19 commented Dec 18, 2020

wking commented Dec 18, 2020

feiskyer commented Dec 18, 2020

wking commented Dec 18, 2020

nilo19 commented Dec 19, 2020 • edited

wking commented Dec 22, 2020

alvaroaleman commented Dec 22, 2020

wking commented Dec 22, 2020

wking commented Dec 22, 2020

wking commented Dec 22, 2020

CecileRobertMichon commented Feb 8, 2021

nilo19 commented Dec 19, 2020 •

edited