Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure LB Availability Set assumptions too restrictive #97375

Closed
wking opened this issue Dec 18, 2020 · 12 comments · Fixed by kubernetes-sigs/cloud-provider-azure#443 or #97635
Closed

Azure LB Availability Set assumptions too restrictive #97375

wking opened this issue Dec 18, 2020 · 12 comments · Fixed by kubernetes-sigs/cloud-provider-azure#443 or #97635
Assignees
Labels
area/provider/azure Issues or PRs related to azure provider kind/bug Categorizes issue or PR as related to a bug. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@wking
Copy link
Contributor

wking commented Dec 18, 2020

What happened: #96111 introduced new dependency on Availability Sets which is too restrictive for some existing installations: https://github.com/kubernetes/kubernetes/pull/96111/files#diff-0414c3aba906b2c0cdb2f09da32bd45c6bf1df71cbb2fc55950743c99a4a5fe4R1071

What you expected to happen: Azure cloud provider to successfully manage LBs, like it had been doing before. Instead, it died with:

azure_loadbalancer.go:193] reconcileLoadBalancer(openshift-ingress/router-default) failed: cannot get the availability set ID from the virtual machine with node name ci-op-bpk9gm56-2dc90-lw6kr-master-1

How to reproduce it (as minimally and precisely as possible): Fails every time in OpenShift CI. Can you point us to documentation around the infrastructure options that the Azure cloud provider expects for LB management? https://kubernetes-sigs.github.io/cloud-provider-azure/topics/loadbalancer/#load-balancer-selection-modes doesn't say anything about NIC names, and it doesn't have specifics around what things should look like without Availability Sets, or whether "without Availability Sets" is not supported (when previous iterations of the cloud provider had no problem with our lack of Availability Sets).

Documenting cloud-provider assumptions about host hardware would have also helped avoid #97352; CC @nilo19

/sig cloud-provider
/area provider/azure

@wking wking added the kind/bug Categorizes issue or PR as related to a bug. label Dec 18, 2020
@k8s-ci-robot k8s-ci-robot added sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. area/provider/azure Issues or PRs related to azure provider needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 18, 2020
@nilo19
Copy link
Member

nilo19 commented Dec 18, 2020

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 18, 2020
@nilo19
Copy link
Member

nilo19 commented Dec 18, 2020

/assign

@wking wking changed the title Azure LB Availabillity Set assumptions too restrictive Azure LB Availability Set assumptions too restrictive Dec 18, 2020
@wking
Copy link
Contributor Author

wking commented Dec 18, 2020

Notes on OpenShift's Azure infrastructure in ARM templates here, in case that helps clarify the infra we use today. Those are OpenShift 4.6 docs, and 4.6 is based on Kubernetes 1.19, and things worked fine. The trouble we've been having when we took the same Azure-infra approach and tried to use it with Kubernetes 1.20, after openshift#471.

@feiskyer
Copy link
Member

@wking could you share an example of resourceID (please replace your subscription/resourceName with fake ones)?

@wking
Copy link
Contributor Author

wking commented Dec 18, 2020

Example CI run here from the NIC-renaming openshift/installer#4490. Installer logs here include:

time="2020-12-17T21:57:59Z" level=debug msg="module.master.azurerm_network_interface.master[1]: Creation complete after 34s [id=/subscriptions/d38f1e38-4bed-438e-b227-833f997adf6a/resourceGroups/ci-op-67x6pb5c-2dc90-n68qc-rg/providers/Microsoft.Network/networkInterfaces/ci-op-67x6pb5c-2dc90-n68qc-master-nic-1]"

I'm hoping that the subscription name is not sensitive, because we've been dumping them in public logs for a long time now 🤞. Should be IDs for any other resources you're interested in in those logs as well.

@nilo19
Copy link
Member

nilo19 commented Dec 19, 2020

Looks like this is about the master node. Are you using standard LB or a basic one? For standard LB, the master node would be excluded from the LB by default.

@wking
Copy link
Contributor Author

wking commented Dec 22, 2020

Are you using standard LB or a basic one?

Standard (and here).

For standard LB, the master node would be excluded from the LB by default.

Is this documented somewhere? We've been using standard LBs for years, with no problems until the bump to the Kube 1.20 Azure cloud-provider.

@alvaroaleman
Copy link
Member

For standard LB, the master node would be excluded from the LB by default.

Are you referring to the built-in logic of Kube to never target master nodes with a service type LoadBalancer?

@wking
Copy link
Contributor Author

wking commented Dec 22, 2020

Are you referring to the built-in logic of Kube to never target master nodes with a service type LoadBalancer?

That used to be a problem (#65618), but my understanding is that it has since been fixed.

@wking
Copy link
Contributor Author

wking commented Dec 22, 2020

Can we re-open this? kubernetes-sigs/cloud-provider-azure#443 was not sufficient to get the cloud-provider happy on our infra, which again, contains no Availability Sets. There are some more notes about our infrastructure here (with discussion in rhbz#1794839), in case that helps.

@wking
Copy link
Contributor Author

wking commented Dec 22, 2020

@staebler, who has a better grasp on this than me, also just opened #97467. Sounds like fixing that broader issue might obsolete this ticket, because "the cloud-provider expects Availability Sets which OpenShift infra doesn't have in order to figure out which nodes to remove" doesn't matter if the cloud-provider isn't removing any OpenShift nodes in the first place.

@CecileRobertMichon
Copy link
Member

@feiskyer @nilo19 can you please confirm the standalone VM scenario is supported? It's a critical use case for CAPZ (our control plane VMs in zone-enabled regions are not placed in availability sets)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/provider/azure Issues or PRs related to azure provider kind/bug Categorizes issue or PR as related to a bug. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
6 participants