Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add VMSize validation to reject VM Sizes with < 2 CPU #765

Closed
CecileRobertMichon opened this issue Jul 8, 2020 · 13 comments · Fixed by #884
Closed

Add VMSize validation to reject VM Sizes with < 2 CPU #765

CecileRobertMichon opened this issue Jul 8, 2020 · 13 comments · Fixed by #884
Assignees

Comments

@CecileRobertMichon
Copy link
Contributor

As documented at https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/#before-you-begin, kubeadm requires 2 CPUs. To improve UX, we should fail fast when the user specifies an Azure VM size that has less than 2 CPUs, that is:

  • Standard_A1_v2
  • Standard_B1ls
  • Standard_B1s
  • Standard_B1ms
  • Standard_DC1s_v2
  • Standard_D1_v2
  • Standard_DS1_v2
  • Standard_F1s
  • Standard_F1

(source: https://docs.microsoft.com/en-us/azure/virtual-machines/linux/compute-benchmark-scores)

This validation should be added to azuremachine_validation.go

/help
/good-first-issue

@k8s-ci-robot k8s-ci-robot added good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Jul 8, 2020
@CecileRobertMichon
Copy link
Contributor Author

@rsmitty does the same apply to Talos?

@rsmitty
Copy link
Contributor

rsmitty commented Jul 8, 2020

It’s not strictly required but we do mention 2 CPUs as a minimum for control plane nodes in our docs, so I’m cool with this.

@alexeldeib
Copy link
Contributor

It's more accurate to call the SKU API for this to avoid generating a static list, the API exposes vCPU as a capability. We could cache the known sizes at start up to avoid extra http calls inside the webhook? Maybe with some background refresh or refresh-on-miss logic?

@mboersma
Copy link
Contributor

Agree that we should make an API call for the SKU rather than hard-code a list. We do already have a resourceskus client that is used to check the acceleratedNetworking capability pre-flight, and it should not be too hard to extend that to check the vCPU count.

@alexeldeib
Copy link
Contributor

alexeldeib commented Jul 10, 2020

thoughts on breaking out skus client to a standalone cache? it might make sense not to issue calls for every VM, every time. We can query once for a subscription and cache the result. If we see we're restricted we can use that as a cache "miss" to refresh (to avoid getting hit by shifting point in time restrictions).

@CecileRobertMichon CecileRobertMichon added this to the next milestone Jul 10, 2020
@mboersma
Copy link
Contributor

I think a cache would be an good optimization for something like this. The results for such API calls are basically static for a given subscription. I intended to memoize the resourceskusclient.HasAcceleratedNetworking call in #645 but didn't get there.

@CecileRobertMichon
Copy link
Contributor Author

I'm going to remove good-first-issue, given that requirements aren't well defined at the moment and that adding a cache/making calls to the sku client would require a little more work.

/remove-good-first-issue

My initial thinking was that a static list was acceptable for something like this since it would only be used to improve UX when that SKU is going to fail down the line anyways and the deny list would basically be a superset of each individual subscription's list of VM sizes with 1 CPU. So it be a good incremental value add and a low hanging fruit (very easy to implement).

I agree that dynamic calls is the best option for accuracy but I'm concerned that in this situation it might be overkill since the alternative of not catching the wrong SKU does result in a failure, just not a fast one. Having to fetch the SKUs from Azure every create would definitely add latency + extra API calls so I would prefer staying away from that. If there is a simple-ish way to implement a cache so it doesn't have to be fetched every reconcile and that the added latency is not significant, I coud get behind that.

Another idea: we could look into just doing regex since 1 CPU SKUs are named in a pretty predictable way (something like Standard_[A-Z]*1[^1-9]*.* should work) and that can be easily unit tested + would avoid needing the make calls to the CPU.

@k8s-ci-robot k8s-ci-robot removed the good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. label Jul 10, 2020
@alexeldeib
Copy link
Contributor

I did some work recently with the SKU API for https://github.com/alexeldeib/throttled/blob/ace/dev/src/resource.rs and some other capacity work. I can take a shot at this after I tie up IP and ephemeral OS.

I'll leave it unassigned in case someone beats me to it :)

@alexeldeib
Copy link
Contributor

alexeldeib commented Jul 22, 2020

After #783 merges this will be really simple, but i'll probably tackle a few other pieces before circling back. If anyone wants to work on this feel free.

compute SKUs API exposes a capability called vCPUs (or CPUs for e.g. bare metal dedicated). It should be at least 2 for the user requested vm SKU. Look at ephemeral OS' use of HasCapability.

/good-first-issue

@k8s-ci-robot k8s-ci-robot added the good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. label Jul 22, 2020
@cpanato
Copy link
Member

cpanato commented Aug 6, 2020

can i try this one? @CecileRobertMichon @alexeldeib

@devigned
Copy link
Contributor

devigned commented Aug 6, 2020

/assign @cpanato

Go for it!

@cpanato
Copy link
Member

cpanato commented Aug 10, 2020

/remove-help

@k8s-ci-robot k8s-ci-robot removed help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. labels Aug 10, 2020
@cpanato
Copy link
Member

cpanato commented Aug 15, 2020

should we validate the memory as well?

@cpanato cpanato moved this from Backlog to In progress in Cluster API Azure Aug 16, 2020
Cluster API Azure automation moved this from In progress to Done (2020 - Q2) Aug 21, 2020
k8s-ci-robot added a commit that referenced this issue Aug 21, 2020
cloud/vm/vmss: validate if vCPUs and Memory matched the minimum required
@CecileRobertMichon CecileRobertMichon removed this from the next milestone May 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Cluster API Azure
  
Done (2020 - Q2)
7 participants