✨ Allow AZs to be Omitted at Runtime #1769

spjmurray · 2023-12-04T13:05:38Z

CAPI appears to make implicit CP scheduling decisions based on what it's told is available by CAPO on an "LRU" basis. It also assumes an infinite sized AZ, so problems begin when the "next" AZ cannot accommodate the VM. We could manually specify an AZ that aggregates all machines explcitly, however this is another mechanism that disables scheduling by CAPI altogether and allows Nova to do what it does, along with a soft-AA rule. However, switching from CAPI scheduling to Nova scheduling is impossible as the field is immutable, so allow this. Testing shows existing scheduled clusters undergo no topology changes, which will be due to the KCPM not taking action, but you can force the changes with a rolling upgrade of some variety. Crucially, if a cluster with CAPI scheduling gets stuck, we can modify to Nova scheduling and it should pick up the new specification and get past the hurdle.

What this PR does / why we need it:

As above

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

squashed commits
if necessary:
- includes documentation
- adds unit tests

/hold

netlify · 2023-12-04T13:05:45Z

✅ Deploy Preview for kubernetes-sigs-cluster-api-openstack ready!

Name	Link
🔨 Latest commit	`04ee2bb`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-cluster-api-openstack/deploys/65a6959fd1a23600089a59ab
😎 Deploy Preview	https://deploy-preview-1769--kubernetes-sigs-cluster-api-openstack.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

k8s-ci-robot · 2023-12-04T13:05:48Z

Hi @spjmurray. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

lentzi90

/ok-to-test

lentzi90

This seems reasonable to me
/lgtm

/cc @mdbooth

lentzi90 · 2023-12-05T09:02:02Z

/test pull-cluster-api-provider-openstack-e2e-test

spjmurray · 2023-12-05T11:00:23Z

Yes Sir Mr Robot 🖖🏻

spjmurray · 2023-12-05T12:24:03Z

So looking at the first set of test results, I do see one of the tests failing due to quota limits being exceeded, that may affect the other one that's happening in parallel, but the machine is "running" apparently, just no logs from it or anything. Need to dig a bit deeper...

lentzi90 · 2023-12-05T12:44:22Z

So looking at the first set of test results, I do see one of the tests failing due to quota limits being exceeded, that may affect the other one that's happening in parallel, but the machine is "running" apparently, just no logs from it or anything. Need to dig a bit deeper...

Thanks for mentioning! This may be why the tests have been quite flaky recently 🙁
Maybe we need to look at ways to increase the quota or limit which tests run in parallel...

mdbooth · 2023-12-06T11:32:58Z

So looking at the first set of test results, I do see one of the tests failing due to quota limits being exceeded, that may affect the other one that's happening in parallel, but the machine is "running" apparently, just no logs from it or anything. Need to dig a bit deeper...

Thanks for mentioning! This may be why the tests have been quite flaky recently 🙁 Maybe we need to look at ways to increase the quota or limit which tests run in parallel...

I think we already maxed out what we can do 😞

spjmurray · 2023-12-06T12:15:10Z

I had exactly this problem in Couchbase, balancing host cluster resources with what was required by the test, may be a cool idea to investigate making a resource aware t.Parallel() for "fun"...

spjmurray · 2023-12-07T11:59:22Z

Fun has been had... https://github.com/spjmurray/testing not the prettiest thing in the world, but food for thought 😸

jichenjc · 2023-12-13T06:32:31Z

/test pull-cluster-api-provider-openstack-e2e-test

mdbooth · 2023-12-13T15:18:26Z

/approve

k8s-ci-robot · 2023-12-13T15:18:33Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mdbooth, spjmurray

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mdbooth]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

lentzi90 · 2023-12-15T08:14:23Z

/hold cancel

CAPI appears to make implicit CP scheduling decisions based on what it's told is available by CAPO on an "LRU" basis. It also assumes an infinite sized AZ, so problems begin when the "next" AZ cannot accommodate the VM. We could manually specify an AZ that aggregates all machines explcitly, however this is another mechanism that disables scheduling by CAPI altogether and allows Nova to do what it does, along with a soft-AA rule. However, switching from CAPI scheduling to Nova scheduling is impossible as the field is immutable, so allow this. Testing shows existing scheduled clusters undergo no topology changes, which will be due to the KCPM not taking action, but you can force the changes with a rolling upgrade of some variety. Crucially, if a cluster with CAPI scheduling gets stuck, we can modify to Nova scheduling and it should pick up the new specification and get past the hurdle.

mdbooth · 2024-01-16T16:39:39Z

/lgtm

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 4, 2023

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Dec 4, 2023

k8s-ci-robot requested review from lentzi90 and seanschneeweiss December 4, 2023 13:05

k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Dec 4, 2023

lentzi90 reviewed Dec 5, 2023

View reviewed changes

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 5, 2023

lentzi90 reviewed Dec 5, 2023

View reviewed changes

k8s-ci-robot requested a review from mdbooth December 5, 2023 06:36

k8s-ci-robot assigned lentzi90 Dec 5, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 5, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 13, 2023

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Dec 15, 2023

spjmurray force-pushed the allow_cp_scheduling_changes branch from bda4f96 to 04ee2bb Compare January 16, 2024 14:41

k8s-ci-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jan 16, 2024

k8s-ci-robot assigned mdbooth Jan 16, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 16, 2024

k8s-ci-robot merged commit 010408d into kubernetes-sigs:main Jan 16, 2024
9 checks passed

spjmurray deleted the allow_cp_scheduling_changes branch January 17, 2024 10:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ Allow AZs to be Omitted at Runtime #1769

✨ Allow AZs to be Omitted at Runtime #1769

spjmurray commented Dec 4, 2023

netlify bot commented Dec 4, 2023 •

edited

k8s-ci-robot commented Dec 4, 2023

lentzi90 left a comment

lentzi90 left a comment

lentzi90 commented Dec 5, 2023

spjmurray commented Dec 5, 2023

spjmurray commented Dec 5, 2023

lentzi90 commented Dec 5, 2023

mdbooth commented Dec 6, 2023

spjmurray commented Dec 6, 2023

spjmurray commented Dec 7, 2023

jichenjc commented Dec 13, 2023

mdbooth commented Dec 13, 2023

k8s-ci-robot commented Dec 13, 2023

lentzi90 commented Dec 15, 2023

mdbooth commented Jan 16, 2024

✨ Allow AZs to be Omitted at Runtime #1769

✨ Allow AZs to be Omitted at Runtime #1769

Conversation

spjmurray commented Dec 4, 2023

netlify bot commented Dec 4, 2023 • edited

✅ Deploy Preview for kubernetes-sigs-cluster-api-openstack ready!

k8s-ci-robot commented Dec 4, 2023

lentzi90 left a comment

Choose a reason for hiding this comment

lentzi90 left a comment

Choose a reason for hiding this comment

lentzi90 commented Dec 5, 2023

spjmurray commented Dec 5, 2023

spjmurray commented Dec 5, 2023

lentzi90 commented Dec 5, 2023

mdbooth commented Dec 6, 2023

spjmurray commented Dec 6, 2023

spjmurray commented Dec 7, 2023

jichenjc commented Dec 13, 2023

mdbooth commented Dec 13, 2023

k8s-ci-robot commented Dec 13, 2023

lentzi90 commented Dec 15, 2023

mdbooth commented Jan 16, 2024

netlify bot commented Dec 4, 2023 •

edited