Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Allow AZs to be Omitted at Runtime #1769

Merged

Conversation

spjmurray
Copy link
Contributor

CAPI appears to make implicit CP scheduling decisions based on what it's told is available by CAPO on an "LRU" basis. It also assumes an infinite sized AZ, so problems begin when the "next" AZ cannot accommodate the VM. We could manually specify an AZ that aggregates all machines explcitly, however this is another mechanism that disables scheduling by CAPI altogether and allows Nova to do what it does, along with a soft-AA rule. However, switching from CAPI scheduling to Nova scheduling is impossible as the field is immutable, so allow this. Testing shows existing scheduled clusters undergo no topology changes, which will be due to the KCPM not taking action, but you can force the changes with a rolling upgrade of some variety. Crucially, if a cluster with CAPI scheduling gets stuck, we can modify to Nova scheduling and it should pick up the new specification and get past the hurdle.

What this PR does / why we need it:

As above

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):

Special notes for your reviewer:

  1. Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

  • squashed commits
  • if necessary:
    • includes documentation
    • adds unit tests

/hold

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 4, 2023
Copy link

netlify bot commented Dec 4, 2023

Deploy Preview for kubernetes-sigs-cluster-api-openstack ready!

Name Link
🔨 Latest commit 04ee2bb
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-cluster-api-openstack/deploys/65a6959fd1a23600089a59ab
😎 Deploy Preview https://deploy-preview-1769--kubernetes-sigs-cluster-api-openstack.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Dec 4, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @spjmurray. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Dec 4, 2023
Copy link
Contributor

@lentzi90 lentzi90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 5, 2023
Copy link
Contributor

@lentzi90 lentzi90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable to me
/lgtm

/cc @mdbooth

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 5, 2023
@lentzi90
Copy link
Contributor

lentzi90 commented Dec 5, 2023

/test pull-cluster-api-provider-openstack-e2e-test

@spjmurray
Copy link
Contributor Author

Yes Sir Mr Robot 🖖🏻

@spjmurray
Copy link
Contributor Author

So looking at the first set of test results, I do see one of the tests failing due to quota limits being exceeded, that may affect the other one that's happening in parallel, but the machine is "running" apparently, just no logs from it or anything. Need to dig a bit deeper...

@lentzi90
Copy link
Contributor

lentzi90 commented Dec 5, 2023

So looking at the first set of test results, I do see one of the tests failing due to quota limits being exceeded, that may affect the other one that's happening in parallel, but the machine is "running" apparently, just no logs from it or anything. Need to dig a bit deeper...

Thanks for mentioning! This may be why the tests have been quite flaky recently 🙁
Maybe we need to look at ways to increase the quota or limit which tests run in parallel...

@mdbooth
Copy link
Contributor

mdbooth commented Dec 6, 2023

So looking at the first set of test results, I do see one of the tests failing due to quota limits being exceeded, that may affect the other one that's happening in parallel, but the machine is "running" apparently, just no logs from it or anything. Need to dig a bit deeper...

Thanks for mentioning! This may be why the tests have been quite flaky recently 🙁 Maybe we need to look at ways to increase the quota or limit which tests run in parallel...

I think we already maxed out what we can do 😞

@spjmurray
Copy link
Contributor Author

I had exactly this problem in Couchbase, balancing host cluster resources with what was required by the test, may be a cool idea to investigate making a resource aware t.Parallel() for "fun"...

@spjmurray
Copy link
Contributor Author

Fun has been had... https://github.com/spjmurray/testing not the prettiest thing in the world, but food for thought 😸

@jichenjc
Copy link
Contributor

/test pull-cluster-api-provider-openstack-e2e-test

@mdbooth
Copy link
Contributor

mdbooth commented Dec 13, 2023

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mdbooth, spjmurray

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 13, 2023
@lentzi90
Copy link
Contributor

/hold cancel

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Dec 15, 2023
CAPI appears to make implicit CP scheduling decisions based on what it's
told is available by CAPO on an "LRU" basis. It also assumes an infinite
sized AZ, so problems begin when the "next" AZ cannot accommodate the
VM. We could manually specify an AZ that aggregates all machines
explcitly, however this is another mechanism that disables scheduling by
CAPI altogether and allows Nova to do what it does, along with a soft-AA
rule. However, switching from CAPI scheduling to Nova scheduling is
impossible as the field is immutable, so allow this.  Testing shows
existing scheduled clusters undergo no topology changes, which will be
due to the KCPM not taking action, but you can force the changes with a
rolling upgrade of some variety. Crucially, if a cluster with CAPI
scheduling gets stuck, we can modify to Nova scheduling and it should
pick up the new specification and get past the hurdle.
@k8s-ci-robot k8s-ci-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jan 16, 2024
@mdbooth
Copy link
Contributor

mdbooth commented Jan 16, 2024

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 16, 2024
@k8s-ci-robot k8s-ci-robot merged commit 010408d into kubernetes-sigs:main Jan 16, 2024
9 checks passed
@spjmurray spjmurray deleted the allow_cp_scheduling_changes branch January 17, 2024 10:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants