Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exit rolling updates when encountering specific errors #14194

Merged

Conversation

jandersen-plaid
Copy link
Contributor

@jandersen-plaid jandersen-plaid commented Aug 26, 2022

Fixes #14176

Add a new error to instancegroups which is returned when cluster validation times out. Then, check this error during cluster node and apiserver instancegroup rolls. If the cluster fails to be validated within the timeout then it is unlikely to succeed on subsequent node rolls, so the preference is to exit the cluster roll (to allow operators to fix the cluster validation issue).

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Aug 26, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: jandersen-plaid (40caf71)

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Aug 26, 2022
@k8s-ci-robot
Copy link
Contributor

Welcome @jandersen-plaid!

It looks like this is your first PR to kubernetes/kops 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kops has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 26, 2022
@k8s-ci-robot
Copy link
Contributor

Hi @jandersen-plaid. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added area/documentation area/rolling-update cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Aug 26, 2022
@olemarkus
Copy link
Member

I'd rather not add a flag for this. I think it is enough to inspect the returned error and return directly if it does not make sense to continue.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 15, 2022
@johngmyers
Copy link
Member

I think I'd prefer this be the default or only option.

/hold
/kind office-hours

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/office-hours labels Nov 24, 2022
@johngmyers
Copy link
Member

I believe the history was that the update on the IG would previously wait forever, not fail.

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 25, 2022
@jandersen-plaid jandersen-plaid changed the title Add a flag to rolling update to fail immediately on IG error Exit rolling updates when encountering specific errors Nov 25, 2022
@olemarkus
Copy link
Member

If a control plane IG fails, we already directly return an error. That is by far the most important behavior. If an IG fails and kOps keeps going to the next, and keeps going to the next and continues to gracefully drain and terminate nodes makes sense.

But in the case of a validation error, it doesn't make sense to keep going as kOps won't succeed with the next IG either.

@jandersen-plaid jandersen-plaid requested review from johngmyers and removed request for olemarkus and hakman November 26, 2022 06:02
pkg/instancegroups/rollingupdate.go Outdated Show resolved Hide resolved
pkg/instancegroups/rollingupdate.go Outdated Show resolved Hide resolved
Copy link
Member

@olemarkus olemarkus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good.

I think it is worth adding a brief explanation to the notes about the change in behavior. Something like:

* As of 1.24, kOps no longer hung indefinitely on eviction errors, but timed out after 15 minutes by default. After the timeout kOps will carry on to the next InstanceGroup. As of kOps 1.26, kOps will no longer carry on if a cluster validation error is encountered.

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. do-not-merge/contains-merge-commits and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Dec 21, 2022
jandersen-plaid and others added 10 commits December 21, 2022 09:30
Signed-off-by: Jack Andersen <jandersen@plaid.com>
Signed-off-by: Jack Andersen <jandersen@plaid.com>
Signed-off-by: Jack Andersen <jandersen@plaid.com>
…exitable

Signed-off-by: Jack Andersen <jandersen@plaid.com>
Co-authored-by: Ole Markus With <olemarkus@gmail.com>
…rs of err check

Signed-off-by: Jack Andersen <jandersen@plaid.com>
Signed-off-by: Jack Andersen <jandersen@plaid.com>
Signed-off-by: Jack Andersen <jandersen@plaid.com>
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 9, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: olemarkus

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 9, 2023
@olemarkus
Copy link
Member

@johngmyers you still want to hold this one?

@johngmyers
Copy link
Member

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 10, 2023
@johngmyers
Copy link
Member

/retest

@k8s-ci-robot k8s-ci-robot merged commit f6a36bf into kubernetes:master Jan 10, 2023
k8s-ci-robot added a commit that referenced this pull request Jan 10, 2023
…194-origin-release-1.26

Automated cherry pick of #14194: Add a flag to rolling update to fail immediately on IG
@jandersen-plaid jandersen-plaid deleted the jandersen-plaid-exit-first-error branch January 10, 2023 14:01
Shimiazoulai pushed a commit to spotinst/kubernetes-kops that referenced this pull request Jul 13, 2023
…ick-of-#14194-origin-release-1.26

Automated cherry pick of kubernetes#14194: Add a flag to rolling update to fail immediately on IG
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/documentation area/rolling-update cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/office-hours lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

All instancegroups must fail before kops exits on rolling update
4 participants