Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Applications: discover if a release is stuck and rollback if necessary #13301

Merged
merged 6 commits into from
Apr 17, 2024

Conversation

SimonTheLeg
Copy link
Member

What this PR does / why we need it:

This PR fixes an issue caused by a helm inconsistency (for more details see below) in which a helm release can be stuck in a "pending-install" state and never be fully rolled out. This has been observed in multiple clusters now, but being an inconsistency, there is no reliable way to reproduce this consistently.

Which issue(s) this PR fixes:

Fixes #12846

What type of PR is this?

/kind bug

Special notes for your reviewer:

During the research we came to the following conclusion:
This is not directly related to an AppInstallController being destroyed while a helm release is still in progress. In all testing the AppController would delete the "broken" release and re-created it. Instead this seems to be related to a fluke in helm, which many people have encountered (see helm/helm#7476 for more details).

The current approach is to check for a variety of conditions that we have observed at customer clusters and do a helm rollback to the previous version if a release appears to be stuck. This comes with the drawback, that after the rollback has been completed we will start a new release with the current version.
However out of all the following alternatives, I think this is the best option. Alternatives considered were:

  • always use the atomic flag -> scrapped because then end-users would loose the ability to debug applications where there is a legitimate issue in an helm chart after it has reached the timeout
  • delete the complete helm secret, thus forcing a re-rollout of the app -> with this the complete helm history would be lost, so we decided against this option.

Does this PR introduce a user-facing change? Then add your Release Note here:

Addressing inconsistencies in helm that lead to an Application stuck in "pending-install"

Documentation:

NONE

Signed-off-by: Simon Bein <simontheleg@gmail.com>
@kubermatic-bot kubermatic-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. docs/none Denotes a PR that doesn't need documentation (changes). kind/bug Categorizes issue or PR as related to a bug. dco-signoff: yes Denotes that all commits in the pull request have the valid DCO signoff message. sig/app-management Denotes a PR or issue as being assigned to SIG App Management. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 15, 2024
@SimonTheLeg
Copy link
Member Author

/hold just realized I should add some integration tests for this as well

@kubermatic-bot kubermatic-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 15, 2024
Signed-off-by: Simon Bein <simontheleg@gmail.com>
Signed-off-by: Simon Bein <simontheleg@gmail.com>
Signed-off-by: Simon Bein <simontheleg@gmail.com>
Signed-off-by: Simon Bein <simontheleg@gmail.com>
Copy link
Member

@embik embik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@kubermatic-bot kubermatic-bot added the lgtm Indicates that a PR is ready to be merged. label Apr 17, 2024
@kubermatic-bot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 1a65658781369dc3e303f3c90e5239d96d4034fd

…rally via the logger

Signed-off-by: Simon Bein <simontheleg@gmail.com>
@kubermatic-bot kubermatic-bot removed the lgtm Indicates that a PR is ready to be merged. label Apr 17, 2024
return false, nil
}
// currently we observe the stuck error exclusively with this message. If it does not exist, exit early
if applicationInstallation.Status.Conditions[appskubermaticv1.Ready].Message != "another operation (install/upgrade/rollback) is in progress" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love helm
/s

Copy link
Contributor

@wurbanski wurbanski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@kubermatic-bot kubermatic-bot added the lgtm Indicates that a PR is ready to be merged. label Apr 17, 2024
@kubermatic-bot
Copy link
Contributor

LGTM label has been added.

Git tree hash: bc675412baafa2b956f59fe435a98c886e2baaba

@kubermatic-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: embik, wurbanski

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubermatic-bot kubermatic-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 17, 2024
@SimonTheLeg
Copy link
Member Author

/retest

1 similar comment
@SimonTheLeg
Copy link
Member Author

/retest

@SimonTheLeg
Copy link
Member Author

/unhold

@kubermatic-bot kubermatic-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 17, 2024
@kubermatic-bot kubermatic-bot merged commit 9f73c2e into kubermatic:main Apr 17, 2024
18 checks passed
@kubermatic-bot kubermatic-bot added this to the KKP 2.26 milestone Apr 17, 2024
@embik
Copy link
Member

embik commented Apr 17, 2024

/cherry-pick release/v2.25

@embik
Copy link
Member

embik commented Apr 17, 2024

/cherry-pick release/v2.24

@kubermatic-bot
Copy link
Contributor

@embik: new pull request created: #13310

In response to this:

/cherry-pick release/v2.25

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kubermatic-bot
Copy link
Contributor

@embik: new pull request created: #13311

In response to this:

/cherry-pick release/v2.24

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Denotes that all commits in the pull request have the valid DCO signoff message. docs/none Denotes a PR that doesn't need documentation (changes). kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/app-management Denotes a PR or issue as being assigned to SIG App Management. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

System application cilium stops upgrading because another operation appears to be in progress
4 participants