-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Applications: discover if a release is stuck and rollback if necessary #13301
Applications: discover if a release is stuck and rollback if necessary #13301
Conversation
Signed-off-by: Simon Bein <simontheleg@gmail.com>
/hold just realized I should add some integration tests for this as well |
Signed-off-by: Simon Bein <simontheleg@gmail.com>
Signed-off-by: Simon Bein <simontheleg@gmail.com>
...controller/user-cluster-controller-manager/application-installation-controller/controller.go
Outdated
Show resolved
Hide resolved
Signed-off-by: Simon Bein <simontheleg@gmail.com>
Signed-off-by: Simon Bein <simontheleg@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
LGTM label has been added. Git tree hash: 1a65658781369dc3e303f3c90e5239d96d4034fd
|
…rally via the logger Signed-off-by: Simon Bein <simontheleg@gmail.com>
return false, nil | ||
} | ||
// currently we observe the stuck error exclusively with this message. If it does not exist, exit early | ||
if applicationInstallation.Status.Conditions[appskubermaticv1.Ready].Message != "another operation (install/upgrade/rollback) is in progress" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love helm
/s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
LGTM label has been added. Git tree hash: bc675412baafa2b956f59fe435a98c886e2baaba
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: embik, wurbanski The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
1 similar comment
/retest |
/unhold |
/cherry-pick release/v2.25 |
/cherry-pick release/v2.24 |
@embik: new pull request created: #13310 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@embik: new pull request created: #13311 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What this PR does / why we need it:
This PR fixes an issue caused by a helm inconsistency (for more details see below) in which a helm release can be stuck in a "pending-install" state and never be fully rolled out. This has been observed in multiple clusters now, but being an inconsistency, there is no reliable way to reproduce this consistently.
Which issue(s) this PR fixes:
Fixes #12846
What type of PR is this?
/kind bug
Special notes for your reviewer:
During the research we came to the following conclusion:
This is not directly related to an AppInstallController being destroyed while a helm release is still in progress. In all testing the AppController would delete the "broken" release and re-created it. Instead this seems to be related to a fluke in helm, which many people have encountered (see helm/helm#7476 for more details).
The current approach is to check for a variety of conditions that we have observed at customer clusters and do a
helm rollback
to the previous version if a release appears to be stuck. This comes with the drawback, that after the rollback has been completed we will start a new release with the current version.However out of all the following alternatives, I think this is the best option. Alternatives considered were:
atomic
flag -> scrapped because then end-users would loose the ability to debug applications where there is a legitimate issue in an helm chart after it has reached the timeoutDoes this PR introduce a user-facing change? Then add your Release Note here:
Documentation: