New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1882394: Delete lock created by CSO 4.6 #91
Bug 1882394: Delete lock created by CSO 4.6 #91
Conversation
@bertinatto: This pull request references Bugzilla bug 1877316, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/hold |
/retest |
/launch aws |
/test launch-aws |
@bertinatto: The specified target(s) for
Use In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test launch-gcp |
@bertinatto: The specified target(s) for
Use In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
33b489b
to
2d1e41a
Compare
@bertinatto: This pull request references Bugzilla bug 1877316, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
e43afcc
to
1b1fae5
Compare
@bertinatto: This pull request references Bugzilla bug 1877316, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/bugzilla refresh |
@bertinatto: This pull request references Bugzilla bug 1877316, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@bertinatto: This pull request references Bugzilla bug 1882394, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
cmd/manager/main.go
Outdated
// We don't exit if an error happens, but we let the operator | ||
// try to become the leader below. If it doesn't succeed, then we exit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In https://bugzilla.redhat.com/show_bug.cgi?id=1877316 we saw 4.5 CSO trying to become leader for 3 hours without exiting. There should be either timeout for leader.Become
or CSO should exit here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was expecting that on line 109 below it would os.Exit(1)
if CSO couldn't become the leader, but operator-sdk doesn't return an error the ConfigMap already exists:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, PTAL
14dcb73
to
c4c3167
Compare
@bertinatto: This pull request references Bugzilla bug 1882394, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/bugzilla refresh |
@bertinatto: This pull request references Bugzilla bug 1882394, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/bugzilla refresh |
@bertinatto: This pull request references Bugzilla bug 1882394, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
errWait := wait.PollImmediate(time.Second*10, time.Minute*5, func() (bool, error) { | ||
err := tryDeleteIncompatibleLock(cfg, lockName) | ||
if err != nil { | ||
return false, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please log the error.
log.Error(errWait, ...
below logsDeadline exceeded
or some timeout message, not the API error- It would be better to get some error logged before 5 minute timeout, users are not that patient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added log
On a downgrade from 4.6 to 4.5 scenario, the leader election ConfigMap is created by CSO 4.6 (library-go) and it doesn't have any `ownerReference` set. However, CSO 4.5 uses `operator-sdk` with leader-for-life election approach, and it expects that the ConfigMap either doesn't exist or exists but has an `ownerReference` set (in order to know whether the lock belongs to it or not). Since it doesn't find any, it tries to create one, but it gets back an error stating the the object already exists. This goes on forever and CSO 4.5 never become the leader. This patch identifies if the lock was created by CSO 4.6 and, if so, deletes it so that CSO 4.5 can become the leader using the leader-for-life model
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bertinatto, jsafrane The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/bugzilla refresh |
@jsafrane: This pull request references Bugzilla bug 1882394, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Using my team lead superpowers to add valid-bug label: there is nothing to fix in 4.6 |
/retest |
/retest |
/retest
|
/hold cancel |
/retest |
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest |
@bertinatto: All pull requests linked via external trackers have merged: Bugzilla bug 1882394 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Leader-for-life leader election causes problems when downgrading from CSO 4.6, which uses library-go and doesn't set an
ownerReference
in theConfigMap
metadata.This patch identifies if the lock was created by CSO 4.6 and, if so, deletes it so that CSO 4.5 can become the leader using the leader-for-life model.
More on the the problem:
On a downgrade from 4.5 to 4.6 scenario, the leader election
ConfigMap
is created by CSO 4.6 (library-go) and it doesn't have anyownerReference
set.However, CSO 4.5 uses
operator-sdk
with leader-for-life election approach, and it expects that theConfigMap
either doesn't exist or exists but has anownerReference
set (in order to know whether the lock belongs to it or not).Since it doesn't find any, it tries to create one, but it gets back an error stating the the object already exists. Therefore, CSO 4.5 thinks that the lock belongs to somebody else and it never starts. This goes on forever and CSO 4.5 never become the leader.