New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update leader election defaults so it handles 60s of kube-apiserver communication disruption #1104
Conversation
| @@ -84,14 +84,25 @@ func ToConfigMapLeaderElection(clientConfig *rest.Config, config configv1.Leader | |||
| func LeaderElectionDefaulting(config configv1.LeaderElection, defaultNamespace, defaultName string) configv1.LeaderElection { | |||
| ret := *(&config).DeepCopy() | |||
|
|
|||
| // 1. lock skew tolerance is leaseDuration-renewDeadline == 22s | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might suggest we call this method FastLeaderElectionDefaulting or CriticalLeaderElectionDefaulting, since these defaults should only be used for the most critical services. Are there a set of library-go based operators that are not critical that can tolerate 30s / 3 retry leases?
One more question - why are you trying to have 6 retries (i.e. what reason for 6 vs 3)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might suggest we call this method
FastLeaderElectionDefaultingorCriticalLeaderElectionDefaulting, since these defaults should only be used for the most critical services. Are there a set of library-go based operators that are not critical that can tolerate 30s / 3 retry leases?One more question - why are you trying to have 6 retries (i.e. what reason for 6 vs 3)?
Now answered in the code comments.
|
I appreciate the commit title. LGTM |
…ommunication disruption
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, smarterclayton The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Update the package-server-manager controller and update the leader election intervals. This is in order to comply with the need for OCP components to be able to withstand 60s of API server disruption on SNO-enabled clusters. For more information, see the following resources: - openshift/library-go#1104 (comment) - https://bugzilla.redhat.com/show_bug.cgi?id=1985697 Alternative implementations include disabling leader election entirely for SNO-enabled clusters. This implementation is centered around dynamically querying for the Infrastructure/cluster singleton resource, checking the HA/non-HA expectations being exposed, and setting leader election properly. This implementation would still need to be careful about how to handle transient errors and provide an escape hatch (e.g. prefer enablement of leader election through a CLI flag, vs. the dynamic value) that users can pass to the PSM deployment for failed upgrades.
Update the package-server-manager controller and update the leader election intervals. This is in order to comply with the need for OCP components to be able to withstand 60s of API server disruption on SNO-enabled clusters. For more information, see the following resources: - openshift/library-go#1104 (comment) - https://bugzilla.redhat.com/show_bug.cgi?id=1985697 Alternative implementations include disabling leader election entirely for SNO-enabled clusters. This implementation is centered around dynamically querying for the Infrastructure/cluster singleton resource, checking the HA/non-HA expectations being exposed, and setting leader election properly. This implementation would still need to be careful about how to handle transient errors and provide an escape hatch (e.g. prefer enablement of leader election through a CLI flag, vs. the dynamic value) that users can pass to the PSM deployment for failed upgrades.
This bump is intended to address the issue that SNO cluster cannot handle 60s of API server communication disruptions. The fix was added in openshift/library-go#1104.
To be able an api server disruptions on SNO, the leader election timeouts needs to be adjusted acording to github.com/openshift/library-go/pull/1104. Signed-off-by: Christoph Stäbler <cstabler@redhat.com>
To be able to handle an api server disruption on SNO, the leader election timeouts needs to be adjusted according to github.com/openshift/library-go/pull/1104. Signed-off-by: Christoph Stäbler <cstabler@redhat.com>
To be able to handle an api server disruption on SNO, the leader election timeouts needs to be adjusted according to github.com/openshift/library-go/pull/1104. Signed-off-by: Christoph Stäbler <cstabler@redhat.com>
found via openshift/origin#26215
We want to handle 60s of communication disruption in all components.