-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ETCD-612: Added a callback to provide additional pre-conditions for installation #1749
base: master
Are you sure you want to change the base?
Conversation
@jubittajohn: This pull request references ETCD-612 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@jubittajohn: This pull request references ETCD-612 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
pkg/operator/staticpod/controller/installer/installer_controller.go
Outdated
Show resolved
Hide resolved
pkg/operator/staticpod/controller/installer/installer_controller.go
Outdated
Show resolved
Hide resolved
@@ -442,6 +449,16 @@ func (c *InstallerController) manageInstallationPods(ctx context.Context, operat | |||
return true, requeueAfter, nil | |||
} | |||
|
|||
// Check if revision should be installed based on if the quorum is about to be violated | |||
if c.shouldRevisionInstall != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as far as I understand, this blocks the entire controller, which we don't want, we want the current revisions to continue to rollout.
We can be more granular here and just block the ensureInstallerPod
routine from spawning just the installer pod. That way it wouldn't block existing rollouts. Having said that, I think there are generally very little tests around this, so we need to run a couple of payload tests in cluster-etcd-operator (CEO) as well
bf42b26
to
d1afb2e
Compare
// checks if new revision should be rolled out | ||
if c.shouldRevisionInstall != nil { | ||
shouldInstall, err := c.shouldRevisionInstall() | ||
if !shouldInstall { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't that be
if !shouldInstall { | |
if err != nil { | |
return err | |
} | |
if !shouldInstall { | |
return nil | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe also add an info logging statement here, so we know if it was skipped
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tjungblu But only if there is error the requeue will happen as per the current logic here. https://github.com/openshift/library-go/blob/master/pkg/operator/staticpod/controller/installer/installer_controller.go#L481.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then we need to rethink the ensureInstallerPod signature potentially. I don't think we should trigger an event for this for example:
c.eventRecorder.Warningf("InstallerPodFailed", "Failed to create installer pod for revision %d count %d on node %q: %v",
currNodeState.TargetRevision, currNodeState.LastFailedCount, currNodeState.NodeName, err)
that's misleading, because there was no actual error that failed the installation. We just skipped it.
kubeClient: fake.NewSimpleClientset(), | ||
eventRecorder: eventstesting.NewTestingEventRecorder(t), | ||
shouldRevisionInstall: func() (bool, error) { | ||
return false, fmt.Errorf("revision shouldn't be installed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah okay, now I get it. If you want to use an error for control flow, that's fine. But then we don't really need the boolean to denote whether to skip it or not.
I (personally) would use the error for real errors though.
}, | ||
originalOperatorStatus: &operatorv1.StaticPodOperatorStatus{ | ||
LatestAvailableRevision: 1, | ||
NodeStatuses: []operatorv1.NodeStatus{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only very superficially looked through your tests, but I'm generally missing one with multiple NodeStatuses.
Maybe more meaningful for etcd would be a test with three nodes at those revisions:
master-0 = 4
master-1 = 3 (would be next to get to 4)
master-2 = 3
Now our quorum guard says master-0 went away (eg, machine was turned off). What happens to the remaining revisions?
I would expect this to block the installation of rev 4 on master-1 and master-2.
Maybe also think of a couple of such tests, maybe with a more structured permutation approach. We can discuss this also next Tuesday.
/assign @dusk125 |
d1afb2e
to
8e7bd43
Compare
pkg/operator/staticpod/controller/installer/installer_controller.go
Outdated
Show resolved
Hide resolved
@jubittajohn: This pull request references ETCD-612 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
54dda05
to
ddabfa2
Compare
pkg/operator/staticpod/controller/installer/installer_controller.go
Outdated
Show resolved
Hide resolved
5f21a8d
to
713d859
Compare
@jubittajohn: This pull request references ETCD-612 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/lgtm (non binding) can you please also squash all commits into one and remove the WIP please? |
/assign @dgrisonnet |
713d859
to
8d7fddd
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jubittajohn, tjungblu The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@@ -98,6 +98,9 @@ type InstallerController struct { | |||
clock clock.Clock | |||
installerBackOff func(count int) time.Duration | |||
fallbackBackOff func(count int) time.Duration | |||
|
|||
// shouldRevisionInstall is a callback function that determines whether a new revision should be installed | |||
shouldRevisionInstall func() (bool, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should follow the same precondition pattern that we have in the static resource controller: https://github.com/openshift/library-go/blob/master/pkg/operator/staticresourcecontroller/static_resource_controller.go#L55-L59. Meaning that we would allow for any condition to be defined to gate the revision process.
Also, we should pass on a context in that method and propagate it all the way down in case we need to cancel it for some reasons. Especially in https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/operator/ceohelpers/bootstrap.go#L139 and maybe some other functions that don't take it yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea, @jubittajohn can we create it slightly different like this?
type StaticPodInstallerPreconditionsFuncType func(ctx context.Context) (bool, error)
and then the struct value could look like this
installPrecondition StaticPodInstallerPreconditionsFuncType
that way is more consistent with the other packages.
// returns whether or not requeue and if an error happened while creating installer pod | ||
func (c *InstallerController) ensureInstallerPod(ctx context.Context, operatorSpec *operatorv1.StaticPodOperatorSpec, ns *operatorv1.NodeStatus) (bool, error) { | ||
// checks if new revision should be rolled out | ||
if c.shouldRevisionInstall != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this check should be moved to manageInstallationPods
because this function is supposed to be only about creating the installer pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could maybe even move it inside the sync
function before the call to manageInstallationPods
since we want to skip the installation process entirely when the preconditions are not fulfilled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah we thought so, but it's better to continue the current rollout. See #1749 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. it made me notice that we might be subject to the following issue: #1749 (comment)
119fa6c
to
f152bd5
Compare
lgtm, but I'll let Damien have the last word |
return true, err | ||
} | ||
if !shouldInstall { | ||
return true, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's imagine the following scenario:
t0: sync for a new revision occurs, preconditions are true, the installer pod is created
t1: preconditions became false while installation is still in progress
t2: sync happens again, ensureInstallerPod
returns true
for requeuing because the installer pod can't be installed due to the preconditions not being met.
When that happens, we return early in
library-go/pkg/operator/staticpod/controller/installer/installer_controller.go
Lines 500 to 503 in f152bd5
if requeue { | |
klog.V(4).Infof("Requeuing the creation of installer pod for revision %d on node %q", currNodeState.TargetRevision, currNodeState.NodeName) | |
return true, 0, nil | |
} |
To prevent the whole flow from being dependent on the preconditions, we should make sure that the preconditions are only evaluated when the installer pod isn't already present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe I'm off, but wouldn't t2 not go through the ensureInstallerPod
branch? the previous if condition should be true?
if operatorStatus.LatestAvailableRevision > currNodeState.TargetRevision {
// no backoff if new revision is pending
}
@jubittajohn can we add a unit test for the condition that @dgrisonnet mentioned? I think he has a valid point that we definitely must update the node status for ongoing installations, even when the precondition is false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dgrisonnet As pointed out , ensureInstallerPod
returns true
at t2.
@tjungblu The if
condition mentioned would evaluate to false
at t2 allowing the installer creation process to continue. The mentioned condition only evaluates to true
when a later revision is pending compared to the one currently being rolled out. Therefore, the ensureInstallerPod
is invoked.
To address this, I have added a check as suggested to ensure that preconditions are only evaluated when the installer pod isn't already present. Additionally, I have added a unit test for this scenario to confirm the behavior.
Could you please review the new changes ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, thanks for the test case :)
pkg/operator/staticpod/controller/installer/installer_controller.go
Outdated
Show resolved
Hide resolved
return true, err | ||
} | ||
if !shouldInstall { | ||
return true, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we leave a log statement here, too, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return true, nil | |
klog.Infof("Skipping the creation of installer pod [%s] because of precondition check", installerPodName) | |
return true, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just as an example, feel free to find a more fitting log statement
424dac8
to
a73116f
Compare
Signed-off-by: jubittajohn <jujohn@redhat.com> Added unit test Signed-off-by: jubittajohn <jujohn@redhat.com> Preconditions are only evaluated when the installer pod isn't already present Signed-off-by: jubittajohn <jujohn@redhat.com>
a73116f
to
4b49b84
Compare
@jubittajohn: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Added the callback shouldRevisionInstall to provide additional conditions for installation
These changes are consumed by : openshift/cluster-etcd-operator#1278