Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ETCD-612: Added a callback to provide additional pre-conditions for installation #1749

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jubittajohn
Copy link

@jubittajohn jubittajohn commented Jun 14, 2024

Added the callback shouldRevisionInstall to provide additional conditions for installation

These changes are consumed by : openshift/cluster-etcd-operator#1278

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 14, 2024
@jubittajohn jubittajohn changed the title WIP: Added the callback shouldRevisionInstall to provide additional c… WIP: ETCD-612: Added the callback shouldRevisionInstall to provide additional c… Jun 14, 2024
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 14, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jun 14, 2024

@jubittajohn: This pull request references ETCD-612 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

In response to this:

…onditions for installation

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Jun 14, 2024

@jubittajohn: This pull request references ETCD-612 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

In response to this:

Added the callback shouldRevisionInstall to provide additional conditions for installation

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@@ -442,6 +449,16 @@ func (c *InstallerController) manageInstallationPods(ctx context.Context, operat
return true, requeueAfter, nil
}

// Check if revision should be installed based on if the quorum is about to be violated
if c.shouldRevisionInstall != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as far as I understand, this blocks the entire controller, which we don't want, we want the current revisions to continue to rollout.

We can be more granular here and just block the ensureInstallerPod routine from spawning just the installer pod. That way it wouldn't block existing rollouts. Having said that, I think there are generally very little tests around this, so we need to run a couple of payload tests in cluster-etcd-operator (CEO) as well

@jubittajohn jubittajohn force-pushed the skip-static-pod-rollouts branch 3 times, most recently from bf42b26 to d1afb2e Compare June 21, 2024 14:07
// checks if new revision should be rolled out
if c.shouldRevisionInstall != nil {
shouldInstall, err := c.shouldRevisionInstall()
if !shouldInstall {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't that be

Suggested change
if !shouldInstall {
if err != nil {
return err
}
if !shouldInstall {
return nil
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe also add an info logging statement here, so we know if it was skipped

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then we need to rethink the ensureInstallerPod signature potentially. I don't think we should trigger an event for this for example:

c.eventRecorder.Warningf("InstallerPodFailed", "Failed to create installer pod for revision %d count %d on node %q: %v", 
       currNodeState.TargetRevision, currNodeState.LastFailedCount, currNodeState.NodeName, err) 

that's misleading, because there was no actual error that failed the installation. We just skipped it.

kubeClient: fake.NewSimpleClientset(),
eventRecorder: eventstesting.NewTestingEventRecorder(t),
shouldRevisionInstall: func() (bool, error) {
return false, fmt.Errorf("revision shouldn't be installed")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah okay, now I get it. If you want to use an error for control flow, that's fine. But then we don't really need the boolean to denote whether to skip it or not.

I (personally) would use the error for real errors though.

},
originalOperatorStatus: &operatorv1.StaticPodOperatorStatus{
LatestAvailableRevision: 1,
NodeStatuses: []operatorv1.NodeStatus{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only very superficially looked through your tests, but I'm generally missing one with multiple NodeStatuses.

Maybe more meaningful for etcd would be a test with three nodes at those revisions:

master-0 = 4
master-1 = 3 (would be next to get to 4)
master-2 = 3

Now our quorum guard says master-0 went away (eg, machine was turned off). What happens to the remaining revisions?

I would expect this to block the installation of rev 4 on master-1 and master-2.
Maybe also think of a couple of such tests, maybe with a more structured permutation approach. We can discuss this also next Tuesday.

@jubittajohn
Copy link
Author

/assign @dusk125

@openshift-ci-robot
Copy link

openshift-ci-robot commented Jun 25, 2024

@jubittajohn: This pull request references ETCD-612 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

In response to this:

Added the callback shouldRevisionInstall to provide additional conditions for installation

This changes are consumed by : openshift/cluster-etcd-operator#1278

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Jun 26, 2024

@jubittajohn: This pull request references ETCD-612 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

In response to this:

Added the callback shouldRevisionInstall to provide additional conditions for installation

These changes are consumed by : openshift/cluster-etcd-operator#1278

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@tjungblu
Copy link
Contributor

/lgtm (non binding)

can you please also squash all commits into one and remove the WIP please?

@tjungblu
Copy link
Contributor

/assign @dgrisonnet

Copy link
Contributor

openshift-ci bot commented Jun 26, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jubittajohn, tjungblu
Once this PR has been reviewed and has the lgtm label, please ask for approval from dgrisonnet. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jubittajohn jubittajohn changed the title WIP: ETCD-612: Added the callback shouldRevisionInstall to provide additional c… ETCD-612: Added the callback shouldRevisionInstall to provide additional conditions for installation Jun 26, 2024
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 26, 2024
@@ -98,6 +98,9 @@ type InstallerController struct {
clock clock.Clock
installerBackOff func(count int) time.Duration
fallbackBackOff func(count int) time.Duration

// shouldRevisionInstall is a callback function that determines whether a new revision should be installed
shouldRevisionInstall func() (bool, error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should follow the same precondition pattern that we have in the static resource controller: https://github.com/openshift/library-go/blob/master/pkg/operator/staticresourcecontroller/static_resource_controller.go#L55-L59. Meaning that we would allow for any condition to be defined to gate the revision process.

Also, we should pass on a context in that method and propagate it all the way down in case we need to cancel it for some reasons. Especially in https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/operator/ceohelpers/bootstrap.go#L139 and maybe some other functions that don't take it yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, @jubittajohn can we create it slightly different like this?

type StaticPodInstallerPreconditionsFuncType func(ctx context.Context) (bool, error)

and then the struct value could look like this

installPrecondition StaticPodInstallerPreconditionsFuncType

that way is more consistent with the other packages.

// returns whether or not requeue and if an error happened while creating installer pod
func (c *InstallerController) ensureInstallerPod(ctx context.Context, operatorSpec *operatorv1.StaticPodOperatorSpec, ns *operatorv1.NodeStatus) (bool, error) {
// checks if new revision should be rolled out
if c.shouldRevisionInstall != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this check should be moved to manageInstallationPods because this function is supposed to be only about creating the installer pod.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could maybe even move it inside the sync function before the call to manageInstallationPods since we want to skip the installation process entirely when the preconditions are not fulfilled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah we thought so, but it's better to continue the current rollout. See #1749 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. it made me notice that we might be subject to the following issue: #1749 (comment)

@jubittajohn jubittajohn changed the title ETCD-612: Added the callback shouldRevisionInstall to provide additional conditions for installation ETCD-612: Added a callback to provide additional pre-conditions for installation Jun 28, 2024
@dusk125
Copy link
Contributor

dusk125 commented Jun 28, 2024

lgtm, but I'll let Damien have the last word

return true, err
}
if !shouldInstall {
return true, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's imagine the following scenario:
t0: sync for a new revision occurs, preconditions are true, the installer pod is created
t1: preconditions became false while installation is still in progress
t2: sync happens again, ensureInstallerPod returns true for requeuing because the installer pod can't be installed due to the preconditions not being met.

When that happens, we return early in

if requeue {
klog.V(4).Infof("Requeuing the creation of installer pod for revision %d on node %q", currNodeState.TargetRevision, currNodeState.NodeName)
return true, 0, nil
}
which means that the statuses will not be updated unless the preconditions are met again.

To prevent the whole flow from being dependent on the preconditions, we should make sure that the preconditions are only evaluated when the installer pod isn't already present.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I'm off, but wouldn't t2 not go through the ensureInstallerPod branch? the previous if condition should be true?

if operatorStatus.LatestAvailableRevision > currNodeState.TargetRevision {
				// no backoff if new revision is pending
			}

@jubittajohn can we add a unit test for the condition that @dgrisonnet mentioned? I think he has a valid point that we definitely must update the node status for ongoing installations, even when the precondition is false.

Copy link
Author

@jubittajohn jubittajohn Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dgrisonnet As pointed out , ensureInstallerPod returns true at t2.
@tjungblu The if condition mentioned would evaluate to false at t2 allowing the installer creation process to continue. The mentioned condition only evaluates to true when a later revision is pending compared to the one currently being rolled out. Therefore, the ensureInstallerPod is invoked.

To address this, I have added a check as suggested to ensure that preconditions are only evaluated when the installer pod isn't already present. Additionally, I have added a unit test for this scenario to confirm the behavior.

Could you please review the new changes ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, thanks for the test case :)

return true, err
}
if !shouldInstall {
return true, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we leave a log statement here, too, please?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return true, nil
klog.Infof("Skipping the creation of installer pod [%s] because of precondition check", installerPodName)
return true, nil

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just as an example, feel free to find a more fitting log statement

Signed-off-by: jubittajohn <jujohn@redhat.com>

Added unit test

Signed-off-by: jubittajohn <jujohn@redhat.com>

Preconditions are only evaluated when the installer pod isn't already present

Signed-off-by: jubittajohn <jujohn@redhat.com>
Copy link
Contributor

openshift-ci bot commented Jul 9, 2024

@jubittajohn: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants