minor cleanup and refactoring for consistency and simplicity #40

joelanford · 2022-09-07T00:28:04Z

Somewhat scatter-brained refactor/fixup PR for random things I found. I'll clean this up more if desired.

Signed-off-by: Joe Lanford joe.lanford@gmail.com

openshift-ci · 2022-09-07T00:28:08Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

joelanford · 2022-09-07T00:28:18Z

/test all

timflannagan · 2022-09-07T00:33:27Z

Looks reasonable to me at a glance. This will fail the linting check due to the package import grouping, and I added some unit tests for when the TypeApplied condition is nil, so you'd have to update those as well.

joelanford · 2022-09-07T01:08:32Z

/test all

joelanford · 2022-09-07T01:26:53Z

/test all

kuiwang02 · 2022-09-07T06:19:15Z

internal/util/util.go

 // In the case that the PO resource is expressing failing states, then an error
 // will be returned to reflect that.
 func inspectPlatformOperator(po platformv1alpha1.PlatformOperator) error {
+	if equality.Semantic.DeepEqual(po.Status, platformv1alpha1.PlatformOperatorStatus{}) {


@joelanford Hi, when equality.Semantic.DeepEqual(po.Status, platformv1alpha1.PlatformOperatorStatus{}) return True, does it mean the PO status part is empty?
if it is correct understanding, I can not understand why return nil here because I thinkreturn nil means PO is in healthy status in method InspectPlatformOperators , and not report error to aggregated CO. But we do not the PO final status yet (maybe finally the PO is not installed successfully). So, here maybe return buildPOFailureMessage(po.GetName(), "PO has no status field")

(I just worry about that before PO installation is done (no final status yet), we report wrong information to aggregated CO. Thanks to check my concern)

does it mean the PO status part is empty?

Yes

I can not understand why return nil here

@timflannagan and I were chatting about this yesterday. In general, a PO with no status is a PO that has yet to be reconciled by the PO controller and that typically happens only as the PO is just created. When the PO controller reconciles this PO, it'll attach a status and then this controller will reconcile again.

I wanted to avoid the aggregated CO going into a failing state temporarily every time a new PO shows up.

One thought I had to guard against the possibility of the PO controller never actually reconciling the PO: We could check status is empty and creationTimestamp is within the last 1 minute.

Which would result in:

Empty status within a minute of being created: no problem, return nil

Empty status after a minute of being create: something's wrong, return an error.

WDYT?

@joelanford Should we add a comment that basically explains what you had written down here? I should've done that in the PR that refactored these checks.

In general, I think that option you outlined seems reasonable at a glance but introduces potential complexity.

Thinking about this more, in the current implementation I think we'll see flapping in the aggregate ClusterOperator resource we're managing. I'm working on the POM component's implementation, which is responsible for bubbling up the underlying BundleDeployment status. In the case where the BD resource is waiting to be unpacked, we'll effectively see there's no TypeApplied status condition type present yet, and propagate the ApplyPending status condition reason in the PO resource. The aggregate CO controller would see that, and mark the aggregate CO resource with the following:

$ k get co -w ... platform-operators-aggregated False True False 0s encountered the failing cincinnati-operatorrwxr4 platform operator with reason "ApplyPending" ... platform-operators-aggregated False True False 0s encountered the failing cincinnati-operatorrwxr4 platform operator with reason "ApplyPending" platform-operators-aggregated True True False 0s

Which feels like the wrong behavior, UX, etc.

@joelanford Thanks to check my comment.

Your idea is ok, but potential complexity (same to timflannagan).
maybe we just change return nil to return buildPOFailureMessage(po.GetName(), platformtypes.ReasonApplyPending) because we possible treat no reconciled yet as ApplyPending.

anyway I am also OK if you want to take your idea on 1minute.

My main concern is that it seems misleading and alarmist to say a PO is failing when in fact it has just been created and is waiting for the PO controller to do something with it.

In the most recent commit, I've updated to include the "time since creationTimestamp" check. But if there's another approach that lets us avoid that complexity AND avoid saying the PO is failing, I'm totally willing to get rid of that.

tylerslaton

Nice work!

/lgtm

tylerslaton · 2022-09-07T13:53:19Z

controllers/aggregated_clusteroperator_controller.go

-		coBuilder.WithProgressing(openshiftconfigv1.ConditionTrue, "No POs are present in the cluster")
+
+		// TODO: cleanup condition reasons.
+		//   1. use constants for condition reasons.


We have some condition constants already defined but it doesn't seem like we are using them here for some reason. Maybe we just need to add onto that list?

platform-operators/api/v1alpha1/types.go

Lines 22 to 25 in 03149b2

ReasonSourceFailed = "SourceFailed"

ReasonApplyFailed = "ApplyFailed"

ReasonApplySuccessful = "ApplySuccessful"

ReasonApplyPending = "ApplyPending"

Those are reasons for the PO Applied condition type. These are reasons for the ClusterOperator condition types.

tylerslaton · 2022-09-07T13:58:27Z

controllers/aggregated_clusteroperator_controller.go

 	// to reflect those failing PO resources.
 	if statusErrorCheck := util.InspectPlatformOperators(poList); statusErrorCheck != nil {
-		coBuilder.WithAvailable(openshiftconfigv1.ConditionFalse, statusErrorCheck.Error(), "POError")
+		coBuilder.WithAvailable(metav1.ConditionFalse, "POError", statusErrorCheck.Error())


In general should we be using the apimachinery api's for this instead of the OpenShift ones? Mainly asking to see how you decided to change this.

They're both definitions of basically the exact same thing. I changed this mainly because apimachinery has some nice helpers for interacting with conditions that helps us avoid maintaining complexity around the "rules" of conditions, like knowing when to bump last transition time.

internal/util/util.go

timflannagan · 2022-09-07T14:05:40Z

lgtm

joelanford · 2022-09-13T13:51:01Z

/retest

tylerslaton · 2022-09-13T14:19:06Z

/lgtm

joelanford · 2022-09-13T14:26:12Z

internal/util/util.go

 // will be returned to reflect that.
 func inspectPlatformOperator(po platformv1alpha1.PlatformOperator) error {
+	if equality.Semantic.DeepEqual(po.Status, platformv1alpha1.PlatformOperatorStatus{}) &&
+		(po.CreationTimestamp.IsZero() || time.Since(po.CreationTimestamp.Time) < time.Minute) {


Just want to highlight that my rebase included this addition based on the earlier discussion here: #40 (comment)

timflannagan · 2022-09-13T15:48:49Z

The e2e results are a no-op until #58 merges.

joelanford · 2022-09-27T15:22:08Z

Another potential TODO is updating the aggregate CO implementation to re-create that resource when it no longer exists.

Based on what Trevor said yesterday, I kinda think we leave it up to CVO to recreate the CO.

tylerslaton

Overall this LGTM still just had some minor very nits and a comment.

tylerslaton · 2022-09-27T15:26:46Z

controllers/aggregated_clusteroperator_controller.go

 	defer func() {
 		if err := coWriter.UpdateStatus(ctx, aggregatedCO, coBuilder.GetStatus()); err != nil {
-			log.Error(err, "error updating CO status")
+			log.Error(err, "error updating cluster operator status")


nit:

Suggested change

log.Error(err, "error updating cluster operator status")

log.Error(err, "error updating ClusterOperator status")

All other instances apply if we want to change this.

tylerslaton · 2022-09-27T15:27:58Z

controllers/aggregated_clusteroperator_controller.go

 	}
-	coBuilder.WithAvailable(configv1.ConditionTrue, "All POs in a successful state", "POsHealthy")
-	coBuilder.WithProgressing(configv1.ConditionFalse, "All POs in a successful state")
+	coBuilder.WithAvailable(metav1.ConditionTrue, clusteroperator.ReasonAsExpected, "All platform operators are in a successful state")


nit:

Suggested change

coBuilder.WithAvailable(metav1.ConditionTrue, clusteroperator.ReasonAsExpected, "All platform operators are in a successful state")

coBuilder.WithAvailable(metav1.ConditionTrue, clusteroperator.ReasonAsExpected, "All PlatformOperators are in a successful state")

Applies to all other instances if we want to fix this.

tylerslaton · 2022-09-27T15:34:51Z

controllers/aggregated_clusteroperator_controller.go

+	// TODO: consider something more fine-grained than a catch-all "PlatformOperatorError" reason.
+	//   There's a non-negligible difference between "PO is explicitly failing installation" and "PO is not yet installed"


Maybe we could define these two conditions as reasons and then bubble them up via util.InspectPlatformOperators() return? That way we can set that reason inside this WithAvailable() call and
have a less generic "PlatformOperatorError" reason.

tylerslaton · 2022-09-27T15:35:16Z

/lgtm

openshift-ci · 2022-09-27T15:37:29Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: joelanford, timflannagan, tylerslaton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [joelanford,timflannagan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

timflannagan · 2022-09-27T16:40:42Z

Looks like some of the e2e tests need to be updated too.

/lgtm cancel

timflannagan · 2022-09-28T02:31:09Z

{Timed out after 60.001s.
Expected
    <string>: All platform operators are in a successful state
to contain substring
    <string>: All platform operators in a successful state failed /go/src/github.com/openshift/platform-operators/test/e2e/aggregated_clusteroperator_test.go:111

Looks like we're missing a couple more test cases 🙃.

timflannagan · 2022-09-28T02:34:04Z

#68 should also help with increasing visibility into those test case failures given our current preference towards gomega pattern matcher implementations.

joelanford · 2022-09-28T17:51:29Z

🤮 this is what I get for trying to fix these without actually looking too closely at the failures.

Signed-off-by: Joe Lanford <joe.lanford@gmail.com>

timflannagan · 2022-09-28T18:09:42Z

/lgtm

joelanford · 2022-09-28T18:15:03Z

Meta-question: Is there a process I'm not aware of to merge non-bug/non-feature PRs? The label requirements seem to imply one or the other.

openshift-ci · 2022-09-28T20:16:35Z

@joelanford: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e	`b11e258`	link	false	`/test e2e`
ci/prow/e2e-techpreview	`ae3ed2b`	link	false	`/test e2e-techpreview`
ci/prow/e2e-aws-techpreview	`e080b5e`	link	false	`/test e2e-aws-techpreview`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

timflannagan · 2022-09-29T13:43:44Z

We're doing the TE slides so manually adding the px-approved label after a slack conversation in the #tmp-platform-operators channel.

/label px-approved

Holding, as I think we need QE approval give we're modifying the expected status condition reason/messages/etc. and it's unclear whether that will break any automated testing.

/hold

timflannagan · 2022-09-29T16:19:54Z

This should also be a no-op for docs. Manually adding that label.

/label docs-approved

jianzhangbjz · 2022-09-29T23:25:56Z

/label qe-approved

openshift-merge-robot · 2022-09-29T23:26:08Z

@joelanford: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

timflannagan · 2022-09-30T15:42:19Z

Superseding this in #75 given the merge conflicts.

/close

openshift-ci · 2022-09-30T15:45:48Z

@timflannagan: Closed this PR.

In response to this:

Superseding this in #75 given the merge conflicts.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 7, 2022

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 7, 2022

joelanford force-pushed the minor-cleanup branch from ea53d9a to 25ab990 Compare September 7, 2022 01:07

joelanford force-pushed the minor-cleanup branch from 25ab990 to b11e258 Compare September 7, 2022 01:17

kuiwang02 reviewed Sep 7, 2022

View reviewed changes

joelanford marked this pull request as ready for review September 7, 2022 12:19

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 7, 2022

openshift-ci bot requested review from timflannagan and tylerslaton September 7, 2022 12:19

tylerslaton approved these changes Sep 7, 2022

View reviewed changes

openshift-ci bot assigned tylerslaton Sep 7, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 7, 2022

timflannagan reviewed Sep 7, 2022

View reviewed changes

internal/util/util.go Show resolved Hide resolved

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 10, 2022

joelanford force-pushed the minor-cleanup branch from b11e258 to 0d50bcf Compare September 13, 2022 01:51

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 13, 2022

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 13, 2022

joelanford commented Sep 13, 2022

View reviewed changes

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 13, 2022

joelanford force-pushed the minor-cleanup branch from 0d50bcf to c7a6fcc Compare September 13, 2022 14:30

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 13, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 27, 2022

tylerslaton approved these changes Sep 27, 2022

View reviewed changes

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 27, 2022

joelanford force-pushed the minor-cleanup branch from 7357c98 to fad426b Compare September 28, 2022 00:05

address PR feedback

e080b5e

Signed-off-by: Joe Lanford <joe.lanford@gmail.com>

joelanford force-pushed the minor-cleanup branch from fad426b to e080b5e Compare September 28, 2022 17:54

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 28, 2022

openshift-ci bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. px-approved Signifies that Product Support has signed off on this PR labels Sep 29, 2022

openshift-ci bot added the docs-approved Signifies that Docs has signed off on this PR label Sep 29, 2022

openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Sep 29, 2022

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 29, 2022

timflannagan mentioned this pull request Sep 30, 2022

minor cleanup and refactoring for consistency and simplicity #75

Merged

openshift-ci bot closed this Sep 30, 2022

	ReasonSourceFailed = "SourceFailed"
	ReasonApplyFailed = "ApplyFailed"
	ReasonApplySuccessful = "ApplySuccessful"
	ReasonApplyPending = "ApplyPending"

	log.Error(err, "error updating cluster operator status")
	log.Error(err, "error updating ClusterOperator status")

	coBuilder.WithAvailable(metav1.ConditionTrue, clusteroperator.ReasonAsExpected, "All platform operators are in a successful state")
	coBuilder.WithAvailable(metav1.ConditionTrue, clusteroperator.ReasonAsExpected, "All PlatformOperators are in a successful state")

		// TODO: consider something more fine-grained than a catch-all "PlatformOperatorError" reason.
		// There's a non-negligible difference between "PO is explicitly failing installation" and "PO is not yet installed"

minor cleanup and refactoring for consistency and simplicity #40

minor cleanup and refactoring for consistency and simplicity #40

Uh oh!

Conversation

joelanford commented Sep 7, 2022

Uh oh!

openshift-ci bot commented Sep 7, 2022

Uh oh!

joelanford commented Sep 7, 2022

Uh oh!

timflannagan commented Sep 7, 2022

Uh oh!

joelanford commented Sep 7, 2022

Uh oh!

joelanford commented Sep 7, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joelanford Sep 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tylerslaton left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

timflannagan commented Sep 7, 2022

Uh oh!

joelanford commented Sep 13, 2022

Uh oh!

tylerslaton commented Sep 13, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timflannagan commented Sep 13, 2022

Uh oh!

joelanford commented Sep 27, 2022

Uh oh!

tylerslaton left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tylerslaton commented Sep 27, 2022

Uh oh!

openshift-ci bot commented Sep 27, 2022

Uh oh!

timflannagan commented Sep 27, 2022

Uh oh!

timflannagan commented Sep 28, 2022

Uh oh!

timflannagan commented Sep 28, 2022

Uh oh!

joelanford commented Sep 28, 2022

Uh oh!

timflannagan commented Sep 28, 2022

Uh oh!

joelanford commented Sep 28, 2022

Uh oh!

joelanford Sep 7, 2022 •

edited

Loading

tylerslaton left a comment •

edited

Loading