Skip to content

Conversation

@joelanford
Copy link
Member

Somewhat scatter-brained refactor/fixup PR for random things I found. I'll clean this up more if desired.

Signed-off-by: Joe Lanford joe.lanford@gmail.com

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 7, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 7, 2022

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@joelanford
Copy link
Member Author

/test all

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 7, 2022
@timflannagan
Copy link
Contributor

Looks reasonable to me at a glance. This will fail the linting check due to the package import grouping, and I added some unit tests for when the TypeApplied condition is nil, so you'd have to update those as well.

@joelanford
Copy link
Member Author

/test all

@joelanford
Copy link
Member Author

/test all

// In the case that the PO resource is expressing failing states, then an error
// will be returned to reflect that.
func inspectPlatformOperator(po platformv1alpha1.PlatformOperator) error {
if equality.Semantic.DeepEqual(po.Status, platformv1alpha1.PlatformOperatorStatus{}) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joelanford Hi, when equality.Semantic.DeepEqual(po.Status, platformv1alpha1.PlatformOperatorStatus{}) return True, does it mean the PO status part is empty?
if it is correct understanding, I can not understand why return nil here because I thinkreturn nil means PO is in healthy status in method InspectPlatformOperators , and not report error to aggregated CO. But we do not the PO final status yet (maybe finally the PO is not installed successfully). So, here maybe return buildPOFailureMessage(po.GetName(), "PO has no status field")

(I just worry about that before PO installation is done (no final status yet), we report wrong information to aggregated CO. Thanks to check my concern)

Copy link
Member Author

@joelanford joelanford Sep 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it mean the PO status part is empty?

Yes

I can not understand why return nil here

@timflannagan and I were chatting about this yesterday. In general, a PO with no status is a PO that has yet to be reconciled by the PO controller and that typically happens only as the PO is just created. When the PO controller reconciles this PO, it'll attach a status and then this controller will reconcile again.

I wanted to avoid the aggregated CO going into a failing state temporarily every time a new PO shows up.

One thought I had to guard against the possibility of the PO controller never actually reconciling the PO: We could check status is empty and creationTimestamp is within the last 1 minute.

Which would result in:

  1. Empty status within a minute of being created: no problem, return nil
  2. Empty status after a minute of being create: something's wrong, return an error.

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joelanford Should we add a comment that basically explains what you had written down here? I should've done that in the PR that refactored these checks.

In general, I think that option you outlined seems reasonable at a glance but introduces potential complexity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more, in the current implementation I think we'll see flapping in the aggregate ClusterOperator resource we're managing. I'm working on the POM component's implementation, which is responsible for bubbling up the underlying BundleDeployment status. In the case where the BD resource is waiting to be unpacked, we'll effectively see there's no TypeApplied status condition type present yet, and propagate the ApplyPending status condition reason in the PO resource. The aggregate CO controller would see that, and mark the aggregate CO resource with the following:

$ k get co -w
...
platform-operators-aggregated                        False       True          False      0s      encountered the failing cincinnati-operatorrwxr4 platform operator with reason "ApplyPending"
...
platform-operators-aggregated                        False       True          False      0s      encountered the failing cincinnati-operatorrwxr4 platform operator with reason "ApplyPending"
platform-operators-aggregated                        True        True          False      0s      

Which feels like the wrong behavior, UX, etc.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joelanford Thanks to check my comment.

Your idea is ok, but potential complexity (same to timflannagan).
maybe we just change return nil to return buildPOFailureMessage(po.GetName(), platformtypes.ReasonApplyPending) because we possible treat no reconciled yet as ApplyPending.

anyway I am also OK if you want to take your idea on 1minute.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern is that it seems misleading and alarmist to say a PO is failing when in fact it has just been created and is waiting for the PO controller to do something with it.

In the most recent commit, I've updated to include the "time since creationTimestamp" check. But if there's another approach that lets us avoid that complexity AND avoid saying the PO is failing, I'm totally willing to get rid of that.

@joelanford joelanford marked this pull request as ready for review September 7, 2022 12:19
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 7, 2022
Copy link
Contributor

@tylerslaton tylerslaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

/lgtm

coBuilder.WithProgressing(openshiftconfigv1.ConditionTrue, "No POs are present in the cluster")

// TODO: cleanup condition reasons.
// 1. use constants for condition reasons.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have some condition constants already defined but it doesn't seem like we are using them here for some reason. Maybe we just need to add onto that list?

ReasonSourceFailed = "SourceFailed"
ReasonApplyFailed = "ApplyFailed"
ReasonApplySuccessful = "ApplySuccessful"
ReasonApplyPending = "ApplyPending"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are reasons for the PO Applied condition type. These are reasons for the ClusterOperator condition types.

// to reflect those failing PO resources.
if statusErrorCheck := util.InspectPlatformOperators(poList); statusErrorCheck != nil {
coBuilder.WithAvailable(openshiftconfigv1.ConditionFalse, statusErrorCheck.Error(), "POError")
coBuilder.WithAvailable(metav1.ConditionFalse, "POError", statusErrorCheck.Error())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general should we be using the apimachinery api's for this instead of the OpenShift ones? Mainly asking to see how you decided to change this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're both definitions of basically the exact same thing. I changed this mainly because apimachinery has some nice helpers for interacting with conditions that helps us avoid maintaining complexity around the "rules" of conditions, like knowing when to bump last transition time.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 7, 2022
@timflannagan
Copy link
Contributor

lgtm

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 10, 2022
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 13, 2022
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 13, 2022
@joelanford
Copy link
Member Author

/retest

@tylerslaton
Copy link
Contributor

/lgtm

// will be returned to reflect that.
func inspectPlatformOperator(po platformv1alpha1.PlatformOperator) error {
if equality.Semantic.DeepEqual(po.Status, platformv1alpha1.PlatformOperatorStatus{}) &&
(po.CreationTimestamp.IsZero() || time.Since(po.CreationTimestamp.Time) < time.Minute) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to highlight that my rebase included this addition based on the earlier discussion here: #40 (comment)

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 13, 2022
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 13, 2022
@timflannagan
Copy link
Contributor

The e2e results are a no-op until #58 merges.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 27, 2022
@joelanford
Copy link
Member Author

Another potential TODO is updating the aggregate CO implementation to re-create that resource when it no longer exists.

Based on what Trevor said yesterday, I kinda think we leave it up to CVO to recreate the CO.

Copy link
Contributor

@tylerslaton tylerslaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this LGTM still just had some minor very nits and a comment.

defer func() {
if err := coWriter.UpdateStatus(ctx, aggregatedCO, coBuilder.GetStatus()); err != nil {
log.Error(err, "error updating CO status")
log.Error(err, "error updating cluster operator status")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
log.Error(err, "error updating cluster operator status")
log.Error(err, "error updating ClusterOperator status")

All other instances apply if we want to change this.

}
coBuilder.WithAvailable(configv1.ConditionTrue, "All POs in a successful state", "POsHealthy")
coBuilder.WithProgressing(configv1.ConditionFalse, "All POs in a successful state")
coBuilder.WithAvailable(metav1.ConditionTrue, clusteroperator.ReasonAsExpected, "All platform operators are in a successful state")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
coBuilder.WithAvailable(metav1.ConditionTrue, clusteroperator.ReasonAsExpected, "All platform operators are in a successful state")
coBuilder.WithAvailable(metav1.ConditionTrue, clusteroperator.ReasonAsExpected, "All PlatformOperators are in a successful state")

Applies to all other instances if we want to fix this.

Comment on lines +96 to +97
// TODO: consider something more fine-grained than a catch-all "PlatformOperatorError" reason.
// There's a non-negligible difference between "PO is explicitly failing installation" and "PO is not yet installed"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could define these two conditions as reasons and then bubble them up via util.InspectPlatformOperators() return? That way we can set that reason inside this WithAvailable() call and
have a less generic "PlatformOperatorError" reason.

@tylerslaton
Copy link
Contributor

/lgtm

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 27, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: joelanford, timflannagan, tylerslaton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [joelanford,timflannagan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@timflannagan
Copy link
Contributor

Looks like some of the e2e tests need to be updated too.

/lgtm cancel

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 27, 2022
@timflannagan
Copy link
Contributor

{Timed out after 60.001s.
Expected
    <string>: All platform operators are in a successful state
to contain substring
    <string>: All platform operators in a successful state failed /go/src/github.com/openshift/platform-operators/test/e2e/aggregated_clusteroperator_test.go:111

Looks like we're missing a couple more test cases 🙃.

@timflannagan
Copy link
Contributor

#68 should also help with increasing visibility into those test case failures given our current preference towards gomega pattern matcher implementations.

@joelanford
Copy link
Member Author

🤮 this is what I get for trying to fix these without actually looking too closely at the failures.

Signed-off-by: Joe Lanford <joe.lanford@gmail.com>
@timflannagan
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 28, 2022
@joelanford
Copy link
Member Author

Meta-question: Is there a process I'm not aware of to merge non-bug/non-feature PRs? The label requirements seem to imply one or the other.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 28, 2022

@joelanford: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e b11e258 link false /test e2e
ci/prow/e2e-techpreview ae3ed2b link false /test e2e-techpreview
ci/prow/e2e-aws-techpreview e080b5e link false /test e2e-aws-techpreview

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@timflannagan
Copy link
Contributor

We're doing the TE slides so manually adding the px-approved label after a slack conversation in the #tmp-platform-operators channel.

/label px-approved

Holding, as I think we need QE approval give we're modifying the expected status condition reason/messages/etc. and it's unclear whether that will break any automated testing.

/hold

@openshift-ci openshift-ci bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. px-approved Signifies that Product Support has signed off on this PR labels Sep 29, 2022
@timflannagan
Copy link
Contributor

This should also be a no-op for docs. Manually adding that label.

/label docs-approved

@openshift-ci openshift-ci bot added the docs-approved Signifies that Docs has signed off on this PR label Sep 29, 2022
@jianzhangbjz
Copy link
Member

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Sep 29, 2022
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 29, 2022
@openshift-merge-robot
Copy link
Contributor

@joelanford: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@timflannagan
Copy link
Contributor

Superseding this in #75 given the merge conflicts.

/close

@openshift-ci openshift-ci bot closed this Sep 30, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 30, 2022

@timflannagan: Closed this PR.

In response to this:

Superseding this in #75 given the merge conflicts.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. docs-approved Signifies that Docs has signed off on this PR lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. px-approved Signifies that Product Support has signed off on this PR qe-approved Signifies that QE has signed off on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants