Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

status: Report the operators that have not yet deployed #158

Merged

Conversation

smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Apr 9, 2019

The current single error return strategy from the CVO sync loop predates
parallel payload execution and limited retries. Instead, collect all
errors from the task execution graph and attempt to synthesize better
messages that describe what is actually happening.

  1. Filter out cancellation error messages - they aren't useful and are
    a normal part of execution
  2. When multiple errors are reported, display a reasonable multi-line
    error that summarizes any blockers
  3. Treat ClusterOperatorNotAvailable as a special case - if all errors
    reported are of that type convert it to ClusterOperatorsNotAvailable
    and synthesize a better message
  4. In the sync loop, if we are still making progress towards the update
    goal and we haven't waited too long for an update, and if the error
    is the specific cluster operator not available types, display the
    condition Progressing=True instead of Failing=true with a synthetic
    message.

This also passes along the task with the UpdateError so that we can do
more selective error messages for specific error cases.

Needs a few more tests, but for Progressing reports

Working towards 1.0.0-abc: 33% complete, waiting on operator-1, operator-2

@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 9, 2019
The current single error return strategy from the CVO sync loop predates
parallel payload execution and limited retries. Instead, collect all
errors from the task execution graph and attempt to synthesize better
messages that describe what is actually happening.

1. Filter out cancellation error messages - they aren't useful and are
   a normal part of execution
2. When multiple errors are reported, display a reasonable multi-line
   error that summarizes any blockers
3. Treat ClusterOperatorNotAvailable as a special case - if all errors
   reported are of that type convert it to ClusterOperatorsNotAvailable
   and synthesize a better message
4. In the sync loop, if we are still making progress towards the update
   goal and we haven't waited too long for an update, and if the error
   is the specific cluster operator not available types, display the
   condition Progressing=True instead of Failing=true with a synthetic
   message.

This also passes along the task with the UpdateError so that we can do
more selective error messages for specific error cases.
@wking
Copy link
Member

wking commented Apr 9, 2019

The additional information looks good, but it's not always there:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/158/pull-ci-openshift-cluster-version-operator-master-e2e-aws/548/artifacts/e2e-aws/installer/.openshift_install.log | grep 'luster .* initialize'
time="2019-04-09T05:33:50Z" level=info msg="Waiting up to 30m0s for the cluster at https://api.ci-op-sh9t14qk-928f7.origin-ci-int-aws.dev.rhcloud.com:6443 to initialize..."
time="2019-04-09T05:33:50Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-04-09-051341: 83% complete"
time="2019-04-09T05:35:35Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-04-09-051341: 85% complete"
time="2019-04-09T05:35:48Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-04-09-051341: 90% complete"
time="2019-04-09T05:36:20Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-04-09-051341: 91% complete"
time="2019-04-09T05:37:21Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-04-09-051341: 91% complete"
time="2019-04-09T05:39:35Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-04-09-051341: 92% complete"
time="2019-04-09T05:40:51Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-04-09-051341: 92% complete"
time="2019-04-09T05:49:05Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-04-09-051341: 98% complete"
time="2019-04-09T05:49:20Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-04-09-051341: 98% complete, waiting on authentication, image-registry, ingress, marketplace, monitoring"
time="2019-04-09T05:50:35Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-04-09-051341: 98% complete"
time="2019-04-09T05:53:35Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-04-09-051341: 99% complete"
time="2019-04-09T05:54:05Z" level=debug msg="Cluster is initialized"

If the installer had decided to timeout at 5:53:35, we wouldn't have been able to point at any slow operators.

@smarterclayton
Copy link
Contributor Author

If the installer had decided to timeout at 5:53:35, we wouldn't have been able to point at any slow operators.

For that to happen we'd have to have been going for 30m. For install with a timeout that's fine, because right now anything taking longer than ~15 is wedged and would always report. We reset because we started making progress. I think if we're making progress we don't want to report.

We can cut the sync interval on install - will add that as a new commit.

This will report incremental operator status much faster to the
progressing condition, including the new operators waiting message.
@wking
Copy link
Member

wking commented Apr 9, 2019

For that to happen we'd have to have been going for 30m. For install with a timeout that's fine, because right now anything taking longer than ~15 is wedged and would always report. We reset because we started making progress. I think if we're making progress we don't want to report.

So is that "by the time we hit our 30m timeout, we will have stabilized in a wedged state"? Previous experience like this shows that sometimes things stick for a while and then eventually shake loose, but not soon enough to get in under the installer timeout. That particular case predates #141, so maybe we are confident that any cluster that isn't ready in 30 minutes will still be hung and reporting via this PR's code?

@smarterclayton
Copy link
Contributor Author

The installer failing at 30m but then eventually recovering is only a problem for CI, and since it's almost always a bug in an operator and we already gather cluster operator status into artifacts, I'm not sure the the message is the important part. It's 10 seconds per failure to look at the failing operator. I'd say this is addressing the experiential "an admin eventually sees the failing job show up once we settle on the wedge".

@smarterclayton
Copy link
Contributor Author

other feedback? @abhinavdahiya ?

@sdodson
Copy link
Member

sdodson commented Apr 10, 2019

The installer failing at 30m but then eventually recovering is only a problem for CI

I don't understand that. We've had multiple users encounter just that and report bugs on that scenario.

@smarterclayton
Copy link
Contributor Author

Those are caused by bugs. The bugs are debuggable in CI. For a user, they should also be debuggable as I noted with this change because we won't be making progress at 30. If they're timing out because of factors beyond our control such as pull speed, then the installer messaging needs to change to more carefully communicate that we are stopping waiting, but that the install may still complete. Especially if it isn't marked as failed, but is still progressing.

return nil
}

nested := make([]error, 0, len(errs))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ques: why create new nested error list?

@abhinavdahiya
Copy link
Contributor

c2ac20f is LGTM I like the idea of summarizing the errors.

and 3c2cf46 makes sense for now as a stop gap.

@abhinavdahiya
Copy link
Contributor

This makes the messages better for CVO and therefore this is:
/approve

but to @wking and @sdodson 's point:
case where cvo hasn't reached level after 30 mins and is still progressing we (as in installer) would need to inform the user that either cvo is still progressing and you might have to wait longer using wait-for or cvo is failing and pass the user the failure reported by the cvo.

@wking
Copy link
Member

wking commented Apr 11, 2019

/lgtm

We can file follow-ups if/when we can agree on them ;).

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 11, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, smarterclayton, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [abhinavdahiya,smarterclayton,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@wking
Copy link
Member

wking commented Apr 11, 2019

integration:

start_integration_test.go:244: cluster version never became available: operator listed as failing (MultipleErrors): Multiple errors are preventing progress:
		* Could not update configmap "e2e-cvo-s5m9/config1" (1 of 2): the server has forbidden updates to this resource
		* Could not update configmap "e2e-cvo-s5m9/config2" (2 of 2): the server has forbidden updates to this resource

/test integration

@smarterclayton
Copy link
Contributor Author

Heh, that's the actual new error in the integration test. Fixing.

I think we can file something on installer to double check the output at the end of a release.

@smarterclayton
Copy link
Contributor Author

Hrm... this looks more like the master failed or lost data.

/retest

@wking
Copy link
Member

wking commented Apr 11, 2019

I think we can file something on installer to double check the output at the end of a release.

I'll work that into openshift/installer#1413 tonight. Should we stop waiting if the CVO ever reports Failing?

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Apr 11, 2019

We should probably just continue until the timeout. I think we could maybe record the last failing condition we see along the way, and if we are currently not failing but have timed out just report "we failed earlier"

@abhinavdahiya
Copy link
Contributor

/refresh
/retest

@smarterclayton
Copy link
Contributor Author

/test integration

Delivered a fix in openshift/release#3450 that should have fixed the issue we were seeing

@abhinavdahiya
Copy link
Contributor

/retest

wking added a commit to wking/cluster-version-operator that referenced this pull request Jun 17, 2020
Because every Failing=True condition should have a reason.

Also wordsmith the user-facing docs to replace "synchronize" with
"reconcile", because our merge logic is more nuanced than the complete
match "synchronize" implies for me.

The ClusterOperatorNotAvailable special casing landed with
convertErrorToProgressing in c2ac20f (status: Report the operators
that have not yet deployed, 2019-04-09, openshift#158).
wking added a commit to wking/cluster-version-operator that referenced this pull request Jun 17, 2020
Because every Failing=True condition should have a reason.

Also wordsmith the user-facing docs to replace "synchronize" with
"reconcile", because our merge logic is more nuanced than the complete
match "synchronize" implies for me.

The ClusterOperatorNotAvailable special casing landed with
convertErrorToProgressing in c2ac20f (status: Report the operators
that have not yet deployed, 2019-04-09, openshift#158).
wking added a commit to wking/cluster-version-operator that referenced this pull request Jul 31, 2020
Reduce false-positives when operators take a while to level (like the
machine-config operator, which has to roll the control plane
machines).  We may want to raise this further in the future, but baby
steps ;).

The previous 10-minute value is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158), which doesn't
make a case for that specific value.  So the bump is unlikely to break
anything unexpected.
wking added a commit to wking/cluster-version-operator that referenced this pull request Aug 5, 2020
Because every Failing=True condition should have a reason.

Also wordsmith the user-facing docs to replace "synchronize" with
"reconcile", because our merge logic is more nuanced than the complete
match "synchronize" implies for me.

The ClusterOperatorNotAvailable special casing landed with
convertErrorToProgressing in c2ac20f (status: Report the operators
that have not yet deployed, 2019-04-09, openshift#158).
wking added a commit to wking/cluster-version-operator that referenced this pull request Aug 5, 2020
Reduce false-positives when operators take a while to level (like the
machine-config operator, which has to roll the control plane
machines).  We may want to raise this further in the future, but baby
steps ;).

The previous 10-minute value is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158), which doesn't
make a case for that specific value.  So the bump is unlikely to break
anything unexpected.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 26, 2021
The MultipleErrors reason landed in c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158), but for some
reason was left out of SummaryForReason.  I'm adding it in this commit
to get something more useful than:

  $ oc adm upgrade
  info: An upgrade is in progress. Unable to apply 4.8.0-0.ci-2021-05-26-172803: an unknown error has occurred: MultipleErrors

when the Failing=True message is:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

I'm not entirely clear on why these didn't fallback to the "unknown
error" strings at the bottom of SummaryForReason, but they don't seem
to have done so.

"# ------------------------ >8 ------------------------
diff --git a/pkg/payload/task.go b/pkg/payload/task.go
index 91bc3110..4a811218 100644
--- a/pkg/payload/task.go
+++ b/pkg/payload/task.go
@@ -264,6 +264,11 @@ func SummaryForReason(reason, name string) string {
 			return fmt.Sprintf("the workload %s cannot roll out", name)
 		}
 		return "a workload cannot roll out"
+	case "MultipleErrors":
+		if len(name) > 0 {
+			return fmt.Sprintf("the workload %s cannot roll out", name)
+		}
+		return "multiple errors reconciling the payload"
 	}

 	if strings.HasPrefix(reason, "UpdatePayload") {
wking added a commit to wking/cluster-version-operator that referenced this pull request May 26, 2021
The MultipleErrors reason landed in c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158), but for some
reason was left out of SummaryForReason.  I'm adding it in this commit
to get something more useful than:

  $ oc adm upgrade
  info: An upgrade is in progress. Unable to apply 4.8.0-0.ci-2021-05-26-172803: an unknown error has occurred: MultipleErrors

when the Failing=True message is:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

I'm not entirely clear on why these didn't fallback to the "unknown
error" strings at the bottom of SummaryForReason, but they don't seem
to have done so.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 26, 2021
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 27, 2021
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 27, 2021
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 27, 2021
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 27, 2021
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 27, 2021
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 27, 2021
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 27, 2021
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 27, 2021
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 27, 2021
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 27, 2021
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 27, 2021
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.
wking added a commit to wking/cluster-version-operator that referenced this pull request Oct 26, 2021
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.
wking added a commit to wking/cluster-version-operator that referenced this pull request Oct 26, 2021
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.
wking added a commit to wking/cluster-version-operator that referenced this pull request Oct 29, 2021
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.
wking added a commit to wking/cluster-version-operator that referenced this pull request Aug 2, 2022
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.
wking added a commit to wking/cluster-version-operator that referenced this pull request Aug 2, 2022
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.
wking added a commit to wking/cluster-version-operator that referenced this pull request Aug 2, 2022
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.
wking added a commit to wking/cluster-version-operator that referenced this pull request Aug 3, 2022
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.

TestCVO_ParallelError no longer tests the consolidated error message,
because the consolidation is now restricted to ClusterOperator
resources.  I tried moving the
pkg/cvo/testdata/paralleltest/release-manifests manifests to
ClusterOperator, but then the test struggled with:

  I0802 16:04:18.133935    2005 sync_worker.go:945] Unable to precreate resource clusteroperator

so now TestCVO_ParallelError is excercising the fact that
non-ClusterOperator failures are not aggregated.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants