New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
status: Report the operators that have not yet deployed #158
status: Report the operators that have not yet deployed #158
Conversation
The current single error return strategy from the CVO sync loop predates parallel payload execution and limited retries. Instead, collect all errors from the task execution graph and attempt to synthesize better messages that describe what is actually happening. 1. Filter out cancellation error messages - they aren't useful and are a normal part of execution 2. When multiple errors are reported, display a reasonable multi-line error that summarizes any blockers 3. Treat ClusterOperatorNotAvailable as a special case - if all errors reported are of that type convert it to ClusterOperatorsNotAvailable and synthesize a better message 4. In the sync loop, if we are still making progress towards the update goal and we haven't waited too long for an update, and if the error is the specific cluster operator not available types, display the condition Progressing=True instead of Failing=true with a synthetic message. This also passes along the task with the UpdateError so that we can do more selective error messages for specific error cases.
7a1f8b3
to
c2ac20f
Compare
The additional information looks good, but it's not always there: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/158/pull-ci-openshift-cluster-version-operator-master-e2e-aws/548/artifacts/e2e-aws/installer/.openshift_install.log | grep 'luster .* initialize'
time="2019-04-09T05:33:50Z" level=info msg="Waiting up to 30m0s for the cluster at https://api.ci-op-sh9t14qk-928f7.origin-ci-int-aws.dev.rhcloud.com:6443 to initialize..."
time="2019-04-09T05:33:50Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-04-09-051341: 83% complete"
time="2019-04-09T05:35:35Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-04-09-051341: 85% complete"
time="2019-04-09T05:35:48Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-04-09-051341: 90% complete"
time="2019-04-09T05:36:20Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-04-09-051341: 91% complete"
time="2019-04-09T05:37:21Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-04-09-051341: 91% complete"
time="2019-04-09T05:39:35Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-04-09-051341: 92% complete"
time="2019-04-09T05:40:51Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-04-09-051341: 92% complete"
time="2019-04-09T05:49:05Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-04-09-051341: 98% complete"
time="2019-04-09T05:49:20Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-04-09-051341: 98% complete, waiting on authentication, image-registry, ingress, marketplace, monitoring"
time="2019-04-09T05:50:35Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-04-09-051341: 98% complete"
time="2019-04-09T05:53:35Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-04-09-051341: 99% complete"
time="2019-04-09T05:54:05Z" level=debug msg="Cluster is initialized" If the installer had decided to timeout at 5:53:35, we wouldn't have been able to point at any slow operators. |
For that to happen we'd have to have been going for 30m. For install with a timeout that's fine, because right now anything taking longer than ~15 is wedged and would always report. We reset because we started making progress. I think if we're making progress we don't want to report. We can cut the sync interval on install - will add that as a new commit. |
This will report incremental operator status much faster to the progressing condition, including the new operators waiting message.
So is that "by the time we hit our 30m timeout, we will have stabilized in a wedged state"? Previous experience like this shows that sometimes things stick for a while and then eventually shake loose, but not soon enough to get in under the installer timeout. That particular case predates #141, so maybe we are confident that any cluster that isn't ready in 30 minutes will still be hung and reporting via this PR's code? |
The installer failing at 30m but then eventually recovering is only a problem for CI, and since it's almost always a bug in an operator and we already gather cluster operator status into artifacts, I'm not sure the the message is the important part. It's 10 seconds per failure to look at the failing operator. I'd say this is addressing the experiential "an admin eventually sees the failing job show up once we settle on the wedge". |
other feedback? @abhinavdahiya ? |
I don't understand that. We've had multiple users encounter just that and report bugs on that scenario. |
Those are caused by bugs. The bugs are debuggable in CI. For a user, they should also be debuggable as I noted with this change because we won't be making progress at 30. If they're timing out because of factors beyond our control such as pull speed, then the installer messaging needs to change to more carefully communicate that we are stopping waiting, but that the install may still complete. Especially if it isn't marked as failed, but is still progressing. |
return nil | ||
} | ||
|
||
nested := make([]error, 0, len(errs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ques: why create new nested
error list?
This makes the messages better for CVO and therefore this is: but to @wking and @sdodson 's point: |
/lgtm We can file follow-ups if/when we can agree on them ;). |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abhinavdahiya, smarterclayton, wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test integration |
Heh, that's the actual new error in the integration test. Fixing. I think we can file something on installer to double check the output at the end of a release. |
Hrm... this looks more like the master failed or lost data. /retest |
I'll work that into openshift/installer#1413 tonight. Should we stop waiting if the CVO ever reports Failing? |
We should probably just continue until the timeout. I think we could maybe record the last failing condition we see along the way, and if we are currently not failing but have timed out just report "we failed earlier" |
/refresh |
/test integration Delivered a fix in openshift/release#3450 that should have fixed the issue we were seeing |
/retest |
Because every Failing=True condition should have a reason. Also wordsmith the user-facing docs to replace "synchronize" with "reconcile", because our merge logic is more nuanced than the complete match "synchronize" implies for me. The ClusterOperatorNotAvailable special casing landed with convertErrorToProgressing in c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158).
Because every Failing=True condition should have a reason. Also wordsmith the user-facing docs to replace "synchronize" with "reconcile", because our merge logic is more nuanced than the complete match "synchronize" implies for me. The ClusterOperatorNotAvailable special casing landed with convertErrorToProgressing in c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158).
Reduce false-positives when operators take a while to level (like the machine-config operator, which has to roll the control plane machines). We may want to raise this further in the future, but baby steps ;). The previous 10-minute value is from c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158), which doesn't make a case for that specific value. So the bump is unlikely to break anything unexpected.
Because every Failing=True condition should have a reason. Also wordsmith the user-facing docs to replace "synchronize" with "reconcile", because our merge logic is more nuanced than the complete match "synchronize" implies for me. The ClusterOperatorNotAvailable special casing landed with convertErrorToProgressing in c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158).
Reduce false-positives when operators take a while to level (like the machine-config operator, which has to roll the control plane machines). We may want to raise this further in the future, but baby steps ;). The previous 10-minute value is from c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158), which doesn't make a case for that specific value. So the bump is unlikely to break anything unexpected.
The MultipleErrors reason landed in c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158), but for some reason was left out of SummaryForReason. I'm adding it in this commit to get something more useful than: $ oc adm upgrade info: An upgrade is in progress. Unable to apply 4.8.0-0.ci-2021-05-26-172803: an unknown error has occurred: MultipleErrors when the Failing=True message is: Multiple errors are preventing progress: * Cluster operator machine-api is updating versions * Cluster operator openshift-apiserver is updating versions I'm not entirely clear on why these didn't fallback to the "unknown error" strings at the bottom of SummaryForReason, but they don't seem to have done so. "# ------------------------ >8 ------------------------ diff --git a/pkg/payload/task.go b/pkg/payload/task.go index 91bc3110..4a811218 100644 --- a/pkg/payload/task.go +++ b/pkg/payload/task.go @@ -264,6 +264,11 @@ func SummaryForReason(reason, name string) string { return fmt.Sprintf("the workload %s cannot roll out", name) } return "a workload cannot roll out" + case "MultipleErrors": + if len(name) > 0 { + return fmt.Sprintf("the workload %s cannot roll out", name) + } + return "multiple errors reconciling the payload" } if strings.HasPrefix(reason, "UpdatePayload") {
The MultipleErrors reason landed in c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158), but for some reason was left out of SummaryForReason. I'm adding it in this commit to get something more useful than: $ oc adm upgrade info: An upgrade is in progress. Unable to apply 4.8.0-0.ci-2021-05-26-172803: an unknown error has occurred: MultipleErrors when the Failing=True message is: Multiple errors are preventing progress: * Cluster operator machine-api is updating versions * Cluster operator openshift-apiserver is updating versions I'm not entirely clear on why these didn't fallback to the "unknown error" strings at the bottom of SummaryForReason, but they don't seem to have done so.
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158). And the not-available filtering is from bdd4545 (status: Hide generic operator status in favor of more specific errors, 2019-05-19, openshift#192). But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from ClusterOperatorNotAvailable to ClusterOperatorUpdating. And we want to avoid uncollapsed errors like: Multiple errors are preventing progress: * Cluster operator machine-api is updating versions * Cluster operator openshift-apiserver is updating versions where we are waiting on multiple ClusterOperator which are in similar situations. This commit drops the filtering, because cluster operators are important. It does sort those errors to the end of the list though, so the first error is the non-ClusterOperator error.
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158). And the not-available filtering is from bdd4545 (status: Hide generic operator status in favor of more specific errors, 2019-05-19, openshift#192). But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from ClusterOperatorNotAvailable to ClusterOperatorUpdating. And we want to avoid uncollapsed errors like: Multiple errors are preventing progress: * Cluster operator machine-api is updating versions * Cluster operator openshift-apiserver is updating versions where we are waiting on multiple ClusterOperator which are in similar situations. This commit drops the filtering, because cluster operators are important. It does sort those errors to the end of the list though, so the first error is the non-ClusterOperator error.
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158). And the not-available filtering is from bdd4545 (status: Hide generic operator status in favor of more specific errors, 2019-05-19, openshift#192). But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from ClusterOperatorNotAvailable to ClusterOperatorUpdating. And we want to avoid uncollapsed errors like: Multiple errors are preventing progress: * Cluster operator machine-api is updating versions * Cluster operator openshift-apiserver is updating versions where we are waiting on multiple ClusterOperator which are in similar situations. This commit drops the filtering, because cluster operators are important. It does sort those errors to the end of the list though, so the first error is the non-ClusterOperator error.
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158). And the not-available filtering is from bdd4545 (status: Hide generic operator status in favor of more specific errors, 2019-05-19, openshift#192). But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from ClusterOperatorNotAvailable to ClusterOperatorUpdating. And we want to avoid uncollapsed errors like: Multiple errors are preventing progress: * Cluster operator machine-api is updating versions * Cluster operator openshift-apiserver is updating versions where we are waiting on multiple ClusterOperator which are in similar situations. This commit drops the filtering, because cluster operators are important. It does sort those errors to the end of the list though, so the first error is the non-ClusterOperator error.
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158). And the not-available filtering is from bdd4545 (status: Hide generic operator status in favor of more specific errors, 2019-05-19, openshift#192). But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from ClusterOperatorNotAvailable to ClusterOperatorUpdating. And we want to avoid uncollapsed errors like: Multiple errors are preventing progress: * Cluster operator machine-api is updating versions * Cluster operator openshift-apiserver is updating versions where we are waiting on multiple ClusterOperator which are in similar situations. This commit drops the filtering, because cluster operators are important. It does sort those errors to the end of the list though, so the first error is the non-ClusterOperator error.
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158). And the not-available filtering is from bdd4545 (status: Hide generic operator status in favor of more specific errors, 2019-05-19, openshift#192). But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from ClusterOperatorNotAvailable to ClusterOperatorUpdating. And we want to avoid uncollapsed errors like: Multiple errors are preventing progress: * Cluster operator machine-api is updating versions * Cluster operator openshift-apiserver is updating versions where we are waiting on multiple ClusterOperator which are in similar situations. This commit drops the filtering, because cluster operators are important. It does sort those errors to the end of the list though, so the first error is the non-ClusterOperator error.
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158). And the not-available filtering is from bdd4545 (status: Hide generic operator status in favor of more specific errors, 2019-05-19, openshift#192). But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from ClusterOperatorNotAvailable to ClusterOperatorUpdating. And we want to avoid uncollapsed errors like: Multiple errors are preventing progress: * Cluster operator machine-api is updating versions * Cluster operator openshift-apiserver is updating versions where we are waiting on multiple ClusterOperator which are in similar situations. This commit drops the filtering, because cluster operators are important. It does sort those errors to the end of the list though, so the first error is the non-ClusterOperator error.
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158). And the not-available filtering is from bdd4545 (status: Hide generic operator status in favor of more specific errors, 2019-05-19, openshift#192). But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from ClusterOperatorNotAvailable to ClusterOperatorUpdating. And we want to avoid uncollapsed errors like: Multiple errors are preventing progress: * Cluster operator machine-api is updating versions * Cluster operator openshift-apiserver is updating versions where we are waiting on multiple ClusterOperator which are in similar situations. This commit drops the filtering, because cluster operators are important. It does sort those errors to the end of the list though, so the first error is the non-ClusterOperator error.
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158). And the not-available filtering is from bdd4545 (status: Hide generic operator status in favor of more specific errors, 2019-05-19, openshift#192). But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from ClusterOperatorNotAvailable to ClusterOperatorUpdating. And we want to avoid uncollapsed errors like: Multiple errors are preventing progress: * Cluster operator machine-api is updating versions * Cluster operator openshift-apiserver is updating versions where we are waiting on multiple ClusterOperator which are in similar situations. This commit drops the filtering, because cluster operators are important. It does sort those errors to the end of the list though, so the first error is the non-ClusterOperator error.
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158). And the not-available filtering is from bdd4545 (status: Hide generic operator status in favor of more specific errors, 2019-05-19, openshift#192). But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from ClusterOperatorNotAvailable to ClusterOperatorUpdating. And we want to avoid uncollapsed errors like: Multiple errors are preventing progress: * Cluster operator machine-api is updating versions * Cluster operator openshift-apiserver is updating versions where we are waiting on multiple ClusterOperator which are in similar situations. This commit drops the filtering, because cluster operators are important. It does sort those errors to the end of the list though, so the first error is the non-ClusterOperator error.
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158). And the not-available filtering is from bdd4545 (status: Hide generic operator status in favor of more specific errors, 2019-05-19, openshift#192). But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from ClusterOperatorNotAvailable to ClusterOperatorUpdating. And we want to avoid uncollapsed errors like: Multiple errors are preventing progress: * Cluster operator machine-api is updating versions * Cluster operator openshift-apiserver is updating versions where we are waiting on multiple ClusterOperator which are in similar situations. This commit drops the filtering, because cluster operators are important. It does sort those errors to the end of the list though, so the first error is the non-ClusterOperator error.
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158). And the not-available filtering is from bdd4545 (status: Hide generic operator status in favor of more specific errors, 2019-05-19, openshift#192). But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from ClusterOperatorNotAvailable to ClusterOperatorUpdating. And we want to avoid uncollapsed errors like: Multiple errors are preventing progress: * Cluster operator machine-api is updating versions * Cluster operator openshift-apiserver is updating versions where we are waiting on multiple ClusterOperator which are in similar situations. This commit drops the filtering, because cluster operators are important. It does sort those errors to the end of the list though, so the first error is the non-ClusterOperator error.
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158). And the not-available filtering is from bdd4545 (status: Hide generic operator status in favor of more specific errors, 2019-05-19, openshift#192). But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from ClusterOperatorNotAvailable to ClusterOperatorUpdating. And we want to avoid uncollapsed errors like: Multiple errors are preventing progress: * Cluster operator machine-api is updating versions * Cluster operator openshift-apiserver is updating versions where we are waiting on multiple ClusterOperator which are in similar situations. This commit drops the filtering, because cluster operators are important. It does sort those errors to the end of the list though, so the first error is the non-ClusterOperator error.
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158). And the not-available filtering is from bdd4545 (status: Hide generic operator status in favor of more specific errors, 2019-05-19, openshift#192). But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from ClusterOperatorNotAvailable to ClusterOperatorUpdating. And we want to avoid uncollapsed errors like: Multiple errors are preventing progress: * Cluster operator machine-api is updating versions * Cluster operator openshift-apiserver is updating versions where we are waiting on multiple ClusterOperator which are in similar situations. This commit drops the filtering, because cluster operators are important. It does sort those errors to the end of the list though, so the first error is the non-ClusterOperator error.
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158). And the not-available filtering is from bdd4545 (status: Hide generic operator status in favor of more specific errors, 2019-05-19, openshift#192). But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from ClusterOperatorNotAvailable to ClusterOperatorUpdating. And we want to avoid uncollapsed errors like: Multiple errors are preventing progress: * Cluster operator machine-api is updating versions * Cluster operator openshift-apiserver is updating versions where we are waiting on multiple ClusterOperator which are in similar situations. This commit drops the filtering, because cluster operators are important. It does sort those errors to the end of the list though, so the first error is the non-ClusterOperator error.
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158). And the not-available filtering is from bdd4545 (status: Hide generic operator status in favor of more specific errors, 2019-05-19, openshift#192). But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from ClusterOperatorNotAvailable to ClusterOperatorUpdating. And we want to avoid uncollapsed errors like: Multiple errors are preventing progress: * Cluster operator machine-api is updating versions * Cluster operator openshift-apiserver is updating versions where we are waiting on multiple ClusterOperator which are in similar situations. This commit drops the filtering, because cluster operators are important. It does sort those errors to the end of the list though, so the first error is the non-ClusterOperator error.
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158). And the not-available filtering is from bdd4545 (status: Hide generic operator status in favor of more specific errors, 2019-05-19, openshift#192). But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from ClusterOperatorNotAvailable to ClusterOperatorUpdating. And we want to avoid uncollapsed errors like: Multiple errors are preventing progress: * Cluster operator machine-api is updating versions * Cluster operator openshift-apiserver is updating versions where we are waiting on multiple ClusterOperator which are in similar situations. This commit drops the filtering, because cluster operators are important. It does sort those errors to the end of the list though, so the first error is the non-ClusterOperator error.
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158). And the not-available filtering is from bdd4545 (status: Hide generic operator status in favor of more specific errors, 2019-05-19, openshift#192). But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from ClusterOperatorNotAvailable to ClusterOperatorUpdating. And we want to avoid uncollapsed errors like: Multiple errors are preventing progress: * Cluster operator machine-api is updating versions * Cluster operator openshift-apiserver is updating versions where we are waiting on multiple ClusterOperator which are in similar situations. This commit drops the filtering, because cluster operators are important. It does sort those errors to the end of the list though, so the first error is the non-ClusterOperator error.
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158). And the not-available filtering is from bdd4545 (status: Hide generic operator status in favor of more specific errors, 2019-05-19, openshift#192). But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from ClusterOperatorNotAvailable to ClusterOperatorUpdating. And we want to avoid uncollapsed errors like: Multiple errors are preventing progress: * Cluster operator machine-api is updating versions * Cluster operator openshift-apiserver is updating versions where we are waiting on multiple ClusterOperator which are in similar situations. This commit drops the filtering, because cluster operators are important. It does sort those errors to the end of the list though, so the first error is the non-ClusterOperator error. TestCVO_ParallelError no longer tests the consolidated error message, because the consolidation is now restricted to ClusterOperator resources. I tried moving the pkg/cvo/testdata/paralleltest/release-manifests manifests to ClusterOperator, but then the test struggled with: I0802 16:04:18.133935 2005 sync_worker.go:945] Unable to precreate resource clusteroperator so now TestCVO_ParallelError is excercising the fact that non-ClusterOperator failures are not aggregated.
The current single error return strategy from the CVO sync loop predates
parallel payload execution and limited retries. Instead, collect all
errors from the task execution graph and attempt to synthesize better
messages that describe what is actually happening.
a normal part of execution
error that summarizes any blockers
reported are of that type convert it to ClusterOperatorsNotAvailable
and synthesize a better message
goal and we haven't waited too long for an update, and if the error
is the specific cluster operator not available types, display the
condition Progressing=True instead of Failing=true with a synthetic
message.
This also passes along the task with the UpdateError so that we can do
more selective error messages for specific error cases.
Needs a few more tests, but for Progressing reports