Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

status: Hide generic operator status in favor of more specific errors #192

Merged

Conversation

smarterclayton
Copy link
Contributor

When upgrading and a sync loop exits, we often see:

  • Operator A not upgraded yet
  • Operator B not upgraded yet
  • Operator C not upgraded yet
  • Some more specific error
  • Operator D not upgraded yet

Since the not available messages carry little value and are expected
when we are processing parallel output, if we have one or more errors
during sync that are not the generic error, only report those. This
focuses the user's attention on the actual problem, rather than
distracting them with noise. In the ideal case, the user immediately
sees "router is totally broken" vs "a, b, c, d, e, router is broken, f".

E.g. https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1724/

Multiple errors are preventing progress:
* Cluster operator authentication is still updating: upgrading oauth-openshift from 4.2.0-0.ci-2019-05-18-031713_openshift to 4.2.0-0.ci-2019-05-18-224753_openshift
* Cluster operator cluster-autoscaler is still updating
* Cluster operator monitoring is still updating
* Cluster operator openshift-controller-manager is still updating
* Cluster operator service-catalog-controller-manager is still updating
* Could not update deployment "openshift-cloud-credential-operator/cloud-credential-operator" (93 of 350)
...

The "still updating" errors are just noise.

/cherrypick release-4.1

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 19, 2019
@smarterclayton
Copy link
Contributor Author

Hrm, this doesn't seem to be working like I expected

@smarterclayton
Copy link
Contributor Author

Oh wait, nm, that was another PR.

When upgrading and a sync loop exits, we often see:

* Operator A not upgraded yet
* Operator B not upgraded yet
* Operator C not upgraded yet
* Some more specific error
* Operator D not upgraded yet

Since the not available messages carry little value and are expected
when we are processing parallel output, if we have one or more errors
during sync that are not the generic error, only report those. This
focuses the user's attention on the actual problem, rather than
distracting them with noise. In the ideal case, the user immediately
sees "router is totally broken" vs "a, b, c, d, e, router is broken, f".
@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 20, 2019
if filtered := filterErrors(errs, isClusterOperatorNotAvailable); len(filtered) > 0 {
return newMultipleError(filtered)
}
// if we're only waiting for operators, condense the error down to a singleton
if err := newClusterOperatorsNotAvailable(errs); err != nil {
return err
}
return newMultipleError(errs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only newMultipleErrors consumer. It feels strange to have some preprocessing locally; can we add this new logic inside newMultipleErrors (and drop the local collapse, since newMultipleErrors already does that internally)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s three different flows - I expect there to be more in the future. MultipleErrors is for when we don’t know anything

@smarterclayton
Copy link
Contributor Author

Reapplying label (here's where we are after fixing a couple of the ugly):

May 21 09:48:03.655: INFO: Cluster version operator acknowledged upgrade request
May 21 09:55:13.785: INFO: cluster upgrade is failing: Multiple errors are preventing progress:
* Cluster operator authentication is still updating
* Cluster operator cluster-autoscaler is still updating
* Cluster operator marketplace is still updating
* Cluster operator monitoring is still updating
* Cluster operator openshift-controller-manager is still updating
* Cluster operator service-catalog-apiserver is still updating
* Cluster operator service-catalog-controller-manager is still updating
* Cluster operator storage is still updating
* Could not update deployment "openshift-cluster-samples-operator/cluster-samples-operator" (185 of 350)
* Could not update deployment "openshift-console/downloads" (237 of 350)
* Could not update deployment "openshift-operator-lifecycle-manager/catalog-operator" (254 of 350)
* Could not update deployment "openshift-service-ca-operator/service-ca-operator" (290 of 350)

@smarterclayton smarterclayton added the lgtm Indicates that a PR is ready to be merged. label May 21, 2019
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 27771ce into openshift:master May 21, 2019
wking added a commit to wking/cluster-version-operator that referenced this pull request May 26, 2021
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 27, 2021
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 27, 2021
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 27, 2021
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 27, 2021
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 27, 2021
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 27, 2021
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 27, 2021
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 27, 2021
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 27, 2021
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 27, 2021
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 27, 2021
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.
wking added a commit to wking/cluster-version-operator that referenced this pull request Oct 26, 2021
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.
wking added a commit to wking/cluster-version-operator that referenced this pull request Oct 26, 2021
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.
wking added a commit to wking/cluster-version-operator that referenced this pull request Oct 29, 2021
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.
wking added a commit to wking/cluster-version-operator that referenced this pull request Aug 2, 2022
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.
wking added a commit to wking/cluster-version-operator that referenced this pull request Aug 2, 2022
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.
wking added a commit to wking/cluster-version-operator that referenced this pull request Aug 2, 2022
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.
wking added a commit to wking/cluster-version-operator that referenced this pull request Aug 3, 2022
newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.

TestCVO_ParallelError no longer tests the consolidated error message,
because the consolidation is now restricted to ClusterOperator
resources.  I tried moving the
pkg/cvo/testdata/paralleltest/release-manifests manifests to
ClusterOperator, but then the test struggled with:

  I0802 16:04:18.133935    2005 sync_worker.go:945] Unable to precreate resource clusteroperator

so now TestCVO_ParallelError is excercising the fact that
non-ClusterOperator failures are not aggregated.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants