status: Report the operators that have not yet deployed #158

smarterclayton · 2019-04-09T05:05:18Z

The current single error return strategy from the CVO sync loop predates
parallel payload execution and limited retries. Instead, collect all
errors from the task execution graph and attempt to synthesize better
messages that describe what is actually happening.

Filter out cancellation error messages - they aren't useful and are
a normal part of execution
When multiple errors are reported, display a reasonable multi-line
error that summarizes any blockers
Treat ClusterOperatorNotAvailable as a special case - if all errors
reported are of that type convert it to ClusterOperatorsNotAvailable
and synthesize a better message
In the sync loop, if we are still making progress towards the update
goal and we haven't waited too long for an update, and if the error
is the specific cluster operator not available types, display the
condition Progressing=True instead of Failing=true with a synthetic
message.

This also passes along the task with the UpdateError so that we can do
more selective error messages for specific error cases.

Needs a few more tests, but for Progressing reports

Working towards 1.0.0-abc: 33% complete, waiting on operator-1, operator-2

The current single error return strategy from the CVO sync loop predates parallel payload execution and limited retries. Instead, collect all errors from the task execution graph and attempt to synthesize better messages that describe what is actually happening. 1. Filter out cancellation error messages - they aren't useful and are a normal part of execution 2. When multiple errors are reported, display a reasonable multi-line error that summarizes any blockers 3. Treat ClusterOperatorNotAvailable as a special case - if all errors reported are of that type convert it to ClusterOperatorsNotAvailable and synthesize a better message 4. In the sync loop, if we are still making progress towards the update goal and we haven't waited too long for an update, and if the error is the specific cluster operator not available types, display the condition Progressing=True instead of Failing=true with a synthetic message. This also passes along the task with the UpdateError so that we can do more selective error messages for specific error cases.

wking · 2019-04-09T08:13:13Z

The additional information looks good, but it's not always there:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/158/pull-ci-openshift-cluster-version-operator-master-e2e-aws/548/artifacts/e2e-aws/installer/.openshift_install.log | grep 'luster .* initialize'
time="2019-04-09T05:33:50Z" level=info msg="Waiting up to 30m0s for the cluster at https://api.ci-op-sh9t14qk-928f7.origin-ci-int-aws.dev.rhcloud.com:6443 to initialize..."
time="2019-04-09T05:33:50Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-04-09-051341: 83% complete"
time="2019-04-09T05:35:35Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-04-09-051341: 85% complete"
time="2019-04-09T05:35:48Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-04-09-051341: 90% complete"
time="2019-04-09T05:36:20Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-04-09-051341: 91% complete"
time="2019-04-09T05:37:21Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-04-09-051341: 91% complete"
time="2019-04-09T05:39:35Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-04-09-051341: 92% complete"
time="2019-04-09T05:40:51Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-04-09-051341: 92% complete"
time="2019-04-09T05:49:05Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-04-09-051341: 98% complete"
time="2019-04-09T05:49:20Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-04-09-051341: 98% complete, waiting on authentication, image-registry, ingress, marketplace, monitoring"
time="2019-04-09T05:50:35Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-04-09-051341: 98% complete"
time="2019-04-09T05:53:35Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-04-09-051341: 99% complete"
time="2019-04-09T05:54:05Z" level=debug msg="Cluster is initialized"

If the installer had decided to timeout at 5:53:35, we wouldn't have been able to point at any slow operators.

smarterclayton · 2019-04-09T19:52:28Z

If the installer had decided to timeout at 5:53:35, we wouldn't have been able to point at any slow operators.

For that to happen we'd have to have been going for 30m. For install with a timeout that's fine, because right now anything taking longer than ~15 is wedged and would always report. We reset because we started making progress. I think if we're making progress we don't want to report.

We can cut the sync interval on install - will add that as a new commit.

This will report incremental operator status much faster to the progressing condition, including the new operators waiting message.

wking · 2019-04-09T21:01:29Z

For that to happen we'd have to have been going for 30m. For install with a timeout that's fine, because right now anything taking longer than ~15 is wedged and would always report. We reset because we started making progress. I think if we're making progress we don't want to report.

So is that "by the time we hit our 30m timeout, we will have stabilized in a wedged state"? Previous experience like this shows that sometimes things stick for a while and then eventually shake loose, but not soon enough to get in under the installer timeout. That particular case predates #141, so maybe we are confident that any cluster that isn't ready in 30 minutes will still be hung and reporting via this PR's code?

smarterclayton · 2019-04-09T22:32:36Z

The installer failing at 30m but then eventually recovering is only a problem for CI, and since it's almost always a bug in an operator and we already gather cluster operator status into artifacts, I'm not sure the the message is the important part. It's 10 seconds per failure to look at the failing operator. I'd say this is addressing the experiential "an admin eventually sees the failing job show up once we settle on the wedge".

smarterclayton · 2019-04-10T20:04:58Z

other feedback? @abhinavdahiya ?

sdodson · 2019-04-10T20:53:44Z

The installer failing at 30m but then eventually recovering is only a problem for CI

I don't understand that. We've had multiple users encounter just that and report bugs on that scenario.

smarterclayton · 2019-04-10T21:22:33Z

Those are caused by bugs. The bugs are debuggable in CI. For a user, they should also be debuggable as I noted with this change because we won't be making progress at 30. If they're timing out because of factors beyond our control such as pull speed, then the installer messaging needs to change to more carefully communicate that we are stopping waiting, but that the install may still complete. Especially if it isn't marked as failed, but is still progressing.

abhinavdahiya · 2019-04-10T22:32:26Z

pkg/cvo/sync_worker.go

+		return nil
+	}
+
+	nested := make([]error, 0, len(errs))


ques: why create new nested error list?

abhinavdahiya · 2019-04-10T22:39:59Z

c2ac20f is LGTM I like the idea of summarizing the errors.

and 3c2cf46 makes sense for now as a stop gap.

abhinavdahiya · 2019-04-10T22:46:53Z

This makes the messages better for CVO and therefore this is:
/approve

but to @wking and @sdodson 's point:
case where cvo hasn't reached level after 30 mins and is still progressing we (as in installer) would need to inform the user that either cvo is still progressing and you might have to wait longer using wait-for or cvo is failing and pass the user the failure reported by the cvo.

wking · 2019-04-11T00:44:51Z

/lgtm

We can file follow-ups if/when we can agree on them ;).

openshift-ci-robot · 2019-04-11T00:44:57Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, smarterclayton, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [abhinavdahiya,smarterclayton,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wking · 2019-04-11T01:31:39Z

integration:

start_integration_test.go:244: cluster version never became available: operator listed as failing (MultipleErrors): Multiple errors are preventing progress:
		* Could not update configmap "e2e-cvo-s5m9/config1" (1 of 2): the server has forbidden updates to this resource
		* Could not update configmap "e2e-cvo-s5m9/config2" (2 of 2): the server has forbidden updates to this resource

/test integration

smarterclayton · 2019-04-11T02:09:23Z

Heh, that's the actual new error in the integration test. Fixing.

I think we can file something on installer to double check the output at the end of a release.

smarterclayton · 2019-04-11T02:12:36Z

Hrm... this looks more like the master failed or lost data.

/retest

wking · 2019-04-11T02:15:27Z

I think we can file something on installer to double check the output at the end of a release.

I'll work that into openshift/installer#1413 tonight. Should we stop waiting if the CVO ever reports Failing?

smarterclayton · 2019-04-11T21:19:55Z

We should probably just continue until the timeout. I think we could maybe record the last failing condition we see along the way, and if we are currently not failing but have timed out just report "we failed earlier"

abhinavdahiya · 2019-04-11T21:28:50Z

/refresh
/retest

smarterclayton · 2019-04-12T00:13:35Z

/test integration

Delivered a fix in openshift/release#3450 that should have fixed the issue we were seeing

abhinavdahiya · 2019-04-12T00:17:52Z

/retest

Because every Failing=True condition should have a reason. Also wordsmith the user-facing docs to replace "synchronize" with "reconcile", because our merge logic is more nuanced than the complete match "synchronize" implies for me. The ClusterOperatorNotAvailable special casing landed with convertErrorToProgressing in c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158).

Reduce false-positives when operators take a while to level (like the machine-config operator, which has to roll the control plane machines). We may want to raise this further in the future, but baby steps ;). The previous 10-minute value is from c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158), which doesn't make a case for that specific value. So the bump is unlikely to break anything unexpected.

Because every Failing=True condition should have a reason. Also wordsmith the user-facing docs to replace "synchronize" with "reconcile", because our merge logic is more nuanced than the complete match "synchronize" implies for me. The ClusterOperatorNotAvailable special casing landed with convertErrorToProgressing in c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158).

Reduce false-positives when operators take a while to level (like the machine-config operator, which has to roll the control plane machines). We may want to raise this further in the future, but baby steps ;). The previous 10-minute value is from c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158), which doesn't make a case for that specific value. So the bump is unlikely to break anything unexpected.

The MultipleErrors reason landed in c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158), but for some reason was left out of SummaryForReason. I'm adding it in this commit to get something more useful than: $ oc adm upgrade info: An upgrade is in progress. Unable to apply 4.8.0-0.ci-2021-05-26-172803: an unknown error has occurred: MultipleErrors when the Failing=True message is: Multiple errors are preventing progress: * Cluster operator machine-api is updating versions * Cluster operator openshift-apiserver is updating versions I'm not entirely clear on why these didn't fallback to the "unknown error" strings at the bottom of SummaryForReason, but they don't seem to have done so. "# ------------------------ >8 ------------------------ diff --git a/pkg/payload/task.go b/pkg/payload/task.go index 91bc3110..4a811218 100644 --- a/pkg/payload/task.go +++ b/pkg/payload/task.go @@ -264,6 +264,11 @@ func SummaryForReason(reason, name string) string { return fmt.Sprintf("the workload %s cannot roll out", name) } return "a workload cannot roll out" + case "MultipleErrors": + if len(name) > 0 { + return fmt.Sprintf("the workload %s cannot roll out", name) + } + return "multiple errors reconciling the payload" } if strings.HasPrefix(reason, "UpdatePayload") {

The MultipleErrors reason landed in c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158), but for some reason was left out of SummaryForReason. I'm adding it in this commit to get something more useful than: $ oc adm upgrade info: An upgrade is in progress. Unable to apply 4.8.0-0.ci-2021-05-26-172803: an unknown error has occurred: MultipleErrors when the Failing=True message is: Multiple errors are preventing progress: * Cluster operator machine-api is updating versions * Cluster operator openshift-apiserver is updating versions I'm not entirely clear on why these didn't fallback to the "unknown error" strings at the bottom of SummaryForReason, but they don't seem to have done so.

newClusterOperatorsNotAvailable is from c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158). And the not-available filtering is from bdd4545 (status: Hide generic operator status in favor of more specific errors, 2019-05-19, openshift#192). But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from ClusterOperatorNotAvailable to ClusterOperatorUpdating. And we want to avoid uncollapsed errors like: Multiple errors are preventing progress: * Cluster operator machine-api is updating versions * Cluster operator openshift-apiserver is updating versions where we are waiting on multiple ClusterOperator which are in similar situations. This commit drops the filtering, because cluster operators are important. It does sort those errors to the end of the list though, so the first error is the non-ClusterOperator error.

newClusterOperatorsNotAvailable is from c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158). And the not-available filtering is from bdd4545 (status: Hide generic operator status in favor of more specific errors, 2019-05-19, openshift#192). But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from ClusterOperatorNotAvailable to ClusterOperatorUpdating. And we want to avoid uncollapsed errors like: Multiple errors are preventing progress: * Cluster operator machine-api is updating versions * Cluster operator openshift-apiserver is updating versions where we are waiting on multiple ClusterOperator which are in similar situations. This commit drops the filtering, because cluster operators are important. It does sort those errors to the end of the list though, so the first error is the non-ClusterOperator error. TestCVO_ParallelError no longer tests the consolidated error message, because the consolidation is now restricted to ClusterOperator resources. I tried moving the pkg/cvo/testdata/paralleltest/release-manifests manifests to ClusterOperator, but then the test struggled with: I0802 16:04:18.133935 2005 sync_worker.go:945] Unable to precreate resource clusteroperator so now TestCVO_ParallelError is excercising the fact that non-ClusterOperator failures are not aggregated.

openshift-ci-robot requested review from abhinavdahiya and crawford April 9, 2019 05:05

openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 9, 2019

smarterclayton mentioned this pull request Apr 9, 2019

pkg/cvo: Report current tasks #152

Closed

smarterclayton force-pushed the display_partial_status branch from 7a1f8b3 to c2ac20f Compare April 9, 2019 05:11

sync: Reduce sync timeout for install to be the interval

3c2cf46

This will report incremental operator status much faster to the progressing condition, including the new operators waiting message.

abhinavdahiya reviewed Apr 10, 2019

View reviewed changes

pkg/cvo/sync_worker.go

return nil

}

nested := make([]error, 0, len(errs))

Copy link

Contributor

abhinavdahiya Apr 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ques: why create new nested error list?

openshift-ci-robot assigned wking Apr 11, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 11, 2019

wking mentioned this pull request Jan 12, 2021

Bug 1884334: UpdateError: enhance for ability to determine when upgrade failing #486

Merged

wking mentioned this pull request May 26, 2021

pkg/payload/task: Handle MultipleErrors in SummaryForReason #576

Closed

wking mentioned this pull request May 27, 2021

pkg/cvo/sync_worker: Consolidate all ClusterOperator errors by reason #577

Merged

wking mentioned this pull request Jan 21, 2022

Bug 1822752: pkg/cvo: Separate payload load from payload apply #683

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

status: Report the operators that have not yet deployed #158

status: Report the operators that have not yet deployed #158

smarterclayton commented Apr 9, 2019 •

edited

wking commented Apr 9, 2019

smarterclayton commented Apr 9, 2019

wking commented Apr 9, 2019

smarterclayton commented Apr 9, 2019

smarterclayton commented Apr 10, 2019

sdodson commented Apr 10, 2019

smarterclayton commented Apr 10, 2019

abhinavdahiya Apr 10, 2019

abhinavdahiya commented Apr 10, 2019

abhinavdahiya commented Apr 10, 2019

wking commented Apr 11, 2019

openshift-ci-robot commented Apr 11, 2019

wking commented Apr 11, 2019

smarterclayton commented Apr 11, 2019

smarterclayton commented Apr 11, 2019

wking commented Apr 11, 2019

smarterclayton commented Apr 11, 2019 •

edited

abhinavdahiya commented Apr 11, 2019

smarterclayton commented Apr 12, 2019

abhinavdahiya commented Apr 12, 2019

status: Report the operators that have not yet deployed #158

status: Report the operators that have not yet deployed #158

Conversation

smarterclayton commented Apr 9, 2019 • edited

wking commented Apr 9, 2019

smarterclayton commented Apr 9, 2019

wking commented Apr 9, 2019

smarterclayton commented Apr 9, 2019

smarterclayton commented Apr 10, 2019

sdodson commented Apr 10, 2019

smarterclayton commented Apr 10, 2019

abhinavdahiya Apr 10, 2019

Choose a reason for hiding this comment

abhinavdahiya commented Apr 10, 2019

abhinavdahiya commented Apr 10, 2019

wking commented Apr 11, 2019

openshift-ci-robot commented Apr 11, 2019

wking commented Apr 11, 2019

smarterclayton commented Apr 11, 2019

smarterclayton commented Apr 11, 2019

wking commented Apr 11, 2019

smarterclayton commented Apr 11, 2019 • edited

abhinavdahiya commented Apr 11, 2019

smarterclayton commented Apr 12, 2019

abhinavdahiya commented Apr 12, 2019

smarterclayton commented Apr 9, 2019 •

edited

smarterclayton commented Apr 11, 2019 •

edited