Bug 1763821: pkg/payload/task_graph: Canceling the task graph partway though is an error even if no tasks fail #255

wking · 2019-10-21T16:52:48Z

Avoid situations like:

2019-10-21T10:34:30.63940461Z I1021 10:34:30.639073       1 start.go:19] ClusterVersionOperator v1.0.0-106-g0725bd53-dirty
...
2019-10-21T10:34:31.132673574Z I1021 10:34:31.132635       1 sync_worker.go:453] Running sync quay.io/runcom/origin-release:v4.2-1196 (force=true) on generation 2 in state Updating at attempt 0
...
2019-10-21T10:40:16.168632703Z I1021 10:40:16.168604       1 sync_worker.go:579] Running sync for customresourcedefinition "baremetalhosts.metal3.io" (101 of 432)
2019-10-21T10:40:16.18425522Z I1021 10:40:16.184220       1 task_graph.go:583] Canceled worker 0
2019-10-21T10:40:16.184381244Z I1021 10:40:16.184360       1 task_graph.go:583] Canceled worker 3
...
2019-10-21T10:40:16.21772875Z I1021 10:40:16.217715       1 task_graph.go:603] Workers finished
2019-10-21T10:40:16.217777479Z I1021 10:40:16.217759       1 task_graph.go:611] Result of work: []
2019-10-21T10:40:16.217864206Z I1021 10:40:16.217846       1 task_graph.go:539] Stopped graph walker due to cancel
...
2019-10-21T10:43:08.743798997Z I1021 10:43:08.743740       1 sync_worker.go:453] Running sync quay.io/runcom/origin-release:v4.2-1196 (force=true) on generation 2 in state Reconciling at attempt 0
...

Where the CVO cancels some workers, sees that there are no errors, and decides "upgrade complete" despite never having attempted to push the bulk of its manifests. With this commit, the result of work will include several worker-canceled errors, and we'll take another upgrade round instead of declaring success and moving into reconciling.

openshift-ci-robot · 2019-10-21T16:54:59Z

@wking: This pull request references Bugzilla bug 1763821, which is invalid:

expected the bug to target the "4.3.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1763821: pkg/payload/task_graph: Canceled workers are errors

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2019-10-21T16:55:15Z

/cherrypick release-4.2

openshift-cherrypick-robot · 2019-10-21T16:55:16Z

@wking: once the present PR merges, I will cherry-pick it on top of release-4.2 in a new PR and assign it to you.

In response to this:

/cherrypick release-4.2

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2019-10-21T16:55:23Z

/cherrypick release-4.1

openshift-cherrypick-robot · 2019-10-21T16:55:23Z

@wking: once the present PR merges, I will cherry-pick it on top of release-4.1 in a new PR and assign it to you.

In response to this:

/cherrypick release-4.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pkg/payload/task_graph.go

smarterclayton · 2019-10-21T19:15:10Z

/hold

smarterclayton · 2019-10-21T19:16:00Z

you're also going to have to fix all the tests, and I'd like to see enough new tests that verify that this has the correct behavior on the sync worker (unit is fine probably, unless we can get the right behavior via one of the integration). then I want to see upgrade soaks to verify this doesn't happen anymore.

wking · 2019-10-22T04:38:43Z

you're also going to have to fix all the tests...

Turned out that the problem was a bit deeper; 40b7e58 -> 5eaad22 brings in a more thorough fix that does not break preexisting tests.

...I'd like to see enough new tests that verify that this has the correct behavior on the sync worker (unit is fine probably...

5eaad22 has a new unit tests which fails without my channel/goroutine refactor. Lots and lots of me trying to explain what was going on before, what was going wrong, and how I've fixed it in the 5eaad22 commit message ;).

wking · 2019-10-22T04:42:29Z

e2e-aws-upgrade:

level=fatal msg="failed to fetch Terraform Variables: failed to fetch dependency of \"Terraform Variables\": failed to fetch dependency of \"Bootstrap Ignition Config\": failed to fetch dependency of \"Master Machines\": failed to generate asset \"Platform Credentials Check\": validate AWS credentials: mint credentials check: error simulating policy: Throttling: Rate exceeded\n\tstatus code: 400, request id: b03a5cbe-8a90-4a84-a3a7-c1300122210f

/test e2e-aws-upgrade

wking · 2019-10-22T05:23:38Z

/bugzilla refresh

Also, now that I have a unit test, I think I can pull your hold:

/hold cancel

openshift-ci-robot · 2019-10-22T05:23:43Z

@wking: This pull request references Bugzilla bug 1763821, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

/bugzilla refresh

Also, now that I have a unit test, I think I can pull your hold:

/hold cancel

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

smarterclayton · 2019-10-22T15:48:40Z

/hold

Hard no on refactoring this entire method. Please only change the minimal function necessary to close the gap. Adding a cancel error at the end is the safest minimal fix. Do not refactor like this when critical changes are required.

The RunGraph implementation was unchanged since it landed in cb4e037 (payload: Create a task graph that can split a payload into chunks, 2019-01-17, openshift#88), with the exception of later logging and c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158) with the adjusted return type. The old code launched a goroutine for the pushing/reaping, which was an unecessary, and made error reporting on any outstanding tasks more complicated. I'd like to drop the goroutine, but Clayton is not comfortable with backporting that large a change to older releases [1]. Instead, this commit adds the pusher/reaper to the WaitGroup and adjusts the push/reap logic to: * Stop pushing work if we've been canceled. * Stick around to collect any in-flight jobs, instead of existing after the pushing completes. * Drop the 'ok' fallback in the completeCh read. Now that we're part of the WaitGroup, with the wg.Wait gorouting blocking on the pusher/reaper exiting, and the main RunGraph gorouting's 'range errCh' blocking on the wg.Wait goroutine, we don't have to worry about completeCh being closed when the pusher/reaper is reading from it. * Drop the 'unreachable' business. The earlier push/reap loop should continue trying to push work until we are canceled or there is nothing left to push. We should never be outside that loop and still have jobs left that we want to push. * Write to errCh with "here's the first node that was incomplete" and possibly "and I was canceled" errors if there are no more-useful errors from the worker nodes. This check happens after we've drained in-flight jobs, so we don't have to worry about new errors coming out of workers at this point. This should avoid situations like [2]: 2019-10-21T10:34:30.63940461Z I1021 10:34:30.639073 1 start.go:19] ClusterVersionOperator v1.0.0-106-g0725bd53-dirty ... 2019-10-21T10:34:31.132673574Z I1021 10:34:31.132635 1 sync_worker.go:453] Running sync quay.io/runcom/origin-release:v4.2-1196 (force=true) on generation 2 in state Updating at attempt 0 ... 2019-10-21T10:40:16.168632703Z I1021 10:40:16.168604 1 sync_worker.go:579] Running sync for customresourcedefinition "baremetalhosts.metal3.io" (101 of 432) 2019-10-21T10:40:16.18425522Z I1021 10:40:16.184220 1 task_graph.go:583] Canceled worker 0 2019-10-21T10:40:16.184381244Z I1021 10:40:16.184360 1 task_graph.go:583] Canceled worker 3 ... 2019-10-21T10:40:16.21772875Z I1021 10:40:16.217715 1 task_graph.go:603] Workers finished 2019-10-21T10:40:16.217777479Z I1021 10:40:16.217759 1 task_graph.go:611] Result of work: [] 2019-10-21T10:40:16.217864206Z I1021 10:40:16.217846 1 task_graph.go:539] Stopped graph walker due to cancel ... 2019-10-21T10:43:08.743798997Z I1021 10:43:08.743740 1 sync_worker.go:453] Running sync quay.io/runcom/origin-release:v4.2-1196 (force=true) on generation 2 in state Reconciling at attempt 0 ... where the CVO canceled some workers, saw that there are worker no errors, and decided "upgrade complete" despite never having attempted to push the bulk of its manifests. Without the task_graph.go changes in this commit, the new test fails with: $ go test -run TestRunGraph ./pkg/payload/ --- FAIL: TestRunGraph (1.03s) --- FAIL: TestRunGraph/cancelation_without_task_errors_is_reported (1.00s) task_graph_test.go:910: unexpected error: [] FAIL FAIL github.com/openshift/cluster-version-operator/pkg/payload 1.042s Also change "cancelled" -> "canceled" to match Go's docs [3] and name the other test cases. [1]: openshift#255 (comment) [2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1/754/artifacts/e2e-aws-upgrade/must-gather/registry-svc-ci-openshift-org-origin-4-1-sha256-f8c863ea08d64eea7b3a9ffbbde9c01ca90501afe6c0707e9c35f0ed7e92a9df/namespaces/openshift-cluster-version/pods/cluster-version-operator-5f5d465967-t57b2/cluster-version-operator/cluster-version-operator/logs/current.log [3]: https://golang.org/pkg/context/#pkg-overview

wking · 2019-10-22T18:10:25Z

Rerolled again with f69e58d -> 11676c7 to take @smarterclayton's minimal fix from #260. The new commit message explains why I'm unhappy to drop my 1 incomplete task nodes, beginning with b error and the potential (not very important) race against late cancels being reported as errors even if they come after we've successfully pushed the whole graph. But getting this fixed is the most important thing, and we can revisit the quality of error message if we have many clusters stuck with context canceled.

The RunGraph implementation was unchanged since it landed in cb4e037 (payload: Create a task graph that can split a payload into chunks, 2019-01-17, openshift#88), with the exception of later logging and c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158) with the adjusted return type. The old code launched a goroutine for the pushing/reaping, which was an unecessary, and made error reporting on any outstanding tasks more complicated. I'd like to drop the goroutine, but Clayton is not comfortable with backporting that large a change to older releases [1]. And I'd like to be able to return errors like: 1 incomplete task nodes, beginning with b but Clayton thinks these are just "took too long, but we're still making progress" and that they'll resolve on their own in the next attempt or few, and that they're not actual deadlocks where you'd want a better fingerprint to pin down the node(s) that were locking [2]. This commit ensures that when we are canceled we return an error, and it does none of the refactoring we'd need to be able to say whether we had unprocessed nodes (for late cancels, it's possible that we could return "I was canceled" even if we had successfully pushed and reaped all the nodes). This should avoid situations like [3]: 2019-10-21T10:34:30.63940461Z I1021 10:34:30.639073 1 start.go:19] ClusterVersionOperator v1.0.0-106-g0725bd53-dirty ... 2019-10-21T10:34:31.132673574Z I1021 10:34:31.132635 1 sync_worker.go:453] Running sync quay.io/runcom/origin-release:v4.2-1196 (force=true) on generation 2 in state Updating at attempt 0 ... 2019-10-21T10:40:16.168632703Z I1021 10:40:16.168604 1 sync_worker.go:579] Running sync for customresourcedefinition "baremetalhosts.metal3.io" (101 of 432) 2019-10-21T10:40:16.18425522Z I1021 10:40:16.184220 1 task_graph.go:583] Canceled worker 0 2019-10-21T10:40:16.184381244Z I1021 10:40:16.184360 1 task_graph.go:583] Canceled worker 3 ... 2019-10-21T10:40:16.21772875Z I1021 10:40:16.217715 1 task_graph.go:603] Workers finished 2019-10-21T10:40:16.217777479Z I1021 10:40:16.217759 1 task_graph.go:611] Result of work: [] 2019-10-21T10:40:16.217864206Z I1021 10:40:16.217846 1 task_graph.go:539] Stopped graph walker due to cancel ... 2019-10-21T10:43:08.743798997Z I1021 10:43:08.743740 1 sync_worker.go:453] Running sync quay.io/runcom/origin-release:v4.2-1196 (force=true) on generation 2 in state Reconciling at attempt 0 ... where the CVO canceled some workers, saw that there are worker no errors, and decided "upgrade complete" despite never having attempted to push the bulk of its manifests. Without the task_graph.go changes in this commit, the new test fails with: $ go test -run TestRunGraph ./pkg/payload/ --- FAIL: TestRunGraph (1.03s) --- FAIL: TestRunGraph/cancelation_without_task_errors_is_reported (1.00s) task_graph_test.go:910: unexpected error: [] FAIL FAIL github.com/openshift/cluster-version-operator/pkg/payload 1.042s Also change "cancelled" -> "canceled" to match Go's docs [4] and name the other test cases. [1]: openshift#255 (comment) [2]: openshift#260 [3]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1/754/artifacts/e2e-aws-upgrade/must-gather/registry-svc-ci-openshift-org-origin-4-1-sha256-f8c863ea08d64eea7b3a9ffbbde9c01ca90501afe6c0707e9c35f0ed7e92a9df/namespaces/openshift-cluster-version/pods/cluster-version-operator-5f5d465967-t57b2/cluster-version-operator/cluster-version-operator/logs/current.log [4]: https://golang.org/pkg/context/#pkg-overview

wking · 2019-10-22T18:13:22Z

And 11676c7 -> eaa3d19 to drop the no-longer-needed-because-we-don't-produce-an-error ba6f23e.

smarterclayton · 2019-10-22T18:49:26Z

/hold cancel
/lgtm

We can follow up next week on a slightly better message summarize from syncworker#apply.

openshift-ci-robot · 2019-10-22T18:49:29Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [smarterclayton,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2019-10-22T20:12:52Z

@wking: All pull requests linked via external trackers have merged. Bugzilla bug 1763821 has been moved to the MODIFIED state.

In response to this:

Bug 1763821: pkg/payload/task_graph: Canceling the task graph partway though is an error even if no tasks fail

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-cherrypick-robot · 2019-10-22T20:12:59Z

@wking: new pull request created: #262

In response to this:

/cherrypick release-4.2

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

The RunGraph implementation was unchanged since it landed in cb4e037 (payload: Create a task graph that can split a payload into chunks, 2019-01-17, openshift#88), with the exception of later logging and c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158) with the adjusted return type. The old code launched a goroutine for the pushing/reaping, which was an unecessary, and made error reporting on any outstanding tasks more complicated. I'd like to drop the goroutine, but Clayton is not comfortable with backporting that large a change to older releases [1]. And I'd like to be able to return errors like: 1 incomplete task nodes, beginning with b but Clayton thinks these are just "took too long, but we're still making progress" and that they'll resolve on their own in the next attempt or few, and that they're not actual deadlocks where you'd want a better fingerprint to pin down the node(s) that were locking [2]. This commit ensures that when we are canceled we return an error, and it does none of the refactoring we'd need to be able to say whether we had unprocessed nodes (for late cancels, it's possible that we could return "I was canceled" even if we had successfully pushed and reaped all the nodes). This should avoid situations like [3]: 2019-10-21T10:34:30.63940461Z I1021 10:34:30.639073 1 start.go:19] ClusterVersionOperator v1.0.0-106-g0725bd53-dirty ... 2019-10-21T10:34:31.132673574Z I1021 10:34:31.132635 1 sync_worker.go:453] Running sync quay.io/runcom/origin-release:v4.2-1196 (force=true) on generation 2 in state Updating at attempt 0 ... 2019-10-21T10:40:16.168632703Z I1021 10:40:16.168604 1 sync_worker.go:579] Running sync for customresourcedefinition "baremetalhosts.metal3.io" (101 of 432) 2019-10-21T10:40:16.18425522Z I1021 10:40:16.184220 1 task_graph.go:583] Canceled worker 0 2019-10-21T10:40:16.184381244Z I1021 10:40:16.184360 1 task_graph.go:583] Canceled worker 3 ... 2019-10-21T10:40:16.21772875Z I1021 10:40:16.217715 1 task_graph.go:603] Workers finished 2019-10-21T10:40:16.217777479Z I1021 10:40:16.217759 1 task_graph.go:611] Result of work: [] 2019-10-21T10:40:16.217864206Z I1021 10:40:16.217846 1 task_graph.go:539] Stopped graph walker due to cancel ... 2019-10-21T10:43:08.743798997Z I1021 10:43:08.743740 1 sync_worker.go:453] Running sync quay.io/runcom/origin-release:v4.2-1196 (force=true) on generation 2 in state Reconciling at attempt 0 ... where the CVO canceled some workers, saw that there are worker no errors, and decided "upgrade complete" despite never having attempted to push the bulk of its manifests. Without the task_graph.go changes in this commit, the new test fails with: $ go test -run TestRunGraph ./pkg/payload/ --- FAIL: TestRunGraph (1.03s) --- FAIL: TestRunGraph/cancelation_without_task_errors_is_reported (1.00s) task_graph_test.go:910: unexpected error: [] FAIL FAIL github.com/openshift/cluster-version-operator/pkg/payload 1.042s Also change "cancelled" -> "canceled" to match Go's docs [4] and name the other test cases. [1]: openshift#255 (comment) [2]: openshift#260 [3]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1/754/artifacts/e2e-aws-upgrade/must-gather/registry-svc-ci-openshift-org-origin-4-1-sha256-f8c863ea08d64eea7b3a9ffbbde9c01ca90501afe6c0707e9c35f0ed7e92a9df/namespaces/openshift-cluster-version/pods/cluster-version-operator-5f5d465967-t57b2/cluster-version-operator/cluster-version-operator/logs/current.log [4]: https://golang.org/pkg/context/#pkg-overview

openshift-cherrypick-robot · 2019-10-22T20:13:05Z

@wking: new pull request created: #263

In response to this:

/cherrypick release-4.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

The RunGraph implementation was unchanged since it landed in cb4e037 (payload: Create a task graph that can split a payload into chunks, 2019-01-17, openshift#88), with the exception of later logging and c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158) with the adjusted return type. The old code launched a goroutine for the pushing/reaping, which was an unecessary, and made error reporting on any outstanding tasks more complicated. I'd like to drop the goroutine, but Clayton is not comfortable with backporting that large a change to older releases [1]. And I'd like to be able to return errors like: 1 incomplete task nodes, beginning with b but Clayton thinks these are just "took too long, but we're still making progress" and that they'll resolve on their own in the next attempt or few, and that they're not actual deadlocks where you'd want a better fingerprint to pin down the node(s) that were locking [2]. This commit ensures that when we are canceled we return an error, and it does none of the refactoring we'd need to be able to say whether we had unprocessed nodes (for late cancels, it's possible that we could return "I was canceled" even if we had successfully pushed and reaped all the nodes). This should avoid situations like [3]: 2019-10-21T10:34:30.63940461Z I1021 10:34:30.639073 1 start.go:19] ClusterVersionOperator v1.0.0-106-g0725bd53-dirty ... 2019-10-21T10:34:31.132673574Z I1021 10:34:31.132635 1 sync_worker.go:453] Running sync quay.io/runcom/origin-release:v4.2-1196 (force=true) on generation 2 in state Updating at attempt 0 ... 2019-10-21T10:40:16.168632703Z I1021 10:40:16.168604 1 sync_worker.go:579] Running sync for customresourcedefinition "baremetalhosts.metal3.io" (101 of 432) 2019-10-21T10:40:16.18425522Z I1021 10:40:16.184220 1 task_graph.go:583] Canceled worker 0 2019-10-21T10:40:16.184381244Z I1021 10:40:16.184360 1 task_graph.go:583] Canceled worker 3 ... 2019-10-21T10:40:16.21772875Z I1021 10:40:16.217715 1 task_graph.go:603] Workers finished 2019-10-21T10:40:16.217777479Z I1021 10:40:16.217759 1 task_graph.go:611] Result of work: [] 2019-10-21T10:40:16.217864206Z I1021 10:40:16.217846 1 task_graph.go:539] Stopped graph walker due to cancel ... 2019-10-21T10:43:08.743798997Z I1021 10:43:08.743740 1 sync_worker.go:453] Running sync quay.io/runcom/origin-release:v4.2-1196 (force=true) on generation 2 in state Reconciling at attempt 0 ... where the CVO canceled some workers, saw that there are worker no errors, and decided "upgrade complete" despite never having attempted to push the bulk of its manifests. Without the task_graph.go changes in this commit, the new test fails with: $ go test -run TestRunGraph ./pkg/payload/ --- FAIL: TestRunGraph (1.03s) --- FAIL: TestRunGraph/cancelation_without_task_errors_is_reported (1.00s) task_graph_test.go:910: unexpected error: [] FAIL FAIL github.com/openshift/cluster-version-operator/pkg/payload 1.042s Also change "cancelled" -> "canceled" to match Go's docs [4] and name the other test cases. [1]: openshift#255 (comment) [2]: openshift#260 [3]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1/754/artifacts/e2e-aws-upgrade/must-gather/registry-svc-ci-openshift-org-origin-4-1-sha256-f8c863ea08d64eea7b3a9ffbbde9c01ca90501afe6c0707e9c35f0ed7e92a9df/namespaces/openshift-cluster-version/pods/cluster-version-operator-5f5d465967-t57b2/cluster-version-operator/cluster-version-operator/logs/current.log [4]: https://golang.org/pkg/context/#pkg-overview

…hout a goroutine The RunGraph implementation was unchanged since it landed in cb4e037 (payload: Create a task graph that can split a payload into chunks, 2019-01-17, openshift#88), with the exception of later logging, c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158) with the adjusted return type, and eaa3d19 (pkg/payload/task_graph: RunGraph error when canceled, 2019-10-21, openshift#255) with its "I was canceled" error backstop. The old code launched a goroutine for the pushing/reaping, which was an unecessary, and made error reporting on any outstanding tasks more complicated. Dropping this goroutine let me get rid of errCh which used to be used to pass errors back to the main goroutine. I also dropped the wg.Wait() goroutine, which used to be part of supporting the 'range errCh' used in the main goroutine to collect failed jobs. But because errCh was filled directly from the worker goroutines, nothing used to be collecting the completeness of graph coverage from the pushing/reaping goroutine. I've also restructured pushing/reaping to be more resistant to locking and spinning. The old implementation had a node-pushing loop which attempted non-blocking pushes, then a result-reaping loop, then some condition checking, then a blocking result reaping attempt (after "we did nothing this round, so we have to wait for more"). The idea with the blocking reap seems to have been that if the pusher/reaper didn't push anything (because workCh was full) and the earlier 'for len(completeCh) > 0' reaping didn't pull anything in (because workers were busy working), that the pusher/reaper should block until it collected a result, in the hopes that worker which returned the result would have cleared a job out of workCh to give the pusher/reaper space to push a new job with the next loop iteration. But if Go decides to give the pusher/reaper more time in the scheduler, it might attempt the next workCh push before the job gets around to being scheduled and popping from workCh. And then the pusher/reaper might trickle down to the blocking reap and wait for another worker (hopefully maxParallelism > 1) to return a result to unblock the pusher/reaper and give it another shot at a workCh push. During this time, the worker that returned earlier is idling with nothing to do. With this commit, when we have a next node to push, we have a single switch statement that blocks until we are either able to push the node or to reap a result. So there's no need for a non-blocking push, and no chance at spinning, although it does mean we need to recalculate the next node after each channel action. When we've been canceled, we stop pushing into workCh, but continue to reap from resultCh until we have no in-flight jobs left. And if we have nothing to push, and there's nothing in-flight to reap, we're obviously done, so that choice is a lot simpler now. I've dropped the "Waiting for workers to complete" log line, because wg.Wait() call should block for much less time now. And because the main RunGraph goroutine is doing the waiting, we no longer need the 'runTask, ok := <-workCh' read to protect against workCh being closed early. With the wg.Wait() now getting called after we have drained all in-flight jobs (previously we launched it immediately after launching workers), there is less call for the "Waiting for..." line. But the most externally noticeable change is that now, if we run without any failing jobs to give us errors, I'm filling in a new "incomplete task nodes" error so folks don't have to look in the CVO logs to see how far the CVO got before hanging. It also allows us to not return the "I was canceled" error in cases where the cancellation happened late enough that we were still able to successfully process the full graph.

…hout a goroutine The RunGraph implementation was unchanged since it landed in cb4e037 (payload: Create a task graph that can split a payload into chunks, 2019-01-17, openshift#88), with the exception of later logging, c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158) with the adjusted return type, and eaa3d19 (pkg/payload/task_graph: RunGraph error when canceled, 2019-10-21, openshift#255) with its "I was canceled" error backstop. The old code launched a goroutine for the pushing/reaping, which was an unecessary, and made error reporting on any outstanding tasks more complicated. Dropping this goroutine let me get rid of errCh which used to be used to pass errors back to the main goroutine. I also dropped the wg.Wait() goroutine, which used to be part of supporting the 'range errCh' used in the main goroutine to collect failed jobs. But because errCh was filled directly from the worker goroutines, nothing used to be collecting the completeness of graph coverage from the pushing/reaping goroutine. I've also restructured pushing/reaping to be more resistant to locking and spinning. The old implementation had a node-pushing loop which attempted non-blocking pushes, then a result-reaping loop, then some condition checking, then a blocking result reaping attempt (after "we did nothing this round, so we have to wait for more"). The idea with the blocking reap seems to have been that if the pusher/reaper didn't push anything (because workCh was full) and the earlier 'for len(completeCh) > 0' reaping didn't pull anything in (because workers were busy working), that the pusher/reaper should block until it collected a result, in the hopes that worker which returned the result would have cleared a job out of workCh to give the pusher/reaper space to push a new job with the next loop iteration. But if Go decides to give the pusher/reaper more time in the scheduler, it might attempt the next workCh push before the job gets around to being scheduled and popping from workCh. And then the pusher/reaper might trickle down to the blocking reap and wait for another worker (hopefully maxParallelism > 1) to return a result to unblock the pusher/reaper and give it another shot at a workCh push. During this time, the worker that returned earlier is idling with nothing to do. With this commit, when we have a next node to push, we have a single switch statement that blocks until we are either able to push the node or to reap a result. So there's no need for a non-blocking push, and no chance at spinning, although it does mean we need to recalculate the next node after each channel action. When we've been canceled, we stop pushing into workCh, but continue to reap from resultCh until we have no in-flight jobs left. And if we have nothing to push, and there's nothing in-flight to reap, we're obviously done, so that choice is a lot simpler now. I've dropped the "Waiting for workers to complete" log line, because wg.Wait() call should block for much less time now. And because the main RunGraph goroutine is doing the waiting, we no longer need the 'runTask, ok := <-workCh' read to protect against workCh being closed early. With the wg.Wait() now getting called after we have drained all in-flight jobs (previously we launched it immediately after launching workers), there is less call for the "Waiting for..." line. But the most externally noticeable change is that now, if we run without any failing jobs to give us errors, I'm filling in a new "incomplete task nodes" error so folks don't have to look in the CVO logs to see how far the CVO got before hanging. It also allows us to not return the "I was canceled" error in cases where the cancellation happened late enough that we were still able to successfully process the full graph. The unit change fills in a less-stubby Kubernetes object, avoiding: --- FAIL: TestRunGraph/cancelation_without_task_errors_is_reported (1.00s) task_graph_test.go:914: error 0 "1 incomplete task nodes, beginning with %!s(PANIC=String method: runtime error: invalid memory address or nil pointer dereference)" doesn't contain "1 incomplete task nodes, beginning with b" when Task.String calls Unstructured.GetName which explodes on the lack of expected Kubernetes-object metadata.

openshift-ci-robot requested review from abhinavdahiya and smarterclayton October 21, 2019 16:52

openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Oct 21, 2019

wking changed the title ~~pkg/payload/task_graph: Canceled workers are errors~~ Bug 1763821: pkg/payload/task_graph: Canceled workers are errors Oct 21, 2019

openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Oct 21, 2019

smarterclayton reviewed Oct 21, 2019

View reviewed changes

pkg/payload/task_graph.go Outdated Show resolved Hide resolved

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 21, 2019

wking force-pushed the canceled-workers-are-errors branch from 40b7e58 to 5eaad22 Compare October 22, 2019 04:36

openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Oct 22, 2019

wking changed the title ~~Bug 1763821: pkg/payload/task_graph: Canceled workers are errors~~ Bug 1763821: pkg/payload/task_graph: Canceling the task graph partway though is an error even if no tasks fail Oct 22, 2019

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 22, 2019

openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Oct 22, 2019

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 22, 2019

wking force-pushed the canceled-workers-are-errors branch from 5eaad22 to f69e58d Compare October 22, 2019 17:31

wking force-pushed the canceled-workers-are-errors branch from f69e58d to 11676c7 Compare October 22, 2019 18:08

openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 22, 2019

wking force-pushed the canceled-workers-are-errors branch from 11676c7 to eaa3d19 Compare October 22, 2019 18:12

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 22, 2019

openshift-ci-robot assigned smarterclayton Oct 22, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 22, 2019

abhinavdahiya mentioned this pull request Oct 22, 2019

payload: Task graph might lose some work if cancelled #260

Closed

openshift-merge-robot merged commit 0725460 into openshift:master Oct 22, 2019

openshift-cherrypick-robot mentioned this pull request Oct 22, 2019

Bug 1763822: A timeout in applying an upgrade should not result in the CVO thinking that it is reconciling the payload #262

Merged

openshift-cherrypick-robot mentioned this pull request Oct 22, 2019

Bug 1763823: pkg/payload/task_graph: Canceling the task graph partway though is an error even if no tasks fail #263

Merged

wking deleted the canceled-workers-are-errors branch October 22, 2019 20:15

wking mentioned this pull request Oct 22, 2019

pkg/payload/task_graph: Handle node pushing and result collection without a goroutine #264

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1763821: pkg/payload/task_graph: Canceling the task graph partway though is an error even if no tasks fail #255

Bug 1763821: pkg/payload/task_graph: Canceling the task graph partway though is an error even if no tasks fail #255

wking commented Oct 21, 2019

openshift-ci-robot commented Oct 21, 2019

wking commented Oct 21, 2019

openshift-cherrypick-robot commented Oct 21, 2019

wking commented Oct 21, 2019

openshift-cherrypick-robot commented Oct 21, 2019

smarterclayton commented Oct 21, 2019

smarterclayton commented Oct 21, 2019

wking commented Oct 22, 2019

wking commented Oct 22, 2019

wking commented Oct 22, 2019

openshift-ci-robot commented Oct 22, 2019

smarterclayton commented Oct 22, 2019

wking commented Oct 22, 2019

wking commented Oct 22, 2019

smarterclayton commented Oct 22, 2019

openshift-ci-robot commented Oct 22, 2019

openshift-ci-robot commented Oct 22, 2019

openshift-cherrypick-robot commented Oct 22, 2019

openshift-cherrypick-robot commented Oct 22, 2019

Bug 1763821: pkg/payload/task_graph: Canceling the task graph partway though is an error even if no tasks fail #255

Bug 1763821: pkg/payload/task_graph: Canceling the task graph partway though is an error even if no tasks fail #255

Conversation

wking commented Oct 21, 2019

openshift-ci-robot commented Oct 21, 2019

wking commented Oct 21, 2019

openshift-cherrypick-robot commented Oct 21, 2019

wking commented Oct 21, 2019

openshift-cherrypick-robot commented Oct 21, 2019

smarterclayton commented Oct 21, 2019

smarterclayton commented Oct 21, 2019

wking commented Oct 22, 2019

wking commented Oct 22, 2019

wking commented Oct 22, 2019

openshift-ci-robot commented Oct 22, 2019

smarterclayton commented Oct 22, 2019

wking commented Oct 22, 2019

wking commented Oct 22, 2019

smarterclayton commented Oct 22, 2019

openshift-ci-robot commented Oct 22, 2019

openshift-ci-robot commented Oct 22, 2019

openshift-cherrypick-robot commented Oct 22, 2019

openshift-cherrypick-robot commented Oct 22, 2019