Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize the payload execution #88

Merged
merged 2 commits into from
Feb 5, 2019

Conversation

smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Jan 18, 2019

Parallelize payload execution using two simple rules:

  1. a Job in the payload acts as a serialization barrier (all previous work must complete)
  2. two manifests can be run in parallel if:
    a. The name of the manifest matches 0000_<NUMBER>_<COMPONENT>_*
    b. The number matches
    c. The component does not match

We'd build a graph, then greedily execute sync tasks, starting a node when all prerequisites are satisfied. We never run manifests in parallel under the same 0000_<NUMBER>_<COMPONENT> prefix.

The first few rules cut us to about ~120 sequential tasks, instead of 227.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 18, 2019
@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 18, 2019
@openshift-ci-robot openshift-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 23, 2019
@openshift-ci-robot openshift-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Feb 1, 2019
@openshift-ci-robot openshift-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 2, 2019
@smarterclayton
Copy link
Contributor Author

First run... I'm so excited! Excited = excited to see the flaming pile of rubble after the CVO melts an apiserver.

pkg/cvo/sync_worker.go Outdated Show resolved Hide resolved
@smarterclayton
Copy link
Contributor Author

github/prow/openshift you suck

/test all

@smarterclayton smarterclayton force-pushed the plan_payload branch 6 times, most recently from 471f8ac to a94ff05 Compare February 4, 2019 05:59
@smarterclayton smarterclayton changed the title wip: Parallelize the payload execution Parallelize the payload execution Feb 4, 2019
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 4, 2019
@smarterclayton
Copy link
Contributor Author

/assign @abhinavdahiya

This is ready for review. Especially the run tests. I've done a number of long -count runs on the graph traversal, but I'm sure there are tests missing.

pkg/start/start.go Outdated Show resolved Hide resolved
RateLimiter will gate how fast the payload runs in parallel. Previously
each unique client got its own rate limiter which prevented us from
reasoning about total write load.
The payload is viewed as a graph of chunks of tasks that must be
executed in serial (roughly corresponding to the manifests defined
by each operator). Using two verbs, Split -- which ensures that
certain tasks cause a happens-before/happens-after for the entire
graph -- and Parallelize -- which tries to run manifests from
different operators at the same run level (0000_70_*) in parallel,
we roughly halve the linear depth of the payload and ensure that
later operators run in parallel while preserving ordering for
the core operators.

Remove requeing because it can change the order of tasks within
a job, which may cause confusion about potential orderings of
actions in the future.

Future variants might more aggressively parallelize tasks.

Refactor the sync worker to use the task graph with a default
parallelism of 8.
@smarterclayton
Copy link
Contributor Author

responded or fixed everything

@abhinavdahiya
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 5, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [abhinavdahiya,smarterclayton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 6611416 into openshift:master Feb 5, 2019
wking added a commit to wking/cluster-version-operator that referenced this pull request Oct 22, 2019
…hout a goroutine

The RunGraph implementation was unchanged since it landed in
cb4e037 (payload: Create a task graph that can split a payload into
chunks, 2019-01-17, openshift#88), with the exception of later logging and
c2ac20f (status: Report the operators that have not yet deployed,
2019-04-09, openshift#158) with the adjusted return type.

The old code launched a goroutine for the pushing/reaping, which was
an unecessary, and made error reporting on any outstanding tasks more
complicated.  Dropping this goroutine let me get rid of errCh which
used to be used to pass errors back to the main goroutine.  I also
dropped the wg.Wait() goroutine, which used to be part of supporting
the 'range errCh' used in the main goroutine to collect failed jobs.
But because errCh was filled directly from the worker goroutines,
nothing used to be collecting the completeness of graph coverage from
the pushing/reaping goroutine.

I've also restructured pushing/reaping to be more resistant to locking
and spinning.  The old implementation had a node-pushing loop which
attempted non-blocking pushes, then a result-reaping loop, then some
condition checking, then a blocking result reaping attempt (after "we
did nothing this round, so we have to wait for more").  The idea with
the blocking reap seems to have been that if the pusher/reaper didn't
push anything (because workCh was full) and the earlier 'for
len(completeCh) > 0' reaping didn't pull anything in (because workers
were busy working), that the pusher/reaper should block until it
collected a result, in the hopes that worker which returned the result
would have cleared a job out of workCh to give the pusher/reaper space
to push a new job with the next loop iteration.  But if Go decides to
give the pusher/reaper more time in the scheduler, it might attempt
the next workCh push before the job gets around to being scheduled and
popping from workCh.  And then the pusher/reaper might trickle down to
the blocking reap and wait for another worker (hopefully
maxParallelism > 1) to return a result to unblock the pusher/reaper
and give it another shot at a workCh push.  During this time, the
worker that returned earlier is idling with nothing to do.

With this commit, when we have a next node to push, we have a single
switch statement that blocks until we are either able to push the node
or to reap a result.  So there's no need for a non-blocking push, and
no chance at spinning, although it does mean we need to recalculate
the next node after each channel action.  When we've been canceled, we
stop pushing into workCh, but continue to reap from resultCh until we
have no in-flight jobs left.  And if we have nothing to push, and
there's nothing in-flight to reap, we're obviously done, so that
choice is a lot simpler now.

I've dropped the "Waiting for workers to complete" log line, because
wg.Wait() call should block for much less time now.  And because the
main RunGraph goroutine is doing the waiting, we no longer need the
'runTask, ok := <-workCh' read to protect against workCh being closed
early.  With the wg.Wait() now getting called after we have drained
all in-flight jobs (previously we launched it immediately after
launching workers), there is less call for the "Waiting for..." line.

But the most externally noticeable change is that now, if we run
without any failing jobs to give us errors, I'm filling in a new
"incomplete task nodes" error and, if we were canceled, the context
error.  This should avoid situations like [1]:

  2019-10-21T10:34:30.63940461Z I1021 10:34:30.639073       1 start.go:19] ClusterVersionOperator v1.0.0-106-g0725bd53-dirty
  ...
  2019-10-21T10:34:31.132673574Z I1021 10:34:31.132635       1 sync_worker.go:453] Running sync quay.io/runcom/origin-release:v4.2-1196 (force=true) on generation 2 in state Updating at attempt 0
  ...
  2019-10-21T10:40:16.168632703Z I1021 10:40:16.168604       1 sync_worker.go:579] Running sync for customresourcedefinition "baremetalhosts.metal3.io" (101 of 432)
  2019-10-21T10:40:16.18425522Z I1021 10:40:16.184220       1 task_graph.go:583] Canceled worker 0
  2019-10-21T10:40:16.184381244Z I1021 10:40:16.184360       1 task_graph.go:583] Canceled worker 3
  ...
  2019-10-21T10:40:16.21772875Z I1021 10:40:16.217715       1 task_graph.go:603] Workers finished
  2019-10-21T10:40:16.217777479Z I1021 10:40:16.217759       1 task_graph.go:611] Result of work: []
  2019-10-21T10:40:16.217864206Z I1021 10:40:16.217846       1 task_graph.go:539] Stopped graph walker due to cancel
  ...
  2019-10-21T10:43:08.743798997Z I1021 10:43:08.743740       1 sync_worker.go:453] Running sync quay.io/runcom/origin-release:v4.2-1196 (force=true) on generation 2 in state Reconciling at attempt 0
  ...

where the CVO canceled some workers, saw that there are worker no
errors, and decided "upgrade complete" despite never having attempted
to push the bulk of its manifests.

Without the task_graph.go changes in this commit, the new test fails
with:

  $ go1.12.9 test -run TestRunGraph  ./pkg/payload/
  --- FAIL: TestRunGraph (1.03s)
      --- FAIL: TestRunGraph/cancelation_without_task_errors_is_reported (1.00s)
          task_graph_test.go:910: unexpected error: []
  FAIL
  FAIL		github.com/openshift/cluster-version-operator/pkg/payload				1.042s

Also change "cancelled" -> "canceled" to match Go's docs [2] and name
the other test cases.

[1]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1/754/artifacts/e2e-aws-upgrade/must-gather/registry-svc-ci-openshift-org-origin-4-1-sha256-f8c863ea08d64eea7b3a9ffbbde9c01ca90501afe6c0707e9c35f0ed7e92a9df/namespaces/openshift-cluster-version/pods/cluster-version-operator-5f5d465967-t57b2/cluster-version-operator/cluster-version-operator/logs/current.log
[2]: https://golang.org/pkg/context/#pkg-overview
wking added a commit to wking/cluster-version-operator that referenced this pull request Oct 22, 2019
The RunGraph implementation was unchanged since it landed in
cb4e037 (payload: Create a task graph that can split a payload into
chunks, 2019-01-17, openshift#88), with the exception of later logging and
c2ac20f (status: Report the operators that have not yet deployed,
2019-04-09, openshift#158) with the adjusted return type.

The old code launched a goroutine for the pushing/reaping, which was
an unecessary, and made error reporting on any outstanding tasks more
complicated.  I'd like to drop the goroutine, but Clayton is not
comfortable with backporting that large a change to older releases
[1].  Instead, this commit adds the pusher/reaper to the WaitGroup and
adjusts the push/reap logic to:

* Stop pushing work if we've been canceled.
* Stick around to collect any in-flight jobs, instead of existing
  after the pushing completes.
* Drop the 'ok' fallback in the completeCh read.  Now that we're part
  of the WaitGroup, with the wg.Wait gorouting blocking on the
  pusher/reaper exiting, and the main RunGraph gorouting's 'range
  errCh' blocking on the wg.Wait goroutine, we don't have to worry
  about completeCh being closed when the pusher/reaper is reading from
  it.
* Drop the 'unreachable' business.  The earlier push/reap loop should
  continue trying to push work until we are canceled or there is
  nothing left to push.  We should never be outside that loop and
  still have jobs left that we want to push.
* Write to errCh with "here's the first node that was incomplete" and
  possibly "and I was canceled" errors if there are no more-useful
  errors from the worker nodes.  This check happens after we've
  drained in-flight jobs, so we don't have to worry about new errors
  coming out of workers at this point.

This should avoid situations like [2]:

  2019-10-21T10:34:30.63940461Z I1021 10:34:30.639073       1 start.go:19] ClusterVersionOperator v1.0.0-106-g0725bd53-dirty
  ...
  2019-10-21T10:34:31.132673574Z I1021 10:34:31.132635       1 sync_worker.go:453] Running sync quay.io/runcom/origin-release:v4.2-1196 (force=true) on generation 2 in state Updating at attempt 0
  ...
  2019-10-21T10:40:16.168632703Z I1021 10:40:16.168604       1 sync_worker.go:579] Running sync for customresourcedefinition "baremetalhosts.metal3.io" (101 of 432)
  2019-10-21T10:40:16.18425522Z I1021 10:40:16.184220       1 task_graph.go:583] Canceled worker 0
  2019-10-21T10:40:16.184381244Z I1021 10:40:16.184360       1 task_graph.go:583] Canceled worker 3
  ...
  2019-10-21T10:40:16.21772875Z I1021 10:40:16.217715       1 task_graph.go:603] Workers finished
  2019-10-21T10:40:16.217777479Z I1021 10:40:16.217759       1 task_graph.go:611] Result of work: []
  2019-10-21T10:40:16.217864206Z I1021 10:40:16.217846       1 task_graph.go:539] Stopped graph walker due to cancel
  ...
  2019-10-21T10:43:08.743798997Z I1021 10:43:08.743740       1 sync_worker.go:453] Running sync quay.io/runcom/origin-release:v4.2-1196 (force=true) on generation 2 in state Reconciling at attempt 0
  ...

where the CVO canceled some workers, saw that there are worker no
errors, and decided "upgrade complete" despite never having attempted
to push the bulk of its manifests.

Without the task_graph.go changes in this commit, the new test fails
with:

  $ go test -run TestRunGraph  ./pkg/payload/
  --- FAIL: TestRunGraph (1.03s)
      --- FAIL: TestRunGraph/cancelation_without_task_errors_is_reported (1.00s)
          task_graph_test.go:910: unexpected error: []
  FAIL
  FAIL		github.com/openshift/cluster-version-operator/pkg/payload				1.042s

Also change "cancelled" -> "canceled" to match Go's docs [3] and name
the other test cases.

[1]: openshift#255 (comment)
[2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1/754/artifacts/e2e-aws-upgrade/must-gather/registry-svc-ci-openshift-org-origin-4-1-sha256-f8c863ea08d64eea7b3a9ffbbde9c01ca90501afe6c0707e9c35f0ed7e92a9df/namespaces/openshift-cluster-version/pods/cluster-version-operator-5f5d465967-t57b2/cluster-version-operator/cluster-version-operator/logs/current.log
[3]: https://golang.org/pkg/context/#pkg-overview
wking added a commit to wking/cluster-version-operator that referenced this pull request Oct 22, 2019
The RunGraph implementation was unchanged since it landed in
cb4e037 (payload: Create a task graph that can split a payload into
chunks, 2019-01-17, openshift#88), with the exception of later logging and
c2ac20f (status: Report the operators that have not yet deployed,
2019-04-09, openshift#158) with the adjusted return type.

The old code launched a goroutine for the pushing/reaping, which was
an unecessary, and made error reporting on any outstanding tasks more
complicated.  I'd like to drop the goroutine, but Clayton is not
comfortable with backporting that large a change to older releases
[1].  And I'd like to be able to return errors like:

  1 incomplete task nodes, beginning with b

but Clayton thinks these are just "took too long, but we're still
making progress" and that they'll resolve on their own in the next
attempt or few, and that they're not actual deadlocks where you'd want
a better fingerprint to pin down the node(s) that were locking [2].

This commit ensures that when we are canceled we return an error, and
it does none of the refactoring we'd need to be able to say whether we
had unprocessed nodes (for late cancels, it's possible that we could
return "I was canceled" even if we had successfully pushed and reaped
all the nodes).  This should avoid situations like [3]:

  2019-10-21T10:34:30.63940461Z I1021 10:34:30.639073       1 start.go:19] ClusterVersionOperator v1.0.0-106-g0725bd53-dirty
  ...
  2019-10-21T10:34:31.132673574Z I1021 10:34:31.132635       1 sync_worker.go:453] Running sync quay.io/runcom/origin-release:v4.2-1196 (force=true) on generation 2 in state Updating at attempt 0
  ...
  2019-10-21T10:40:16.168632703Z I1021 10:40:16.168604       1 sync_worker.go:579] Running sync for customresourcedefinition "baremetalhosts.metal3.io" (101 of 432)
  2019-10-21T10:40:16.18425522Z I1021 10:40:16.184220       1 task_graph.go:583] Canceled worker 0
  2019-10-21T10:40:16.184381244Z I1021 10:40:16.184360       1 task_graph.go:583] Canceled worker 3
  ...
  2019-10-21T10:40:16.21772875Z I1021 10:40:16.217715       1 task_graph.go:603] Workers finished
  2019-10-21T10:40:16.217777479Z I1021 10:40:16.217759       1 task_graph.go:611] Result of work: []
  2019-10-21T10:40:16.217864206Z I1021 10:40:16.217846       1 task_graph.go:539] Stopped graph walker due to cancel
  ...
  2019-10-21T10:43:08.743798997Z I1021 10:43:08.743740       1 sync_worker.go:453] Running sync quay.io/runcom/origin-release:v4.2-1196 (force=true) on generation 2 in state Reconciling at attempt 0
  ...

where the CVO canceled some workers, saw that there are worker no
errors, and decided "upgrade complete" despite never having attempted
to push the bulk of its manifests.

Without the task_graph.go changes in this commit, the new test fails
with:

  $ go test -run TestRunGraph  ./pkg/payload/
  --- FAIL: TestRunGraph (1.03s)
      --- FAIL: TestRunGraph/cancelation_without_task_errors_is_reported (1.00s)
          task_graph_test.go:910: unexpected error: []
  FAIL
  FAIL		github.com/openshift/cluster-version-operator/pkg/payload				1.042s

Also change "cancelled" -> "canceled" to match Go's docs [4] and name
the other test cases.

[1]: openshift#255 (comment)
[2]: openshift#260
[3]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1/754/artifacts/e2e-aws-upgrade/must-gather/registry-svc-ci-openshift-org-origin-4-1-sha256-f8c863ea08d64eea7b3a9ffbbde9c01ca90501afe6c0707e9c35f0ed7e92a9df/namespaces/openshift-cluster-version/pods/cluster-version-operator-5f5d465967-t57b2/cluster-version-operator/cluster-version-operator/logs/current.log
[4]: https://golang.org/pkg/context/#pkg-overview
wking added a commit to wking/cluster-version-operator that referenced this pull request Oct 22, 2019
The RunGraph implementation was unchanged since it landed in
cb4e037 (payload: Create a task graph that can split a payload into
chunks, 2019-01-17, openshift#88), with the exception of later logging and
c2ac20f (status: Report the operators that have not yet deployed,
2019-04-09, openshift#158) with the adjusted return type.

The old code launched a goroutine for the pushing/reaping, which was
an unecessary, and made error reporting on any outstanding tasks more
complicated.  I'd like to drop the goroutine, but Clayton is not
comfortable with backporting that large a change to older releases
[1].  And I'd like to be able to return errors like:

  1 incomplete task nodes, beginning with b

but Clayton thinks these are just "took too long, but we're still
making progress" and that they'll resolve on their own in the next
attempt or few, and that they're not actual deadlocks where you'd want
a better fingerprint to pin down the node(s) that were locking [2].

This commit ensures that when we are canceled we return an error, and
it does none of the refactoring we'd need to be able to say whether we
had unprocessed nodes (for late cancels, it's possible that we could
return "I was canceled" even if we had successfully pushed and reaped
all the nodes).  This should avoid situations like [3]:

  2019-10-21T10:34:30.63940461Z I1021 10:34:30.639073       1 start.go:19] ClusterVersionOperator v1.0.0-106-g0725bd53-dirty
  ...
  2019-10-21T10:34:31.132673574Z I1021 10:34:31.132635       1 sync_worker.go:453] Running sync quay.io/runcom/origin-release:v4.2-1196 (force=true) on generation 2 in state Updating at attempt 0
  ...
  2019-10-21T10:40:16.168632703Z I1021 10:40:16.168604       1 sync_worker.go:579] Running sync for customresourcedefinition "baremetalhosts.metal3.io" (101 of 432)
  2019-10-21T10:40:16.18425522Z I1021 10:40:16.184220       1 task_graph.go:583] Canceled worker 0
  2019-10-21T10:40:16.184381244Z I1021 10:40:16.184360       1 task_graph.go:583] Canceled worker 3
  ...
  2019-10-21T10:40:16.21772875Z I1021 10:40:16.217715       1 task_graph.go:603] Workers finished
  2019-10-21T10:40:16.217777479Z I1021 10:40:16.217759       1 task_graph.go:611] Result of work: []
  2019-10-21T10:40:16.217864206Z I1021 10:40:16.217846       1 task_graph.go:539] Stopped graph walker due to cancel
  ...
  2019-10-21T10:43:08.743798997Z I1021 10:43:08.743740       1 sync_worker.go:453] Running sync quay.io/runcom/origin-release:v4.2-1196 (force=true) on generation 2 in state Reconciling at attempt 0
  ...

where the CVO canceled some workers, saw that there are worker no
errors, and decided "upgrade complete" despite never having attempted
to push the bulk of its manifests.

Without the task_graph.go changes in this commit, the new test fails
with:

  $ go test -run TestRunGraph  ./pkg/payload/
  --- FAIL: TestRunGraph (1.03s)
      --- FAIL: TestRunGraph/cancelation_without_task_errors_is_reported (1.00s)
          task_graph_test.go:910: unexpected error: []
  FAIL
  FAIL		github.com/openshift/cluster-version-operator/pkg/payload				1.042s

Also change "cancelled" -> "canceled" to match Go's docs [4] and name
the other test cases.

[1]: openshift#255 (comment)
[2]: openshift#260
[3]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1/754/artifacts/e2e-aws-upgrade/must-gather/registry-svc-ci-openshift-org-origin-4-1-sha256-f8c863ea08d64eea7b3a9ffbbde9c01ca90501afe6c0707e9c35f0ed7e92a9df/namespaces/openshift-cluster-version/pods/cluster-version-operator-5f5d465967-t57b2/cluster-version-operator/cluster-version-operator/logs/current.log
[4]: https://golang.org/pkg/context/#pkg-overview
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cluster-version-operator that referenced this pull request Oct 22, 2019
The RunGraph implementation was unchanged since it landed in
cb4e037 (payload: Create a task graph that can split a payload into
chunks, 2019-01-17, openshift#88), with the exception of later logging and
c2ac20f (status: Report the operators that have not yet deployed,
2019-04-09, openshift#158) with the adjusted return type.

The old code launched a goroutine for the pushing/reaping, which was
an unecessary, and made error reporting on any outstanding tasks more
complicated.  I'd like to drop the goroutine, but Clayton is not
comfortable with backporting that large a change to older releases
[1].  And I'd like to be able to return errors like:

  1 incomplete task nodes, beginning with b

but Clayton thinks these are just "took too long, but we're still
making progress" and that they'll resolve on their own in the next
attempt or few, and that they're not actual deadlocks where you'd want
a better fingerprint to pin down the node(s) that were locking [2].

This commit ensures that when we are canceled we return an error, and
it does none of the refactoring we'd need to be able to say whether we
had unprocessed nodes (for late cancels, it's possible that we could
return "I was canceled" even if we had successfully pushed and reaped
all the nodes).  This should avoid situations like [3]:

  2019-10-21T10:34:30.63940461Z I1021 10:34:30.639073       1 start.go:19] ClusterVersionOperator v1.0.0-106-g0725bd53-dirty
  ...
  2019-10-21T10:34:31.132673574Z I1021 10:34:31.132635       1 sync_worker.go:453] Running sync quay.io/runcom/origin-release:v4.2-1196 (force=true) on generation 2 in state Updating at attempt 0
  ...
  2019-10-21T10:40:16.168632703Z I1021 10:40:16.168604       1 sync_worker.go:579] Running sync for customresourcedefinition "baremetalhosts.metal3.io" (101 of 432)
  2019-10-21T10:40:16.18425522Z I1021 10:40:16.184220       1 task_graph.go:583] Canceled worker 0
  2019-10-21T10:40:16.184381244Z I1021 10:40:16.184360       1 task_graph.go:583] Canceled worker 3
  ...
  2019-10-21T10:40:16.21772875Z I1021 10:40:16.217715       1 task_graph.go:603] Workers finished
  2019-10-21T10:40:16.217777479Z I1021 10:40:16.217759       1 task_graph.go:611] Result of work: []
  2019-10-21T10:40:16.217864206Z I1021 10:40:16.217846       1 task_graph.go:539] Stopped graph walker due to cancel
  ...
  2019-10-21T10:43:08.743798997Z I1021 10:43:08.743740       1 sync_worker.go:453] Running sync quay.io/runcom/origin-release:v4.2-1196 (force=true) on generation 2 in state Reconciling at attempt 0
  ...

where the CVO canceled some workers, saw that there are worker no
errors, and decided "upgrade complete" despite never having attempted
to push the bulk of its manifests.

Without the task_graph.go changes in this commit, the new test fails
with:

  $ go test -run TestRunGraph  ./pkg/payload/
  --- FAIL: TestRunGraph (1.03s)
      --- FAIL: TestRunGraph/cancelation_without_task_errors_is_reported (1.00s)
          task_graph_test.go:910: unexpected error: []
  FAIL
  FAIL		github.com/openshift/cluster-version-operator/pkg/payload				1.042s

Also change "cancelled" -> "canceled" to match Go's docs [4] and name
the other test cases.

[1]: openshift#255 (comment)
[2]: openshift#260
[3]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1/754/artifacts/e2e-aws-upgrade/must-gather/registry-svc-ci-openshift-org-origin-4-1-sha256-f8c863ea08d64eea7b3a9ffbbde9c01ca90501afe6c0707e9c35f0ed7e92a9df/namespaces/openshift-cluster-version/pods/cluster-version-operator-5f5d465967-t57b2/cluster-version-operator/cluster-version-operator/logs/current.log
[4]: https://golang.org/pkg/context/#pkg-overview
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cluster-version-operator that referenced this pull request Oct 22, 2019
The RunGraph implementation was unchanged since it landed in
cb4e037 (payload: Create a task graph that can split a payload into
chunks, 2019-01-17, openshift#88), with the exception of later logging and
c2ac20f (status: Report the operators that have not yet deployed,
2019-04-09, openshift#158) with the adjusted return type.

The old code launched a goroutine for the pushing/reaping, which was
an unecessary, and made error reporting on any outstanding tasks more
complicated.  I'd like to drop the goroutine, but Clayton is not
comfortable with backporting that large a change to older releases
[1].  And I'd like to be able to return errors like:

  1 incomplete task nodes, beginning with b

but Clayton thinks these are just "took too long, but we're still
making progress" and that they'll resolve on their own in the next
attempt or few, and that they're not actual deadlocks where you'd want
a better fingerprint to pin down the node(s) that were locking [2].

This commit ensures that when we are canceled we return an error, and
it does none of the refactoring we'd need to be able to say whether we
had unprocessed nodes (for late cancels, it's possible that we could
return "I was canceled" even if we had successfully pushed and reaped
all the nodes).  This should avoid situations like [3]:

  2019-10-21T10:34:30.63940461Z I1021 10:34:30.639073       1 start.go:19] ClusterVersionOperator v1.0.0-106-g0725bd53-dirty
  ...
  2019-10-21T10:34:31.132673574Z I1021 10:34:31.132635       1 sync_worker.go:453] Running sync quay.io/runcom/origin-release:v4.2-1196 (force=true) on generation 2 in state Updating at attempt 0
  ...
  2019-10-21T10:40:16.168632703Z I1021 10:40:16.168604       1 sync_worker.go:579] Running sync for customresourcedefinition "baremetalhosts.metal3.io" (101 of 432)
  2019-10-21T10:40:16.18425522Z I1021 10:40:16.184220       1 task_graph.go:583] Canceled worker 0
  2019-10-21T10:40:16.184381244Z I1021 10:40:16.184360       1 task_graph.go:583] Canceled worker 3
  ...
  2019-10-21T10:40:16.21772875Z I1021 10:40:16.217715       1 task_graph.go:603] Workers finished
  2019-10-21T10:40:16.217777479Z I1021 10:40:16.217759       1 task_graph.go:611] Result of work: []
  2019-10-21T10:40:16.217864206Z I1021 10:40:16.217846       1 task_graph.go:539] Stopped graph walker due to cancel
  ...
  2019-10-21T10:43:08.743798997Z I1021 10:43:08.743740       1 sync_worker.go:453] Running sync quay.io/runcom/origin-release:v4.2-1196 (force=true) on generation 2 in state Reconciling at attempt 0
  ...

where the CVO canceled some workers, saw that there are worker no
errors, and decided "upgrade complete" despite never having attempted
to push the bulk of its manifests.

Without the task_graph.go changes in this commit, the new test fails
with:

  $ go test -run TestRunGraph  ./pkg/payload/
  --- FAIL: TestRunGraph (1.03s)
      --- FAIL: TestRunGraph/cancelation_without_task_errors_is_reported (1.00s)
          task_graph_test.go:910: unexpected error: []
  FAIL
  FAIL		github.com/openshift/cluster-version-operator/pkg/payload				1.042s

Also change "cancelled" -> "canceled" to match Go's docs [4] and name
the other test cases.

[1]: openshift#255 (comment)
[2]: openshift#260
[3]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1/754/artifacts/e2e-aws-upgrade/must-gather/registry-svc-ci-openshift-org-origin-4-1-sha256-f8c863ea08d64eea7b3a9ffbbde9c01ca90501afe6c0707e9c35f0ed7e92a9df/namespaces/openshift-cluster-version/pods/cluster-version-operator-5f5d465967-t57b2/cluster-version-operator/cluster-version-operator/logs/current.log
[4]: https://golang.org/pkg/context/#pkg-overview
wking added a commit to wking/cluster-version-operator that referenced this pull request Oct 22, 2019
…hout a goroutine

The RunGraph implementation was unchanged since it landed in
cb4e037 (payload: Create a task graph that can split a payload into
chunks, 2019-01-17, openshift#88), with the exception of later logging,
c2ac20f (status: Report the operators that have not yet deployed,
2019-04-09, openshift#158) with the adjusted return type, and eaa3d19
(pkg/payload/task_graph: RunGraph error when canceled, 2019-10-21, openshift#255)
with its "I was canceled" error backstop.

The old code launched a goroutine for the pushing/reaping, which was
an unecessary, and made error reporting on any outstanding tasks more
complicated.  Dropping this goroutine let me get rid of errCh which
used to be used to pass errors back to the main goroutine.  I also
dropped the wg.Wait() goroutine, which used to be part of supporting
the 'range errCh' used in the main goroutine to collect failed jobs.
But because errCh was filled directly from the worker goroutines,
nothing used to be collecting the completeness of graph coverage from
the pushing/reaping goroutine.

I've also restructured pushing/reaping to be more resistant to locking
and spinning.  The old implementation had a node-pushing loop which
attempted non-blocking pushes, then a result-reaping loop, then some
condition checking, then a blocking result reaping attempt (after "we
did nothing this round, so we have to wait for more").  The idea with
the blocking reap seems to have been that if the pusher/reaper didn't
push anything (because workCh was full) and the earlier 'for
len(completeCh) > 0' reaping didn't pull anything in (because workers
were busy working), that the pusher/reaper should block until it
collected a result, in the hopes that worker which returned the result
would have cleared a job out of workCh to give the pusher/reaper space
to push a new job with the next loop iteration.  But if Go decides to
give the pusher/reaper more time in the scheduler, it might attempt
the next workCh push before the job gets around to being scheduled and
popping from workCh.  And then the pusher/reaper might trickle down to
the blocking reap and wait for another worker (hopefully
maxParallelism > 1) to return a result to unblock the pusher/reaper
and give it another shot at a workCh push.  During this time, the
worker that returned earlier is idling with nothing to do.

With this commit, when we have a next node to push, we have a single
switch statement that blocks until we are either able to push the node
or to reap a result.  So there's no need for a non-blocking push, and
no chance at spinning, although it does mean we need to recalculate
the next node after each channel action.  When we've been canceled, we
stop pushing into workCh, but continue to reap from resultCh until we
have no in-flight jobs left.  And if we have nothing to push, and
there's nothing in-flight to reap, we're obviously done, so that
choice is a lot simpler now.

I've dropped the "Waiting for workers to complete" log line, because
wg.Wait() call should block for much less time now.  And because the
main RunGraph goroutine is doing the waiting, we no longer need the
'runTask, ok := <-workCh' read to protect against workCh being closed
early.  With the wg.Wait() now getting called after we have drained
all in-flight jobs (previously we launched it immediately after
launching workers), there is less call for the "Waiting for..." line.

But the most externally noticeable change is that now, if we run
without any failing jobs to give us errors, I'm filling in a new
"incomplete task nodes" error so folks don't have to look in the CVO
logs to see how far the CVO got before hanging.  It also allows us to
not return the "I was canceled" error in cases where the cancellation
happened late enough that we were still able to successfully process
the full graph.
wking added a commit to wking/cluster-version-operator that referenced this pull request Dec 16, 2019
…hout a goroutine

The RunGraph implementation was unchanged since it landed in
cb4e037 (payload: Create a task graph that can split a payload into
chunks, 2019-01-17, openshift#88), with the exception of later logging,
c2ac20f (status: Report the operators that have not yet deployed,
2019-04-09, openshift#158) with the adjusted return type, and eaa3d19
(pkg/payload/task_graph: RunGraph error when canceled, 2019-10-21, openshift#255)
with its "I was canceled" error backstop.

The old code launched a goroutine for the pushing/reaping, which was
an unecessary, and made error reporting on any outstanding tasks more
complicated.  Dropping this goroutine let me get rid of errCh which
used to be used to pass errors back to the main goroutine.  I also
dropped the wg.Wait() goroutine, which used to be part of supporting
the 'range errCh' used in the main goroutine to collect failed jobs.
But because errCh was filled directly from the worker goroutines,
nothing used to be collecting the completeness of graph coverage from
the pushing/reaping goroutine.

I've also restructured pushing/reaping to be more resistant to locking
and spinning.  The old implementation had a node-pushing loop which
attempted non-blocking pushes, then a result-reaping loop, then some
condition checking, then a blocking result reaping attempt (after "we
did nothing this round, so we have to wait for more").  The idea with
the blocking reap seems to have been that if the pusher/reaper didn't
push anything (because workCh was full) and the earlier 'for
len(completeCh) > 0' reaping didn't pull anything in (because workers
were busy working), that the pusher/reaper should block until it
collected a result, in the hopes that worker which returned the result
would have cleared a job out of workCh to give the pusher/reaper space
to push a new job with the next loop iteration.  But if Go decides to
give the pusher/reaper more time in the scheduler, it might attempt
the next workCh push before the job gets around to being scheduled and
popping from workCh.  And then the pusher/reaper might trickle down to
the blocking reap and wait for another worker (hopefully
maxParallelism > 1) to return a result to unblock the pusher/reaper
and give it another shot at a workCh push.  During this time, the
worker that returned earlier is idling with nothing to do.

With this commit, when we have a next node to push, we have a single
switch statement that blocks until we are either able to push the node
or to reap a result.  So there's no need for a non-blocking push, and
no chance at spinning, although it does mean we need to recalculate
the next node after each channel action.  When we've been canceled, we
stop pushing into workCh, but continue to reap from resultCh until we
have no in-flight jobs left.  And if we have nothing to push, and
there's nothing in-flight to reap, we're obviously done, so that
choice is a lot simpler now.

I've dropped the "Waiting for workers to complete" log line, because
wg.Wait() call should block for much less time now.  And because the
main RunGraph goroutine is doing the waiting, we no longer need the
'runTask, ok := <-workCh' read to protect against workCh being closed
early.  With the wg.Wait() now getting called after we have drained
all in-flight jobs (previously we launched it immediately after
launching workers), there is less call for the "Waiting for..." line.

But the most externally noticeable change is that now, if we run
without any failing jobs to give us errors, I'm filling in a new
"incomplete task nodes" error so folks don't have to look in the CVO
logs to see how far the CVO got before hanging.  It also allows us to
not return the "I was canceled" error in cases where the cancellation
happened late enough that we were still able to successfully process
the full graph.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 28, 2020
…hout a goroutine

The RunGraph implementation was unchanged since it landed in
cb4e037 (payload: Create a task graph that can split a payload into
chunks, 2019-01-17, openshift#88), with the exception of later logging,
c2ac20f (status: Report the operators that have not yet deployed,
2019-04-09, openshift#158) with the adjusted return type, and eaa3d19
(pkg/payload/task_graph: RunGraph error when canceled, 2019-10-21, openshift#255)
with its "I was canceled" error backstop.

The old code launched a goroutine for the pushing/reaping, which was
an unecessary, and made error reporting on any outstanding tasks more
complicated.  Dropping this goroutine let me get rid of errCh which
used to be used to pass errors back to the main goroutine.  I also
dropped the wg.Wait() goroutine, which used to be part of supporting
the 'range errCh' used in the main goroutine to collect failed jobs.
But because errCh was filled directly from the worker goroutines,
nothing used to be collecting the completeness of graph coverage from
the pushing/reaping goroutine.

I've also restructured pushing/reaping to be more resistant to locking
and spinning.  The old implementation had a node-pushing loop which
attempted non-blocking pushes, then a result-reaping loop, then some
condition checking, then a blocking result reaping attempt (after "we
did nothing this round, so we have to wait for more").  The idea with
the blocking reap seems to have been that if the pusher/reaper didn't
push anything (because workCh was full) and the earlier 'for
len(completeCh) > 0' reaping didn't pull anything in (because workers
were busy working), that the pusher/reaper should block until it
collected a result, in the hopes that worker which returned the result
would have cleared a job out of workCh to give the pusher/reaper space
to push a new job with the next loop iteration.  But if Go decides to
give the pusher/reaper more time in the scheduler, it might attempt
the next workCh push before the job gets around to being scheduled and
popping from workCh.  And then the pusher/reaper might trickle down to
the blocking reap and wait for another worker (hopefully
maxParallelism > 1) to return a result to unblock the pusher/reaper
and give it another shot at a workCh push.  During this time, the
worker that returned earlier is idling with nothing to do.

With this commit, when we have a next node to push, we have a single
switch statement that blocks until we are either able to push the node
or to reap a result.  So there's no need for a non-blocking push, and
no chance at spinning, although it does mean we need to recalculate
the next node after each channel action.  When we've been canceled, we
stop pushing into workCh, but continue to reap from resultCh until we
have no in-flight jobs left.  And if we have nothing to push, and
there's nothing in-flight to reap, we're obviously done, so that
choice is a lot simpler now.

I've dropped the "Waiting for workers to complete" log line, because
wg.Wait() call should block for much less time now.  And because the
main RunGraph goroutine is doing the waiting, we no longer need the
'runTask, ok := <-workCh' read to protect against workCh being closed
early.  With the wg.Wait() now getting called after we have drained
all in-flight jobs (previously we launched it immediately after
launching workers), there is less call for the "Waiting for..." line.

But the most externally noticeable change is that now, if we run
without any failing jobs to give us errors, I'm filling in a new
"incomplete task nodes" error so folks don't have to look in the CVO
logs to see how far the CVO got before hanging.  It also allows us to
not return the "I was canceled" error in cases where the cancellation
happened late enough that we were still able to successfully process
the full graph.
wking added a commit to wking/cluster-version-operator that referenced this pull request Jun 1, 2020
…hout a goroutine

The RunGraph implementation was unchanged since it landed in
cb4e037 (payload: Create a task graph that can split a payload into
chunks, 2019-01-17, openshift#88), with the exception of later logging,
c2ac20f (status: Report the operators that have not yet deployed,
2019-04-09, openshift#158) with the adjusted return type, and eaa3d19
(pkg/payload/task_graph: RunGraph error when canceled, 2019-10-21, openshift#255)
with its "I was canceled" error backstop.

The old code launched a goroutine for the pushing/reaping, which was
an unecessary, and made error reporting on any outstanding tasks more
complicated.  Dropping this goroutine let me get rid of errCh which
used to be used to pass errors back to the main goroutine.  I also
dropped the wg.Wait() goroutine, which used to be part of supporting
the 'range errCh' used in the main goroutine to collect failed jobs.
But because errCh was filled directly from the worker goroutines,
nothing used to be collecting the completeness of graph coverage from
the pushing/reaping goroutine.

I've also restructured pushing/reaping to be more resistant to locking
and spinning.  The old implementation had a node-pushing loop which
attempted non-blocking pushes, then a result-reaping loop, then some
condition checking, then a blocking result reaping attempt (after "we
did nothing this round, so we have to wait for more").  The idea with
the blocking reap seems to have been that if the pusher/reaper didn't
push anything (because workCh was full) and the earlier 'for
len(completeCh) > 0' reaping didn't pull anything in (because workers
were busy working), that the pusher/reaper should block until it
collected a result, in the hopes that worker which returned the result
would have cleared a job out of workCh to give the pusher/reaper space
to push a new job with the next loop iteration.  But if Go decides to
give the pusher/reaper more time in the scheduler, it might attempt
the next workCh push before the job gets around to being scheduled and
popping from workCh.  And then the pusher/reaper might trickle down to
the blocking reap and wait for another worker (hopefully
maxParallelism > 1) to return a result to unblock the pusher/reaper
and give it another shot at a workCh push.  During this time, the
worker that returned earlier is idling with nothing to do.

With this commit, when we have a next node to push, we have a single
switch statement that blocks until we are either able to push the node
or to reap a result.  So there's no need for a non-blocking push, and
no chance at spinning, although it does mean we need to recalculate
the next node after each channel action.  When we've been canceled, we
stop pushing into workCh, but continue to reap from resultCh until we
have no in-flight jobs left.  And if we have nothing to push, and
there's nothing in-flight to reap, we're obviously done, so that
choice is a lot simpler now.

I've dropped the "Waiting for workers to complete" log line, because
wg.Wait() call should block for much less time now.  And because the
main RunGraph goroutine is doing the waiting, we no longer need the
'runTask, ok := <-workCh' read to protect against workCh being closed
early.  With the wg.Wait() now getting called after we have drained
all in-flight jobs (previously we launched it immediately after
launching workers), there is less call for the "Waiting for..." line.

But the most externally noticeable change is that now, if we run
without any failing jobs to give us errors, I'm filling in a new
"incomplete task nodes" error so folks don't have to look in the CVO
logs to see how far the CVO got before hanging.  It also allows us to
not return the "I was canceled" error in cases where the cancellation
happened late enough that we were still able to successfully process
the full graph.
wking added a commit to wking/cluster-version-operator that referenced this pull request Jun 1, 2020
…hout a goroutine

The RunGraph implementation was unchanged since it landed in
cb4e037 (payload: Create a task graph that can split a payload into
chunks, 2019-01-17, openshift#88), with the exception of later logging,
c2ac20f (status: Report the operators that have not yet deployed,
2019-04-09, openshift#158) with the adjusted return type, and eaa3d19
(pkg/payload/task_graph: RunGraph error when canceled, 2019-10-21, openshift#255)
with its "I was canceled" error backstop.

The old code launched a goroutine for the pushing/reaping, which was
an unecessary, and made error reporting on any outstanding tasks more
complicated.  Dropping this goroutine let me get rid of errCh which
used to be used to pass errors back to the main goroutine.  I also
dropped the wg.Wait() goroutine, which used to be part of supporting
the 'range errCh' used in the main goroutine to collect failed jobs.
But because errCh was filled directly from the worker goroutines,
nothing used to be collecting the completeness of graph coverage from
the pushing/reaping goroutine.

I've also restructured pushing/reaping to be more resistant to locking
and spinning.  The old implementation had a node-pushing loop which
attempted non-blocking pushes, then a result-reaping loop, then some
condition checking, then a blocking result reaping attempt (after "we
did nothing this round, so we have to wait for more").  The idea with
the blocking reap seems to have been that if the pusher/reaper didn't
push anything (because workCh was full) and the earlier 'for
len(completeCh) > 0' reaping didn't pull anything in (because workers
were busy working), that the pusher/reaper should block until it
collected a result, in the hopes that worker which returned the result
would have cleared a job out of workCh to give the pusher/reaper space
to push a new job with the next loop iteration.  But if Go decides to
give the pusher/reaper more time in the scheduler, it might attempt
the next workCh push before the job gets around to being scheduled and
popping from workCh.  And then the pusher/reaper might trickle down to
the blocking reap and wait for another worker (hopefully
maxParallelism > 1) to return a result to unblock the pusher/reaper
and give it another shot at a workCh push.  During this time, the
worker that returned earlier is idling with nothing to do.

With this commit, when we have a next node to push, we have a single
switch statement that blocks until we are either able to push the node
or to reap a result.  So there's no need for a non-blocking push, and
no chance at spinning, although it does mean we need to recalculate
the next node after each channel action.  When we've been canceled, we
stop pushing into workCh, but continue to reap from resultCh until we
have no in-flight jobs left.  And if we have nothing to push, and
there's nothing in-flight to reap, we're obviously done, so that
choice is a lot simpler now.

I've dropped the "Waiting for workers to complete" log line, because
wg.Wait() call should block for much less time now.  And because the
main RunGraph goroutine is doing the waiting, we no longer need the
'runTask, ok := <-workCh' read to protect against workCh being closed
early.  With the wg.Wait() now getting called after we have drained
all in-flight jobs (previously we launched it immediately after
launching workers), there is less call for the "Waiting for..." line.

But the most externally noticeable change is that now, if we run
without any failing jobs to give us errors, I'm filling in a new
"incomplete task nodes" error so folks don't have to look in the CVO
logs to see how far the CVO got before hanging.  It also allows us to
not return the "I was canceled" error in cases where the cancellation
happened late enough that we were still able to successfully process
the full graph.

The unit change fills in a less-stubby Kubernetes object, avoiding:

  --- FAIL: TestRunGraph/cancelation_without_task_errors_is_reported (1.00s)
      task_graph_test.go:914: error 0 "1 incomplete task nodes, beginning with %!s(PANIC=String method: runtime error: invalid memory address or nil pointer dereference)" doesn't contain "1 incomplete task nodes, beginning with b"

when Task.String calls Unstructured.GetName which explodes on the lack
of expected Kubernetes-object metadata.
wking added a commit to wking/cluster-version-operator that referenced this pull request Jun 2, 2020
…hout a goroutine

The RunGraph implementation was unchanged since it landed in
cb4e037 (payload: Create a task graph that can split a payload into
chunks, 2019-01-17, openshift#88), with the exception of later logging,
c2ac20f (status: Report the operators that have not yet deployed,
2019-04-09, openshift#158) with the adjusted return type, and eaa3d19
(pkg/payload/task_graph: RunGraph error when canceled, 2019-10-21, openshift#255)
with its "I was canceled" error backstop.

The old code launched a goroutine for the pushing/reaping, which was
an unecessary, and made error reporting on any outstanding tasks more
complicated.  Dropping this goroutine let me get rid of errCh which
used to be used to pass errors back to the main goroutine.  I also
dropped the wg.Wait() goroutine, which used to be part of supporting
the 'range errCh' used in the main goroutine to collect failed jobs.
But because errCh was filled directly from the worker goroutines,
nothing used to be collecting the completeness of graph coverage from
the pushing/reaping goroutine.

I've also restructured pushing/reaping to be more resistant to locking
and spinning.  The old implementation had a node-pushing loop which
attempted non-blocking pushes, then a result-reaping loop, then some
condition checking, then a blocking result reaping attempt (after "we
did nothing this round, so we have to wait for more").  The idea with
the blocking reap seems to have been that if the pusher/reaper didn't
push anything (because workCh was full) and the earlier 'for
len(completeCh) > 0' reaping didn't pull anything in (because workers
were busy working), that the pusher/reaper should block until it
collected a result, in the hopes that worker which returned the result
would have cleared a job out of workCh to give the pusher/reaper space
to push a new job with the next loop iteration.  But if Go decides to
give the pusher/reaper more time in the scheduler, it might attempt
the next workCh push before the job gets around to being scheduled and
popping from workCh.  And then the pusher/reaper might trickle down to
the blocking reap and wait for another worker (hopefully
maxParallelism > 1) to return a result to unblock the pusher/reaper
and give it another shot at a workCh push.  During this time, the
worker that returned earlier is idling with nothing to do.

With this commit, when we have a next node to push, we have a single
switch statement that blocks until we are either able to push the node
or to reap a result.  So there's no need for a non-blocking push, and
no chance at spinning, although it does mean we need to recalculate
the next node after each channel action.  When we've been canceled, we
stop pushing into workCh, but continue to reap from resultCh until we
have no in-flight jobs left.  And if we have nothing to push, and
there's nothing in-flight to reap, we're obviously done, so that
choice is a lot simpler now.

I've dropped the "Waiting for workers to complete" log line, because
wg.Wait() call should block for much less time now.  And because the
main RunGraph goroutine is doing the waiting, we no longer need the
'runTask, ok := <-workCh' read to protect against workCh being closed
early.  With the wg.Wait() now getting called after we have drained
all in-flight jobs (previously we launched it immediately after
launching workers), there is less call for the "Waiting for..." line.

But the most externally noticeable change is that now, if we run
without any failing jobs to give us errors, I'm filling in a new
"incomplete task nodes" error so folks don't have to look in the CVO
logs to see how far the CVO got before hanging.  It also allows us to
not return the "I was canceled" error in cases where the cancellation
happened late enough that we were still able to successfully process
the full graph.

The unit change fills in a less-stubby Kubernetes object, avoiding:

  --- FAIL: TestRunGraph/cancelation_without_task_errors_is_reported (1.00s)
      task_graph_test.go:914: error 0 "1 incomplete task nodes, beginning with %!s(PANIC=String method: runtime error: invalid memory address or nil pointer dereference)" doesn't contain "1 incomplete task nodes, beginning with b"

when Task.String calls Unstructured.GetName which explodes on the lack
of expected Kubernetes-object metadata.
wking added a commit to wking/cluster-version-operator that referenced this pull request Jun 2, 2020
The old format is from cb4e037 (payload: Create a task graph that
can split a payload into chunks, 2019-01-17, openshift#88), but Abhinav points
out that "on %d" could use a bit more context [1].

[1]: openshift#264 (comment)
wking added a commit to wking/cluster-version-operator that referenced this pull request Jun 9, 2020
…hout a goroutine

The RunGraph implementation was unchanged since it landed in
cb4e037 (payload: Create a task graph that can split a payload into
chunks, 2019-01-17, openshift#88), with the exception of later logging,
c2ac20f (status: Report the operators that have not yet deployed,
2019-04-09, openshift#158) with the adjusted return type, and eaa3d19
(pkg/payload/task_graph: RunGraph error when canceled, 2019-10-21, openshift#255)
with its "I was canceled" error backstop.

The old code launched a goroutine for the pushing/reaping, which was
an unecessary, and made error reporting on any outstanding tasks more
complicated.  Dropping this goroutine let me get rid of errCh which
used to be used to pass errors back to the main goroutine.  I also
dropped the wg.Wait() goroutine, which used to be part of supporting
the 'range errCh' used in the main goroutine to collect failed jobs.
But because errCh was filled directly from the worker goroutines,
nothing used to be collecting the completeness of graph coverage from
the pushing/reaping goroutine.

I've also restructured pushing/reaping to be more resistant to locking
and spinning.  The old implementation had a node-pushing loop which
attempted non-blocking pushes, then a result-reaping loop, then some
condition checking, then a blocking result reaping attempt (after "we
did nothing this round, so we have to wait for more").  The idea with
the blocking reap seems to have been that if the pusher/reaper didn't
push anything (because workCh was full) and the earlier 'for
len(completeCh) > 0' reaping didn't pull anything in (because workers
were busy working), that the pusher/reaper should block until it
collected a result, in the hopes that worker which returned the result
would have cleared a job out of workCh to give the pusher/reaper space
to push a new job with the next loop iteration.  But if Go decides to
give the pusher/reaper more time in the scheduler, it might attempt
the next workCh push before the job gets around to being scheduled and
popping from workCh.  And then the pusher/reaper might trickle down to
the blocking reap and wait for another worker (hopefully
maxParallelism > 1) to return a result to unblock the pusher/reaper
and give it another shot at a workCh push.  During this time, the
worker that returned earlier is idling with nothing to do.

With this commit, when we have a next node to push, we have a single
switch statement that blocks until we are either able to push the node
or to reap a result.  So there's no need for a non-blocking push, and
no chance at spinning, although it does mean we need to recalculate
the next node after each channel action.  When we've been canceled, we
stop pushing into workCh, but continue to reap from resultCh until we
have no in-flight jobs left.  And if we have nothing to push, and
there's nothing in-flight to reap, we're obviously done, so that
choice is a lot simpler now.

I've dropped the "Waiting for workers to complete" log line, because
wg.Wait() call should block for much less time now.  And because the
main RunGraph goroutine is doing the waiting, we no longer need the
'runTask, ok := <-workCh' read to protect against workCh being closed
early.  With the wg.Wait() now getting called after we have drained
all in-flight jobs (previously we launched it immediately after
launching workers), there is less call for the "Waiting for..." line.

But the most externally noticeable change is that now, if we run
without any failing jobs to give us errors, I'm filling in a new
"incomplete task nodes" error so folks don't have to look in the CVO
logs to see how far the CVO got before hanging.  It also allows us to
not return the "I was canceled" error in cases where the cancellation
happened late enough that we were still able to successfully process
the full graph.

The unit change fills in a less-stubby Kubernetes object, avoiding:

  --- FAIL: TestRunGraph/cancelation_without_task_errors_is_reported (1.00s)
      task_graph_test.go:914: error 0 "1 incomplete task nodes, beginning with %!s(PANIC=String method: runtime error: invalid memory address or nil pointer dereference)" doesn't contain "1 incomplete task nodes, beginning with b"

when Task.String calls Unstructured.GetName which explodes on the lack
of expected Kubernetes-object metadata.
wking added a commit to wking/cluster-version-operator that referenced this pull request Aug 17, 2020
…ic discussion

The regular expression that the cluster-version operator actually uses
is unchanged since it landed in cb4e037 (payload: Create a task
graph that can split a payload into chunks, 2019-01-17, openshift#88).  But the
regular expression given in the docs was over-simplified (e.g. it
allowed the component to begin with a dash, while the actual
expression forbids component-leading dashes).  Instead of talking
about the specifics of the regular expression, just explain the
different portions in the docs.
wking added a commit to wking/cluster-version-operator that referenced this pull request Apr 24, 2024
cb4e037 (payload: Create a task graph that can split a payload into
chunks, 2019-01-17, openshift#88) dropped the last consumer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants