Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/payload/task_graph: Handle node pushing and result collection without a goroutine #264

Merged

Conversation

wking
Copy link
Member

@wking wking commented Oct 22, 2019

We want to give #255 time to cook in master before backporting it, so:

/hold

while that happens. Once it does, I think we want this larger refactor in the master branch, even if we don't end up backporting it to earlier z streams.

The old code launched a goroutine for the pushing/reaping, which was an unecessary, and made error reporting on any outstanding tasks more complicated. Dropping this goroutine let me get rid of errCh which used to be used to pass errors back to the main goroutine. I also dropped the wg.Wait() goroutine, which used to be part of supporting the range errCh used in the main goroutine to collect failed jobs. But because errCh was filled directly from the worker goroutines, nothing used to be collecting the completeness of graph coverage from the pushing/reaping goroutine.

I've also restructured pushing/reaping to be more resistant to locking and spinning. The old implementation had a node-pushing loop which attempted non-blocking pushes, then a result-reaping loop, then some condition checking, then a blocking result reaping attempt (after we did nothing this round, so we have to wait for more). The idea with the blocking reap seems to have been that if the pusher/reaper didn't push anything (because workCh was full) and the earlier for len(completeCh) > 0 reaping didn't pull anything in (because workers were busy working), that the pusher/reaper should block until it collected a result, in the hopes that worker which returned the result would have cleared a job out of workCh to give the pusher/reaper space to push a new job with the next loop iteration. But if Go decides to give the pxusher/reaper more time in the scheduler, it might attempt the next workCh push before the job gets around to being scheduled and popping from workCh. And then the pusher/reaper might trickle down to the blocking reap and wait for another worker (hopefully maxParallelism > 1) to return a result to unblock the pusher/reaper and give it another shot at a workCh push. During this time, the worker that returned earlier is idling with nothing to do.

With this PR, when we have a next node to push, we have a single switch statement that blocks until we are either able to push the node or to reap a result. So there's no need for a non-blocking push, and no chance at spinning, although it does mean we need to recalculate the next node after each channel action. When we've been canceled, we stop pushing into workCh, but continue to reap from resultCh until we have no in-flight jobs left. And if we have nothing to push, and there's nothing in-flight to reap, we're obviously done, so that choice is a lot simpler now.

I've dropped the Waiting for workers to complete log line, because wg.Wait() call should block for much less time now. And because the main RunGraph goroutine is doing the waiting, we no longer need the runTask, ok := <-workCh read to protect against workCh being closed early. With the wg.Wait() now getting called after we have drained all in-flight jobs (previously we launched it immediately after launching workers), there is less call for the Waiting for... line.

But the most externally noticeable change is that now, if we run without any failing jobs to give us errors, I'm filling in a new "incomplete task nodes" error so folks don't have to look in the CVO logs to see how far the CVO got before hanging. It also allows us to not return the "I was canceled" error in cases where the cancellation happened late enough that we were still able to successfully process the full graph.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 22, 2019
@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 22, 2019
@wking
Copy link
Member Author

wking commented Oct 29, 2019

#255 has cooked for a week and so has the 4.2 backport #262. The upgrade failure was an install-time error simulating policy: Throttling: Rate exceeded, which is unrelated to this PR.

/hold cancel
/retest

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 29, 2019
@wking wking force-pushed the run-graph-goroutine-refactor branch from fe98f76 to 437c38c Compare December 16, 2019 21:56
@wking
Copy link
Member Author

wking commented Dec 16, 2019

Rebased around #201 with fe98f76 -> 437c38c.

@wking wking force-pushed the run-graph-goroutine-refactor branch 2 times, most recently from c674859 to 5ee7d5b Compare June 1, 2020 22:41
@wking
Copy link
Member Author

wking commented Jun 1, 2020

Rebased again with c674859 -> 5ee7d5b, because I'd forgotten I'd rebased this last week 🤦

@wking wking force-pushed the run-graph-goroutine-refactor branch from 5ee7d5b to 7f95e2e Compare June 1, 2020 23:21
}
}

cancelFn()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the runner is

  • waiting on ctx
  • waiting on work from workCh

the also passes along the ctx to the syn task fn.

maybe what we should do is,
the runner only loops on the workCh for work, and passes along the ctx to task sync fn so that we can terminate in progress work.
here we should just close workCh as there is no longer any work left to done.

that simplifies/compartmentalizes what's used for what.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the Go spec:

Sending to or closing a closed channel causes a run-time panic.

So if we manually call close(workCh) here, then this deferred call will panic, right? Maybe Go ignores panics from deferred functions, but still, I personally prefer having the worker watching both nestedCtx and workCh.

@wking wking force-pushed the run-graph-goroutine-refactor branch from 7f95e2e to f6271d0 Compare June 2, 2020 02:32
wking added a commit to wking/cluster-version-operator that referenced this pull request Jun 2, 2020
The old format is from cb4e037 (payload: Create a task graph that
can split a payload into chunks, 2019-01-17, openshift#88), but Abhinav points
out that "on %d" could use a bit more context [1].

[1]: openshift#264 (comment)
…hout a goroutine

The RunGraph implementation was unchanged since it landed in
cb4e037 (payload: Create a task graph that can split a payload into
chunks, 2019-01-17, openshift#88), with the exception of later logging,
c2ac20f (status: Report the operators that have not yet deployed,
2019-04-09, openshift#158) with the adjusted return type, and eaa3d19
(pkg/payload/task_graph: RunGraph error when canceled, 2019-10-21, openshift#255)
with its "I was canceled" error backstop.

The old code launched a goroutine for the pushing/reaping, which was
an unecessary, and made error reporting on any outstanding tasks more
complicated.  Dropping this goroutine let me get rid of errCh which
used to be used to pass errors back to the main goroutine.  I also
dropped the wg.Wait() goroutine, which used to be part of supporting
the 'range errCh' used in the main goroutine to collect failed jobs.
But because errCh was filled directly from the worker goroutines,
nothing used to be collecting the completeness of graph coverage from
the pushing/reaping goroutine.

I've also restructured pushing/reaping to be more resistant to locking
and spinning.  The old implementation had a node-pushing loop which
attempted non-blocking pushes, then a result-reaping loop, then some
condition checking, then a blocking result reaping attempt (after "we
did nothing this round, so we have to wait for more").  The idea with
the blocking reap seems to have been that if the pusher/reaper didn't
push anything (because workCh was full) and the earlier 'for
len(completeCh) > 0' reaping didn't pull anything in (because workers
were busy working), that the pusher/reaper should block until it
collected a result, in the hopes that worker which returned the result
would have cleared a job out of workCh to give the pusher/reaper space
to push a new job with the next loop iteration.  But if Go decides to
give the pusher/reaper more time in the scheduler, it might attempt
the next workCh push before the job gets around to being scheduled and
popping from workCh.  And then the pusher/reaper might trickle down to
the blocking reap and wait for another worker (hopefully
maxParallelism > 1) to return a result to unblock the pusher/reaper
and give it another shot at a workCh push.  During this time, the
worker that returned earlier is idling with nothing to do.

With this commit, when we have a next node to push, we have a single
switch statement that blocks until we are either able to push the node
or to reap a result.  So there's no need for a non-blocking push, and
no chance at spinning, although it does mean we need to recalculate
the next node after each channel action.  When we've been canceled, we
stop pushing into workCh, but continue to reap from resultCh until we
have no in-flight jobs left.  And if we have nothing to push, and
there's nothing in-flight to reap, we're obviously done, so that
choice is a lot simpler now.

I've dropped the "Waiting for workers to complete" log line, because
wg.Wait() call should block for much less time now.  And because the
main RunGraph goroutine is doing the waiting, we no longer need the
'runTask, ok := <-workCh' read to protect against workCh being closed
early.  With the wg.Wait() now getting called after we have drained
all in-flight jobs (previously we launched it immediately after
launching workers), there is less call for the "Waiting for..." line.

But the most externally noticeable change is that now, if we run
without any failing jobs to give us errors, I'm filling in a new
"incomplete task nodes" error so folks don't have to look in the CVO
logs to see how far the CVO got before hanging.  It also allows us to
not return the "I was canceled" error in cases where the cancellation
happened late enough that we were still able to successfully process
the full graph.

The unit change fills in a less-stubby Kubernetes object, avoiding:

  --- FAIL: TestRunGraph/cancelation_without_task_errors_is_reported (1.00s)
      task_graph_test.go:914: error 0 "1 incomplete task nodes, beginning with %!s(PANIC=String method: runtime error: invalid memory address or nil pointer dereference)" doesn't contain "1 incomplete task nodes, beginning with b"

when Task.String calls Unstructured.GetName which explodes on the lack
of expected Kubernetes-object metadata.
@wking wking force-pushed the run-graph-goroutine-refactor branch from f6271d0 to 55ef3d3 Compare June 9, 2020 04:21
@wking
Copy link
Member Author

wking commented Jun 9, 2020

Rebased around 30bdaa8 (#377) with f6271d0 -> 55ef3d3.

@wking
Copy link
Member Author

wking commented Jun 10, 2020

/retest

@abhinavdahiya
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 10, 2020
@jottofar
Copy link
Contributor

/lgtm

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, jottofar, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [abhinavdahiya,jottofar,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit fc25a6f into openshift:master Jun 12, 2020
@wking wking deleted the run-graph-goroutine-refactor branch June 12, 2020 15:57
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 15, 2020
Before 55ef3d3 (pkg/payload/task_graph: Handle node pushing and
result collection without a goroutine, 2019-10-21, openshift#264), RunGraph had
a separate goroutine that managed the work queue, with results fed
into errCh to be collected by the main RunGraph goroutine.  It didn't
matter if that work queue goroutine hung; as long as all the worker
goroutines exited, RunGraph would collect their errors from errCh and
return.  In 55ef3d3, I removed the queue goroutine and moved queue
management into the main RunGraph goroutine.  With that change, we
became exposed to the following race:

1. Main goroutine pushes work into workCh.
2. Context canceled.
3. Workers exit via the "Canceled worker..." case, so they don't pick
   the work out of workCh.
4. Main goroutine deadlocks because there is work in flight, but
   nothing in resultCh, and no longer any workers to feed resultCh.

In logs, this looks like "sync-worker goroutine has gone to sleep, and
is no longer synchronizing manifests" [1].

With this commit, we drain results when they are available, but we
also respect the context to allow the resultCh read to be canceled.
When we have been canceled with work in flight, we also attempt a
non-blocking read from workCh to drain out anything there that has not
yet been picked up by a worker.  Because 'done' will not be set true,
we'll call getNextNode again and come in with a fresh pass through the
for loop.  ctx.Err() will no longer be nil, but if the workCh drain
worked, we may now have inflight == 0, and we'll end up in the case
that sets 'done' true, and break out of the for loop on that round.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1873900
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 16, 2020
Before 55ef3d3 (pkg/payload/task_graph: Handle node pushing and
result collection without a goroutine, 2019-10-21, openshift#264), RunGraph had
a separate goroutine that managed the work queue, with results fed
into errCh to be collected by the main RunGraph goroutine.  It didn't
matter if that work queue goroutine hung; as long as all the worker
goroutines exited, RunGraph would collect their errors from errCh and
return.  In 55ef3d3, I removed the queue goroutine and moved queue
management into the main RunGraph goroutine.  With that change, we
became exposed to the following race:

1. Main goroutine pushes work into workCh.
2. Context canceled.
3. Workers exit via the "Canceled worker..." case, so they don't pick
   the work out of workCh.
4. Main goroutine deadlocks because there is work in flight, but
   nothing in resultCh, and no longer any workers to feed resultCh.

In logs, this looks like "sync-worker goroutine has gone to sleep, and
is no longer synchronizing manifests" [1].

With this commit, we drain results when they are available, but we
also respect the context to allow the resultCh read to be canceled.
When we have been canceled with work in flight, we also attempt a
non-blocking read from workCh to drain out anything there that has not
yet been picked up by a worker.  Because 'done' will not be set true,
we'll call getNextNode again and come in with a fresh pass through the
for loop.  ctx.Err() will no longer be nil, but if the workCh drain
worked, we may now have inflight == 0, and we'll end up in the case
that sets 'done' true, and break out of the for loop on that round.

The unit test sets up three parallel nodes, a, b, and c.  We configure
two workers, so the first picks up node a while the second picks up
node b.  Node c doesn't block on either, so it gets pushed into workCh
while the workers grind through a and b.  When both workers are
halfway through their second task in their respective nodes, the 'a'
worker cancels the run.  Because the sleeps do not have
select-ctx.Done guards, the workers each finish off that second task,
notice the cancel as they enter their third task, and exit with the
"context canceled" error.  This leaves node c stuck in workCh, and we
need the fix from this commit to avoid deadlocking on that in-flight
node.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1873900
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 16, 2020
Before 55ef3d3 (pkg/payload/task_graph: Handle node pushing and
result collection without a goroutine, 2019-10-21, openshift#264), RunGraph had
a separate goroutine that managed the work queue, with results fed
into errCh to be collected by the main RunGraph goroutine.  It didn't
matter if that work queue goroutine hung; as long as all the worker
goroutines exited, RunGraph would collect their errors from errCh and
return.  In 55ef3d3, I removed the queue goroutine and moved queue
management into the main RunGraph goroutine.  With that change, we
became exposed to the following race:

1. Main goroutine pushes work into workCh.
2. Context canceled.
3. Workers exit via the "Canceled worker..." case, so they don't pick
   the work out of workCh.
4. Main goroutine deadlocks because there is work in flight, but
   nothing in resultCh, and no longer any workers to feed resultCh.

In logs, this looks like "sync-worker goroutine has gone to sleep, and
is no longer synchronizing manifests" [1].

With this commit, we drain results when they are available, but we
also respect the context to allow the resultCh read to be canceled.
When we have been canceled with work in flight, we also attempt a
non-blocking read from workCh to drain out anything there that has not
yet been picked up by a worker.  Because 'done' will not be set true,
we'll call getNextNode again and come in with a fresh pass through the
for loop.  ctx.Err() will no longer be nil, but if the workCh drain
worked, we may now have inflight == 0, and we'll end up in the case
that sets 'done' true, and break out of the for loop on that round.

The unit test sets up three parallel nodes, a, b, and c.  We configure
two workers, so the first picks up node a while the second picks up
node b.  Node c doesn't block on either, so it gets pushed into workCh
while the workers grind through a and b.  When both workers are
halfway through their second task in their respective nodes, the 'a'
worker cancels the run.  Because the sleeps do not have
select-ctx.Done guards, the workers each finish off that second task,
notice the cancel as they enter their third task, and exit with the
"context canceled" error.  This leaves node c stuck in workCh, and we
need the fix from this commit to avoid deadlocking on that in-flight
node.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1873900
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 16, 2020
Before 55ef3d3 (pkg/payload/task_graph: Handle node pushing and
result collection without a goroutine, 2019-10-21, openshift#264), RunGraph had
a separate goroutine that managed the work queue, with results fed
into errCh to be collected by the main RunGraph goroutine.  It didn't
matter if that work queue goroutine hung; as long as all the worker
goroutines exited, RunGraph would collect their errors from errCh and
return.  In 55ef3d3, I removed the queue goroutine and moved queue
management into the main RunGraph goroutine.  With that change, we
became exposed to the following race:

1. Main goroutine pushes work into workCh.
2. Context canceled.
3. Workers exit via the "Canceled worker..." case, so they don't pick
   the work out of workCh.
4. Main goroutine deadlocks because there is work in flight, but
   nothing in resultCh, and no longer any workers to feed resultCh.

In logs, this looks like "sync-worker goroutine has gone to sleep, and
is no longer synchronizing manifests" [1].

With this commit, we drain results when they are available, but we
also respect the context to allow the resultCh read to be canceled.
When we have been canceled with work in flight, we also attempt a
non-blocking read from workCh to drain out anything there that has not
yet been picked up by a worker.  Because 'done' will not be set true,
we'll call getNextNode again and come in with a fresh pass through the
for loop.  ctx.Err() will no longer be nil, but if the workCh drain
worked, we may now have inflight == 0, and we'll end up in the case
that sets 'done' true, and break out of the for loop on that round.

The unit test sets up two parallel nodes: a and b.  We configure one
worker, which picks up node a.  Node b doesn't block on node a, so it
gets pushed into workCh while the worker grinds through node a.  On
its second task in node a, the worker cancels the run.  Because the
sleeps do not have select-ctx.Done guards, the worker finishes off
that second task, notices the cancel as they enter their third task,
and exits with the "context canceled" error.  This leaves node b stuck
in workCh, and we need the fix from this commit to avoid deadlocking
on that in-flight node.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1873900
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 16, 2020
Before 55ef3d3 (pkg/payload/task_graph: Handle node pushing and
result collection without a goroutine, 2019-10-21, openshift#264), RunGraph had
a separate goroutine that managed the work queue, with results fed
into errCh to be collected by the main RunGraph goroutine.  It didn't
matter if that work queue goroutine hung; as long as all the worker
goroutines exited, RunGraph would collect their errors from errCh and
return.  In 55ef3d3, I removed the queue goroutine and moved queue
management into the main RunGraph goroutine.  With that change, we
became exposed to the following race:

1. Main goroutine pushes work into workCh.
2. Context canceled.
3. Workers exit via the "Canceled worker..." case, so they don't pick
   the work out of workCh.
4. Main goroutine deadlocks because there is work in flight, but
   nothing in resultCh, and no longer any workers to feed resultCh.

In logs, this looks like "sync-worker goroutine has gone to sleep, and
is no longer synchronizing manifests" [1].  There are two mitigating
factors:

a. 'select' docs [2]:

    If one or more of the communications can proceed, a single one
    that can proceed is chosen via a uniform pseudo-random selection.

  So the races step 3 will happen in about half of the cases where the
  context has been canceled.  In the other half of cases, the worker
  will randomly decide to pick up the queued work, notice it's been
  canceled while processing that work, and return a "context canceled"
  result.

b. We scale workCh by the number of workers.

So the deadlock risk requires enough parallel work to fill the queue
faster than workers are draining it and enough bad luck in the
worker's select that the canceled workers don't drain the queue on
their own.  E.g. with our eight ClusterOperator-precreate workers,
we'd have an 0.5^8 ~= 0.4% chance of not draining a single in-queue
node post-cancel, and a 1 - 0.5^8 ~= 99.6% chance of not draining
eight in-queue nodes post-cancel.

With this commit, we drain results when they are available, but we
also respect the context to allow the resultCh read to be canceled.
When we have been canceled with work in flight, we also attempt a
non-blocking read from workCh to drain out anything there that has not
yet been picked up by a worker.  Because 'done' will not be set true,
we'll call getNextNode again and come in with a fresh pass through the
for loop.  ctx.Err() will no longer be nil, but if the workCh drain
worked, we may now have inflight == 0, and we'll end up in the case
that sets 'done' true, and break out of the for loop on that round.

The unit test sets up two parallel nodes: a and b.  We configure one
worker, which picks up node a.  Node b doesn't block on node a, so it
gets pushed into workCh while the worker grinds through node a.  On
its second task in node a, the worker cancels the run.  Because the
sleeps do not have select-ctx.Done guards, the worker finishes off
that second task, notices the cancel as they enter their third task,
and exits with the "context canceled" error.  This leaves node b stuck
in workCh, and we need the fix from this commit to avoid deadlocking
on that in-flight node.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1873900
[2]: https://golang.org/ref/spec#Select_statements
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants