Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1873900: pkg/payload/task_graph: Avoid deadlocking on cancel with workCh queue #455

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
pkg/payload/task_graph: Avoid deadlocking on cancel with workCh queue
Before 55ef3d3 (pkg/payload/task_graph: Handle node pushing and
result collection without a goroutine, 2019-10-21, #264), RunGraph had
a separate goroutine that managed the work queue, with results fed
into errCh to be collected by the main RunGraph goroutine.  It didn't
matter if that work queue goroutine hung; as long as all the worker
goroutines exited, RunGraph would collect their errors from errCh and
return.  In 55ef3d3, I removed the queue goroutine and moved queue
management into the main RunGraph goroutine.  With that change, we
became exposed to the following race:

1. Main goroutine pushes work into workCh.
2. Context canceled.
3. Workers exit via the "Canceled worker..." case, so they don't pick
   the work out of workCh.
4. Main goroutine deadlocks because there is work in flight, but
   nothing in resultCh, and no longer any workers to feed resultCh.

In logs, this looks like "sync-worker goroutine has gone to sleep, and
is no longer synchronizing manifests" [1].  There are two mitigating
factors:

a. 'select' docs [2]:

    If one or more of the communications can proceed, a single one
    that can proceed is chosen via a uniform pseudo-random selection.

  So the races step 3 will happen in about half of the cases where the
  context has been canceled.  In the other half of cases, the worker
  will randomly decide to pick up the queued work, notice it's been
  canceled while processing that work, and return a "context canceled"
  result.

b. We scale workCh by the number of workers.

So the deadlock risk requires enough parallel work to fill the queue
faster than workers are draining it and enough bad luck in the
worker's select that the canceled workers don't drain the queue on
their own.  E.g. with our eight ClusterOperator-precreate workers,
we'd have an 0.5^8 ~= 0.4% chance of not draining a single in-queue
node post-cancel, and a 1 - 0.5^8 ~= 99.6% chance of not draining
eight in-queue nodes post-cancel.

With this commit, we drain results when they are available, but we
also respect the context to allow the resultCh read to be canceled.
When we have been canceled with work in flight, we also attempt a
non-blocking read from workCh to drain out anything there that has not
yet been picked up by a worker.  Because 'done' will not be set true,
we'll call getNextNode again and come in with a fresh pass through the
for loop.  ctx.Err() will no longer be nil, but if the workCh drain
worked, we may now have inflight == 0, and we'll end up in the case
that sets 'done' true, and break out of the for loop on that round.

The unit test sets up two parallel nodes: a and b.  We configure one
worker, which picks up node a.  Node b doesn't block on node a, so it
gets pushed into workCh while the worker grinds through node a.  On
its second task in node a, the worker cancels the run.  Because the
sleeps do not have select-ctx.Done guards, the worker finishes off
that second task, notices the cancel as they enter their third task,
and exits with the "context canceled" error.  This leaves node b stuck
in workCh, and we need the fix from this commit to avoid deadlocking
on that in-flight node.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1873900
[2]: https://golang.org/ref/spec#Select_statements
  • Loading branch information
wking committed Sep 16, 2020
commit 632e763fa209c3e1bd07098c2e33599f8c663e3e
15 changes: 12 additions & 3 deletions pkg/payload/task_graph.go
Expand Up @@ -507,9 +507,18 @@ func RunGraph(ctx context.Context, graph *TaskGraph, maxParallelism int, fn func
case <-ctx.Done():
}
case inflight > 0: // no work available to push; collect results
result := <-resultCh
results[result.index] = &result
inflight--
select {
case result := <-resultCh:
results[result.index] = &result
inflight--
case <-ctx.Done():
select {
case runTask := <-workCh: // workers canceled, so remove any work from the queue ourselves
inflight--
submitted[runTask.index] = false
default:
}
}
default: // no work to push and nothing in flight. We're done
done = true
}
Expand Down
20 changes: 20 additions & 0 deletions pkg/payload/task_graph_test.go
Expand Up @@ -831,6 +831,26 @@ func TestRunGraph(t *testing.T) {
}
},
},
{
name: "mid-task cancellation with work in queue does not deadlock",
nodes: []*TaskNode{
{Tasks: tasks("a1", "a2", "a3")},
{Tasks: tasks("b")},
},
sleep: time.Millisecond,
parallel: 1,
errorOn: func(t *testing.T, name string, ctx context.Context, cancelFn func()) error {
if err := ctx.Err(); err != nil {
return err
}
if name == "a2" {
cancelFn()
}
return nil
},
want: []string{"a1", "a2"},
wantErrs: []string{"context canceled"},
},
{
name: "task errors in parallel nodes both reported",
nodes: []*TaskNode{
Expand Down