Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1873900: pkg/payload/task_graph: Avoid deadlocking on cancel with workCh queue #455

Conversation

wking
Copy link
Member

@wking wking commented Sep 15, 2020

Before 55ef3d3 (#264, new in 4.6), RunGraph had a separate goroutine that managed the work queue, with results fed into errCh to be collected by the main RunGraph goroutine. It didn't matter if that work queue goroutine hung; as long as all the worker goroutines exited, RunGraph would collect their errors from errCh and return. In 55ef3d3, I removed the queue goroutine and moved queue management into the main RunGraph goroutine. With that change, we became exposed to the following race:

  1. Main goroutine pushes work into workCh.
  2. Context canceled.
  3. Workers exit via the Canceled worker... case, so they don't pick the work out of workCh.
  4. Main goroutine deadlocks because there is work in flight, but nothing in resultCh, and no longer any workers to feed resultCh.

In logs, this looks like "sync-worker goroutine has gone to sleep, and is no longer synchronizing manifests".

With this commit, we drain results when they are available, but we also respect the context to allow the resultCh read to be canceled. When we have been canceled with work in flight, we also attempt a non-blocking read from workCh to drain out anything there that has not yet been picked up by a worker. Because done will not be set true, we'll call getNextNode again and come in with a fresh pass through the for loop. ctx.Err() will no longer be nil, but if the workCh drain worked, we may now have inflight == 0, and we'll end up in the case that sets done true, and break out of the for loop on that round.

@openshift-ci-robot openshift-ci-robot added the bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. label Sep 15, 2020
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Bugzilla bug 1873900, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1873900: pkg/payload/task_graph: Avoid deadlocking on cancel with workCh queue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Sep 15, 2020
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 15, 2020
@wking
Copy link
Member Author

wking commented Sep 15, 2020

Upgrade job timed out some pods while creating the release.

/retest

@vrutkovs
Copy link
Member

Makes sense to me, I wonder if it can be unit-tested?

@wking
Copy link
Member Author

wking commented Sep 16, 2020

Makes sense to me, I wonder if it can be unit-tested?

We scale workCh by worker parallelism, so... set parallelism to one, create two slow, parallel tasks, and cancel halfway through the first? Or something...

@wking wking force-pushed the context-canceled-locked-task-graph branch from cae6291 to 5982498 Compare September 16, 2020 17:10
@wking
Copy link
Member Author

wking commented Sep 16, 2020

I've added a unit test with cae6291 -> 5982498. 5982498's commit message has a paragraph talking through the test.

@wking wking force-pushed the context-canceled-locked-task-graph branch from 5982498 to 24e9e91 Compare September 16, 2020 17:14
@wking
Copy link
Member Author

wking commented Sep 16, 2020

Fixed gofmt errors with 5982498 -> 24e9e91.

@wking wking force-pushed the context-canceled-locked-task-graph branch from 24e9e91 to 622e04f Compare September 16, 2020 17:18
@wking
Copy link
Member Author

wking commented Sep 16, 2020

And moved from two workers to one worker in the unit test with 24e9e91 -> 622e04f, which means we don't have to care about keeping the tasks synchronized-enough between two workers to figure out what want should be.

Before 55ef3d3 (pkg/payload/task_graph: Handle node pushing and
result collection without a goroutine, 2019-10-21, openshift#264), RunGraph had
a separate goroutine that managed the work queue, with results fed
into errCh to be collected by the main RunGraph goroutine.  It didn't
matter if that work queue goroutine hung; as long as all the worker
goroutines exited, RunGraph would collect their errors from errCh and
return.  In 55ef3d3, I removed the queue goroutine and moved queue
management into the main RunGraph goroutine.  With that change, we
became exposed to the following race:

1. Main goroutine pushes work into workCh.
2. Context canceled.
3. Workers exit via the "Canceled worker..." case, so they don't pick
   the work out of workCh.
4. Main goroutine deadlocks because there is work in flight, but
   nothing in resultCh, and no longer any workers to feed resultCh.

In logs, this looks like "sync-worker goroutine has gone to sleep, and
is no longer synchronizing manifests" [1].  There are two mitigating
factors:

a. 'select' docs [2]:

    If one or more of the communications can proceed, a single one
    that can proceed is chosen via a uniform pseudo-random selection.

  So the races step 3 will happen in about half of the cases where the
  context has been canceled.  In the other half of cases, the worker
  will randomly decide to pick up the queued work, notice it's been
  canceled while processing that work, and return a "context canceled"
  result.

b. We scale workCh by the number of workers.

So the deadlock risk requires enough parallel work to fill the queue
faster than workers are draining it and enough bad luck in the
worker's select that the canceled workers don't drain the queue on
their own.  E.g. with our eight ClusterOperator-precreate workers,
we'd have an 0.5^8 ~= 0.4% chance of not draining a single in-queue
node post-cancel, and a 1 - 0.5^8 ~= 99.6% chance of not draining
eight in-queue nodes post-cancel.

With this commit, we drain results when they are available, but we
also respect the context to allow the resultCh read to be canceled.
When we have been canceled with work in flight, we also attempt a
non-blocking read from workCh to drain out anything there that has not
yet been picked up by a worker.  Because 'done' will not be set true,
we'll call getNextNode again and come in with a fresh pass through the
for loop.  ctx.Err() will no longer be nil, but if the workCh drain
worked, we may now have inflight == 0, and we'll end up in the case
that sets 'done' true, and break out of the for loop on that round.

The unit test sets up two parallel nodes: a and b.  We configure one
worker, which picks up node a.  Node b doesn't block on node a, so it
gets pushed into workCh while the worker grinds through node a.  On
its second task in node a, the worker cancels the run.  Because the
sleeps do not have select-ctx.Done guards, the worker finishes off
that second task, notices the cancel as they enter their third task,
and exits with the "context canceled" error.  This leaves node b stuck
in workCh, and we need the fix from this commit to avoid deadlocking
on that in-flight node.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1873900
[2]: https://golang.org/ref/spec#Select_statements
@wking wking force-pushed the context-canceled-locked-task-graph branch from 622e04f to 632e763 Compare September 16, 2020 17:36
@wking
Copy link
Member Author

wking commented Sep 16, 2020

I've pushed 622e04f -> 632e763 with some discussion of workCh capacity and select randomness to show that, even with the old code, there were some cases where in-flight work would still not have deadlocked. That reasoning is why the new unit test will sometimes fail with two cancels instead of deadlocking before the code-fix from this PR.

@jottofar
Copy link
Contributor

I had the same comment as Vadim - "make sense to me but would feel better with a test". I see that's done so...LGTM.

Copy link
Member

@vrutkovs vrutkovs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

unit-test seems sane

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vrutkovs, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 17, 2020
@LalatenduMohanty
Copy link
Member

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

5 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

3 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 4cd5b39 into openshift:master Sep 17, 2020
@openshift-ci-robot
Copy link
Contributor

@wking: All pull requests linked via external trackers have merged:

Bugzilla bug 1873900 has been moved to the MODIFIED state.

In response to this:

Bug 1873900: pkg/payload/task_graph: Avoid deadlocking on cancel with workCh queue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the context-canceled-locked-task-graph branch September 18, 2020 03:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants