New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1873900: pkg/payload/task_graph: Avoid deadlocking on cancel with workCh queue #455
Bug 1873900: pkg/payload/task_graph: Avoid deadlocking on cancel with workCh queue #455
Conversation
@wking: This pull request references Bugzilla bug 1873900, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Upgrade job timed out some pods while creating the release. /retest |
Makes sense to me, I wonder if it can be unit-tested? |
We scale |
cae6291
to
5982498
Compare
5982498
to
24e9e91
Compare
Fixed gofmt errors with 5982498 -> 24e9e91. |
24e9e91
to
622e04f
Compare
Before 55ef3d3 (pkg/payload/task_graph: Handle node pushing and result collection without a goroutine, 2019-10-21, openshift#264), RunGraph had a separate goroutine that managed the work queue, with results fed into errCh to be collected by the main RunGraph goroutine. It didn't matter if that work queue goroutine hung; as long as all the worker goroutines exited, RunGraph would collect their errors from errCh and return. In 55ef3d3, I removed the queue goroutine and moved queue management into the main RunGraph goroutine. With that change, we became exposed to the following race: 1. Main goroutine pushes work into workCh. 2. Context canceled. 3. Workers exit via the "Canceled worker..." case, so they don't pick the work out of workCh. 4. Main goroutine deadlocks because there is work in flight, but nothing in resultCh, and no longer any workers to feed resultCh. In logs, this looks like "sync-worker goroutine has gone to sleep, and is no longer synchronizing manifests" [1]. There are two mitigating factors: a. 'select' docs [2]: If one or more of the communications can proceed, a single one that can proceed is chosen via a uniform pseudo-random selection. So the races step 3 will happen in about half of the cases where the context has been canceled. In the other half of cases, the worker will randomly decide to pick up the queued work, notice it's been canceled while processing that work, and return a "context canceled" result. b. We scale workCh by the number of workers. So the deadlock risk requires enough parallel work to fill the queue faster than workers are draining it and enough bad luck in the worker's select that the canceled workers don't drain the queue on their own. E.g. with our eight ClusterOperator-precreate workers, we'd have an 0.5^8 ~= 0.4% chance of not draining a single in-queue node post-cancel, and a 1 - 0.5^8 ~= 99.6% chance of not draining eight in-queue nodes post-cancel. With this commit, we drain results when they are available, but we also respect the context to allow the resultCh read to be canceled. When we have been canceled with work in flight, we also attempt a non-blocking read from workCh to drain out anything there that has not yet been picked up by a worker. Because 'done' will not be set true, we'll call getNextNode again and come in with a fresh pass through the for loop. ctx.Err() will no longer be nil, but if the workCh drain worked, we may now have inflight == 0, and we'll end up in the case that sets 'done' true, and break out of the for loop on that round. The unit test sets up two parallel nodes: a and b. We configure one worker, which picks up node a. Node b doesn't block on node a, so it gets pushed into workCh while the worker grinds through node a. On its second task in node a, the worker cancels the run. Because the sleeps do not have select-ctx.Done guards, the worker finishes off that second task, notices the cancel as they enter their third task, and exits with the "context canceled" error. This leaves node b stuck in workCh, and we need the fix from this commit to avoid deadlocking on that in-flight node. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1873900 [2]: https://golang.org/ref/spec#Select_statements
622e04f
to
632e763
Compare
I've pushed 622e04f -> 632e763 with some discussion of |
I had the same comment as Vadim - "make sense to me but would feel better with a test". I see that's done so...LGTM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
unit-test seems sane
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: vrutkovs, wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
/retest Please review the full test history for this PR and help us cut down flakes. |
5 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
3 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
@wking: All pull requests linked via external trackers have merged:
Bugzilla bug 1873900 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Before 55ef3d3 (#264, new in 4.6),
RunGraph
had a separate goroutine that managed the work queue, with results fed intoerrCh
to be collected by the mainRunGraph
goroutine. It didn't matter if that work queue goroutine hung; as long as all the worker goroutines exited,RunGraph
would collect their errors fromerrCh
and return. In 55ef3d3, I removed the queue goroutine and moved queue management into the mainRunGraph
goroutine. With that change, we became exposed to the following race:workCh
.Canceled worker...
case, so they don't pick the work out ofworkCh
.resultCh
, and no longer any workers to feedresultCh
.In logs, this looks like "sync-worker goroutine has gone to sleep, and is no longer synchronizing manifests".
With this commit, we drain results when they are available, but we also respect the context to allow the
resultCh
read to be canceled. When we have been canceled with work in flight, we also attempt a non-blocking read fromworkCh
to drain out anything there that has not yet been picked up by a worker. Becausedone
will not be set true, we'll callgetNextNode
again and come in with a fresh pass through the for loop.ctx.Err()
will no longer benil
, but if theworkCh
drain worked, we may now haveinflight == 0
, and we'll end up in the case that setsdone
true, and break out of the for loop on that round.