Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UPSTREAM: 58107: Fix quota controller worker deadlock #18080

Merged
merged 1 commit into from
Jan 16, 2018

Conversation

ironcladlou
Copy link
Contributor

The resource quota controller worker pool can deadlock when:

  • Worker goroutines are idle waiting for work from queues
  • The Sync() method detects discovery updates to apply

The problem is workers acquire a read lock while idle, making write lock
acquisition dependent upon the presence of work in the queues.

The Sync() method blocks on a pending write lock acquisition and won't unblock
until every existing worker processes one item from their queue and releases
their read lock. While the Sync() method's lock is pending, all new read lock
acquisitions will block; if a worker does process work and release its lock, it
will then become blocked on a read lock acquisition; they become blocked on
Sync(). This can easily deadlock all the workers processing from one queue while
any workers on the other queue remain blocked waiting for work.

Fix the deadlock by refactoring workers to acquire a read lock after work is
popped from the queue. This allows writers to get locks while workers are idle,
while preserving the worker pause semantics necessary to allow safe sync.

The resource quota controller worker pool can deadlock when:

* Worker goroutines are idle waiting for work from queues
* The Sync() method detects discovery updates to apply

The problem is workers acquire a read lock while idle, making write lock
acquisition dependent upon the presence of work in the queues.

The Sync() method blocks on a pending write lock acquisition and won't unblock
until every existing worker processes one item from their queue and releases
their read lock. While the Sync() method's lock is pending, all new read lock
acquisitions will block; if a worker does process work and release its lock, it
will then become blocked on a read lock acquisition; they become blocked on
Sync(). This can easily deadlock all the workers processing from one queue while
any workers on the other queue remain blocked waiting for work.

Fix the deadlock by refactoring workers to acquire a read lock *after* work is
popped from the queue. This allows writers to get locks while workers are idle,
while preserving the worker pause semantics necessary to allow safe sync.
@openshift-ci-robot openshift-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jan 11, 2018
@openshift-merge-robot openshift-merge-robot added the vendor-update Touching vendor dir or related files label Jan 11, 2018
@deads2k
Copy link
Contributor

deads2k commented Jan 11, 2018

@ironcladlou does clusterresource quota suffer from the same problem?

@ironcladlou
Copy link
Contributor Author

@deads2k

does clusterresource quota suffer from the same problem?

Just looked at it... in theory yes, but I'm not yet seeing where any goroutine is spun up to call ClusterQuotaReconcilationController.Sync.

That's now three controllers using the same worker pool methodology/code...

@deads2k
Copy link
Contributor

deads2k commented Jan 11, 2018

That's now three controllers using the same worker pool methodology/code...

The worker pool is very common. The lock is less common.

@ironcladlou
Copy link
Contributor Author

/retest

4 similar comments
@ironcladlou
Copy link
Contributor Author

/retest

@mfojtik
Copy link
Contributor

mfojtik commented Jan 12, 2018

/retest

@ironcladlou
Copy link
Contributor Author

/retest

@ironcladlou
Copy link
Contributor Author

/retest

@deads2k
Copy link
Contributor

deads2k commented Jan 16, 2018

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 16, 2018
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, ironcladlou

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 16, 2018
@openshift-merge-robot
Copy link
Contributor

Automatic merge from submit-queue (batch tested with PRs 17976, 17195, 18093, 18080, 17922).

@openshift-merge-robot openshift-merge-robot merged commit 3a20d59 into openshift:master Jan 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. vendor-update Touching vendor dir or related files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants