-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UPSTREAM: 58107: Fix quota controller worker deadlock #18080
UPSTREAM: 58107: Fix quota controller worker deadlock #18080
Conversation
The resource quota controller worker pool can deadlock when: * Worker goroutines are idle waiting for work from queues * The Sync() method detects discovery updates to apply The problem is workers acquire a read lock while idle, making write lock acquisition dependent upon the presence of work in the queues. The Sync() method blocks on a pending write lock acquisition and won't unblock until every existing worker processes one item from their queue and releases their read lock. While the Sync() method's lock is pending, all new read lock acquisitions will block; if a worker does process work and release its lock, it will then become blocked on a read lock acquisition; they become blocked on Sync(). This can easily deadlock all the workers processing from one queue while any workers on the other queue remain blocked waiting for work. Fix the deadlock by refactoring workers to acquire a read lock *after* work is popped from the queue. This allows writers to get locks while workers are idle, while preserving the worker pause semantics necessary to allow safe sync.
@ironcladlou does clusterresource quota suffer from the same problem? |
Just looked at it... in theory yes, but I'm not yet seeing where any goroutine is spun up to call That's now three controllers using the same worker pool methodology/code... |
The worker pool is very common. The lock is less common. |
/retest |
4 similar comments
/retest |
/retest |
/retest |
/retest |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, ironcladlou The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
Automatic merge from submit-queue (batch tested with PRs 17976, 17195, 18093, 18080, 17922). |
The resource quota controller worker pool can deadlock when:
The problem is workers acquire a read lock while idle, making write lock
acquisition dependent upon the presence of work in the queues.
The Sync() method blocks on a pending write lock acquisition and won't unblock
until every existing worker processes one item from their queue and releases
their read lock. While the Sync() method's lock is pending, all new read lock
acquisitions will block; if a worker does process work and release its lock, it
will then become blocked on a read lock acquisition; they become blocked on
Sync(). This can easily deadlock all the workers processing from one queue while
any workers on the other queue remain blocked waiting for work.
Fix the deadlock by refactoring workers to acquire a read lock after work is
popped from the queue. This allows writers to get locks while workers are idle,
while preserving the worker pause semantics necessary to allow safe sync.