Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Readers-Writer lock for resource group #17131

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

abhiseksaikia
Copy link
Contributor

@abhiseksaikia abhiseksaikia commented Dec 22, 2021

There are endpoints like /v1/queryState that takes resource group lock while fetching pathToRoot. This leads to lock contention when there are few thousand queries queued on a cluster. We can use Readers-Writer lock to avoid lock contention.

Test plan -

  1. Run existing unit test
  2. Perform verifier run
  3. Shadow testing
== NO RELEASE NOTE ==

@abhiseksaikia abhiseksaikia marked this pull request as draft December 22, 2021 18:27
@abhiseksaikia abhiseksaikia changed the title Use Readers-Writer lock for resource group [WIP] Use Readers-Writer lock for resource group Dec 22, 2021
@abhiseksaikia abhiseksaikia changed the title [WIP] Use Readers-Writer lock for resource group Use Readers-Writer lock for resource group Dec 22, 2021
@abhiseksaikia abhiseksaikia marked this pull request as ready for review December 22, 2021 18:46
Copy link
Contributor

@swapsmagic swapsmagic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding some unit test to make sure the methods fail when the thread has not acquired a lock beforehand is great. (i.e. around internalStartNext, startInBackground, enqueueQuery etc).

Also for the the shadow run, what exactly we tested? The test should cover the cluster with ~2K queries on it (running + queued), we should check if adjusted queue size metric is still publishing data and not timing out.

}
else {
id = new ResourceGroupId(name);
root = this;
this.lock = new ReentrantReadWriteLock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With default acquisition policy being non-fair where order of entry of read/write lock is unspecified, it make sense to use fair policy instead the default one.
https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/locks/ReentrantReadWriteLock.html

Acquisition order
This class does not impose a reader or writer preference ordering for lock access. However, it does support an optional fairness policy.
Non-fair mode (default)
When constructed as non-fair (the default), the order of entry to the read and write lock is unspecified, subject to reentrancy constraints. A nonfair lock that is continuously contended may indefinitely postpone one or more reader or writer threads, but will normally have higher throughput than a fair lock.
Fair mode
When constructed as fair, threads contend for entry using an approximately arrival-order policy. When the currently held lock is released, either the longest-waiting single writer thread will be assigned the write lock, or if there is a group of reader threads waiting longer than all waiting writer threads, that group will be assigned the read lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding some unit test to make sure the methods fail when the thread has not acquired a lock beforehand is great. (i.e. around internalStartNext, startInBackground, enqueueQuery etc).

Also for the the shadow run, what exactly we tested? The test should cover the cluster with ~2K queries on it (running + queued), we should check if adjusted queue size metric is still publishing data and not timing out.

My initial shadow run was mainly to make sure there are no deadlock introduced. I will do more testing with both fair and non-fair mode. I was a bit worried to use fair mode as there is an impact on throughput, but I will do more testing to verify this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on testing and also digging deeper into the code, it looks like reader-writer lock is not going to help much here as we have more write locks compared to read locks when the query throughput is high. Also based on testing, metrics (not just adjusted queue size, but all other metrics) are stopped getting published when cpu utilization is consistently ~83% for longer time, so the issue seems to be not due to lock contention. I have converted this PR to draft for future reference in case we want to optimize locking further (places where we feel we can downgrade locking from write to read).

There are endpoints like /v1/queryState endpoint that takes resource group lock while fetching pathToRoot for a given resource group. This leads to lock contention when there are few thousand queries queued on a cluster. We can use Readers-Writer lock to avoid lock contention.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants