-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API worker timesout waiting on an advisory lock to dispatch a task #5390
Comments
When I look at this, here's the situation I see. The DB itself is not fully loaded, it's at 33% ish so the DB isn't the rate limiting component here. Also the API workers are timing out due waiting a really long time for an advisory lock. So what that means to me is I believe we're running into the architectural limit of task insertion into the db (or maybe also task handling too?). We have 48 workers running in this system so that's a lot of workers, but also we may even need more. This is an interesting problem because we can't increase throughput or capacity by making more hardware resources available. This can only be solved algorithmically I think. The idea would be (somehow?) to make the acquisition of locks less contentious. |
Can you identify whether this is related to the unblocked_at change? |
OTOH It might be worth rerunning the tests with the new indices we just added on the tasks table. |
Thanks for the thoughtful comments. Yes, let's try to rerun the tests again after our installation is upgraded to that released version. Can you let us know what version that is whenever that is known? |
It merged this week. |
I currently have 50 concurrent threads creating a remote, a repo, and syncing the repo. Here is what the top 10 queries are. I have 24 workers running right now. The green color represents CPU wait time. AWS is suggesting that the instance be upgraded to one with more CPU resources. I agree with their assessment. |
Tell me, is this a reason to say we can close this issue? |
I opened this issue when I had 48 workers running. Right now I am using 24 workers to get around the advisory lock issue. I believe if I increase to 48 again, we will see this problem. Let's keep it open at least until I try 48 workers again. |
I provisioned a larger RDS instance (8 vCPU and 32gb RAM) and this problem went away. |
Version
3.52.0
Describe the bug
I have 10 API pods each running 20 gunicorn workers. I am submitting a lot of sync tasks and eventually I have some API workers timeout and the following traceback is emitted:
Here is a screenshot of the db load:
The text was updated successfully, but these errors were encountered: