Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API worker timesout waiting on an advisory lock to dispatch a task #5390

Open
dkliban opened this issue May 15, 2024 · 9 comments
Open

API worker timesout waiting on an advisory lock to dispatch a task #5390

dkliban opened this issue May 15, 2024 · 9 comments
Labels

Comments

@dkliban
Copy link
Member

dkliban commented May 15, 2024

Version
3.52.0

Describe the bug
I have 10 API pods each running 20 gunicorn workers. I am submitting a lot of sync tasks and eventually I have some API workers timeout and the following traceback is emitted:

[2024-05-15 19:41:55 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:14)
[2024-05-15 19:41:55 +0000] [14] [ERROR] Error handling request /api/pulp/default/api/v3/repositories/rpm/rpm/018f7dc4-d1fd-77e4-a00d-5c3fb16b6416/sync/
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/gunicorn/workers/sync.py", line 135, in handle
    self.handle_request(listener, req, client, addr)
  File "/usr/local/lib/python3.9/site-packages/gunicorn/workers/sync.py", line 178, in handle_request
    respiter = self.wsgi(environ, resp.start_response)
  File "/usr/local/lib/python3.9/site-packages/opentelemetry/instrumentation/wsgi/__init__.py", line 589, in __call__
    iterable = self.wsgi(environ, start_response)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/wsgi.py", line 124, in __call__
    response = self.get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/base.py", line 140, in get_response
    response = self._middleware_chain(request)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django_guid/middleware.py", line 79, in middleware
    response = get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/utils/deprecation.py", line 134, in __call__
    response = response or self.get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.9/site-packages/whitenoise/middleware.py", line 124, in __call__
    return self.get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/utils/deprecation.py", line 134, in __call__
    response = response or self.get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/utils/deprecation.py", line 134, in __call__
    response = response or self.get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/utils/deprecation.py", line 134, in __call__
    response = response or self.get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/utils/deprecation.py", line 134, in __call__
    response = response or self.get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/utils/deprecation.py", line 134, in __call__
    response = response or self.get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/utils/deprecation.py", line 134, in __call__
    response = response or self.get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.9/site-packages/pulpcore/middleware.py", line 24, in __call__
    response = self.get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/base.py", line 197, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/local/lib/python3.9/site-packages/django/views/decorators/csrf.py", line 56, in wrapper_view
    return view_func(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/rest_framework/viewsets.py", line 124, in view
    return self.dispatch(request, *args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/rest_framework/views.py", line 506, in dispatch
    response = handler(request, *args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/pulp_rpm/app/viewsets/repository.py", line 237, in sync
    result = dispatch(
  File "/usr/local/lib/python3.9/site-packages/pulpcore/tasking/tasks.py", line 168, in dispatch
    cursor.execute("SELECT pg_advisory_xact_lock(%s, %s)", [0, TASK_DISPATCH_LOCK])
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 67, in execute
    return self._execute_with_wrappers(
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 80, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 89, in _execute
    return self.cursor.execute(sql, params)
  File "/usr/local/lib/python3.9/site-packages/psycopg/cursor.py", line 728, in execute
    self._conn.wait(
  File "/usr/local/lib/python3.9/site-packages/psycopg/connection.py", line 969, in wait
    return waiting.wait(gen, self.pgconn.socket, timeout=timeout)
  File "psycopg_binary/_psycopg/waiting.pyx", line 193, in psycopg_binary._psycopg.wait_c
  File "/usr/local/lib/python3.9/site-packages/gunicorn/workers/base.py", line 203, in handle_abort
    sys.exit(1)
SystemExit: 1
('pulp [-]: ::ffff:10.131.8.12 - - [15/May/2024:19:41:55 +0000] "POST /api/pulp/default/api/v3/repositories/rpm/rpm/018f7dc4-d1fd-77e4-a00d-5c3fb16b6416/sync/ HTTP/1.1" 500 0 "-" "-"',)
[2024-05-15 19:41:56 +0000] [14] [INFO] Worker exiting (pid: 14)
[2024-05-15 19:41:56 +0000] [1] [ERROR] Worker (pid:14) was sent SIGKILL! Perhaps out of memory?
[2024-05-15 19:41:56 +0000] [213] [INFO] Booting worker with pid: 213

Here is a screenshot of the db load:

Screenshot from 2024-05-15 16-13-36

@bmbouter
Copy link
Member

bmbouter commented May 16, 2024

When I look at this, here's the situation I see. The DB itself is not fully loaded, it's at 33% ish so the DB isn't the rate limiting component here. Also the API workers are timing out due waiting a really long time for an advisory lock.

So what that means to me is I believe we're running into the architectural limit of task insertion into the db (or maybe also task handling too?). We have 48 workers running in this system so that's a lot of workers, but also we may even need more.

This is an interesting problem because we can't increase throughput or capacity by making more hardware resources available. This can only be solved algorithmically I think. The idea would be (somehow?) to make the acquisition of locks less contentious.

@mdellweg
Copy link
Member

Can you identify whether this is related to the unblocked_at change?
Maybe we are seeing other table locks slowing the insertion down and so the advisory lock (being a turnstile for ensuring monotonous pulp_created values at all cost) would be slowed down externally.
Or maybe we really just hit the limit of that special bottleneck. Adding more resources is certainly not improving the situation here.
A first idea (under the assumption that concurrent tasks rarely touch the same resources) could be to create some sort of bloom filter on the tasks resources and spread the current single advisory lock into 8. Now only tasks having an overlap in the 8 resource identifier buckets need to wait on each others locks.

@mdellweg
Copy link
Member

OTOH It might be worth rerunning the tests with the new indices we just added on the tasks table.

@bmbouter
Copy link
Member

Thanks for the thoughtful comments.

Yes, let's try to rerun the tests again after our installation is upgraded to that released version. Can you let us know what version that is whenever that is known?

@mdellweg
Copy link
Member

It merged this week.

@dkliban
Copy link
Member Author

dkliban commented May 16, 2024

Here is another screenshot from the RDS management console.
image

@dkliban
Copy link
Member Author

dkliban commented May 22, 2024

I currently have 50 concurrent threads creating a remote, a repo, and syncing the repo. Here is what the top 10 queries are.

I have 24 workers running right now.

image

The green color represents CPU wait time. AWS is suggesting that the instance be upgraded to one with more CPU resources. I agree with their assessment.

@mdellweg
Copy link
Member

Tell me, is this a reason to say we can close this issue?

@dkliban
Copy link
Member Author

dkliban commented May 23, 2024

I opened this issue when I had 48 workers running. Right now I am using 24 workers to get around the advisory lock issue.

I believe if I increase to 48 again, we will see this problem. Let's keep it open at least until I try 48 workers again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants