API worker timesout waiting on an advisory lock to dispatch a task #5390

dkliban · 2024-05-15T20:26:32Z

Version
3.52.0

Describe the bug
I have 10 API pods each running 20 gunicorn workers. I am submitting a lot of sync tasks and eventually I have some API workers timeout and the following traceback is emitted:

[2024-05-15 19:41:55 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:14)
[2024-05-15 19:41:55 +0000] [14] [ERROR] Error handling request /api/pulp/default/api/v3/repositories/rpm/rpm/018f7dc4-d1fd-77e4-a00d-5c3fb16b6416/sync/
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/gunicorn/workers/sync.py", line 135, in handle
    self.handle_request(listener, req, client, addr)
  File "/usr/local/lib/python3.9/site-packages/gunicorn/workers/sync.py", line 178, in handle_request
    respiter = self.wsgi(environ, resp.start_response)
  File "/usr/local/lib/python3.9/site-packages/opentelemetry/instrumentation/wsgi/__init__.py", line 589, in __call__
    iterable = self.wsgi(environ, start_response)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/wsgi.py", line 124, in __call__
    response = self.get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/base.py", line 140, in get_response
    response = self._middleware_chain(request)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django_guid/middleware.py", line 79, in middleware
    response = get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/utils/deprecation.py", line 134, in __call__
    response = response or self.get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.9/site-packages/whitenoise/middleware.py", line 124, in __call__
    return self.get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/utils/deprecation.py", line 134, in __call__
    response = response or self.get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/utils/deprecation.py", line 134, in __call__
    response = response or self.get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/utils/deprecation.py", line 134, in __call__
    response = response or self.get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/utils/deprecation.py", line 134, in __call__
    response = response or self.get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/utils/deprecation.py", line 134, in __call__
    response = response or self.get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/utils/deprecation.py", line 134, in __call__
    response = response or self.get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.9/site-packages/pulpcore/middleware.py", line 24, in __call__
    response = self.get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/base.py", line 197, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/local/lib/python3.9/site-packages/django/views/decorators/csrf.py", line 56, in wrapper_view
    return view_func(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/rest_framework/viewsets.py", line 124, in view
    return self.dispatch(request, *args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/rest_framework/views.py", line 506, in dispatch
    response = handler(request, *args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/pulp_rpm/app/viewsets/repository.py", line 237, in sync
    result = dispatch(
  File "/usr/local/lib/python3.9/site-packages/pulpcore/tasking/tasks.py", line 168, in dispatch
    cursor.execute("SELECT pg_advisory_xact_lock(%s, %s)", [0, TASK_DISPATCH_LOCK])
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 67, in execute
    return self._execute_with_wrappers(
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 80, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 89, in _execute
    return self.cursor.execute(sql, params)
  File "/usr/local/lib/python3.9/site-packages/psycopg/cursor.py", line 728, in execute
    self._conn.wait(
  File "/usr/local/lib/python3.9/site-packages/psycopg/connection.py", line 969, in wait
    return waiting.wait(gen, self.pgconn.socket, timeout=timeout)
  File "psycopg_binary/_psycopg/waiting.pyx", line 193, in psycopg_binary._psycopg.wait_c
  File "/usr/local/lib/python3.9/site-packages/gunicorn/workers/base.py", line 203, in handle_abort
    sys.exit(1)
SystemExit: 1
('pulp [-]: ::ffff:10.131.8.12 - - [15/May/2024:19:41:55 +0000] "POST /api/pulp/default/api/v3/repositories/rpm/rpm/018f7dc4-d1fd-77e4-a00d-5c3fb16b6416/sync/ HTTP/1.1" 500 0 "-" "-"',)
[2024-05-15 19:41:56 +0000] [14] [INFO] Worker exiting (pid: 14)
[2024-05-15 19:41:56 +0000] [1] [ERROR] Worker (pid:14) was sent SIGKILL! Perhaps out of memory?
[2024-05-15 19:41:56 +0000] [213] [INFO] Booting worker with pid: 213

Here is a screenshot of the db load:

The text was updated successfully, but these errors were encountered:

bmbouter · 2024-05-16T13:09:09Z

When I look at this, here's the situation I see. The DB itself is not fully loaded, it's at 33% ish so the DB isn't the rate limiting component here. Also the API workers are timing out due waiting a really long time for an advisory lock.

So what that means to me is I believe we're running into the architectural limit of task insertion into the db (or maybe also task handling too?). We have 48 workers running in this system so that's a lot of workers, but also we may even need more.

This is an interesting problem because we can't increase throughput or capacity by making more hardware resources available. This can only be solved algorithmically I think. The idea would be (somehow?) to make the acquisition of locks less contentious.

mdellweg · 2024-05-16T13:32:24Z

Can you identify whether this is related to the unblocked_at change?
Maybe we are seeing other table locks slowing the insertion down and so the advisory lock (being a turnstile for ensuring monotonous pulp_created values at all cost) would be slowed down externally.
Or maybe we really just hit the limit of that special bottleneck. Adding more resources is certainly not improving the situation here.
A first idea (under the assumption that concurrent tasks rarely touch the same resources) could be to create some sort of bloom filter on the tasks resources and spread the current single advisory lock into 8. Now only tasks having an overlap in the 8 resource identifier buckets need to wait on each others locks.

mdellweg · 2024-05-16T13:32:57Z

OTOH It might be worth rerunning the tests with the new indices we just added on the tasks table.

bmbouter · 2024-05-16T14:37:58Z

Thanks for the thoughtful comments.

Yes, let's try to rerun the tests again after our installation is upgraded to that released version. Can you let us know what version that is whenever that is known?

mdellweg · 2024-05-16T15:02:49Z

It merged this week.

dkliban · 2024-05-16T20:27:46Z

Here is another screenshot from the RDS management console.

dkliban · 2024-05-22T15:00:52Z

I currently have 50 concurrent threads creating a remote, a repo, and syncing the repo. Here is what the top 10 queries are.

I have 24 workers running right now.

The green color represents CPU wait time. AWS is suggesting that the instance be upgraded to one with more CPU resources. I agree with their assessment.

mdellweg · 2024-05-23T09:01:38Z

Tell me, is this a reason to say we can close this issue?

dkliban · 2024-05-23T18:32:48Z

I opened this issue when I had 48 workers running. Right now I am using 24 workers to get around the advisory lock issue.

I believe if I increase to 48 again, we will see this problem. Let's keep it open at least until I try 48 workers again.

dkliban · 2024-07-30T14:53:12Z

I provisioned a larger RDS instance (8 vCPU and 32gb RAM) and this problem went away.

dkliban added Issue Triage-Needed labels May 15, 2024

ggainey removed the Triage-Needed label Jun 4, 2024

dkliban closed this as completed Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API worker timesout waiting on an advisory lock to dispatch a task #5390

API worker timesout waiting on an advisory lock to dispatch a task #5390

dkliban commented May 15, 2024 •

edited

Loading

bmbouter commented May 16, 2024 •

edited

Loading

mdellweg commented May 16, 2024

mdellweg commented May 16, 2024

bmbouter commented May 16, 2024

mdellweg commented May 16, 2024

dkliban commented May 16, 2024

dkliban commented May 22, 2024 •

edited

Loading

mdellweg commented May 23, 2024

dkliban commented May 23, 2024

dkliban commented Jul 30, 2024

API worker timesout waiting on an advisory lock to dispatch a task #5390

API worker timesout waiting on an advisory lock to dispatch a task #5390

Comments

dkliban commented May 15, 2024 • edited Loading

bmbouter commented May 16, 2024 • edited Loading

mdellweg commented May 16, 2024

mdellweg commented May 16, 2024

bmbouter commented May 16, 2024

mdellweg commented May 16, 2024

dkliban commented May 16, 2024

dkliban commented May 22, 2024 • edited Loading

mdellweg commented May 23, 2024

dkliban commented May 23, 2024

dkliban commented Jul 30, 2024

dkliban commented May 15, 2024 •

edited

Loading

bmbouter commented May 16, 2024 •

edited

Loading

dkliban commented May 22, 2024 •

edited

Loading