Add periodic malloc_trim to prevent unbounded RSS growth in API workers by amasolov · Pull Request #7481 · pulp/pulpcore

amasolov · 2026-03-18T01:30:01Z

Summary

PulpApiWorker (gunicorn SyncWorker) exhibits unbounded RSS growth over time due to glibc heap fragmentation. Django's per-request allocation pattern creates and destroys many small C-level objects (ORM compilers, SQL strings, psycopg cursor state), causing glibc's malloc to retain freed pages rather than returning them to the OS.

This PR adds periodic gc.collect() + malloc_trim(0) calls in PulpApiWorker.handle_request() every N requests (default 1024, configurable via PULP_MEMORY_TRIM_INTERVAL env var, set to 0 to disable).

The fix is Linux-only (glibc malloc_trim), graceful no-op on other platforms. No new dependencies.

Problem

Observed in Ansible Automation Platform 2.6 deployments running pulpcore 3.49 on OpenShift: hub-api worker RSS grows ~1 kB/request even with zero user activity (liveness/readiness probes alone drive growth). Over hours this leads to OOM kills and pod restarts.

Profiling on a live cluster confirmed:

Python object counts are completely stable (gc.get_objects() delta ~0)
gc.collect() recovers 0 bytes (no reference cycles)
malloc_trim(0) recovers ~2 MB immediately (heap fragmentation confirmed)
RSS grows linearly without trimming, stabilizes completely with trimming

The root cause is glibc's default malloc behavior: small allocations spread across many arenas cause heap fragmentation, and freed blocks are not returned to the OS until malloc_trim is explicitly called.

Changes

pulpcore/app/entrypoint.py:

At module load: detect Linux, load libc.malloc_trim via ctypes
PulpApiWorker.handle_request(): after each request, increment counter; every PULP_MEMORY_TRIM_INTERVAL requests (default 1024), call gc.collect() then malloc_trim(0)
Log at worker init when trimming is enabled

Configuration

Env var	Default	Description
`PULP_MEMORY_TRIM_INTERVAL`	`1024`	Run trim every N requests. Set to `0` to disable.

Test plan

Verify workers start normally with default settings (trim enabled)
Verify PULP_MEMORY_TRIM_INTERVAL=0 disables trimming (no log message)
Verify RSS growth stabilizes under sustained probe/request load
Verify no functional regression on macOS (trim is a no-op, no errors)
Run existing unit/functional test suite

📜 Checklist

Commits are cleanly separated with meaningful messages (simple features and bug fixes should be squashed to one commit)
A changelog entry or entries has been added for any significant changes
Follows the Pulp policy on AI Usage
(For new features) - User documentation and test coverage has been added

See: Pull Request Walkthrough

dralley · 2026-03-18T03:02:09Z

I think a safer and more practical approach might be to just configure a maximum number of requests for a Gunicorn worker to handle before being rebooted.

We expose the options to do so on the API entrypoint already: https://github.com/pulp/pulpcore/blob/main/pulpcore/app/entrypoint.py#L153-L154

https://gunicorn.org/guides/docker/?h=memory#out-of-memory
https://gunicorn.org/reference/settings/#worker_connections

amasolov · 2026-03-18T03:53:41Z

I think a safer and more practical approach might be to just configure a maximum number of requests for a Gunicorn worker to handle before being rebooted.

I agree that --max-requests is a practical safety net and should be a part of the story. However I think these two approaches are complementary rather than alternatives.

--max-requests masks the symptom by recycling workers periodically but each worker still grows until it's replaced. Under heavier load in enterprise environments the recycling more frequent and add brief latency during worker replacement.

malloc_trim addresses the root cause by making glibc retaining freed pages in the process heap. Calling malloc_trim(0) returns them to the OS and RSS stabilises and never grows further. No worker restart needed.

Utilising --max-requests requires changes in the end products (for example AAP doesn't have an option to set it and keep it persistent) and malloc_trim would just work.

dralley · 2026-03-18T04:53:35Z

Can you briefly try using https://docs.python.org/3/library/tracemalloc.html (or something similar, like memray) to get a report on what is allocating memory during a standard liveness probe request?

I understand that that would measure what is going on with Python's own allocators rather than libc malloc, but still, I wouldn't think there would be much fragmentation accumulating on a service in a case where the same endpoint was merely being called over and over, using and then releasing approximately the same amount every time. So there is probably fragmentation, but it may also be triggered by other misbehavior

pedro-psb · 2026-03-18T15:08:12Z

pulpcore/app/entrypoint.py


 logger = getLogger(__name__)

+_MEMORY_TRIM_INTERVAL = int(os.environ.get("PULP_MEMORY_TRIM_INTERVAL", "1024"))


I guess there is some discussion about this, but in any case, this setting should be defined in settings.py as the others and documented in settings.md. Settings defined there are automatically overridable via PULP_{NAME} envvar.

@pedro-psb Good call, updated in the latest push:

MEMORY_TRIM_INTERVAL = 1024 added to settings.py (so it picks up PULP_MEMORY_TRIM_INTERVAL via dynaconf automatically)

entrypoint.py now reads from settings.MEMORY_TRIM_INTERVAL in init_process() instead of os.environ.get()

Documented in docs/admin/reference/settings.md

amasolov · 2026-03-18T19:58:01Z

Can you briefly try using https://docs.python.org/3/library/tracemalloc.html (or something similar, like memray) to get a report on what is allocating memory during a standard liveness probe request?

@dralley Sure

Here's tracemalloc data from a live AAP 2.6 cluster (pulpcore 3.49.49, Django 4.2.27, Python 3.12, glibc 2.34, OpenShift).

Baseline snapshot taken after lazy init settled, then 200 sequential curl requests to /pulp/api/v3/status/ from inside the pod, then diff snapshot.

RSS vs Python allocations (PID 2, 200 requests):

Metric	Baseline	After 200 reqs	Delta
VmRSS	168,932 kB	181,028 kB	+12,096 kB
tracemalloc traced	6,837,468 B	~6,851,690 B	+14 KB
gc.get_objects()	306,151	~306,000	~0

tracemalloc top diffs (all negative = Python freeing, not leaking):
rest_framework/fields.py:625 -10,208 B (-81 objs) rest_framework/fields.py:341 -9,968 B (-64 objs) psycopg/_adapters_map.py:181 -9,376 B (-4 objs) psycopg/_adapters_map.py:156 -9,376 B (-4 objs) psycopg/_typeinfo.py:339 -9,304 B (-2 objs) rest_framework/fields.py:381 -7,488 B (-86 objs) django/utils/deconstruct.py:18 +6,664 B (+119 objs) <- largest positive, one-time lazy init

Earlier malloc_trim validation (same cluster, 4000 requests):

gc.collect() -> 0 bytes recovered (no reference cycles)
malloc_trim(0) -> ~2 MB recovered immediately
With periodic malloc_trim(0) every 1024 reqs: RSS stabilised at ~140 MB; one worker decreased 224 kB between req 2000 and 4000
12 MB RSS growth with only 14 KB of Python allocation growth = the gap is glibc holding freed pages. malloc_trim returns them to the OS.

Gunicorn API workers exhibit unbounded RSS growth over time due to glibc heap fragmentation. Django's per-request allocation pattern creates and destroys many small C-level objects (ORM compilers, SQL strings, psycopg cursor state) which causes glibc's malloc to retain freed pages in the process heap rather than returning them to the OS. Profiling on a live Ansible Automation Platform 2.6 deployment (pulpcore 3.49.49, Django 4.2.27, Python 3.12) confirmed: - Python object counts are completely stable (no object leak) - gc.collect() recovers 0 bytes (no reference cycles) - malloc_trim(0) recovers ~2 MB immediately (fragmentation confirmed) - RSS grows ~1 kB/request without trimming This adds periodic gc.collect() + malloc_trim(0) calls in PulpApiWorker.handle_request() every MEMORY_TRIM_INTERVAL requests (default 1024, configurable via PULP_MEMORY_TRIM_INTERVAL through the standard Django/dynaconf settings, set 0 to disable). The fix is Linux-only (glibc malloc_trim), graceful no-op on other platforms. Testing shows RSS stabilizes completely after one-time lazy initialization, eliminating unbounded growth. closes pulp#7482 Assisted-by: Claude (Anthropic) - investigation, profiling, and code Made-with: Cursor

github-actions bot added no-changelog no-issue labels Mar 18, 2026

amasolov force-pushed the fix/api-worker-memory-trim branch from bbb84a7 to a54ced1 Compare March 18, 2026 01:39

github-actions bot removed the no-changelog label Mar 18, 2026

amasolov mentioned this pull request Mar 18, 2026

PulpApiWorker RSS grows unbounded due to glibc heap fragmentation #7482

Open

amasolov force-pushed the fix/api-worker-memory-trim branch from a54ced1 to 4279638 Compare March 18, 2026 01:44

github-actions bot removed the no-issue label Mar 18, 2026

pedro-psb linked an issue Mar 18, 2026 that may be closed by this pull request

PulpApiWorker RSS grows unbounded due to glibc heap fragmentation #7482

Open

pedro-psb reviewed Mar 18, 2026

View reviewed changes

amasolov force-pushed the fix/api-worker-memory-trim branch from 4279638 to 86faabc Compare March 18, 2026 20:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add periodic malloc_trim to prevent unbounded RSS growth in API workers#7481

Add periodic malloc_trim to prevent unbounded RSS growth in API workers#7481
amasolov wants to merge 1 commit intopulp:mainfrom
amasolov:fix/api-worker-memory-trim

amasolov commented Mar 18, 2026 •

edited

Loading

Uh oh!

dralley commented Mar 18, 2026 •

edited

Loading

Uh oh!

amasolov commented Mar 18, 2026

Uh oh!

dralley commented Mar 18, 2026 •

edited

Loading

Uh oh!

pedro-psb Mar 18, 2026

Uh oh!

amasolov Mar 18, 2026

Uh oh!

amasolov commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		logger = getLogger(__name__)

		_MEMORY_TRIM_INTERVAL = int(os.environ.get("PULP_MEMORY_TRIM_INTERVAL", "1024"))

Conversation

amasolov commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Changes

Configuration

Test plan

📜 Checklist

Uh oh!

dralley commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amasolov commented Mar 18, 2026

Uh oh!

dralley commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pedro-psb Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

amasolov Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

amasolov commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

amasolov commented Mar 18, 2026 •

edited

Loading

dralley commented Mar 18, 2026 •

edited

Loading

dralley commented Mar 18, 2026 •

edited

Loading