Fix repair_metadata OOM on large repositories by decko · Pull Request #1189 · pulp/pulp_python

decko · 2026-04-09T19:25:52Z

Summary

Reduce BULK_SIZE from 1000 to 250, flushing batches 4x more often to cap peak memory
Eliminate double S3 read per wheel by reusing the temp file from metadata extraction for metadata artifact creation
Explicitly close artifact file handles after each iteration to release S3 buffer memory

Test plan

Existing test_repair.py tests pass (metadata repair command, endpoint, artifact repair)
New test_metadata_repair_batch_boundary passes with reduced BULK_SIZE
Deploy to stage and run repair-python-metadata.py --env stage --domain <large-domain> to verify no OOM

JIRA: PULP-1573

🤖 Generated with Claude Code

Large repositories (1000+ packages) cause workers to OOM during repair_metadata. Reducing BULK_SIZE flushes batches 4x more often, capping the number of Django model instances and artifact references held in memory at any given time. JIRA: PULP-1573 Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds a variant of artifact_to_metadata_artifact that accepts an already-extracted temp file path instead of re-reading the artifact from storage. Also adds keep_temp_file parameter to artifact_to_python_content_data to support reuse. This prepares for eliminating the double S3 read per wheel during repair_metadata. JIRA: PULP-1573 Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

During repair_metadata, each wheel was read from S3 twice: once for content metadata extraction and again for metadata artifact creation. Now the temp file from the first read is reused for the second, halving S3 reads and temp file I/O for wheels. Also explicitly closes artifact.file after each iteration to release S3 buffer memory immediately instead of waiting for GC. JIRA: PULP-1573 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

JIRA: PULP-1573 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Verifies that repair_metadata correctly processes multiple wheel packages across batch flush boundaries after BULK_SIZE reduction. JIRA: PULP-1573

JIRA: PULP-1573

The explicit close() on main_artifact.file broke the fallback path in _process_metadata_batch, which calls artifact_to_metadata_artifact() for the final batch — that function calls artifact.file.seek(0) on the already-closed handle. Removing the close is safe: the file handle is released when the batch flushes and references are cleared. The temp file cleanup (os.unlink) still happens immediately. JIRA: PULP-1573

jobselko · 2026-04-10T14:24:30Z

pulp_python/app/tasks/repair.py

+            # For wheels, keep the temp file so we can reuse it for metadata artifact
+            is_wheel = package.filename.endswith(".whl")
+            if is_wheel:
+                new_data, temp_path = artifact_to_python_content_data(
+                    package.filename, main_artifact, domain, keep_temp_file=True
+                )
+            else:
+                new_data = artifact_to_python_content_data(package.filename, main_artifact, domain)
+                temp_path = None


A better way would be to create a helper function that copies artifact to a temp file, then use the result for content data extraction in artifact_to_python_content_data and metadata extraction via extract_wheel_metadata. update_metadata_artifact_if_needed can then use temp file / metadata content in bytes instead of main_artifact.

Agreed on the helper approach — will rework to have a small helper that copies the artifact to a temp file, then pass that path to both artifact_to_python_content_data and artifact_to_metadata_artifact.

One thing I'd like to discuss: passing metadata content in bytes through update_metadata_artifact_if_needed would couple the content extraction and artifact creation paths — artifact_to_python_content_data would need to know about and return the raw metadata bytes alongside the parsed data. Passing the temp file path keeps both functions independent: each opens the wheel for exactly what it needs. Would you be OK with just passing the temp path instead of bytes?

jobselko · 2026-04-10T14:28:04Z

pulp_python/app/utils.py



-def artifact_to_python_content_data(filename, artifact, domain=None):
+def artifact_to_python_content_data(filename, artifact, domain=None, keep_temp_file=False):


There should be only minimal changes to use either temp path or the original logic with with tempfile.NamedTemporaryFile

jobselko · 2026-04-10T14:42:12Z

pulp_python/app/utils.py

    return metadata_artifact


+def artifact_to_metadata_artifact_from_path(filename: str, temp_wheel_path: str) -> Artifact | None:


This is not needed. artifact_to_metadata_artifact should take a new parameter (temp path or metadata content in bytes) and use either this or the original logic with with tempfile.NamedTemporaryFile.

Address review feedback: - Add copy_artifact_to_temp_file helper that copies artifact to disk - artifact_to_python_content_data accepts optional temp_path param - artifact_to_metadata_artifact accepts optional temp_path param - Remove artifact_to_metadata_artifact_from_path (no longer needed) - Repair loop creates temp file once via helper, passes to both functions - Use try/finally for temp file cleanup JIRA: PULP-1573

The temp file is deleted after each loop iteration (in the finally block), but the metadata batch is flushed later — so the temp_path stored in the batch points to a deleted file. Fix: use temp_path only for artifact_to_python_content_data (avoids one S3 read per package). The metadata batch falls back to artifact_to_metadata_artifact's original behavior for the second read. Combined with BULK_SIZE=250, this is still a major memory improvement. JIRA: PULP-1573

decko and others added 6 commits April 9, 2026 15:38

docs: add changelog fragment for repair_metadata memory fix

694d010

JIRA: PULP-1573 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

test: add batch boundary test for repair_metadata

2939a9c

Verifies that repair_metadata correctly processes multiple wheel packages across batch flush boundaries after BULK_SIZE reduction. JIRA: PULP-1573

docs: rename changelog fragment to match GitHub issue pulp#1188

6c99ac8

JIRA: PULP-1573

github-actions bot added multi-commit Add to bypass single commit lint check no-changelog no-issue labels Apr 9, 2026

decko added 2 commits April 9, 2026 16:30

style: fix black formatting in repair.py and utils.py

a92fa57

JIRA: PULP-1573

jobselko requested changes Apr 10, 2026

View reviewed changes

decko added 2 commits April 10, 2026 12:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix repair_metadata OOM on large repositories#1189

Fix repair_metadata OOM on large repositories#1189
decko wants to merge 10 commits intopulp:mainfrom
decko:fix/repair-metadata-memory

decko commented Apr 9, 2026

Uh oh!

jobselko Apr 10, 2026

Uh oh!

decko Apr 10, 2026

Uh oh!

jobselko Apr 10, 2026

Uh oh!

jobselko Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		def artifact_to_python_content_data(filename, artifact, domain=None):
		def artifact_to_python_content_data(filename, artifact, domain=None, keep_temp_file=False):

		return metadata_artifact


		def artifact_to_metadata_artifact_from_path(filename: str, temp_wheel_path: str) -> Artifact \| None:

Conversation

decko commented Apr 9, 2026

Summary

Test plan

Uh oh!

jobselko Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

decko Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

jobselko Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

jobselko Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants