Fix repair_metadata OOM on large repositories#1189
Conversation
Large repositories (1000+ packages) cause workers to OOM during repair_metadata. Reducing BULK_SIZE flushes batches 4x more often, capping the number of Django model instances and artifact references held in memory at any given time. JIRA: PULP-1573 Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds a variant of artifact_to_metadata_artifact that accepts an already-extracted temp file path instead of re-reading the artifact from storage. Also adds keep_temp_file parameter to artifact_to_python_content_data to support reuse. This prepares for eliminating the double S3 read per wheel during repair_metadata. JIRA: PULP-1573 Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
During repair_metadata, each wheel was read from S3 twice: once for content metadata extraction and again for metadata artifact creation. Now the temp file from the first read is reused for the second, halving S3 reads and temp file I/O for wheels. Also explicitly closes artifact.file after each iteration to release S3 buffer memory immediately instead of waiting for GC. JIRA: PULP-1573 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
JIRA: PULP-1573 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Verifies that repair_metadata correctly processes multiple wheel packages across batch flush boundaries after BULK_SIZE reduction. JIRA: PULP-1573
JIRA: PULP-1573
JIRA: PULP-1573
The explicit close() on main_artifact.file broke the fallback path in _process_metadata_batch, which calls artifact_to_metadata_artifact() for the final batch — that function calls artifact.file.seek(0) on the already-closed handle. Removing the close is safe: the file handle is released when the batch flushes and references are cleared. The temp file cleanup (os.unlink) still happens immediately. JIRA: PULP-1573
pulp_python/app/tasks/repair.py
Outdated
| # For wheels, keep the temp file so we can reuse it for metadata artifact | ||
| is_wheel = package.filename.endswith(".whl") | ||
| if is_wheel: | ||
| new_data, temp_path = artifact_to_python_content_data( | ||
| package.filename, main_artifact, domain, keep_temp_file=True | ||
| ) | ||
| else: | ||
| new_data = artifact_to_python_content_data(package.filename, main_artifact, domain) | ||
| temp_path = None |
There was a problem hiding this comment.
A better way would be to create a helper function that copies artifact to a temp file, then use the result for content data extraction in artifact_to_python_content_data and metadata extraction via extract_wheel_metadata. update_metadata_artifact_if_needed can then use temp file / metadata content in bytes instead of main_artifact.
There was a problem hiding this comment.
Agreed on the helper approach — will rework to have a small helper that copies the artifact to a temp file, then pass that path to both artifact_to_python_content_data and artifact_to_metadata_artifact.
One thing I'd like to discuss: passing metadata content in bytes through update_metadata_artifact_if_needed would couple the content extraction and artifact creation paths — artifact_to_python_content_data would need to know about and return the raw metadata bytes alongside the parsed data. Passing the temp file path keeps both functions independent: each opens the wheel for exactly what it needs. Would you be OK with just passing the temp path instead of bytes?
pulp_python/app/utils.py
Outdated
|
|
||
|
|
||
| def artifact_to_python_content_data(filename, artifact, domain=None): | ||
| def artifact_to_python_content_data(filename, artifact, domain=None, keep_temp_file=False): |
There was a problem hiding this comment.
There should be only minimal changes to use either temp path or the original logic with with tempfile.NamedTemporaryFile
pulp_python/app/utils.py
Outdated
| return metadata_artifact | ||
|
|
||
|
|
||
| def artifact_to_metadata_artifact_from_path(filename: str, temp_wheel_path: str) -> Artifact | None: |
There was a problem hiding this comment.
This is not needed. artifact_to_metadata_artifact should take a new parameter (temp path or metadata content in bytes) and use either this or the original logic with with tempfile.NamedTemporaryFile.
Address review feedback: - Add copy_artifact_to_temp_file helper that copies artifact to disk - artifact_to_python_content_data accepts optional temp_path param - artifact_to_metadata_artifact accepts optional temp_path param - Remove artifact_to_metadata_artifact_from_path (no longer needed) - Repair loop creates temp file once via helper, passes to both functions - Use try/finally for temp file cleanup JIRA: PULP-1573
The temp file is deleted after each loop iteration (in the finally block), but the metadata batch is flushed later — so the temp_path stored in the batch points to a deleted file. Fix: use temp_path only for artifact_to_python_content_data (avoids one S3 read per package). The metadata batch falls back to artifact_to_metadata_artifact's original behavior for the second read. Combined with BULK_SIZE=250, this is still a major memory improvement. JIRA: PULP-1573
Summary
BULK_SIZEfrom 1000 to 250, flushing batches 4x more often to cap peak memoryFixes #1188
Test plan
test_repair.pytests pass (metadata repair command, endpoint, artifact repair)test_metadata_repair_batch_boundarypasses with reduced BULK_SIZErepair-python-metadata.py --env stage --domain <large-domain>to verify no OOMJIRA: PULP-1573
🤖 Generated with Claude Code