Skip to content

Fix repair_metadata OOM on large repositories#1189

Open
decko wants to merge 10 commits intopulp:mainfrom
decko:fix/repair-metadata-memory
Open

Fix repair_metadata OOM on large repositories#1189
decko wants to merge 10 commits intopulp:mainfrom
decko:fix/repair-metadata-memory

Conversation

@decko
Copy link
Copy Markdown
Member

@decko decko commented Apr 9, 2026

Summary

  • Reduce BULK_SIZE from 1000 to 250, flushing batches 4x more often to cap peak memory
  • Eliminate double S3 read per wheel by reusing the temp file from metadata extraction for metadata artifact creation
  • Explicitly close artifact file handles after each iteration to release S3 buffer memory

Fixes #1188

Test plan

  • Existing test_repair.py tests pass (metadata repair command, endpoint, artifact repair)
  • New test_metadata_repair_batch_boundary passes with reduced BULK_SIZE
  • Deploy to stage and run repair-python-metadata.py --env stage --domain <large-domain> to verify no OOM

JIRA: PULP-1573

🤖 Generated with Claude Code

decko and others added 6 commits April 9, 2026 15:38
Large repositories (1000+ packages) cause workers to OOM during
repair_metadata. Reducing BULK_SIZE flushes batches 4x more often,
capping the number of Django model instances and artifact references
held in memory at any given time.

JIRA: PULP-1573

Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds a variant of artifact_to_metadata_artifact that accepts an
already-extracted temp file path instead of re-reading the artifact
from storage. Also adds keep_temp_file parameter to
artifact_to_python_content_data to support reuse.

This prepares for eliminating the double S3 read per wheel during
repair_metadata.

JIRA: PULP-1573

Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
During repair_metadata, each wheel was read from S3 twice: once for
content metadata extraction and again for metadata artifact creation.
Now the temp file from the first read is reused for the second,
halving S3 reads and temp file I/O for wheels.

Also explicitly closes artifact.file after each iteration to release
S3 buffer memory immediately instead of waiting for GC.

JIRA: PULP-1573

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
JIRA: PULP-1573

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Verifies that repair_metadata correctly processes multiple wheel
packages across batch flush boundaries after BULK_SIZE reduction.

JIRA: PULP-1573
@github-actions github-actions bot added multi-commit Add to bypass single commit lint check no-changelog no-issue labels Apr 9, 2026
decko added 2 commits April 9, 2026 16:30
The explicit close() on main_artifact.file broke the fallback path in
_process_metadata_batch, which calls artifact_to_metadata_artifact()
for the final batch — that function calls artifact.file.seek(0) on
the already-closed handle.

Removing the close is safe: the file handle is released when the
batch flushes and references are cleared. The temp file cleanup
(os.unlink) still happens immediately.

JIRA: PULP-1573
Comment on lines +123 to +131
# For wheels, keep the temp file so we can reuse it for metadata artifact
is_wheel = package.filename.endswith(".whl")
if is_wheel:
new_data, temp_path = artifact_to_python_content_data(
package.filename, main_artifact, domain, keep_temp_file=True
)
else:
new_data = artifact_to_python_content_data(package.filename, main_artifact, domain)
temp_path = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A better way would be to create a helper function that copies artifact to a temp file, then use the result for content data extraction in artifact_to_python_content_data and metadata extraction via extract_wheel_metadata. update_metadata_artifact_if_needed can then use temp file / metadata content in bytes instead of main_artifact.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on the helper approach — will rework to have a small helper that copies the artifact to a temp file, then pass that path to both artifact_to_python_content_data and artifact_to_metadata_artifact.

One thing I'd like to discuss: passing metadata content in bytes through update_metadata_artifact_if_needed would couple the content extraction and artifact creation paths — artifact_to_python_content_data would need to know about and return the raw metadata bytes alongside the parsed data. Passing the temp file path keeps both functions independent: each opens the wheel for exactly what it needs. Would you be OK with just passing the temp path instead of bytes?



def artifact_to_python_content_data(filename, artifact, domain=None):
def artifact_to_python_content_data(filename, artifact, domain=None, keep_temp_file=False):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be only minimal changes to use either temp path or the original logic with with tempfile.NamedTemporaryFile

return metadata_artifact


def artifact_to_metadata_artifact_from_path(filename: str, temp_wheel_path: str) -> Artifact | None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed. artifact_to_metadata_artifact should take a new parameter (temp path or metadata content in bytes) and use either this or the original logic with with tempfile.NamedTemporaryFile.

decko added 2 commits April 10, 2026 12:19
Address review feedback:
- Add copy_artifact_to_temp_file helper that copies artifact to disk
- artifact_to_python_content_data accepts optional temp_path param
- artifact_to_metadata_artifact accepts optional temp_path param
- Remove artifact_to_metadata_artifact_from_path (no longer needed)
- Repair loop creates temp file once via helper, passes to both functions
- Use try/finally for temp file cleanup

JIRA: PULP-1573
The temp file is deleted after each loop iteration (in the finally
block), but the metadata batch is flushed later — so the temp_path
stored in the batch points to a deleted file.

Fix: use temp_path only for artifact_to_python_content_data (avoids
one S3 read per package). The metadata batch falls back to
artifact_to_metadata_artifact's original behavior for the second read.
Combined with BULK_SIZE=250, this is still a major memory improvement.

JIRA: PULP-1573
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

multi-commit Add to bypass single commit lint check no-changelog no-issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

repair_metadata OOMs on large repositories (1000+ packages)

2 participants