Problem
When repair_metadata processes a repository with 1000+ packages, peak memory reaches ~7.7GB against an 8GB worker limit. The worker becomes unresponsive, misses its heartbeat, and Pulp marks the task as "Worker has gone missing." This consistently fails on the same large repos.
Root Cause
Three factors compound to create the memory spike:
BULK_SIZE = 1000 — batch and metadata_batch lists accumulate up to 1000 items before flushing
- Double wheel read — each wheel is read from S3 twice: once in
artifact_to_python_content_data and again in artifact_to_metadata_artifact
- No file handle cleanup —
artifact.file handles are never explicitly closed, keeping buffered data in memory
Proposed Fix
- Reduce
BULK_SIZE from 1000 to 250
- Reuse the temp file from the first wheel read for metadata artifact creation
- Explicitly close artifact file handles after each iteration
Expected peak memory reduction: from ~7.7GB to ~2-3GB for a 1042-package repo.
Evidence
Task failure from production:
{
"state": "failed",
"error": {"reason": "Worker has gone missing."},
"progress_reports": [{"total": 1042, "done": 833}]
}
Prometheus metrics show memory spiking from 1.5GB to 7.7GB (96.8% of 8GB limit) during the repair task.
Related: PULP-1573
Problem
When
repair_metadataprocesses a repository with 1000+ packages, peak memory reaches ~7.7GB against an 8GB worker limit. The worker becomes unresponsive, misses its heartbeat, and Pulp marks the task as "Worker has gone missing." This consistently fails on the same large repos.Root Cause
Three factors compound to create the memory spike:
BULK_SIZE = 1000— batch and metadata_batch lists accumulate up to 1000 items before flushingartifact_to_python_content_dataand again inartifact_to_metadata_artifactartifact.filehandles are never explicitly closed, keeping buffered data in memoryProposed Fix
BULK_SIZEfrom 1000 to 250Expected peak memory reduction: from ~7.7GB to ~2-3GB for a 1042-package repo.
Evidence
Task failure from production:
{ "state": "failed", "error": {"reason": "Worker has gone missing."}, "progress_reports": [{"total": 1042, "done": 833}] }Prometheus metrics show memory spiking from 1.5GB to 7.7GB (96.8% of 8GB limit) during the repair task.
Related: PULP-1573