Skip to content

repair_metadata OOMs on large repositories (1000+ packages) #1188

@decko

Description

@decko

Problem

When repair_metadata processes a repository with 1000+ packages, peak memory reaches ~7.7GB against an 8GB worker limit. The worker becomes unresponsive, misses its heartbeat, and Pulp marks the task as "Worker has gone missing." This consistently fails on the same large repos.

Root Cause

Three factors compound to create the memory spike:

  1. BULK_SIZE = 1000 — batch and metadata_batch lists accumulate up to 1000 items before flushing
  2. Double wheel read — each wheel is read from S3 twice: once in artifact_to_python_content_data and again in artifact_to_metadata_artifact
  3. No file handle cleanupartifact.file handles are never explicitly closed, keeping buffered data in memory

Proposed Fix

  • Reduce BULK_SIZE from 1000 to 250
  • Reuse the temp file from the first wheel read for metadata artifact creation
  • Explicitly close artifact file handles after each iteration

Expected peak memory reduction: from ~7.7GB to ~2-3GB for a 1042-package repo.

Evidence

Task failure from production:

{
  "state": "failed",
  "error": {"reason": "Worker has gone missing."},
  "progress_reports": [{"total": 1042, "done": 833}]
}

Prometheus metrics show memory spiking from 1.5GB to 7.7GB (96.8% of 8GB limit) during the repair task.

Related: PULP-1573

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions