Problem
The package provenance lookup query in the PyPI Simple API retrieve method generates an inefficient nested subquery pattern that is a significant contributor to database load in production. During peak load on 2026-05-20 at 16:00 UTC, 4 concurrent instances of this query pattern consumed 2.64 avg active sessions on the RDS instance (db.m7g.4xlarge, 16 vCPUs). Total DB load reached 21.9 avg active sessions, exceeding the CPU count.
Source
pulp_python/app/pypi/views.py lines 387-388:
provenances = PackageProvenance.objects.filter(package__in=local_packages).values_list(
"package__filename", flat=True
)
SQL pattern observed (from RDS Performance Insights)
SELECT "python_pythonpackagecontent"."filename" AS "package__filename"
FROM "python_packageprovenance"
INNER JOIN "python_pythonpackagecontent"
ON ("python_packageprovenance"."package_id" = "python_pythonpackagecontent"."content_ptr_id")
WHERE "python_packageprovenance"."package_id" IN (
SELECT V0."content_ptr_id"
FROM "python_pythonpackagecontent" V0
WHERE V0."content_ptr_id" IN (
SELECT U0."pulp_id"
FROM "core_content" U0
WHERE U0."pulp_id" IN ('...uuid...', '...uuid...', ...)
)
)
The nested IN (SELECT ... IN (SELECT ...)) pattern is inefficient. The wait event breakdown during the spike was primarily CPU (17.9 avg active sessions), indicating compute-bound query execution (likely hash joins or repeated subquery evaluation).
Suggestions
- Combine with the metadata query: Instead of two separate queries (one for package metadata, one for provenance), use a single annotated queryset with
Exists() or Subquery() to check provenance in the same query
- Add an index on
python_packageprovenance.package_id if one doesn't exist
- Simplify the filter: The
package__in=local_packages produces the nested subquery. Consider using package_id__in=local_packages.values_list("pk", flat=True) to flatten the subquery, or passing an explicit list of PKs
- Cache provenance existence: If provenance data changes infrequently, cache which packages have provenance records
Problem
The package provenance lookup query in the PyPI Simple API
retrievemethod generates an inefficient nested subquery pattern that is a significant contributor to database load in production. During peak load on 2026-05-20 at 16:00 UTC, 4 concurrent instances of this query pattern consumed 2.64 avg active sessions on the RDS instance (db.m7g.4xlarge, 16 vCPUs). Total DB load reached 21.9 avg active sessions, exceeding the CPU count.Source
pulp_python/app/pypi/views.pylines 387-388:SQL pattern observed (from RDS Performance Insights)
The nested
IN (SELECT ... IN (SELECT ...))pattern is inefficient. The wait event breakdown during the spike was primarily CPU (17.9 avg active sessions), indicating compute-bound query execution (likely hash joins or repeated subquery evaluation).Suggestions
Exists()orSubquery()to check provenance in the same querypython_packageprovenance.package_idif one doesn't existpackage__in=local_packagesproduces the nested subquery. Consider usingpackage_id__in=local_packages.values_list("pk", flat=True)to flatten the subquery, or passing an explicit list of PKs