Optimize v3 collection API querysets. #1476

newswangerd · 2023-06-02T16:25:52Z

Changes

This PR updates the v3 collection API querysets so that they are all executed as either one queryset or one queryset plus a set of preteches.

Content is queried using a new, optimized function that queries the Content table directly instead of querying RerpositoryContent and then performing a pk__in filter on the Content table, reducing the number and size of queries.

The CollectionVersion model has been updated with individual fields for the version number components so that CollectionVersions can be efficiently sorted by their semver version using .order_by() on the queryset.

The collection version list view and collection version detail view responses are now cached by repository version pk.

Results

Summary

~8x improvements in number of simultaneous clients and 24x improvement in average client response time

Tests

Tested using X number of ansible-galaxy clients simultaneously pulling ansible.posix

Current pulp ansible main branch: 128 simultaneous downloads before any client timeouts

-------------------------------------------------------------------------------------------
running with 128 total clients against http://localhost:5001/pulp_ansible/galaxy/test/api/
-------------------------------------------------------------------------------------------
failures: []
failure percent: 0.0
avg duration: 41.084137023437506

This pr: 1024 simultaneous downloads before any clients timeout

-------------------------------------------------------------------------------------------
running with 128 total clients against http://localhost:5001/pulp_ansible/galaxy/test/api/
-------------------------------------------------------------------------------------------
failures: []
failure percent: 0.0
avg duration: 1.6661786093750004

--------------------------------------------------------------------------------------------
running with 1024 total clients against http://localhost:5001/pulp_ansible/galaxy/test/api/
--------------------------------------------------------------------------------------------
failures: []
failure percent: 0.0
avg duration: 5.763926267578129

Fixes: https://issues.redhat.com/browse/AAH-2346

jctanner · 2023-06-05T18:23:18Z

pulp_ansible/app/utils.py

+    """
+    A more efficient way to query content in a repository version.
+
+    Returns a the given queryset, but filtered to only include content


Returns a ... ?

pulp_ansible/app/utils.py

gerrod3 · 2023-06-08T01:37:57Z

pulp_ansible/app/galaxy/v3/views.py

+        latest_cv_version_qs = (
+            filter_content_for_repo_version(CollectionVersion.objects, repo_version)


Have you try using distinct("namespace", "name") on this query after the order_by to see if it speeds it up?

I don't think that would matter. It's already filtering by .filter(collection=OuterRef("pk"))

gerrod3 · 2023-06-08T01:40:06Z

pulp_ansible/app/galaxy/v3/views.py

+        # to cache all of them, so this will only cache queries that stick to
+        # limit/offset params, which the ansible-galaxy client uses


What are all the endpoints that the ansible-client uses?

index/<namespace>/<name>/versions/

index/<namespace>/<name>/versions/<version>/

artifacts/collections/<path>/<filename>/

[noissue]

gerrod3

Looking good. Have some questions and suggestions before final approval.

gerrod3 · 2023-06-14T02:21:37Z

pulp_ansible/app/utils.py

+    """
+
+    repo_version_qs = RepositoryVersion.objects.filter(
+        repository=repo_version.repository, number__lte=repo_version.number


Suggested change

repository=repo_version.repository, number__lte=repo_version.number

repository=repo_version.repository_id, number__lte=repo_version.number

Or maybe it's repository_id=repo_version.repository_id. This should prevent another db request.

gerrod3 · 2023-06-14T02:28:51Z

pulp_ansible/app/utils.py

+    f = Q(version_added__in=repo_version_qs) & Q(
+        Q(version_removed=None) | ~Q(version_removed__in=repo_version_qs)
+    )
+    content_rel = RepositoryContent.objects.filter(f)


Is this Q filter better than using .filter().exclude()? Seems harder to understand (at least to me).

In my testing this was slightly more efficient. The key here is Q(version_removed=None) | ~Q(version_removed__in=repo_version_qs). If the version remove is none, then it prevents the lookup in the repo_version_qs in postgres from happening. When you use .exclude() it will always perform the version_removed__in lookup

I think you should add Q(repository=repo_version.repository_id) to the filter.

gerrod3 · 2023-06-14T02:43:27Z

pulp_ansible/app/galaxy/v3/views.py

+
+        qs = (
+            filter_content_for_repo_version(CollectionVersion.objects, repo_version)
+            .select_related("content_ptr__contentartifact")


I'm not sure this "content_ptr__contentartifact" is doing anything. Can you look at the generated query and see what it produces?

gerrod3 · 2023-06-14T03:07:15Z

pulp_ansible/app/galaxy/v3/views.py

+            RepositoryContent.objects.filter(
+                repository_id=repo_version.repository_id,
+                version_added__number__lte=repo_version.number,
+            )
+            .select_related("content__ansible_collectionversion")
+            .select_related("version_added")
+            .select_related("version_removed")
+            .filter(content__ansible_collectionversion__collection_id=OuterRef("pk"))
+            .annotate(
+                last_updated=Greatest(
+                    "version_added__pulp_created", "version_removed__pulp_created"
+                )
+            )
+            .order_by("-last_updated")
+            .only("last_updated")


This seems overly complicated. RepositoryContent should only get updated when the content is removed (version_removed gets set) so we probably don't need to perform this select_related, just check the pulp_last_updated on the RepositoryContent.

Suggested change

RepositoryContent.objects.filter(

repository_id=repo_version.repository_id,

version_added__number__lte=repo_version.number,

)

.select_related("content__ansible_collectionversion")

.select_related("version_added")

.select_related("version_removed")

.filter(content__ansible_collectionversion__collection_id=OuterRef("pk"))

.annotate(

last_updated=Greatest(

"version_added__pulp_created", "version_removed__pulp_created"

)

)

.order_by("-last_updated")

.only("last_updated")

RepositoryContent.objects.filter(

repository_id=repo_version.repository_id,

version_added__number__lte=repo_version.number,

)

.select_related("content__ansible_collectionversion")

.filter(content__ansible_collectionversion__collection_id=OuterRef("pk"))

.order_by("-pulp_last_updated")

.only("pulp_last_updated")

I tried using pulp_last_updated and it doesn't work. That field doesn't seem to get updated when version removed is added to the RepositoryContent entry

gerrod3 · 2023-06-14T03:12:19Z

pulp_ansible/app/galaxy/v3/views.py

-        # This is -very- slow when the collection has many versions.
-        # queryset = sorted(
-        #    queryset, key=lambda obj: semantic_version.Version(obj.version), reverse=True
-        # )
-


Do we not want to bring back sorting by version now that version has been broken up? Right now I think the returned results are default sorted by pulp_created which will typically be the same as sorted versions.

sorting by versions will still likely be slower, so I'm not sure it's worth bringing this back since it seems to be working fine.

[noissue]

gerrod3

Final changes needed.

gerrod3 · 2023-06-21T13:03:02Z

pulp_ansible/app/galaxy/v3/serializers.py

@@ -111,7 +101,7 @@ class CollectionVersionListSerializer(serializers.ModelSerializer):

    href = serializers.SerializerMethodField()
    created_at = serializers.DateTimeField(source="collection.pulp_created")


Shouldn't this also be just the pulp_created of the CV?

gerrod3 · 2023-06-21T13:06:24Z

pulp_ansible/app/galaxy/v3/serializers.py

+        # the unpaginated viewset uses an iterator, which doesn't seem to evaluate prefetch
+        if hasattr(obj, "artifacts"):


According to the django 4.2 docs, iterator will now evaluate the prefetch if chunk_size is set in the iterator call. Can you test this out?

gerrod3 · 2023-06-21T13:08:39Z

pulp_ansible/app/galaxy/v3/serializers.py

+        if self._get_artifact(obj).artifact:
+            return ArtifactRefSerializer(self._get_artifact(obj)).data


Suggested change

if self._get_artifact(obj).artifact:

return ArtifactRefSerializer(self._get_artifact(obj)).data

if ca := self._get_artifact(obj) and ca.artifact:

return ArtifactRefSerializer(ca).data

Does the ArtifactRefSerializer really take in a ContentArtifact instead of an Artifact?

gerrod3 · 2023-06-21T13:12:42Z

pulp_ansible/app/galaxy/v3/serializers.py

+        if not self._get_artifact(obj).artifact:
+            return self._get_artifact(obj).remoteartifact_set.all()[0].url[:-47]


Suggested change

if not self._get_artifact(obj).artifact:

return self._get_artifact(obj).remoteartifact_set.all()[0].url[:-47]

ca = self._get_artifact(obj)

if not ca.artifact:

return ca.remoteartifact_set.first().url[:-47]

I'm starting to think this _get_artifact isn't a well named function.

gerrod3 · 2023-06-21T13:19:50Z

pulp_ansible/app/galaxy/v3/views.py

+                    .values("metadata_sha256"),
+                )
+            )
+            .select_related("collection")


Suggested change

.select_related("collection")

get_queryset already select_related collection.

gerrod3 · 2023-06-21T14:12:22Z

pulp_ansible/app/galaxy/v3/views.py

+                    .values("metadata_sha256"),
+                )
+            )
+            .select_related("collection")


Suggested change

.select_related("collection")

[noissue]

newswangerd requested review from jctanner and gerrod3 June 2, 2023 16:25

newswangerd force-pushed the fix/AAH-2346-inefficient-ns-queries branch from 5a561e3 to 71c8dff Compare June 2, 2023 16:36

jctanner reviewed Jun 5, 2023

View reviewed changes

gerrod3 reviewed Jun 7, 2023

View reviewed changes

pulp_ansible/app/utils.py Outdated Show resolved Hide resolved

gerrod3 reviewed Jun 8, 2023

View reviewed changes

newswangerd added 8 commits June 9, 2023 14:54

Optimize collection list viewset

e151bbd

[noissue]

Optimize collection versions api

0f4ca10

[noissue]

Add caching

4ac4142

[noissue]

Optimize filter_content function

08356f3

[noissue]

Speed up migration

de46d34

[noissue]

Fix CI

fb65031

[noissue]

Implement repo version filtering

2be3c94

[noissue]

Fix migration

a6ed620

[noissue]

newswangerd force-pushed the fix/AAH-2346-inefficient-ns-queries branch from 7f3e6e8 to a6ed620 Compare June 9, 2023 20:55

newswangerd added 3 commits June 9, 2023 15:50

Fix functional tests

8875596

[noissue]

Fix functional tests

35839e0

[noissue]

Fix more tests

94a4b41

[noissue]

newswangerd force-pushed the fix/AAH-2346-inefficient-ns-queries branch from 4633624 to 94a4b41 Compare June 12, 2023 19:40

Fix more test failures

e4d3a49

[noissue]

newswangerd requested review from jctanner and gerrod3 June 13, 2023 13:49

gerrod3 requested changes Jun 14, 2023

View reviewed changes

Make review changes

13a9b2f

[noissue]

newswangerd requested a review from gerrod3 June 20, 2023 14:19

gerrod3 requested changes Jun 21, 2023

View reviewed changes

jctanner mentioned this pull request Jun 21, 2023

Namespace compatibility changes #1437

Closed

8 tasks

newswangerd force-pushed the fix/AAH-2346-inefficient-ns-queries branch from 8d486a6 to 50dbb7c Compare June 23, 2023 13:58

Optimize more queries

9561995

[noissue]

newswangerd force-pushed the fix/AAH-2346-inefficient-ns-queries branch from 50dbb7c to 9561995 Compare June 23, 2023 17:29

newswangerd requested a review from gerrod3 June 23, 2023 18:10

Update content filter

c02e833

[noissue]

gerrod3 approved these changes Jun 26, 2023

View reviewed changes

newswangerd merged commit adc4e61 into pulp:main Jun 27, 2023
14 checks passed

kdelee mentioned this pull request Nov 17, 2023

refactor version sorting in CollectionVersionViewSet.list() #1410

Closed

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize v3 collection API querysets. #1476

Optimize v3 collection API querysets. #1476

newswangerd commented Jun 2, 2023 •

edited

Loading

jctanner Jun 5, 2023

gerrod3 Jun 8, 2023

newswangerd Jun 13, 2023

gerrod3 Jun 8, 2023

newswangerd Jun 8, 2023

gerrod3 left a comment

gerrod3 Jun 14, 2023

gerrod3 Jun 14, 2023

newswangerd Jun 14, 2023

gerrod3 Jun 21, 2023

gerrod3 Jun 14, 2023

gerrod3 Jun 14, 2023

newswangerd Jun 14, 2023

gerrod3 Jun 14, 2023

newswangerd Jun 20, 2023

gerrod3 left a comment

gerrod3 Jun 21, 2023

gerrod3 Jun 21, 2023

gerrod3 Jun 21, 2023

gerrod3 Jun 21, 2023

gerrod3 Jun 21, 2023

gerrod3 Jun 21, 2023

		latest_cv_version_qs = (
		filter_content_for_repo_version(CollectionVersion.objects, repo_version)

		# to cache all of them, so this will only cache queries that stick to
		# limit/offset params, which the ansible-galaxy client uses

	repository=repo_version.repository, number__lte=repo_version.number
	repository=repo_version.repository_id, number__lte=repo_version.number

		@@ -111,7 +101,7 @@ class CollectionVersionListSerializer(serializers.ModelSerializer):

		href = serializers.SerializerMethodField()
		created_at = serializers.DateTimeField(source="collection.pulp_created")

		# the unpaginated viewset uses an iterator, which doesn't seem to evaluate prefetch
		if hasattr(obj, "artifacts"):

		if self._get_artifact(obj).artifact:
		return ArtifactRefSerializer(self._get_artifact(obj)).data

		if not self._get_artifact(obj).artifact:
		return self._get_artifact(obj).remoteartifact_set.all()[0].url[:-47]

-        if not self._get_artifact(obj).artifact:
-            return self._get_artifact(obj).remoteartifact_set.all()[0].url[:-47]
+        ca = self._get_artifact(obj)
+        if not ca.artifact:
+            return ca.remoteartifact_set.first().url[:-47]

Optimize v3 collection API querysets. #1476

Optimize v3 collection API querysets. #1476

Conversation

newswangerd commented Jun 2, 2023 • edited Loading

Changes

Results

Summary

Tests

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gerrod3 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gerrod3 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

newswangerd commented Jun 2, 2023 •

edited

Loading