Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize v3 collection API querysets. #1476

Merged
merged 15 commits into from
Jun 27, 2023

Conversation

newswangerd
Copy link
Contributor

@newswangerd newswangerd commented Jun 2, 2023

Changes

This PR updates the v3 collection API querysets so that they are all executed as either one queryset or one queryset plus a set of preteches.

Content is queried using a new, optimized function that queries the Content table directly instead of querying RerpositoryContent and then performing a pk__in filter on the Content table, reducing the number and size of queries.

The CollectionVersion model has been updated with individual fields for the version number components so that CollectionVersions can be efficiently sorted by their semver version using .order_by() on the queryset.

The collection version list view and collection version detail view responses are now cached by repository version pk.

Results

Summary

~8x improvements in number of simultaneous clients and 24x improvement in average client response time

Tests

Tested using X number of ansible-galaxy clients simultaneously pulling ansible.posix

Current pulp ansible main branch: 128 simultaneous downloads before any client timeouts

-------------------------------------------------------------------------------------------
running with 128 total clients against http://localhost:5001/pulp_ansible/galaxy/test/api/
-------------------------------------------------------------------------------------------
failures: []
failure percent: 0.0
avg duration: 41.084137023437506

This pr: 1024 simultaneous downloads before any clients timeout

-------------------------------------------------------------------------------------------
running with 128 total clients against http://localhost:5001/pulp_ansible/galaxy/test/api/
-------------------------------------------------------------------------------------------
failures: []
failure percent: 0.0
avg duration: 1.6661786093750004

--------------------------------------------------------------------------------------------
running with 1024 total clients against http://localhost:5001/pulp_ansible/galaxy/test/api/
--------------------------------------------------------------------------------------------
failures: []
failure percent: 0.0
avg duration: 5.763926267578129

Fixes: https://issues.redhat.com/browse/AAH-2346

@newswangerd newswangerd force-pushed the fix/AAH-2346-inefficient-ns-queries branch from 5a561e3 to 71c8dff Compare June 2, 2023 16:36
"""
A more efficient way to query content in a repository version.

Returns a the given queryset, but filtered to only include content
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returns a ... ?

pulp_ansible/app/utils.py Outdated Show resolved Hide resolved
Comment on lines +304 to +305
latest_cv_version_qs = (
filter_content_for_repo_version(CollectionVersion.objects, repo_version)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you try using distinct("namespace", "name") on this query after the order_by to see if it speeds it up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that would matter. It's already filtering by .filter(collection=OuterRef("pk"))

Comment on lines +855 to +865
# to cache all of them, so this will only cache queries that stick to
# limit/offset params, which the ansible-galaxy client uses
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are all the endpoints that the ansible-client uses?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • index/<namespace>/<name>/versions/
  • index/<namespace>/<name>/versions/<version>/
  • artifacts/collections/<path>/<filename>/

@newswangerd newswangerd force-pushed the fix/AAH-2346-inefficient-ns-queries branch from 7f3e6e8 to a6ed620 Compare June 9, 2023 20:55
@newswangerd newswangerd force-pushed the fix/AAH-2346-inefficient-ns-queries branch from 4633624 to 94a4b41 Compare June 12, 2023 19:40
Copy link
Contributor

@gerrod3 gerrod3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Have some questions and suggestions before final approval.

"""

repo_version_qs = RepositoryVersion.objects.filter(
repository=repo_version.repository, number__lte=repo_version.number
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
repository=repo_version.repository, number__lte=repo_version.number
repository=repo_version.repository_id, number__lte=repo_version.number

Or maybe it's repository_id=repo_version.repository_id. This should prevent another db request.

Comment on lines 21 to 24
f = Q(version_added__in=repo_version_qs) & Q(
Q(version_removed=None) | ~Q(version_removed__in=repo_version_qs)
)
content_rel = RepositoryContent.objects.filter(f)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this Q filter better than using .filter().exclude()? Seems harder to understand (at least to me).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my testing this was slightly more efficient. The key here is Q(version_removed=None) | ~Q(version_removed__in=repo_version_qs). If the version remove is none, then it prevents the lookup in the repo_version_qs in postgres from happening. When you use .exclude() it will always perform the version_removed__in lookup

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should add Q(repository=repo_version.repository_id) to the filter.


qs = (
filter_content_for_repo_version(CollectionVersion.objects, repo_version)
.select_related("content_ptr__contentartifact")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this "content_ptr__contentartifact" is doing anything. Can you look at the generated query and see what it produces?

Comment on lines +299 to +313
RepositoryContent.objects.filter(
repository_id=repo_version.repository_id,
version_added__number__lte=repo_version.number,
)
.select_related("content__ansible_collectionversion")
.select_related("version_added")
.select_related("version_removed")
.filter(content__ansible_collectionversion__collection_id=OuterRef("pk"))
.annotate(
last_updated=Greatest(
"version_added__pulp_created", "version_removed__pulp_created"
)
)
.order_by("-last_updated")
.only("last_updated")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems overly complicated. RepositoryContent should only get updated when the content is removed (version_removed gets set) so we probably don't need to perform this select_related, just check the pulp_last_updated on the RepositoryContent.

Suggested change
RepositoryContent.objects.filter(
repository_id=repo_version.repository_id,
version_added__number__lte=repo_version.number,
)
.select_related("content__ansible_collectionversion")
.select_related("version_added")
.select_related("version_removed")
.filter(content__ansible_collectionversion__collection_id=OuterRef("pk"))
.annotate(
last_updated=Greatest(
"version_added__pulp_created", "version_removed__pulp_created"
)
)
.order_by("-last_updated")
.only("last_updated")
RepositoryContent.objects.filter(
repository_id=repo_version.repository_id,
version_added__number__lte=repo_version.number,
)
.select_related("content__ansible_collectionversion")
.filter(content__ansible_collectionversion__collection_id=OuterRef("pk"))
.order_by("-pulp_last_updated")
.only("pulp_last_updated")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using pulp_last_updated and it doesn't work. That field doesn't seem to get updated when version removed is added to the RepositoryContent entry

Comment on lines -888 to -892
# This is -very- slow when the collection has many versions.
# queryset = sorted(
# queryset, key=lambda obj: semantic_version.Version(obj.version), reverse=True
# )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not want to bring back sorting by version now that version has been broken up? Right now I think the returned results are default sorted by pulp_created which will typically be the same as sorted versions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorting by versions will still likely be slower, so I'm not sure it's worth bringing this back since it seems to be working fine.

@newswangerd newswangerd requested a review from gerrod3 June 20, 2023 14:19
Copy link
Contributor

@gerrod3 gerrod3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final changes needed.

@@ -111,7 +101,7 @@ class CollectionVersionListSerializer(serializers.ModelSerializer):

href = serializers.SerializerMethodField()
created_at = serializers.DateTimeField(source="collection.pulp_created")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this also be just the pulp_created of the CV?

Comment on lines 250 to 251
# the unpaginated viewset uses an iterator, which doesn't seem to evaluate prefetch
if hasattr(obj, "artifacts"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the django 4.2 docs, iterator will now evaluate the prefetch if chunk_size is set in the iterator call. Can you test this out?

Comment on lines 261 to 262
if self._get_artifact(obj).artifact:
return ArtifactRefSerializer(self._get_artifact(obj)).data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if self._get_artifact(obj).artifact:
return ArtifactRefSerializer(self._get_artifact(obj)).data
if ca := self._get_artifact(obj) and ca.artifact:
return ArtifactRefSerializer(ca).data

Does the ArtifactRefSerializer really take in a ContentArtifact instead of an Artifact?

Comment on lines 287 to 288
if not self._get_artifact(obj).artifact:
return self._get_artifact(obj).remoteartifact_set.all()[0].url[:-47]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if not self._get_artifact(obj).artifact:
return self._get_artifact(obj).remoteartifact_set.all()[0].url[:-47]
ca = self._get_artifact(obj)
if not ca.artifact:
return ca.remoteartifact_set.first().url[:-47]

I'm starting to think this _get_artifact isn't a well named function.

.values("metadata_sha256"),
)
)
.select_related("collection")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.select_related("collection")

get_queryset already select_related collection.

.values("metadata_sha256"),
)
)
.select_related("collection")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.select_related("collection")

@jctanner jctanner mentioned this pull request Jun 21, 2023
8 tasks
@newswangerd newswangerd force-pushed the fix/AAH-2346-inefficient-ns-queries branch from 8d486a6 to 50dbb7c Compare June 23, 2023 13:58
@newswangerd newswangerd force-pushed the fix/AAH-2346-inefficient-ns-queries branch from 50dbb7c to 9561995 Compare June 23, 2023 17:29
@newswangerd newswangerd requested a review from gerrod3 June 23, 2023 18:10
@newswangerd newswangerd merged commit adc4e61 into pulp:main Jun 27, 2023
14 checks passed
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants