-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize v3 collection API querysets. #1476
Optimize v3 collection API querysets. #1476
Conversation
5a561e3
to
71c8dff
Compare
pulp_ansible/app/utils.py
Outdated
""" | ||
A more efficient way to query content in a repository version. | ||
|
||
Returns a the given queryset, but filtered to only include content |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returns a ... ?
latest_cv_version_qs = ( | ||
filter_content_for_repo_version(CollectionVersion.objects, repo_version) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you try using distinct("namespace", "name")
on this query after the order_by
to see if it speeds it up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that would matter. It's already filtering by .filter(collection=OuterRef("pk"))
# to cache all of them, so this will only cache queries that stick to | ||
# limit/offset params, which the ansible-galaxy client uses |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are all the endpoints that the ansible-client uses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
index/<namespace>/<name>/versions/
index/<namespace>/<name>/versions/<version>/
artifacts/collections/<path>/<filename>/
[noissue]
[noissue]
[noissue]
[noissue]
[noissue]
[noissue]
[noissue]
7f3e6e8
to
a6ed620
Compare
[noissue]
[noissue]
[noissue]
4633624
to
94a4b41
Compare
[noissue]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. Have some questions and suggestions before final approval.
pulp_ansible/app/utils.py
Outdated
""" | ||
|
||
repo_version_qs = RepositoryVersion.objects.filter( | ||
repository=repo_version.repository, number__lte=repo_version.number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
repository=repo_version.repository, number__lte=repo_version.number | |
repository=repo_version.repository_id, number__lte=repo_version.number |
Or maybe it's repository_id=repo_version.repository_id
. This should prevent another db request.
pulp_ansible/app/utils.py
Outdated
f = Q(version_added__in=repo_version_qs) & Q( | ||
Q(version_removed=None) | ~Q(version_removed__in=repo_version_qs) | ||
) | ||
content_rel = RepositoryContent.objects.filter(f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this Q filter better than using .filter().exclude()
? Seems harder to understand (at least to me).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my testing this was slightly more efficient. The key here is Q(version_removed=None) | ~Q(version_removed__in=repo_version_qs)
. If the version remove is none, then it prevents the lookup in the repo_version_qs
in postgres from happening. When you use .exclude()
it will always perform the version_removed__in
lookup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should add Q(repository=repo_version.repository_id)
to the filter.
pulp_ansible/app/galaxy/v3/views.py
Outdated
|
||
qs = ( | ||
filter_content_for_repo_version(CollectionVersion.objects, repo_version) | ||
.select_related("content_ptr__contentartifact") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this "content_ptr__contentartifact" is doing anything. Can you look at the generated query and see what it produces?
RepositoryContent.objects.filter( | ||
repository_id=repo_version.repository_id, | ||
version_added__number__lte=repo_version.number, | ||
) | ||
.select_related("content__ansible_collectionversion") | ||
.select_related("version_added") | ||
.select_related("version_removed") | ||
.filter(content__ansible_collectionversion__collection_id=OuterRef("pk")) | ||
.annotate( | ||
last_updated=Greatest( | ||
"version_added__pulp_created", "version_removed__pulp_created" | ||
) | ||
) | ||
.order_by("-last_updated") | ||
.only("last_updated") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems overly complicated. RepositoryContent should only get updated when the content is removed (version_removed gets set) so we probably don't need to perform this select_related
, just check the pulp_last_updated
on the RepositoryContent.
RepositoryContent.objects.filter( | |
repository_id=repo_version.repository_id, | |
version_added__number__lte=repo_version.number, | |
) | |
.select_related("content__ansible_collectionversion") | |
.select_related("version_added") | |
.select_related("version_removed") | |
.filter(content__ansible_collectionversion__collection_id=OuterRef("pk")) | |
.annotate( | |
last_updated=Greatest( | |
"version_added__pulp_created", "version_removed__pulp_created" | |
) | |
) | |
.order_by("-last_updated") | |
.only("last_updated") | |
RepositoryContent.objects.filter( | |
repository_id=repo_version.repository_id, | |
version_added__number__lte=repo_version.number, | |
) | |
.select_related("content__ansible_collectionversion") | |
.filter(content__ansible_collectionversion__collection_id=OuterRef("pk")) | |
.order_by("-pulp_last_updated") | |
.only("pulp_last_updated") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried using pulp_last_updated
and it doesn't work. That field doesn't seem to get updated when version removed is added to the RepositoryContent entry
# This is -very- slow when the collection has many versions. | ||
# queryset = sorted( | ||
# queryset, key=lambda obj: semantic_version.Version(obj.version), reverse=True | ||
# ) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we not want to bring back sorting by version now that version has been broken up? Right now I think the returned results are default sorted by pulp_created
which will typically be the same as sorted versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorting by versions will still likely be slower, so I'm not sure it's worth bringing this back since it seems to be working fine.
[noissue]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Final changes needed.
@@ -111,7 +101,7 @@ class CollectionVersionListSerializer(serializers.ModelSerializer): | |||
|
|||
href = serializers.SerializerMethodField() | |||
created_at = serializers.DateTimeField(source="collection.pulp_created") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this also be just the pulp_created
of the CV?
# the unpaginated viewset uses an iterator, which doesn't seem to evaluate prefetch | ||
if hasattr(obj, "artifacts"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the django 4.2 docs, iterator will now evaluate the prefetch if chunk_size
is set in the iterator call. Can you test this out?
if self._get_artifact(obj).artifact: | ||
return ArtifactRefSerializer(self._get_artifact(obj)).data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if self._get_artifact(obj).artifact: | |
return ArtifactRefSerializer(self._get_artifact(obj)).data | |
if ca := self._get_artifact(obj) and ca.artifact: | |
return ArtifactRefSerializer(ca).data |
Does the ArtifactRefSerializer really take in a ContentArtifact instead of an Artifact?
if not self._get_artifact(obj).artifact: | ||
return self._get_artifact(obj).remoteartifact_set.all()[0].url[:-47] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not self._get_artifact(obj).artifact: | |
return self._get_artifact(obj).remoteartifact_set.all()[0].url[:-47] | |
ca = self._get_artifact(obj) | |
if not ca.artifact: | |
return ca.remoteartifact_set.first().url[:-47] |
I'm starting to think this _get_artifact
isn't a well named function.
.values("metadata_sha256"), | ||
) | ||
) | ||
.select_related("collection") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.select_related("collection") |
get_queryset
already select_related
collection.
.values("metadata_sha256"), | ||
) | ||
) | ||
.select_related("collection") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.select_related("collection") |
8d486a6
to
50dbb7c
Compare
[noissue]
50dbb7c
to
9561995
Compare
[noissue]
Changes
This PR updates the v3 collection API querysets so that they are all executed as either one queryset or one queryset plus a set of preteches.
Content is queried using a new, optimized function that queries the Content table directly instead of querying RerpositoryContent and then performing a
pk__in
filter on the Content table, reducing the number and size of queries.The CollectionVersion model has been updated with individual fields for the version number components so that CollectionVersions can be efficiently sorted by their semver version using .order_by() on the queryset.
The collection version list view and collection version detail view responses are now cached by repository version pk.
Results
Summary
~8x improvements in number of simultaneous clients and 24x improvement in average client response time
Tests
Tested using X number of ansible-galaxy clients simultaneously pulling
ansible.posix
Current pulp ansible main branch: 128 simultaneous downloads before any client timeouts
This pr: 1024 simultaneous downloads before any clients timeout
Fixes: https://issues.redhat.com/browse/AAH-2346