-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize unpaginated collection_versions endpoint query #573
Conversation
WARNING!!! This PR is not attached to an issue. In most cases this is not advisable. Please see our PR docs for more information about how to attach this PR to an issue. |
|
||
def list(self, request, *args, **kwargs): | ||
""" | ||
Returns paginated CollectionVersions list. | ||
""" | ||
queryset = self.filter_queryset(self.get_queryset()) | ||
queryset = sorted( | ||
queryset, key=lambda obj: semantic_version.Version(obj.version), reverse=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to sort by version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why though? It's just a json dict that we are returning, does galaxy expect the dictionary to be ordered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know exactly the ansible-galaxy
CLI code, but I think it expects the versions to be ordered:
https://galaxy.ansible.com/api/v2/collections/pulp/pulp_installer/versions/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this endpoint is only for syncing, correct? They would still use the other endpoints that sort by version for installing collections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I checked the code and they just loop through versions
https://github.com/ansible/ansible/blob/devel/lib/ansible/galaxy/api.py#L767-L847
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right! UnpaginatedCollectionVersionViewSet
is only used on sync
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropping the ordering requirement probably would make it go faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and I agree it's only used on sync
pulp_ansible/app/galaxy/v3/views.py
Outdated
@staticmethod | ||
def construct_full_collection_version(obj): | ||
"""Constructs a new CollectionVersion object.""" | ||
collection_attrs = { | ||
"pk": obj["collection_id"], | ||
"pulp_created": obj.pop("collection__pulp_created"), | ||
"pulp_last_updated": obj.pop("collection__pulp_last_updated"), | ||
"name": obj["name"], | ||
"namespace": obj["namespace"], | ||
} | ||
collection = Collection(**collection_attrs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know much about django ORM, but this method seems to be rebuilding some django method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fao89 I think I agree. It's probably faster than the ORM, but it's unclear how much. My reading shows that it's constructing lighter weight versions of the objects, but I'm not seeing how it's producing meaningfully fewer database queries.
@gerrod3 overall producing some timing info for this PR I think would demonstrate if it's actually much faster or not.
Also pointing out the SQL queries before and after may be helpful.
If all of that is too much to do (which I could understand) maybe just go with the streaming option, which I thought would be a better fit for solving the timeout issue even if it does have a higher queryload.
Please point out anything that I'm missing in this I've only just read it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I typically use cprofile to get python code timings and visualize them. (lmk if you want any help).
Also getting the SQL EXPLAIN statements for the SQL produced is then more easily analyzed to show 'oh look these queries are better/faster'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
select_related()
might be what you guys are thinking of. Although I'm not sure I would rewrite this if it's already working. values()
is probably a little faster by virtue of avoiding the object serialization overhead. There's a code clarity vs. performance tradeoff here but since it seems like we're pretty focused on performance for these couple of endpoints then it wouldn't be the worst thing to just roll with it.
I would just be sure that values()
is doing that implicit select_related()
though - it most likely is, but I don't know off the top of my head.
pulp_ansible/app/galaxy/v3/views.py
Outdated
queryset = sorted( | ||
queryset, key=lambda obj: semantic_version.Version(obj.version), reverse=True | ||
) | ||
queryset = map(self.construct_full_collection_version, self.get_queryset().iterator()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A list comprehension would probably be clearer in this case.
@bmbouter Here is some comparison data. I synced pulp_installer which has 33 collection versions. The old way does 4 + 5 * n database queries (4 + 5 * 33 = 169) while the new optimized way does 4 + n (4 + 33 = 37). The new way alone is not enough to fix the timeout issues with large collection versions count (2000+), but this optimization with streaming endpoint #562 does fix the timeout issue for repositories up to 7000 collection versions. (as far as I tested) These two changes only move the goal post down further and can't be a permanent solution. Do you think it's still worth to try them? |
Thank you for posting this and doing the analysis. I think I'm reading that the streaming option and this query optimization PR together resolved the issue as high as you've tested 7K collections. That's great. I'm suspecting of the two, the streaming resolution will resolve it on its own because even if the queries take longer, we can just make the batches smaller until it also works up to 7K. If that's the case, that's what I'd recommend (keep the streaming resolution and not merge this query change). My thinking is that this mostly makes the ORM faster and it's not worth carrying that if we don't need to. @gerrod3 would you be willing to test the "largest collection count you tested" (iirc 7K?) against just the streaming PR? And if it doesn't work maybe adjust the batch sizes to be smaller? Or is this not a great idea? wdyt? |
I'm not sure adjusting the batch_sizes will make the streaming PR pass on its own, but maybe it will. I can adjust this PR to be less crazy, but still have some optimizations. Like I said in our meeting, the 5 queries per collection version include 2 duplicates and 1 that could be removed with select_related (the current select_related is actually doing nothing on the query). Less queries will make the streaming endpoint faster. |
Also we could accept as-is and move on to other work? Overall I don't think this code is going to be here that long because I believe we need to switch pulp_ansible to use publications and have this be pre-computed instead. If we are going to keep this code around then I think we do need to be careful, but if not, not really. @gerrod3 @daviddavis @fao89 wdyt should be done? Also are there concerns about converting pulp_ansible to a publication-based plugin soon? Any feedback is welcome. |
799218c
to
5647440
Compare
@bmbouter @fao89 @daviddavis I just simplified the optimization so now the amount of requests is 4 + 2n. I think this is much more manageable and still provides much better performance. I think we should merge this variation. |
This looks good to me. Is the stremaing response one already merged? |
This PR looked good to me before, I didn't approve so far because I was hoping someone would jump in and say: "here, use this django_orm_magic_method instead" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we need a changelog entry for this!
Other than that, consider this PR approved!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
Agreed. Thank you! |
@gerrod3 ready for merging? |
[noissue]
This should reduce the amount of database queries 5x. So now the amount of queries for this endpoint is n + 4 where n is the number of collectionversions. I'm not sure if I can remove the final n queries to make this constant since they are for the tags which are a many-to-many relation.