Optimize unpaginated collection_versions endpoint query #573

gerrod3 · 2021-05-04T16:49:19Z

[noissue]

This should reduce the amount of database queries 5x. So now the amount of queries for this endpoint is n + 4 where n is the number of collectionversions. I'm not sure if I can remove the final n queries to make this constant since they are for the tags which are a many-to-many relation.

pulpbot · 2021-05-04T17:21:35Z

WARNING!!! This PR is not attached to an issue. In most cases this is not advisable. Please see our PR docs for more information about how to attach this PR to an issue.

fao89 · 2021-05-04T20:03:31Z

pulp_ansible/app/galaxy/v3/views.py


    def list(self, request, *args, **kwargs):
        """
        Returns paginated CollectionVersions list.
        """
-        queryset = self.filter_queryset(self.get_queryset())
-        queryset = sorted(
-            queryset, key=lambda obj: semantic_version.Version(obj.version), reverse=True


we need to sort by version

Why though? It's just a json dict that we are returning, does galaxy expect the dictionary to be ordered?

I don't know exactly the ansible-galaxy CLI code, but I think it expects the versions to be ordered:
https://galaxy.ansible.com/api/v2/collections/pulp/pulp_installer/versions/

But this endpoint is only for syncing, correct? They would still use the other endpoints that sort by version for installing collections.

Actually I checked the code and they just loop through versions
https://github.com/ansible/ansible/blob/devel/lib/ansible/galaxy/api.py#L767-L847

you are right! UnpaginatedCollectionVersionViewSet is only used on sync

Dropping the ordering requirement probably would make it go faster.

and I agree it's only used on sync

fao89 · 2021-05-04T20:06:26Z

pulp_ansible/app/galaxy/v3/views.py

+    @staticmethod
+    def construct_full_collection_version(obj):
+        """Constructs a new CollectionVersion object."""
+        collection_attrs = {
+            "pk": obj["collection_id"],
+            "pulp_created": obj.pop("collection__pulp_created"),
+            "pulp_last_updated": obj.pop("collection__pulp_last_updated"),
+            "name": obj["name"],
+            "namespace": obj["namespace"],
+        }
+        collection = Collection(**collection_attrs)


I don't know much about django ORM, but this method seems to be rebuilding some django method

@fao89 I think I agree. It's probably faster than the ORM, but it's unclear how much. My reading shows that it's constructing lighter weight versions of the objects, but I'm not seeing how it's producing meaningfully fewer database queries.

@gerrod3 overall producing some timing info for this PR I think would demonstrate if it's actually much faster or not.

Also pointing out the SQL queries before and after may be helpful.

If all of that is too much to do (which I could understand) maybe just go with the streaming option, which I thought would be a better fit for solving the timeout issue even if it does have a higher queryload.

Please point out anything that I'm missing in this I've only just read it.

I typically use cprofile to get python code timings and visualize them. (lmk if you want any help).

Also getting the SQL EXPLAIN statements for the SQL produced is then more easily analyzed to show 'oh look these queries are better/faster'.

select_related() might be what you guys are thinking of. Although I'm not sure I would rewrite this if it's already working. values() is probably a little faster by virtue of avoiding the object serialization overhead. There's a code clarity vs. performance tradeoff here but since it seems like we're pretty focused on performance for these couple of endpoints then it wouldn't be the worst thing to just roll with it.

I would just be sure that values() is doing that implicit select_related() though - it most likely is, but I don't know off the top of my head.

dralley · 2021-05-11T03:30:34Z

pulp_ansible/app/galaxy/v3/views.py

-        queryset = sorted(
-            queryset, key=lambda obj: semantic_version.Version(obj.version), reverse=True
-        )
+        queryset = map(self.construct_full_collection_version, self.get_queryset().iterator())


A list comprehension would probably be clearer in this case.

gerrod3 · 2021-05-12T22:27:18Z

@bmbouter Here is some comparison data. I synced pulp_installer which has 33 collection versions.

Without optimization:

With optimization:

The old way does 4 + 5 * n database queries (4 + 5 * 33 = 169) while the new optimized way does 4 + n (4 + 33 = 37). The new way alone is not enough to fix the timeout issues with large collection versions count (2000+), but this optimization with streaming endpoint #562 does fix the timeout issue for repositories up to 7000 collection versions. (as far as I tested)

These two changes only move the goal post down further and can't be a permanent solution. Do you think it's still worth to try them?

bmbouter · 2021-05-13T15:35:54Z

Thank you for posting this and doing the analysis. I think I'm reading that the streaming option and this query optimization PR together resolved the issue as high as you've tested 7K collections. That's great. I'm suspecting of the two, the streaming resolution will resolve it on its own because even if the queries take longer, we can just make the batches smaller until it also works up to 7K.

If that's the case, that's what I'd recommend (keep the streaming resolution and not merge this query change). My thinking is that this mostly makes the ORM faster and it's not worth carrying that if we don't need to.

@gerrod3 would you be willing to test the "largest collection count you tested" (iirc 7K?) against just the streaming PR? And if it doesn't work maybe adjust the batch sizes to be smaller? Or is this not a great idea? wdyt?

gerrod3 · 2021-05-13T15:42:35Z

@gerrod3 would you be willing to test the "largest collection count you tested" (iirc 7K?) against just the streaming PR? And if it doesn't work maybe adjust the batch sizes to be smaller? Or is this not a great idea? wdyt?

I'm not sure adjusting the batch_sizes will make the streaming PR pass on its own, but maybe it will. I can adjust this PR to be less crazy, but still have some optimizations. Like I said in our meeting, the 5 queries per collection version include 2 duplicates and 1 that could be removed with select_related (the current select_related is actually doing nothing on the query). Less queries will make the streaming endpoint faster.

bmbouter · 2021-05-13T15:50:39Z

Also we could accept as-is and move on to other work? Overall I don't think this code is going to be here that long because I believe we need to switch pulp_ansible to use publications and have this be pre-computed instead. If we are going to keep this code around then I think we do need to be careful, but if not, not really.

@gerrod3 @daviddavis @fao89 wdyt should be done? Also are there concerns about converting pulp_ansible to a publication-based plugin soon? Any feedback is welcome.

gerrod3 · 2021-05-13T16:00:27Z

@bmbouter @fao89 @daviddavis I just simplified the optimization so now the amount of requests is 4 + 2n. I think this is much more manageable and still provides much better performance. I think we should merge this variation.

bmbouter · 2021-05-13T16:05:32Z

@bmbouter @fao89 @daviddavis I just simplified the optimization so now the amount of requests is 4 + 2n. I think this is much more manageable and still provides much better performance. I think we should merge this variation.

This looks good to me. Is the stremaing response one already merged?

fao89 · 2021-05-13T16:56:28Z

This PR looked good to me before, I didn't approve so far because I was hoping someone would jump in and say: "here, use this django_orm_magic_method instead"

fao89

But we need a changelog entry for this!
Other than that, consider this PR approved!

fixes: #8746

gerrod3 · 2021-05-13T17:40:06Z

This looks good to me. Is the stremaing response one already merged?

The streaming response #562 is dependent on this PR.

@fao89 Added changed log.

fao89

Thank you!

bmbouter · 2021-05-13T18:02:10Z

Agreed. Thank you!

bmbouter · 2021-05-13T18:02:24Z

@gerrod3 ready for merging?

gerrod3 requested a review from a team as a code owner May 4, 2021 16:49

gerrod3 force-pushed the better_query branch from 1204e8b to 2291b95 Compare May 4, 2021 16:54

fao89 reviewed May 4, 2021

View reviewed changes

dralley reviewed May 11, 2021

View reviewed changes

gerrod3 force-pushed the better_query branch 2 times, most recently from 799218c to 5647440 Compare May 13, 2021 15:57

gerrod3 force-pushed the better_query branch from 5647440 to 25f179d Compare May 13, 2021 16:01

fao89 reviewed May 13, 2021

View reviewed changes

gerrod3 force-pushed the better_query branch from 25f179d to 459e03a Compare May 13, 2021 17:37

Optimize unpaginated collection_versions endpoint query

c538ac2

fixes: #8746

gerrod3 force-pushed the better_query branch from 459e03a to c538ac2 Compare May 13, 2021 17:38

fao89 approved these changes May 13, 2021

View reviewed changes

gerrod3 merged commit 6df9c9e into pulp:master May 13, 2021

gerrod3 deleted the better_query branch May 13, 2021 18:03

rochacbruno mentioned this pull request Jun 18, 2021

Backport collections sync optimization to 0.7 #586

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize unpaginated collection_versions endpoint query #573

Optimize unpaginated collection_versions endpoint query #573

gerrod3 commented May 4, 2021 •

edited

Loading

pulpbot commented May 4, 2021

fao89 May 4, 2021

gerrod3 May 4, 2021

fao89 May 4, 2021

gerrod3 May 4, 2021

fao89 May 4, 2021

fao89 May 4, 2021

bmbouter May 10, 2021

bmbouter May 10, 2021

fao89 May 4, 2021

bmbouter May 10, 2021

bmbouter May 10, 2021

dralley May 11, 2021 •

edited

Loading

dralley May 11, 2021

gerrod3 commented May 12, 2021

bmbouter commented May 13, 2021

gerrod3 commented May 13, 2021

bmbouter commented May 13, 2021

gerrod3 commented May 13, 2021

bmbouter commented May 13, 2021

fao89 commented May 13, 2021

fao89 left a comment

gerrod3 commented May 13, 2021

fao89 left a comment

bmbouter commented May 13, 2021

bmbouter commented May 13, 2021

Optimize unpaginated collection_versions endpoint query #573

Optimize unpaginated collection_versions endpoint query #573

Conversation

gerrod3 commented May 4, 2021 • edited Loading

pulpbot commented May 4, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dralley May 11, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gerrod3 commented May 12, 2021

bmbouter commented May 13, 2021

gerrod3 commented May 13, 2021

bmbouter commented May 13, 2021

gerrod3 commented May 13, 2021

bmbouter commented May 13, 2021

fao89 commented May 13, 2021

fao89 left a comment

Choose a reason for hiding this comment

gerrod3 commented May 13, 2021

fao89 left a comment

Choose a reason for hiding this comment

bmbouter commented May 13, 2021

bmbouter commented May 13, 2021

gerrod3 commented May 4, 2021 •

edited

Loading

dralley May 11, 2021 •

edited

Loading