Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize unpaginated collection_versions endpoint query #573

Merged
merged 1 commit into from
May 13, 2021

Conversation

gerrod3
Copy link
Contributor

@gerrod3 gerrod3 commented May 4, 2021

[noissue]

This should reduce the amount of database queries 5x. So now the amount of queries for this endpoint is n + 4 where n is the number of collectionversions. I'm not sure if I can remove the final n queries to make this constant since they are for the tags which are a many-to-many relation.

@gerrod3 gerrod3 requested a review from a team as a code owner May 4, 2021 16:49
@pulpbot
Copy link
Member

pulpbot commented May 4, 2021

WARNING!!! This PR is not attached to an issue. In most cases this is not advisable. Please see our PR docs for more information about how to attach this PR to an issue.


def list(self, request, *args, **kwargs):
"""
Returns paginated CollectionVersions list.
"""
queryset = self.filter_queryset(self.get_queryset())
queryset = sorted(
queryset, key=lambda obj: semantic_version.Version(obj.version), reverse=True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to sort by version

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why though? It's just a json dict that we are returning, does galaxy expect the dictionary to be ordered?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know exactly the ansible-galaxy CLI code, but I think it expects the versions to be ordered:
https://galaxy.ansible.com/api/v2/collections/pulp/pulp_installer/versions/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this endpoint is only for syncing, correct? They would still use the other endpoints that sort by version for installing collections.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I checked the code and they just loop through versions
https://github.com/ansible/ansible/blob/devel/lib/ansible/galaxy/api.py#L767-L847

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right! UnpaginatedCollectionVersionViewSet is only used on sync

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping the ordering requirement probably would make it go faster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and I agree it's only used on sync

Comment on lines 464 to 474
@staticmethod
def construct_full_collection_version(obj):
"""Constructs a new CollectionVersion object."""
collection_attrs = {
"pk": obj["collection_id"],
"pulp_created": obj.pop("collection__pulp_created"),
"pulp_last_updated": obj.pop("collection__pulp_last_updated"),
"name": obj["name"],
"namespace": obj["namespace"],
}
collection = Collection(**collection_attrs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know much about django ORM, but this method seems to be rebuilding some django method

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fao89 I think I agree. It's probably faster than the ORM, but it's unclear how much. My reading shows that it's constructing lighter weight versions of the objects, but I'm not seeing how it's producing meaningfully fewer database queries.

@gerrod3 overall producing some timing info for this PR I think would demonstrate if it's actually much faster or not.

Also pointing out the SQL queries before and after may be helpful.

If all of that is too much to do (which I could understand) maybe just go with the streaming option, which I thought would be a better fit for solving the timeout issue even if it does have a higher queryload.

Please point out anything that I'm missing in this I've only just read it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I typically use cprofile to get python code timings and visualize them. (lmk if you want any help).

Also getting the SQL EXPLAIN statements for the SQL produced is then more easily analyzed to show 'oh look these queries are better/faster'.

Copy link
Contributor

@dralley dralley May 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

select_related() might be what you guys are thinking of. Although I'm not sure I would rewrite this if it's already working. values() is probably a little faster by virtue of avoiding the object serialization overhead. There's a code clarity vs. performance tradeoff here but since it seems like we're pretty focused on performance for these couple of endpoints then it wouldn't be the worst thing to just roll with it.

I would just be sure that values() is doing that implicit select_related() though - it most likely is, but I don't know off the top of my head.

queryset = sorted(
queryset, key=lambda obj: semantic_version.Version(obj.version), reverse=True
)
queryset = map(self.construct_full_collection_version, self.get_queryset().iterator())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A list comprehension would probably be clearer in this case.

@gerrod3
Copy link
Contributor Author

gerrod3 commented May 12, 2021

@bmbouter Here is some comparison data. I synced pulp_installer which has 33 collection versions.

Without optimization:
Screenshot from 2021-05-12 17-35-36

With optimization:
Screenshot from 2021-05-12 17-36-27

The old way does 4 + 5 * n database queries (4 + 5 * 33 = 169) while the new optimized way does 4 + n (4 + 33 = 37). The new way alone is not enough to fix the timeout issues with large collection versions count (2000+), but this optimization with streaming endpoint #562 does fix the timeout issue for repositories up to 7000 collection versions. (as far as I tested)

These two changes only move the goal post down further and can't be a permanent solution. Do you think it's still worth to try them?

@bmbouter
Copy link
Member

Thank you for posting this and doing the analysis. I think I'm reading that the streaming option and this query optimization PR together resolved the issue as high as you've tested 7K collections. That's great. I'm suspecting of the two, the streaming resolution will resolve it on its own because even if the queries take longer, we can just make the batches smaller until it also works up to 7K.

If that's the case, that's what I'd recommend (keep the streaming resolution and not merge this query change). My thinking is that this mostly makes the ORM faster and it's not worth carrying that if we don't need to.

@gerrod3 would you be willing to test the "largest collection count you tested" (iirc 7K?) against just the streaming PR? And if it doesn't work maybe adjust the batch sizes to be smaller? Or is this not a great idea? wdyt?

@gerrod3
Copy link
Contributor Author

gerrod3 commented May 13, 2021

@gerrod3 would you be willing to test the "largest collection count you tested" (iirc 7K?) against just the streaming PR? And if it doesn't work maybe adjust the batch sizes to be smaller? Or is this not a great idea? wdyt?

I'm not sure adjusting the batch_sizes will make the streaming PR pass on its own, but maybe it will. I can adjust this PR to be less crazy, but still have some optimizations. Like I said in our meeting, the 5 queries per collection version include 2 duplicates and 1 that could be removed with select_related (the current select_related is actually doing nothing on the query). Less queries will make the streaming endpoint faster.

@bmbouter
Copy link
Member

Also we could accept as-is and move on to other work? Overall I don't think this code is going to be here that long because I believe we need to switch pulp_ansible to use publications and have this be pre-computed instead. If we are going to keep this code around then I think we do need to be careful, but if not, not really.

@gerrod3 @daviddavis @fao89 wdyt should be done? Also are there concerns about converting pulp_ansible to a publication-based plugin soon? Any feedback is welcome.

@gerrod3 gerrod3 force-pushed the better_query branch 2 times, most recently from 799218c to 5647440 Compare May 13, 2021 15:57
@gerrod3
Copy link
Contributor Author

gerrod3 commented May 13, 2021

@bmbouter @fao89 @daviddavis I just simplified the optimization so now the amount of requests is 4 + 2n. I think this is much more manageable and still provides much better performance. I think we should merge this variation.

@bmbouter
Copy link
Member

@bmbouter @fao89 @daviddavis I just simplified the optimization so now the amount of requests is 4 + 2n. I think this is much more manageable and still provides much better performance. I think we should merge this variation.

This looks good to me. Is the stremaing response one already merged?

@fao89
Copy link
Member

fao89 commented May 13, 2021

This PR looked good to me before, I didn't approve so far because I was hoping someone would jump in and say: "here, use this django_orm_magic_method instead"

Copy link
Member

@fao89 fao89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we need a changelog entry for this!
Other than that, consider this PR approved!

@gerrod3
Copy link
Contributor Author

gerrod3 commented May 13, 2021

This looks good to me. Is the stremaing response one already merged?

The streaming response #562 is dependent on this PR.

@fao89 Added changed log.

Copy link
Member

@fao89 fao89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@bmbouter
Copy link
Member

Agreed. Thank you!

@bmbouter
Copy link
Member

@gerrod3 ready for merging?

@gerrod3 gerrod3 merged commit 6df9c9e into pulp:master May 13, 2021
@gerrod3 gerrod3 deleted the better_query branch May 13, 2021 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants