-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Works around a sync-time performance regression on PG12 #4609
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this, we got down to ~25m of sync time on the Satellite 6.14 environment (instead of 2 hours), syncing 36 repositories to a capsule with the on-demand policy. We got the same result as with Pulpcore 3.22.12, where the commit in question (#4275) was not included.
It would be interesting to see if this change improved the performance on PG13 even more.
@@ -850,7 +850,9 @@ def add_content(self, content): | |||
raise ResourceImmutableError(self) | |||
|
|||
repo_content = [] | |||
to_add = set(content.exclude(pk__in=self.content).values_list("pk", flat=True)) | |||
to_add = set(content.values_list("pk", flat=True)) - set( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the net here, is that we are moving from "let postgres do the set-theory via exclude()", to "do the set-theory explicitly, in python, via set(A) - set(B)". I have nothing to add that clears up any of the mysteries noted; just clarifying what we're doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we using set
just to get the difference between those two sets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello folks. I'm just exploring some issues to get familiarized with the codebase and had a small idea on this.
We can achieve the same set difference with some performance gains (about 2x with a humble sample) by not creating a second set. If this is a meaningful point to optimize, it may be worth trying this out.
def set_diff(a: Set, b: tuple):
for obj in b:
if obj in a:
a.remove(obj)
return a
from typing import Set
def set_diff_normal(a: Set, b: tuple):
return a - set(b)
a = {3775, 4642, 1725, 576, 1467, 1152, 1718,
6721, 7899, 4756, 2213, 7843, 768, 6464, 7427}
b = (3775, 7899, 434, 15048, 7843)
if __name__ == "__main__":
assert set_diff(a, b) == set_diff_normal(a, b)
from timeit import timeit
result_a = timeit(
"set_diff(a,b)",
setup="from __main__ import set_diff,a,b")
result_b = timeit(
"set_diff_normal(a,b)",
setup="from __main__ import set_diff_normal,a,b")
print(result_a) # output: ~ 0.10
print(result_b) # output: ~ 0.24
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The performance issue has nothing to do with the set creation (which would take milliseconds at most), but rather the way the query is constructed and how Postgresql plans its execution.
Previously we were generating one large query, which different versions of Postgresql seemed to handle differently, and Postgresql 12 was having severe performance issues. By gathering the results of some smaller queries into a set in Python, we keep the queries smaller and simpler, and avoid causing Postgresql to go a little haywire.
This is really just a workaround since it was time-sensitive. We're trying to better understand why this is going on currently.
I appreciate the enthusiasm @pedro-psb :) But you'll have plenty of time to learn this stuff when you're being paid to do so, there's no need to do it in your "free" time!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dralley Yeah, I thought maybe it could be irrelevant in the problem context, thanks for the clarification.
And I dig your tip, I'm really a little over-enthusiastic!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted, I can't explain any of the noted mysteries - and this makes me a little nervous. But if this fixes the observed performance problem, I'm good to move ahead with it.
Backport to 3.16: 💚 backport PR created✅ Backport PR branch: Backported as #4610 🤖 @patchback |
Backport to 3.18: 💚 backport PR created✅ Backport PR branch: Backported as #4611 🤖 @patchback |
Backport to 3.21: 💚 backport PR created✅ Backport PR branch: Backported as #4612 🤖 @patchback |
Backport to 3.22: 💚 backport PR created✅ Backport PR branch: Backported as #4613 🤖 @patchback |
Backport to 3.23: 💚 backport PR created✅ Backport PR branch: Backported as #4614 🤖 @patchback |
Backport to 3.28: 💚 backport PR created✅ Backport PR branch: Backported as #4615 🤖 @patchback |
Backport to 3.38: 💚 backport PR created✅ Backport PR branch: Backported as #4620 🤖 @patchback |
Backport to 3.39: 💚 backport PR created✅ Backport PR branch: Backported as #4623 🤖 @patchback |
closes #4591
We still don't really "know":