Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Works around a sync-time performance regression on PG12 #4609

Merged
merged 1 commit into from
Oct 25, 2023

Conversation

dralley
Copy link
Contributor

@dralley dralley commented Oct 25, 2023

closes #4591

We still don't really "know":

Copy link
Member

@lubosmj lubosmj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this, we got down to ~25m of sync time on the Satellite 6.14 environment (instead of 2 hours), syncing 36 repositories to a capsule with the on-demand policy. We got the same result as with Pulpcore 3.22.12, where the commit in question (#4275) was not included.

It would be interesting to see if this change improved the performance on PG13 even more.

@@ -850,7 +850,9 @@ def add_content(self, content):
raise ResourceImmutableError(self)

repo_content = []
to_add = set(content.exclude(pk__in=self.content).values_list("pk", flat=True))
to_add = set(content.values_list("pk", flat=True)) - set(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the net here, is that we are moving from "let postgres do the set-theory via exclude()", to "do the set-theory explicitly, in python, via set(A) - set(B)". I have nothing to add that clears up any of the mysteries noted; just clarifying what we're doing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we using set just to get the difference between those two sets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello folks. I'm just exploring some issues to get familiarized with the codebase and had a small idea on this.

We can achieve the same set difference with some performance gains (about 2x with a humble sample) by not creating a second set. If this is a meaningful point to optimize, it may be worth trying this out.

def set_diff(a: Set, b: tuple):
    for obj in b:
        if obj in a:
            a.remove(obj)
    return a

from typing import Set

def set_diff_normal(a: Set, b: tuple):
    return a - set(b)


a = {3775, 4642, 1725, 576, 1467, 1152, 1718,
     6721, 7899, 4756, 2213, 7843, 768, 6464, 7427}
b = (3775, 7899, 434, 15048, 7843)

if __name__ == "__main__":
    assert set_diff(a, b) == set_diff_normal(a, b)

    from timeit import timeit
    result_a = timeit(
        "set_diff(a,b)",
        setup="from __main__ import set_diff,a,b")
    result_b = timeit(
        "set_diff_normal(a,b)",
        setup="from __main__ import set_diff_normal,a,b")
    print(result_a) # output: ~ 0.10
    print(result_b) # output: ~ 0.24

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance issue has nothing to do with the set creation (which would take milliseconds at most), but rather the way the query is constructed and how Postgresql plans its execution.

Previously we were generating one large query, which different versions of Postgresql seemed to handle differently, and Postgresql 12 was having severe performance issues. By gathering the results of some smaller queries into a set in Python, we keep the queries smaller and simpler, and avoid causing Postgresql to go a little haywire.

This is really just a workaround since it was time-sensitive. We're trying to better understand why this is going on currently.

I appreciate the enthusiasm @pedro-psb :) But you'll have plenty of time to learn this stuff when you're being paid to do so, there's no need to do it in your "free" time!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dralley Yeah, I thought maybe it could be irrelevant in the problem context, thanks for the clarification.
And I dig your tip, I'm really a little over-enthusiastic!

Copy link
Contributor

@ggainey ggainey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted, I can't explain any of the noted mysteries - and this makes me a little nervous. But if this fixes the observed performance problem, I'm good to move ahead with it.

@dkliban dkliban merged commit 4c4d6aa into pulp:main Oct 25, 2023
12 of 13 checks passed
@patchback
Copy link

patchback bot commented Oct 25, 2023

Backport to 3.16: 💚 backport PR created

✅ Backport PR branch: patchback/backports/3.16/4c4d6aa0745b4938e2aeee8ffc28a0b0392cc9ef/pr-4609

Backported as #4610

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

@patchback
Copy link

patchback bot commented Oct 25, 2023

Backport to 3.18: 💚 backport PR created

✅ Backport PR branch: patchback/backports/3.18/4c4d6aa0745b4938e2aeee8ffc28a0b0392cc9ef/pr-4609

Backported as #4611

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

@patchback
Copy link

patchback bot commented Oct 25, 2023

Backport to 3.21: 💚 backport PR created

✅ Backport PR branch: patchback/backports/3.21/4c4d6aa0745b4938e2aeee8ffc28a0b0392cc9ef/pr-4609

Backported as #4612

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

@patchback
Copy link

patchback bot commented Oct 25, 2023

Backport to 3.22: 💚 backport PR created

✅ Backport PR branch: patchback/backports/3.22/4c4d6aa0745b4938e2aeee8ffc28a0b0392cc9ef/pr-4609

Backported as #4613

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

@patchback
Copy link

patchback bot commented Oct 25, 2023

Backport to 3.23: 💚 backport PR created

✅ Backport PR branch: patchback/backports/3.23/4c4d6aa0745b4938e2aeee8ffc28a0b0392cc9ef/pr-4609

Backported as #4614

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

@patchback
Copy link

patchback bot commented Oct 25, 2023

Backport to 3.28: 💚 backport PR created

✅ Backport PR branch: patchback/backports/3.28/4c4d6aa0745b4938e2aeee8ffc28a0b0392cc9ef/pr-4609

Backported as #4615

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

@patchback
Copy link

patchback bot commented Oct 27, 2023

Backport to 3.38: 💚 backport PR created

✅ Backport PR branch: patchback/backports/3.38/4c4d6aa0745b4938e2aeee8ffc28a0b0392cc9ef/pr-4609

Backported as #4620

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

@patchback
Copy link

patchback bot commented Oct 30, 2023

Backport to 3.39: 💚 backport PR created

✅ Backport PR branch: patchback/backports/3.39/4c4d6aa0745b4938e2aeee8ffc28a0b0392cc9ef/pr-4609

Backported as #4623

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression on performance improvements in rv.get_content()
6 participants