find_repo_content_units: skip db query if expected result is empty #4008

rbikar · 2020-11-12T07:50:16Z

For performance reasons, this change does server-side counting
of units that are expected to be in result before actual db query
happens. If expected count is zero, skip db query and continue to
the next chunk of unit ids.

The performance problem is noticable in RHSM-pulp for big repos
with 200k and more units associated. During associate action
results of the most of queries in affected code will be empty
so we can skip querying the db this way.

Without this fix, code will do eg. 200 unnecesary queries for repo
with 200k units which will take a lot time to finish that leads
to major performance regression.

For performance reasons, this change does server-side counting of units that are expected to be in result before actual db query happens. If expected count is zero, skip db query and continue to the next chunk of unit ids. The performance problem is noticable in RHSM-pulp for big repos with 200k and more units associated. During associate action results of the most of queries in affected code will be empty so we can skip querying the db this way. Without this fix, code will do eg. 200 unnecesary queries for repo with 200k units which will take a lot time to finish that leads to major performance regression.

pep8speaks · 2020-11-12T07:50:19Z

Hello @rbikar! Thanks for opening this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file server/pulp/server/controllers/repository.py:

Line 237:1: W293 blank line contains whitespace
Line 238:101: E501 line too long (101 > 100 characters)
Line 240:95: W291 trailing whitespace

rbikar · 2020-11-12T07:59:28Z

@rohanpm @midnightercz @ipanova
Can you please review?

rohanpm · 2020-11-12T22:43:42Z

server/pulp/server/controllers/repository.py

+            # for performace reasons, do server-side counting of units the will be returned in
+            # result before we do actual query to db
+            # if expected count of units in result is zero, continue to next chunk of unit_ids
+            if qs.count() == 0:


The logic here looks right, no concerns from "is it functionally correct" point of view.

Can I ask though whether the performance impact is still just a theory or whether it was verified by testing? It's really surprising to me if the mongo performance of doing a count with no results, vs doing a find with no results, has a notable difference. It seems like the DB should be performing almost exactly the same work in both cases.

I tried some testing myself against one of the actual DBs involved (QA env), using scripts: https://gist.github.com/rohanpm/1652c648a135a22af34547f7faf19e05

It doesn't simulate the queries here exactly, but it deals with same number of ids in "$in" and uses a query which will return no results. Timing it shows up pretty much as I expected, no notable difference between the two.

That test of course is not definitive, but if this hasn't been tested I'm a little skeptical about it.

My theory is that degradation happens due to instantiation of qs for every chunk - which I think creates new mongo cursor every time. If my theory is correct I think this won't help to resolve the problem

I tested the solution with partial prod db on my laptop with pub push-advisory for e2e repos.
When I debugged where exactly execution waits incredible amount of time it was exactly on for unit in qs:
With skipping on count==0, performance was almost back to normal.
It is true that qs is instantiated for every chunk but actual query is happening during the iteration over qs.
I'll try to come up with some reproducer script, it looks it needs more realistic context for replicating the problem.

Based on the testing done elsewhere (RHELDST-2771), although this is unintuitive and might point to some deficiency in mongo or one of the client libraries, it really is giving a major performance benefit. Apparently, doing a search which finds nothing can be a lot slower than doing a count which finds nothing.

This PR gets a 👍 from me as we have evidence that it will significantly improve the performance.

goosemania · 2020-12-04T07:09:08Z

@rbikar, thanks for the PR and clarifications around this change. Please squash the commits and we'll merge it.

ipanova

Based on the provided reproducer and write up, lgtm

In (pulp#4008) 033b2e6, there was accidentally added unicode char to the end the repository.py file, it caused failure during run of coverage tool in CI.

In (pulp#4008) 033b2e6, the function was updated to skip db query, if the expected results is empty. It seems that it was forgotten to update tests, so this change updates tests to reflect the change in pulp#4008. In this commit I've also fixed the repository.py file. There was accidentally added unicode char to the end the repository.py file, it caused failure during run of coverage tool in CI.

In (#4008) 033b2e6, the function was updated to skip db query, if the expected results is empty. It seems that it was forgotten to update tests, so this change updates tests to reflect the change in #4008. In this commit I've also fixed the repository.py file. There was accidentally added unicode char to the end the repository.py file, it caused failure during run of coverage tool in CI.

…ulp#4008) * find_repo_content_units: skip db query if expected result is empty For performance reasons, this change does server-side counting of units that are expected to be in result before actual db query happens. If expected count is zero, skip db query and continue to the next chunk of unit ids. The performance problem is noticable in RHSM-pulp for big repos with 200k and more units associated. During associate action results of the most of queries in affected code will be empty so we can skip querying the db this way. Without this fix, code will do eg. 200 unnecesary queries for repo with 200k units which will take a lot time to finish that leads to major performance regression. (cherry picked from commit 033b2e6)

…4008) * find_repo_content_units: skip db query if expected result is empty For performance reasons, this change does server-side counting of units that are expected to be in result before actual db query happens. If expected count is zero, skip db query and continue to the next chunk of unit ids. The performance problem is noticable in RHSM-pulp for big repos with 200k and more units associated. During associate action results of the most of queries in affected code will be empty so we can skip querying the db this way. Without this fix, code will do eg. 200 unnecesary queries for repo with 200k units which will take a lot time to finish that leads to major performance regression. (cherry picked from commit 033b2e6)

In (pulp#4008) 033b2e6, the function was updated to skip db query, if the expected results is empty. It seems that it was forgotten to update tests, so this change updates tests to reflect the change in pulp#4008. In this commit I've also fixed the repository.py file. There was accidentally added unicode char to the end the repository.py file, it caused failure during run of coverage tool in CI. (cherry picked from commit 0e0b905)

In (#4008) 033b2e6, the function was updated to skip db query, if the expected results is empty. It seems that it was forgotten to update tests, so this change updates tests to reflect the change in #4008. In this commit I've also fixed the repository.py file. There was accidentally added unicode char to the end the repository.py file, it caused failure during run of coverage tool in CI. (cherry picked from commit 0e0b905)

Notable change for rhsm-pulp: - find_repo_content_units: skip db query if expected result is empty (pulp#4008) [RHELDST-4090] this resolves performance regression that happens while querying collection with more that 1M documents. * upstream/2-master: Update tests for find_repo_content_units() func Adds task failure handling for celery in bad state find_repo_content_units: skip db query if expected result is empty (pulp#4008) Revert mkdir() changes for Bundle and Lock. Bump version to 2.21.4 in travis config for docs Added release notes for 2.21.4 release. Ignore force_full in config override when deciding publish to be operational Taught pulp-workers to have a umask of 002 instead of 022. Taught util.misc.mkdir() to set umask to allow group-w for makedirs calls. Change-Id: I47908c98359918c0e5cf833026049ba338e87208

Fix PEP8 issues

a0d23cd

rohanpm reviewed Nov 12, 2020

View reviewed changes

rohanpm approved these changes Dec 3, 2020

View reviewed changes

ipanova approved these changes Dec 4, 2020

View reviewed changes

ipanova merged commit 033b2e6 into pulp:2-master Dec 4, 2020

rbikar added a commit to rbikar/pulp that referenced this pull request Dec 11, 2020

Remove offending unicode char

1de0c57

In (pulp#4008) 033b2e6, there was accidentally added unicode char to the end the repository.py file, it caused failure during run of coverage tool in CI.

rbikar mentioned this pull request Dec 11, 2020

Update tests for find_repo_content_units() func #4010

Merged

pcreech mentioned this pull request Jan 18, 2021

Update tests for find_repo_content_units() func #4015

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

find_repo_content_units: skip db query if expected result is empty #4008

find_repo_content_units: skip db query if expected result is empty #4008

rbikar commented Nov 12, 2020

pep8speaks commented Nov 12, 2020

rbikar commented Nov 12, 2020

rohanpm Nov 12, 2020

midnightercz Nov 13, 2020

rbikar Nov 18, 2020 •

edited

rohanpm Dec 3, 2020

goosemania commented Dec 4, 2020

ipanova left a comment •

edited

find_repo_content_units: skip db query if expected result is empty #4008

find_repo_content_units: skip db query if expected result is empty #4008

Conversation

rbikar commented Nov 12, 2020

pep8speaks commented Nov 12, 2020

rbikar commented Nov 12, 2020

rohanpm Nov 12, 2020

Choose a reason for hiding this comment

midnightercz Nov 13, 2020

Choose a reason for hiding this comment

rbikar Nov 18, 2020 • edited

Choose a reason for hiding this comment

rohanpm Dec 3, 2020

Choose a reason for hiding this comment

goosemania commented Dec 4, 2020

ipanova left a comment • edited

Choose a reason for hiding this comment

rbikar Nov 18, 2020 •

edited

ipanova left a comment •

edited