find_repo_content_units: skip db query if expected result is empty #4008
Conversation
For performance reasons, this change does server-side counting of units that are expected to be in result before actual db query happens. If expected count is zero, skip db query and continue to the next chunk of unit ids. The performance problem is noticable in RHSM-pulp for big repos with 200k and more units associated. During associate action results of the most of queries in affected code will be empty so we can skip querying the db this way. Without this fix, code will do eg. 200 unnecesary queries for repo with 200k units which will take a lot time to finish that leads to major performance regression.
Hello @rbikar! Thanks for opening this PR. We checked the lines you've touched for PEP 8 issues, and found:
|
@rohanpm @midnightercz @ipanova |
# for performace reasons, do server-side counting of units the will be returned in | ||
# result before we do actual query to db | ||
# if expected count of units in result is zero, continue to next chunk of unit_ids | ||
if qs.count() == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic here looks right, no concerns from "is it functionally correct" point of view.
Can I ask though whether the performance impact is still just a theory or whether it was verified by testing? It's really surprising to me if the mongo performance of doing a count with no results, vs doing a find with no results, has a notable difference. It seems like the DB should be performing almost exactly the same work in both cases.
I tried some testing myself against one of the actual DBs involved (QA env), using scripts: https://gist.github.com/rohanpm/1652c648a135a22af34547f7faf19e05
It doesn't simulate the queries here exactly, but it deals with same number of ids in "$in" and uses a query which will return no results. Timing it shows up pretty much as I expected, no notable difference between the two.
That test of course is not definitive, but if this hasn't been tested I'm a little skeptical about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My theory is that degradation happens due to instantiation of qs for every chunk - which I think creates new mongo cursor every time. If my theory is correct I think this won't help to resolve the problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested the solution with partial prod db on my laptop with pub push-advisory for e2e repos.
When I debugged where exactly execution waits incredible amount of time it was exactly on for unit in qs:
With skipping on count==0, performance was almost back to normal.
It is true that qs is instantiated for every chunk but actual query is happening during the iteration over qs.
I'll try to come up with some reproducer script, it looks it needs more realistic context for replicating the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the testing done elsewhere (RHELDST-2771), although this is unintuitive and might point to some deficiency in mongo or one of the client libraries, it really is giving a major performance benefit. Apparently, doing a search which finds nothing can be a lot slower than doing a count which finds nothing.
This PR gets a 👍 from me as we have evidence that it will significantly improve the performance.
@rbikar, thanks for the PR and clarifications around this change. Please squash the commits and we'll merge it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the provided reproducer and write up, lgtm
In (pulp#4008) 033b2e6, the function was updated to skip db query, if the expected results is empty. It seems that it was forgotten to update tests, so this change updates tests to reflect the change in pulp#4008. In this commit I've also fixed the repository.py file. There was accidentally added unicode char to the end the repository.py file, it caused failure during run of coverage tool in CI.
In (pulp#4008) 033b2e6, the function was updated to skip db query, if the expected results is empty. It seems that it was forgotten to update tests, so this change updates tests to reflect the change in pulp#4008. In this commit I've also fixed the repository.py file. There was accidentally added unicode char to the end the repository.py file, it caused failure during run of coverage tool in CI.
In (pulp#4008) 033b2e6, the function was updated to skip db query, if the expected results is empty. It seems that it was forgotten to update tests, so this change updates tests to reflect the change in pulp#4008. In this commit I've also fixed the repository.py file. There was accidentally added unicode char to the end the repository.py file, it caused failure during run of coverage tool in CI.
In (#4008) 033b2e6, the function was updated to skip db query, if the expected results is empty. It seems that it was forgotten to update tests, so this change updates tests to reflect the change in #4008. In this commit I've also fixed the repository.py file. There was accidentally added unicode char to the end the repository.py file, it caused failure during run of coverage tool in CI.
…ulp#4008) * find_repo_content_units: skip db query if expected result is empty For performance reasons, this change does server-side counting of units that are expected to be in result before actual db query happens. If expected count is zero, skip db query and continue to the next chunk of unit ids. The performance problem is noticable in RHSM-pulp for big repos with 200k and more units associated. During associate action results of the most of queries in affected code will be empty so we can skip querying the db this way. Without this fix, code will do eg. 200 unnecesary queries for repo with 200k units which will take a lot time to finish that leads to major performance regression. (cherry picked from commit 033b2e6)
…4008) * find_repo_content_units: skip db query if expected result is empty For performance reasons, this change does server-side counting of units that are expected to be in result before actual db query happens. If expected count is zero, skip db query and continue to the next chunk of unit ids. The performance problem is noticable in RHSM-pulp for big repos with 200k and more units associated. During associate action results of the most of queries in affected code will be empty so we can skip querying the db this way. Without this fix, code will do eg. 200 unnecesary queries for repo with 200k units which will take a lot time to finish that leads to major performance regression. (cherry picked from commit 033b2e6)
In (pulp#4008) 033b2e6, the function was updated to skip db query, if the expected results is empty. It seems that it was forgotten to update tests, so this change updates tests to reflect the change in pulp#4008. In this commit I've also fixed the repository.py file. There was accidentally added unicode char to the end the repository.py file, it caused failure during run of coverage tool in CI. (cherry picked from commit 0e0b905)
In (#4008) 033b2e6, the function was updated to skip db query, if the expected results is empty. It seems that it was forgotten to update tests, so this change updates tests to reflect the change in #4008. In this commit I've also fixed the repository.py file. There was accidentally added unicode char to the end the repository.py file, it caused failure during run of coverage tool in CI. (cherry picked from commit 0e0b905)
Notable change for rhsm-pulp: - find_repo_content_units: skip db query if expected result is empty (pulp#4008) [RHELDST-4090] this resolves performance regression that happens while querying collection with more that 1M documents. * upstream/2-master: Update tests for find_repo_content_units() func Adds task failure handling for celery in bad state find_repo_content_units: skip db query if expected result is empty (pulp#4008) Revert mkdir() changes for Bundle and Lock. Bump version to 2.21.4 in travis config for docs Added release notes for 2.21.4 release. Ignore force_full in config override when deciding publish to be operational Taught pulp-workers to have a umask of 002 instead of 022. Taught util.misc.mkdir() to set umask to allow group-w for makedirs calls. Change-Id: I47908c98359918c0e5cf833026049ba338e87208
For performance reasons, this change does server-side counting
of units that are expected to be in result before actual db query
happens. If expected count is zero, skip db query and continue to
the next chunk of unit ids.
The performance problem is noticable in RHSM-pulp for big repos
with 200k and more units associated. During associate action
results of the most of queries in affected code will be empty
so we can skip querying the db this way.
Without this fix, code will do eg. 200 unnecesary queries for repo
with 200k units which will take a lot time to finish that leads
to major performance regression.