Skip to content
This repository has been archived by the owner on Dec 7, 2022. It is now read-only.

find_repo_content_units: skip db query if expected result is empty #4008

Merged
merged 2 commits into from Dec 4, 2020

Conversation

rbikar
Copy link
Contributor

@rbikar rbikar commented Nov 12, 2020

For performance reasons, this change does server-side counting
of units that are expected to be in result before actual db query
happens. If expected count is zero, skip db query and continue to
the next chunk of unit ids.

The performance problem is noticable in RHSM-pulp for big repos
with 200k and more units associated. During associate action
results of the most of queries in affected code will be empty
so we can skip querying the db this way.

Without this fix, code will do eg. 200 unnecesary queries for repo
with 200k units which will take a lot time to finish that leads
to major performance regression.

For performance reasons, this change does server-side counting
of units that are expected to be in result before actual db query
happens. If expected count is zero, skip db query and continue to
the next chunk of unit ids.

The performance problem is noticable in RHSM-pulp for big repos
with 200k and more units associated. During associate action
results of the most of queries in affected code will be empty
so we can skip querying the db this way.

Without this fix, code will do eg. 200 unnecesary queries for repo
with 200k units which will take a lot time to finish that leads
to major performance regression.
@pep8speaks
Copy link

Hello @rbikar! Thanks for opening this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 237:1: W293 blank line contains whitespace
Line 238:101: E501 line too long (101 > 100 characters)
Line 240:95: W291 trailing whitespace

@rbikar
Copy link
Contributor Author

rbikar commented Nov 12, 2020

@rohanpm @midnightercz @ipanova
Can you please review?

# for performace reasons, do server-side counting of units the will be returned in
# result before we do actual query to db
# if expected count of units in result is zero, continue to next chunk of unit_ids
if qs.count() == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic here looks right, no concerns from "is it functionally correct" point of view.

Can I ask though whether the performance impact is still just a theory or whether it was verified by testing? It's really surprising to me if the mongo performance of doing a count with no results, vs doing a find with no results, has a notable difference. It seems like the DB should be performing almost exactly the same work in both cases.

I tried some testing myself against one of the actual DBs involved (QA env), using scripts: https://gist.github.com/rohanpm/1652c648a135a22af34547f7faf19e05

It doesn't simulate the queries here exactly, but it deals with same number of ids in "$in" and uses a query which will return no results. Timing it shows up pretty much as I expected, no notable difference between the two.

That test of course is not definitive, but if this hasn't been tested I'm a little skeptical about it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My theory is that degradation happens due to instantiation of qs for every chunk - which I think creates new mongo cursor every time. If my theory is correct I think this won't help to resolve the problem

Copy link
Contributor Author

@rbikar rbikar Nov 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the solution with partial prod db on my laptop with pub push-advisory for e2e repos.
When I debugged where exactly execution waits incredible amount of time it was exactly on for unit in qs:
With skipping on count==0, performance was almost back to normal.
It is true that qs is instantiated for every chunk but actual query is happening during the iteration over qs.
I'll try to come up with some reproducer script, it looks it needs more realistic context for replicating the problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the testing done elsewhere (RHELDST-2771), although this is unintuitive and might point to some deficiency in mongo or one of the client libraries, it really is giving a major performance benefit. Apparently, doing a search which finds nothing can be a lot slower than doing a count which finds nothing.

This PR gets a 👍 from me as we have evidence that it will significantly improve the performance.

@goosemania
Copy link
Member

@rbikar, thanks for the PR and clarifications around this change. Please squash the commits and we'll merge it.

Copy link
Member

@ipanova ipanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the provided reproducer and write up, lgtm

@ipanova ipanova merged commit 033b2e6 into pulp:2-master Dec 4, 2020
rbikar added a commit to rbikar/pulp that referenced this pull request Dec 11, 2020
In (pulp#4008) 033b2e6, there was
accidentally added unicode char to the end the repository.py file,
it caused failure during run of coverage tool in CI.
rbikar added a commit to rbikar/pulp that referenced this pull request Dec 11, 2020
In (pulp#4008) 033b2e6, the function
was updated to skip db query, if the expected results is empty.

It seems that it was forgotten to update tests, so this change
updates tests to reflect the change in pulp#4008.

In this commit I've also fixed the repository.py file.
There was accidentally added unicode char to the end the repository.py file,
it caused failure during run of coverage tool in CI.
rbikar added a commit to rbikar/pulp that referenced this pull request Dec 11, 2020
In (pulp#4008) 033b2e6, the function
was updated to skip db query, if the expected results is empty.

It seems that it was forgotten to update tests, so this change
updates tests to reflect the change in pulp#4008.

In this commit I've also fixed the repository.py file.
There was accidentally added unicode char to the end the repository.py file,
it caused failure during run of coverage tool in CI.
rbikar added a commit to rbikar/pulp that referenced this pull request Dec 11, 2020
In (pulp#4008) 033b2e6, the function
was updated to skip db query, if the expected results is empty.

It seems that it was forgotten to update tests, so this change
updates tests to reflect the change in pulp#4008.

In this commit I've also fixed the repository.py file.
There was accidentally added unicode char to the end the repository.py file,
it caused failure during run of coverage tool in CI.
ipanova pushed a commit that referenced this pull request Jan 14, 2021
In (#4008) 033b2e6, the function
was updated to skip db query, if the expected results is empty.

It seems that it was forgotten to update tests, so this change
updates tests to reflect the change in #4008.

In this commit I've also fixed the repository.py file.
There was accidentally added unicode char to the end the repository.py file,
it caused failure during run of coverage tool in CI.
pcreech pushed a commit to pcreech/pulp that referenced this pull request Jan 18, 2021
…ulp#4008)

* find_repo_content_units: skip db query if expected result is empty

For performance reasons, this change does server-side counting
of units that are expected to be in result before actual db query
happens. If expected count is zero, skip db query and continue to
the next chunk of unit ids.

The performance problem is noticable in RHSM-pulp for big repos
with 200k and more units associated. During associate action
results of the most of queries in affected code will be empty
so we can skip querying the db this way.

Without this fix, code will do eg. 200 unnecesary queries for repo
with 200k units which will take a lot time to finish that leads
to major performance regression.

(cherry picked from commit 033b2e6)
pcreech pushed a commit that referenced this pull request Jan 18, 2021
…4008)

* find_repo_content_units: skip db query if expected result is empty

For performance reasons, this change does server-side counting
of units that are expected to be in result before actual db query
happens. If expected count is zero, skip db query and continue to
the next chunk of unit ids.

The performance problem is noticable in RHSM-pulp for big repos
with 200k and more units associated. During associate action
results of the most of queries in affected code will be empty
so we can skip querying the db this way.

Without this fix, code will do eg. 200 unnecesary queries for repo
with 200k units which will take a lot time to finish that leads
to major performance regression.

(cherry picked from commit 033b2e6)
pcreech pushed a commit to pcreech/pulp that referenced this pull request Jan 18, 2021
In (pulp#4008) 033b2e6, the function
was updated to skip db query, if the expected results is empty.

It seems that it was forgotten to update tests, so this change
updates tests to reflect the change in pulp#4008.

In this commit I've also fixed the repository.py file.
There was accidentally added unicode char to the end the repository.py file,
it caused failure during run of coverage tool in CI.

(cherry picked from commit 0e0b905)
pcreech pushed a commit that referenced this pull request Jan 18, 2021
In (#4008) 033b2e6, the function
was updated to skip db query, if the expected results is empty.

It seems that it was forgotten to update tests, so this change
updates tests to reflect the change in #4008.

In this commit I've also fixed the repository.py file.
There was accidentally added unicode char to the end the repository.py file,
it caused failure during run of coverage tool in CI.

(cherry picked from commit 0e0b905)
rbikar added a commit to rbikar/pulp that referenced this pull request Jun 22, 2021
Notable change for rhsm-pulp:
- find_repo_content_units: skip db query if expected result is empty
  (pulp#4008) [RHELDST-4090]
  this resolves performance regression that happens while querying
  collection with more that 1M documents.

* upstream/2-master:
  Update tests for find_repo_content_units() func
  Adds task failure handling for celery in bad state
  find_repo_content_units: skip db query if expected result is empty (pulp#4008)
  Revert mkdir() changes for Bundle and Lock.
  Bump version to 2.21.4 in travis config for docs
  Added release notes for 2.21.4 release.
  Ignore force_full in config override when deciding publish to be operational
  Taught pulp-workers to have a umask of 002 instead of 022.
  Taught util.misc.mkdir() to set umask to allow group-w for makedirs calls.

Change-Id: I47908c98359918c0e5cf833026049ba338e87208
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
6 participants