Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of duplicate removal, and prevent content from contaminating typed repos #441

Merged
merged 4 commits into from Dec 9, 2019

Conversation

dralley
Copy link
Contributor

@dralley dralley commented Dec 5, 2019

No description provided.

dralley added a commit to dralley/pulp_rpm that referenced this pull request Dec 5, 2019
dralley added a commit to dralley/pulp_file that referenced this pull request Dec 5, 2019
dralley added a commit to dralley/pulp_rpm that referenced this pull request Dec 5, 2019
dralley added a commit to dralley/pulp_file that referenced this pull request Dec 5, 2019
Copy link
Member

@fao89 fao89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really great work!

field: getattr(content_unit, field) for field in type_obj.repo_key_fields
}
item_query = Q(**unit_q_dict) & ~Q(pk=content_unit.pk)
find_dup_qs |= item_query
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do a batch queryset here, RHEL 7 has about 27000 packages, so this query would result in 27000 objects + 27000 unit_q_dicts

@dralley dralley force-pushed the performance branch 2 times, most recently from febc7ed to c965be0 Compare December 5, 2019 19:42

if new_content_qs.count():
find_dup_qs = Q()
for content_unit in new_content_qs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you do:

Suggested change
for content_unit in new_content_qs:
repo_key_fields_with_pk = type_obj.repo_key_fields + ('pk',)
for content_unit in new_content_qs.values(*repo_key_fields_with_pk):

it would already bring the result in dicts, so you won't need to do a dict comprehension for each item

pk = str(content_unit.pop("pk"))
item_query = Q(**content_unit) & ~Q(pk=pk)

@dralley dralley force-pushed the performance branch 3 times, most recently from a7a9098 to 2bba500 Compare December 5, 2019 20:29
Copy link
Member

@fao89 fao89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

dralley added a commit to dralley/pulp_file that referenced this pull request Dec 6, 2019
dralley added a commit to dralley/pulp_rpm that referenced this pull request Dec 6, 2019
dralley added a commit to dralley/pulp_file that referenced this pull request Dec 6, 2019
@@ -1,47 +0,0 @@
from unittest.mock import patch
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this is moved to pulp_file. It has to be, because "core.content" is no longer a valid type of content to be in a repository, and you're not supposed to be creating a generic "Repository" anymore either.

dralley added a commit to dralley/pulp_rpm that referenced this pull request Dec 6, 2019
dralley added a commit to dralley/pulp_file that referenced this pull request Dec 6, 2019
dralley added a commit to dralley/pulp_file that referenced this pull request Dec 6, 2019
dralley added a commit to dralley/pulp_file that referenced this pull request Dec 6, 2019
dralley added a commit to dralley/pulp_python that referenced this pull request Dec 6, 2019
dralley added a commit to dralley/pulp_maven that referenced this pull request Dec 6, 2019
dralley added a commit to dralley/pulp_maven that referenced this pull request Dec 6, 2019
dralley added a commit to dralley/pulp_maven that referenced this pull request Dec 6, 2019
dralley added a commit to dralley/pulp_python that referenced this pull request Dec 6, 2019
dralley added a commit to dralley/pulp_rpm that referenced this pull request Dec 7, 2019
Update ProgressReport for batches of content instead of individually.
ProgressReport updating took 30% of the runtime of syncs, now 1%.

Required PR: pulp/pulpcore#441

[noissue]
dralley added a commit to dralley/pulp_rpm that referenced this pull request Dec 7, 2019
Update ProgressReport for batches of content instead of individually.
ProgressReport updating took 30% of the runtime of syncs, now 1%.

Required PR: pulp/pulpcore#441

[noissue]

for content_dict in batch:
content_pk = content_dict.pop('pk')
item_query = Q(**content_dict) & ~Q(pk=content_pk)
Copy link
Contributor

@gmbnomis gmbnomis Dec 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need & ~Q(pk=content_pk) here? It looks like these pks are filtered out by .filter(pk__in=existing_content) below anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, probably not.

repository = repository_version.repository.cast()
content_types = {ctype.get_pulp_type(): ctype for ctype in repository.CONTENT_TYPES}

for pulp_type in content_types.keys():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for pulp_type in content_types.keys():
for pulp_type, type_obj in content_types.items():

and remove the next line.

It responsible for 13% of sync/resync runtime, and provides no value.
The name of the content is external and would have no translation, and
the name of the stage should not be translated, as it should correspond
with the source code.

[noissue]
Reduce code duplication for a common pattern

[noissue]
@dralley
Copy link
Contributor Author

dralley commented Dec 8, 2019

Comments addressed


duplicates_qs = type_obj.objects.filter(pk__in=existing_content)\
.filter(find_dup_qs)\
.only('pk')
Copy link
Contributor

@gmbnomis gmbnomis Dec 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (.only('pk')) is really neat. I must remember to put it my toolbox.

@gmbnomis
Copy link
Contributor

gmbnomis commented Dec 8, 2019

Comments addressed

@dralley this looks really great!

no_change = not self.added() and not self.removed()
if no_change:
self.delete()
else:
content_types_seen = set(
self.content.values_list('pulp_type', flat=True).distinct()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might be able to get away with just checking the content used in the added content units. But I think I'll let that wait until I can start looking at the query generation. We'll definitely be doing more optimization work post-GA

dralley added a commit to dralley/pulp_file that referenced this pull request Dec 9, 2019
dralley added a commit to dralley/pulp_file that referenced this pull request Dec 9, 2019
dralley added a commit to dralley/pulp_file that referenced this pull request Dec 9, 2019
dralley added a commit to dralley/pulp_file that referenced this pull request Dec 9, 2019
dralley added a commit to dralley/pulp_file that referenced this pull request Dec 9, 2019
dralley added a commit to dralley/pulp_rpm that referenced this pull request Dec 9, 2019
Update ProgressReport for batches of content instead of individually.
ProgressReport updating took 30% of the runtime of syncs, now 1%.

Required PR: pulp/pulpcore#441

[noissue]
@@ -79,7 +78,6 @@ def __init__(self, new_version, *args, **kwargs):
with ProgressReport(message='Un-Associating Content', code='unassociating.content') as pb:
async for queryset_to_unassociate in self.items():
self.new_version.remove_content(queryset_to_unassociate)
pb.done = pb.done + queryset_to_unassociate.count()
pb.save()
pb.increase_by(queryset_to_unassociate.count())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove the auto-throttling from pulpcore's ProgressReport context manager? I'm concerned that we already have a mechanism (auto-throttle saving) that handles this, but we're side stepping that both here and in pulp_rpm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not actually side stepping that here, the mechanism is inside of .save(), and .increase_by() calls .save(), so it's functionally equivalent.

In pulp_rpm, yes, I agree.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. Still I'm getting the feeling that auto-throttling is an anti-feature. It's here to help plugin writers to not think about these things, but in practice plugin writers probably need to think about their data processing in batches. The throttling is here to serve plugin writers, but with plugins avoiding it I don't see who its serving. I'm +1 to remove it either in this PR or as a separate issue. It can be in 3.1 also it doesn't matter to me when. It's important to me that we don't have core offering a feature in its plugin API that isn't a good fit for plugin writers.

So can we remove this? and if so shall I file a separate story to?

Copy link
Contributor Author

@dralley dralley Dec 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, pulp_file is using it, pulp_cookbook uses it, pulp_container uses it in one place (but not another), and pulp_rpm is using it now (updated my PR). I'm not sure the statement that plugins are avoiding it is justified.

With that said, yeah, it does seem a bit hacky. I could go either way with removing it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, pulp_file was using it, pulp_cookbook uses it, pulp_container uses it in one place (but not another), and pulp_rpm is using it now. I'm not sure the statement that plugins are avoiding it is justified.

It's accurate to say that plugins managing their own updates are not relying on pulpcore to provide that feature and avoiding its use.

With that said, yeah, it does seem a bit hacky. I could go either way with removing it.

I filed https://pulp.plan.io/issues/5855 and started a discussion on pulp-dev here for feedback: https://www.redhat.com/archives/pulp-dev/2019-December/msg00032.html

dralley added a commit to dralley/pulp_rpm that referenced this pull request Dec 9, 2019
Update ProgressReport for batches of content instead of individually.
ProgressReport updating took 30% of the runtime of syncs, now 1%.

Required PR: pulp/pulpcore#441

[noissue]
dralley added a commit to dralley/pulp_rpm that referenced this pull request Dec 9, 2019
Update ProgressReport for batches of content instead of individually.
ProgressReport updating took 30% of the runtime of syncs, now 1%.

Required PR: pulp/pulpcore#441

[noissue]
dralley added a commit to dralley/pulp_rpm that referenced this pull request Dec 9, 2019
Update ProgressReport for batches of content instead of individually.
ProgressReport updating took 30% of the runtime of syncs, now 1%.

Required PR: pulp/pulpcore#441

[noissue]
@dralley dralley merged commit e00e864 into pulp:master Dec 9, 2019
@dralley dralley deleted the performance branch December 9, 2019 20:19
@bmbouter
Copy link
Member

bmbouter commented Dec 9, 2019

I read through again and it looks good to me. Thank you @dralley and @gmbnomis

dralley added a commit to dralley/pulp_rpm that referenced this pull request Dec 9, 2019
Use auto-throttling behavior of ProgressReport context manager.

Required PR: pulp/pulpcore#441

[noissue]
goosemania pushed a commit to goosemania/pulp_rpm that referenced this pull request Dec 10, 2019
ipanova pushed a commit to ipanova/pulp_container that referenced this pull request Dec 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants