New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of duplicate removal, and prevent content from contaminating typed repos #441
Conversation
Required PR: pulp/pulpcore#441 re: #5701 https://pulp.plan.io/issues/5701
Required PR: pulp/pulpcore#441 re: #5701 https://pulp.plan.io/issues/5701
Required PR: pulp/pulpcore#441 re: #5701 https://pulp.plan.io/issues/5701
Required PR: pulp/pulpcore#441 re: #5701 https://pulp.plan.io/issues/5701
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really great work!
| field: getattr(content_unit, field) for field in type_obj.repo_key_fields | ||
| } | ||
| item_query = Q(**unit_q_dict) & ~Q(pk=content_unit.pk) | ||
| find_dup_qs |= item_query |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would do a batch queryset here, RHEL 7 has about 27000 packages, so this query would result in 27000 objects + 27000 unit_q_dicts
febc7ed
to
c965be0
Compare
|
|
||
| if new_content_qs.count(): | ||
| find_dup_qs = Q() | ||
| for content_unit in new_content_qs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you do:
| for content_unit in new_content_qs: | |
| repo_key_fields_with_pk = type_obj.repo_key_fields + ('pk',) | |
| for content_unit in new_content_qs.values(*repo_key_fields_with_pk): |
it would already bring the result in dicts, so you won't need to do a dict comprehension for each item
pk = str(content_unit.pop("pk"))
item_query = Q(**content_unit) & ~Q(pk=pk)a7a9098
to
2bba500
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
Required PR: pulp/pulpcore#441 re: #5701 https://pulp.plan.io/issues/5701
Required PR: pulp/pulpcore#441 re: #5701 https://pulp.plan.io/issues/5701
Required PR: pulp/pulpcore#441 re: #5701 https://pulp.plan.io/issues/5701
| @@ -1,47 +0,0 @@ | |||
| from unittest.mock import patch | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this is moved to pulp_file. It has to be, because "core.content" is no longer a valid type of content to be in a repository, and you're not supposed to be creating a generic "Repository" anymore either.
Required PR: pulp/pulpcore#441 re: #5701 https://pulp.plan.io/issues/5701
Required PR: pulp/pulpcore#441 re: #5701 https://pulp.plan.io/issues/5701
Required PR: pulp/pulpcore#441 re: #5701 https://pulp.plan.io/issues/5701
Required PR: pulp/pulpcore#441 re: #5701 https://pulp.plan.io/issues/5701
Required PR: pulp/pulpcore#441 re: #5701 https://pulp.plan.io/issues/5701
Required PR: pulp/pulpcore#441 re: #5701 https://pulp.plan.io/issues/5701
Required PR: pulp/pulpcore#441 re: #5701 https://pulp.plan.io/issues/5701
Required PR: pulp/pulpcore#441 re: #5701 https://pulp.plan.io/issues/5701
Required PR: pulp/pulpcore#441 re: #5701 https://pulp.plan.io/issues/5701
Update ProgressReport for batches of content instead of individually. ProgressReport updating took 30% of the runtime of syncs, now 1%. Required PR: pulp/pulpcore#441 [noissue]
Update ProgressReport for batches of content instead of individually. ProgressReport updating took 30% of the runtime of syncs, now 1%. Required PR: pulp/pulpcore#441 [noissue]
|
|
||
| for content_dict in batch: | ||
| content_pk = content_dict.pop('pk') | ||
| item_query = Q(**content_dict) & ~Q(pk=content_pk) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need & ~Q(pk=content_pk) here? It looks like these pks are filtered out by .filter(pk__in=existing_content) below anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point, probably not.
| repository = repository_version.repository.cast() | ||
| content_types = {ctype.get_pulp_type(): ctype for ctype in repository.CONTENT_TYPES} | ||
|
|
||
| for pulp_type in content_types.keys(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| for pulp_type in content_types.keys(): | |
| for pulp_type, type_obj in content_types.items(): |
and remove the next line.
It responsible for 13% of sync/resync runtime, and provides no value. The name of the content is external and would have no translation, and the name of the stage should not be translated, as it should correspond with the source code. [noissue]
Reduce code duplication for a common pattern [noissue]
|
Comments addressed |
|
|
||
| duplicates_qs = type_obj.objects.filter(pk__in=existing_content)\ | ||
| .filter(find_dup_qs)\ | ||
| .only('pk') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This (.only('pk')) is really neat. I must remember to put it my toolbox.
@dralley this looks really great! |
| no_change = not self.added() and not self.removed() | ||
| if no_change: | ||
| self.delete() | ||
| else: | ||
| content_types_seen = set( | ||
| self.content.values_list('pulp_type', flat=True).distinct() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might be able to get away with just checking the content used in the added content units. But I think I'll let that wait until I can start looking at the query generation. We'll definitely be doing more optimization work post-GA
Required PR: pulp/pulpcore#441 [noissue]
Required PR: pulp/pulpcore#441 [noissue]
Required PR: pulp/pulpcore#441 [noissue]
Required PR: pulp/pulpcore#441 [noissue]
Required PR: pulp/pulpcore#441 [noissue]
Update ProgressReport for batches of content instead of individually. ProgressReport updating took 30% of the runtime of syncs, now 1%. Required PR: pulp/pulpcore#441 [noissue]
| @@ -79,7 +78,6 @@ def __init__(self, new_version, *args, **kwargs): | |||
| with ProgressReport(message='Un-Associating Content', code='unassociating.content') as pb: | |||
| async for queryset_to_unassociate in self.items(): | |||
| self.new_version.remove_content(queryset_to_unassociate) | |||
| pb.done = pb.done + queryset_to_unassociate.count() | |||
| pb.save() | |||
| pb.increase_by(queryset_to_unassociate.count()) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove the auto-throttling from pulpcore's ProgressReport context manager? I'm concerned that we already have a mechanism (auto-throttle saving) that handles this, but we're side stepping that both here and in pulp_rpm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not actually side stepping that here, the mechanism is inside of .save(), and .increase_by() calls .save(), so it's functionally equivalent.
In pulp_rpm, yes, I agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true. Still I'm getting the feeling that auto-throttling is an anti-feature. It's here to help plugin writers to not think about these things, but in practice plugin writers probably need to think about their data processing in batches. The throttling is here to serve plugin writers, but with plugins avoiding it I don't see who its serving. I'm +1 to remove it either in this PR or as a separate issue. It can be in 3.1 also it doesn't matter to me when. It's important to me that we don't have core offering a feature in its plugin API that isn't a good fit for plugin writers.
So can we remove this? and if so shall I file a separate story to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, pulp_file is using it, pulp_cookbook uses it, pulp_container uses it in one place (but not another), and pulp_rpm is using it now (updated my PR). I'm not sure the statement that plugins are avoiding it is justified.
With that said, yeah, it does seem a bit hacky. I could go either way with removing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well,
pulp_filewas using it,pulp_cookbookuses it,pulp_containeruses it in one place (but not another), andpulp_rpmis using it now. I'm not sure the statement that plugins are avoiding it is justified.
It's accurate to say that plugins managing their own updates are not relying on pulpcore to provide that feature and avoiding its use.
With that said, yeah, it does seem a bit hacky. I could go either way with removing it.
I filed https://pulp.plan.io/issues/5855 and started a discussion on pulp-dev here for feedback: https://www.redhat.com/archives/pulp-dev/2019-December/msg00032.html
Update ProgressReport for batches of content instead of individually. ProgressReport updating took 30% of the runtime of syncs, now 1%. Required PR: pulp/pulpcore#441 [noissue]
Update ProgressReport for batches of content instead of individually. ProgressReport updating took 30% of the runtime of syncs, now 1%. Required PR: pulp/pulpcore#441 [noissue]
Update ProgressReport for batches of content instead of individually. ProgressReport updating took 30% of the runtime of syncs, now 1%. Required PR: pulp/pulpcore#441 [noissue]
Use auto-throttling behavior of ProgressReport context manager. Required PR: pulp/pulpcore#441 [noissue]
Required PR: pulp/pulpcore#441 re: #5701 https://pulp.plan.io/issues/5701 (cherry picked from commit 140d849)
Required PR: pulp/pulpcore#441 re: #5701 https://pulp.plan.io/issues/5701 (cherry picked from commit 3caded8)
No description provided.