New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve memory performance on publishing #1551
Conversation
1c4ef3d
to
1cb42fc
Compare
| @@ -107,10 +107,14 @@ def pkglist_to_lst(cls, value): | |||
| """ | |||
| package_list = [] | |||
| for i in value: | |||
| package_list.append({'name': i.name, | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why, but sometimes it duplicates some packages
| self.packages = [] | ||
| self.published_artifacts = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was holding objects in memory
pulp_rpm/app/tasks/publishing.py
Outdated
| packages (pulp_rpm.models.Package): A list of packages. | ||
| """ | ||
| published_artifacts = [] | ||
| for content_artifact in ContentArtifact.objects.filter(content__in=contents).iterator(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of going to each kind of content and prefetch its content_artifact, it was better to use ContentArtifact for all contents, using iterator, so it would not cache all the data in memory
pulp_rpm/app/tasks/publishing.py
Outdated
| sub_repo_packages = self.get_packages(content) | ||
| all_packages = all_packages | sub_repo_packages | ||
| setattr(self, f"{name}_packages", sub_repo_packages) | ||
| setattr(self, f"{name}_contents", content) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exposing contents instead of packages this change is due to kickstarts, I will explain better on the next comments
| @@ -246,14 +241,17 @@ def create_repomd_xml(packages, publication, extra_repomdrecords, sub_folder=Non | |||
| oth_db = cr.OtherSqlite(oth_db_path) | |||
| upd_xml = cr.UpdateInfoXmlFile(upd_xml_path) | |||
|
|
|||
| pri_xml.set_num_of_pkgs(len(packages)) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was bringing the entire query to memory, for 27000 packages, it was huge!
|
|
||
| # Process all packages | ||
| for package in packages: | ||
| for package in packages.iterator(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, iterator only bring 2000 rows into memory, as we only would use packages on this loop, it was better to use iterator than caching the entire query.
pulp_rpm/app/tasks/publishing.py
Outdated
| @@ -262,40 +260,33 @@ def create_repomd_xml(packages, publication, extra_repomdrecords, sub_folder=Non | |||
| oth_db.add_pkg(pkg) | |||
|
|
|||
| # Process update records | |||
| for update_record in UpdateRecord.objects.filter( | |||
| pk__in=publication.repository_version.content): | |||
| for update_record in UpdateRecord.objects.filter(pk__in=contents).iterator(): | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's the reason why I exposed contents instead of packages. Before, this and the other queries below were only pointing to main repo contents. It was a bug for kickstarts, because sub repo contents were not exposed, so this query would never find the desired sub repo content
pulp_rpm/app/tasks/publishing.py
Outdated
| @@ -95,21 +92,46 @@ def prepare_metadata_files(self, contents, folder=None): | |||
|
|
|||
| return repomdrecords | |||
|
|
|||
| def get_packages(self, contents): | |||
| def publish_artifacts(self, contents): | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it was not added in this PR, however can we change contents to content?
I believe content implies that there can be multiple pieces of content.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually I emphasize contents to bring the attention it is not just one thing, it can be huge
pulp_rpm/app/tasks/publishing.py
Outdated
|
|
||
| PublishedArtifact.objects.bulk_create(published_artifacts, batch_size=2000) | ||
|
|
||
| def handle_sub_repo(self, lst, identifier): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And what does it do? It only publishes artifacts, not metadata, correct?
how about publish_sub_repo_artifacts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be clearer for future code readers if the names are more descriptive.
How about lst->sub_repos and identifier -> sub_repo_type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it not only publish artifacts, it also makes sub repo content available to be used on create_repomd_xml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
identifier would be addon_id or variant_id, I don't know if sub_repo_type suits well
pulp_rpm/app/tasks/publishing.py
Outdated
| return packages | ||
| """ | ||
| sub_repos_pks = [] | ||
| for item in lst: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, can we make it more descritpive, e.g. item -> sub_repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
item would be addon/variant, I don't know which generic name could cover it
pulp_rpm/app/tasks/publishing.py
Outdated
| if repository_version and repository.sub_repo: | ||
| item_id = getattr(item, identifier) | ||
| self.sub_repos.append((item_id, repository_version.content)) | ||
| if repo_pk not in sub_repos_pks: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you expect duplicated sub_repos?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as worked with few multi-addons/variants, I'm just being cautious. Answering in another way, I'm not sure what to expect
|
Everything here looks good to me with the exception of the |
|
@daviddavis and @goosemania |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
actually it was not only a DRY nitpick, if repository_version and repository_version.content != main_content:to if repository_version and repository.sub_repo:comparing content was more expensive than using sub_repo attribute and also introduced some checking if the content was not published, if not, then publish it |
https://pulp.plan.io/issues/5689
closes #5689