Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve memory performance on publishing #1551

Merged
merged 1 commit into from Dec 9, 2019
Merged

Conversation

fao89
Copy link
Member

@fao89 fao89 commented Dec 6, 2019

@fao89 fao89 force-pushed the 5689 branch 2 times, most recently from 1c4ef3d to 1cb42fc Compare December 7, 2019 13:50
@@ -107,10 +107,14 @@ def pkglist_to_lst(cls, value):
"""
package_list = []
for i in value:
package_list.append({'name': i.name,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why, but sometimes it duplicates some packages

Comment on lines -61 to -62
self.packages = []
self.published_artifacts = []
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was holding objects in memory

packages (pulp_rpm.models.Package): A list of packages.
"""
published_artifacts = []
for content_artifact in ContentArtifact.objects.filter(content__in=contents).iterator():
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of going to each kind of content and prefetch its content_artifact, it was better to use ContentArtifact for all contents, using iterator, so it would not cache all the data in memory

sub_repo_packages = self.get_packages(content)
all_packages = all_packages | sub_repo_packages
setattr(self, f"{name}_packages", sub_repo_packages)
setattr(self, f"{name}_contents", content)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exposing contents instead of packages this change is due to kickstarts, I will explain better on the next comments

@@ -246,14 +241,17 @@ def create_repomd_xml(packages, publication, extra_repomdrecords, sub_folder=Non
oth_db = cr.OtherSqlite(oth_db_path)
upd_xml = cr.UpdateInfoXmlFile(upd_xml_path)

pri_xml.set_num_of_pkgs(len(packages))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was bringing the entire query to memory, for 27000 packages, it was huge!


# Process all packages
for package in packages:
for package in packages.iterator():
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, iterator only bring 2000 rows into memory, as we only would use packages on this loop, it was better to use iterator than caching the entire query.

@@ -262,40 +260,33 @@ def create_repomd_xml(packages, publication, extra_repomdrecords, sub_folder=Non
oth_db.add_pkg(pkg)

# Process update records
for update_record in UpdateRecord.objects.filter(
pk__in=publication.repository_version.content):
for update_record in UpdateRecord.objects.filter(pk__in=contents).iterator():
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's the reason why I exposed contents instead of packages. Before, this and the other queries below were only pointing to main repo contents. It was a bug for kickstarts, because sub repo contents were not exposed, so this query would never find the desired sub repo content

@@ -95,21 +92,46 @@ def prepare_metadata_files(self, contents, folder=None):

return repomdrecords

def get_packages(self, contents):
def publish_artifacts(self, contents):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it was not added in this PR, however can we change contents to content?
I believe content implies that there can be multiple pieces of content.
What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I emphasize contents to bring the attention it is not just one thing, it can be huge


PublishedArtifact.objects.bulk_create(published_artifacts, batch_size=2000)

def handle_sub_repo(self, lst, identifier):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And what does it do? It only publishes artifacts, not metadata, correct?
how about publish_sub_repo_artifacts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be clearer for future code readers if the names are more descriptive.
How about lst->sub_repos and identifier -> sub_repo_type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it not only publish artifacts, it also makes sub repo content available to be used on create_repomd_xml

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

identifier would be addon_id or variant_id, I don't know if sub_repo_type suits well

return packages
"""
sub_repos_pks = []
for item in lst:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, can we make it more descritpive, e.g. item -> sub_repo?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

item would be addon/variant, I don't know which generic name could cover it

if repository_version and repository.sub_repo:
item_id = getattr(item, identifier)
self.sub_repos.append((item_id, repository_version.content))
if repo_pk not in sub_repos_pks:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you expect duplicated sub_repos?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as worked with few multi-addons/variants, I'm just being cautious. Answering in another way, I'm not sure what to expect

@daviddavis
Copy link
Contributor

Everything here looks good to me with the exception of the handle_sub_repo() function. It seems to add a lot of complexity over what we currently have and I'm not quite clear on how it improves memory usage?

@fao89
Copy link
Member Author

fao89 commented Dec 9, 2019

Copy link
Contributor

@daviddavis daviddavis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fao89
Copy link
Member Author

fao89 commented Dec 9, 2019

@daviddavis and @goosemania handle_sub_repo is this:
https://github.com/pulp/pulp_rpm/blob/master/pulp_rpm/app/tasks/publishing.py#L141-L148
just a DRY nitpick

actually it was not only a DRY nitpick,
I changed how to determine if it is a sub-repo from:

if repository_version and repository_version.content != main_content:

to

if repository_version and repository.sub_repo:

comparing content was more expensive than using sub_repo attribute

and also introduced some checking if the content was not published, if not, then publish it

@daviddavis daviddavis merged commit 744ae4b into pulp:master Dec 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants