Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MetadataStep performance #57

Merged
merged 4 commits into from
Jan 22, 2019
Merged

Conversation

quba42
Copy link
Collaborator

@quba42 quba42 commented Nov 5, 2018

This pull request aims to improve the performance of the MetadataStep when dealing with large Debian repositories (e.g. stretch/main). This is achieved by completely removing the debpkgr dependency from this step. The needed functionality is instead implemented directly in pulp_deb.

This pull request is intended as a pure code refactor that does not add or remove any features. (Though I have several ideas for follow on improvements including bug fixes and new features that would be easy to implement on top of these changes). The only "feature changes" I am aware of is that the relative order of fields in both 'Packages' and 'Release' files will have changed (I used the same order I found in upstream Debian Stretch mirrors), and that I no longer write redundant information into 'Release' files (debpkgr would write the codename into various non-codename fields).

I do not (yet) have any systematic benchmarks, but my initial tests suggest that the performance improvement is quite substantial. (I can reliably sync repositories with more than 50000 packages, on a system with 12 GiB of RAM. In the past such repositories have routinely lead to memory allocation failures on systems with 32 GiB of RAM.)

I am happy to answer any questions and would greatly appreciate any feedback so far.

@pulpbot
Copy link
Member

pulpbot commented Nov 5, 2018

Can one of the admins verify this patch?

1 similar comment
@pulpbot
Copy link
Member

pulpbot commented Nov 5, 2018

Can one of the admins verify this patch?

@quba42
Copy link
Collaborator Author

quba42 commented Nov 6, 2018

I was thinking that perhaps I should explain my thinking behind this pull request a bit more. The basic idea is to improve performance by not parsing the same package files multiple times.

In the current implementation of the MetadataStep, the metadata associated with a repository is loaded from the db into memory, and is then processed to gain a list of all packages associated with each architecture for each component for each release in the repository. Particularly packages with architecture = "all" will be appended to every other architecture's list (as well as a list of their own), meaning these packages will appear in multiple lists (and as a result be parsed multiple times).

Each list is then processed by a call to python-debpkgr, which proceeds to access every package file in the list, mostly to regenerate the metadata we already loaded into memory from our db. For repositories containing thousands or tens of thousands of packages (often within a single debpkgr call) this is a real resource bottleneck.

The implementation of this pull request avoids all that by using the metadata from the db directly. This does mean reimplementing some of the functionality previously supplied by debpkgr, but I believe this to be justified both by the performance improvement, as well as the greater level of control it gives us over the MetadataStep.

Once merged, it will become much easier to implement certain features and fix certain bugs I have identified, since we will no longer depend on debpkgr supplying us with the needed interfaces.

Possible New Features post merge:

  • Adding an 'InRelease' file (instead of just a 'Release' file)
  • Compressing 'Packages' files using xz instead of bz2 as is common in official Debian repositories.
  • Improving our pool folder structure. (Currently we just use pool/<component>/<files>, whereas Debian uses pool/<component>/<layer1>/<layer2>/<files>. Here, <layer1> is a single letter or number, or lib concatenated with a single letter or number. <layer2> is a source package name.

Possible further performance improvements:

In it's current form this pull request does not change what is stored in the db. There are several problems with this. Firstly, the db currently only stores a single checksum (along with it's type) for debian packages. However, we want to write SHA1, SHA256, and MD5sum into 'Packages' files. Currently this pull request has to regenerate these hashes for every package file in the repository and monkey patch them into the objects retrieved from the db. It would be better to store all three hashes in the db directly.

Secondly there should be arch_units_deb in additon to release_units_deb, comp_units_deb, and units_deb. Mainly to fix the following bug (https://pulp.plan.io/issues/4094), but also for some further performance gains in the MetadataStep, as well as for reasons of conceptual completeness.

@quba42
Copy link
Collaborator Author

quba42 commented Nov 7, 2018

It is possible this PR also fixes https://pulp.plan.io/issues/3750
I was able to synchronize the repository described in the issue report using the changes in this PR.
However, I was not able to reproduce the issue without the changes in this PR either, since my test setup ran out of memory.
I will have another go at it with a more powerful test system tomorrow.

EDIT: I can't reproduce the above bug either before or after my changes.

@quba42
Copy link
Collaborator Author

quba42 commented Nov 12, 2018

I am about ready to rebase this PR one more time with the following new features:

  1. Using deb822 from python-debian to write the Release and Packages files
  2. A change to the db model to store all needed hashes in the mongo db

@quba42
Copy link
Collaborator Author

quba42 commented Nov 13, 2018

I have now rebased the branch using a version I consider to be finished (though I am of course happy to incorporate any feedback I receive).

I have also opened a pulp issue to go with this PR: https://pulp.plan.io/issues/4151

WARNING: The current version of this PR makes changes to the pulp_deb db models! (Take care when testing on systems you care about...)

Copy link
Member

@mibanescu mibanescu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran out of time (and patience) following the big code change, but I promise to come back to it :-)

As noted, I would prefer if we didn't change the unit key names (checksum -> sha256).

common/pulp_deb/common/ids.py Outdated Show resolved Hide resolved
key_id=key_id)
sign_options = signer.SignOptions(cmd, repository_name=repository_name,
key_id=key_id)
return signer.Signer(sign_options)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the change, I just need to remember to port it into pulp_rpm, where we have almost the same code for signing repository metadata files.

plugins/pulp_deb/plugins/distributors/metadata_files.py Outdated Show resolved Hide resolved
@quba42
Copy link
Collaborator Author

quba42 commented Nov 14, 2018

I ran out of time (and patience) following the big code change, but I promise to come back to it :-)

I do appreciate all the detailed feedback! I understand this is a large potential change.

@quba42
Copy link
Collaborator Author

quba42 commented Nov 15, 2018

If you made CHECKSUM_TYPES a dict, then you had the translation for the different names in one place, and it would be easier to include e.g. SHA512 somewhere in the future.

I think of somthing along the lines of util.calculate_checksums(input_file, list(CHECKSUM_TYPES.values())) and
return { k, checksums[v] for k,v in CHECKSUM_TYPES.items() }

I tried to incorporate these changes in d13f5a4 "REBASE_BEFORE_MERGE: Incorporate calculate_deb_checksums() suggestions".

I also renamed the function to calculate_deb_checksums as suggested above.

@quba42
Copy link
Collaborator Author

quba42 commented Nov 21, 2018

I found another issue during testing:
https://pulp.plan.io/issues/4176

I will need to add an additional fix for this bug to the PR.

'sha1': util.TYPE_SHA1,
'sha256': util.TYPE_SHA256,
}
with open(input_file_path) as input_file:
Copy link
Collaborator Author

@quba42 quba42 Nov 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I be catching IOError and raise it as Error similar to lines 174 to 175 125 to 126?

@quba42
Copy link
Collaborator Author

quba42 commented Nov 28, 2018

Since it was getting rather messy, I have rebased the branch.
The current version still cannot deal with this issue: https://pulp.plan.io/issues/4176
(Which should really be resolved in a separate PR.)

Copy link
Contributor

@m-bucher m-bucher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation uses deb-control-fields from the database, which are not complete and depend on how the package was added (sync or upload). E.g. uploaded packages seem to miss 'Maintainer' field.
Furthermore in a proprietary-repository I saw deb-control-fields like 'License' and 'Vendor'.
These fields are probably not standard, but they should appear in the resulting Packages-file.

Idea: Save the complete control-file in the DB and use that for creating the Packages-file upon publish.

@quba42
Copy link
Collaborator Author

quba42 commented Dec 11, 2018

#61 fixes https://pulp.plan.io/issues/4176.
It also ads a control_fields dict containing all control file fields.
(This is essentially the suggestion by @m-bucher above).

This PR is blocked until #61 is merged at which point I will need to rebase and make minor adjustments.

@quba42 quba42 changed the title MetadataStep performance WIP: MetadataStep performance Dec 12, 2018
@quba42
Copy link
Collaborator Author

quba42 commented Dec 12, 2018

The blocker (#61) is now merged into master, and I have rebased this PR on top of it.

I have also incorporated use of the new control_fields dict to (hopefully) fix the issue mentioned by @m-bucher above. (See a3845be for the relevant changes).

@quba42 quba42 changed the title WIP: MetadataStep performance MetadataStep performance Dec 12, 2018
Copy link
Contributor

@m-bucher m-bucher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!
Looks very good.

@quba42 quba42 force-pushed the metadatastep_performance branch 3 times, most recently from d13dc51 to 4b00f63 Compare December 17, 2018 13:31
@quba42
Copy link
Collaborator Author

quba42 commented Dec 17, 2018

Right, unless additional requests for changes, or newly identified bugs come along, I am pretty much done with this. I have looked it over one more time, and flagged any remaining spots I am not quite certain about.

On the whole, I am pretty pleased with how it has turned out. Some initial test runs suggest that the MetadataStep takes less than half the time it did without these changes. It is also possible to reliably synchronize 50000+ package repositories on my test system with 12GiB of RAM, which would have been completely impossible without these changes. A big thanks to everyone who helped make it so!

* Does not add or remove features.
* Removes the debpkgr dependency from the MetadataStep.
* The order of fields in 'Packages' and 'Release' files has changed.
* This does not add or remove features.

All needed checksums (md5sum, sha1, sha256) are now stored in the
mongodb, thus removing the need to re-calculate any checksums during the
MetadataStep. In other words, publishing repositories no longer requires
reading any files.
@mibanescu mibanescu merged commit e0d9027 into pulp:master Jan 22, 2019
@quba42 quba42 deleted the metadatastep_performance branch January 22, 2019 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants