-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MetadataStep performance #57
Conversation
Can one of the admins verify this patch? |
1 similar comment
Can one of the admins verify this patch? |
I was thinking that perhaps I should explain my thinking behind this pull request a bit more. The basic idea is to improve performance by not parsing the same package files multiple times. In the current implementation of the MetadataStep, the metadata associated with a repository is loaded from the db into memory, and is then processed to gain a list of all packages associated with each architecture for each component for each release in the repository. Particularly packages with Each list is then processed by a call to python-debpkgr, which proceeds to access every package file in the list, mostly to regenerate the metadata we already loaded into memory from our db. For repositories containing thousands or tens of thousands of packages (often within a single debpkgr call) this is a real resource bottleneck. The implementation of this pull request avoids all that by using the metadata from the db directly. This does mean reimplementing some of the functionality previously supplied by debpkgr, but I believe this to be justified both by the performance improvement, as well as the greater level of control it gives us over the MetadataStep. Once merged, it will become much easier to implement certain features and fix certain bugs I have identified, since we will no longer depend on debpkgr supplying us with the needed interfaces. Possible New Features post merge:
Possible further performance improvements: In it's current form this pull request does not change what is stored in the db. There are several problems with this. Firstly, the db currently only stores a single checksum (along with it's type) for debian packages. However, we want to write SHA1, SHA256, and MD5sum into 'Packages' files. Currently this pull request has to regenerate these hashes for every package file in the repository and monkey patch them into the objects retrieved from the db. It would be better to store all three hashes in the db directly. Secondly there should be arch_units_deb in additon to release_units_deb, comp_units_deb, and units_deb. Mainly to fix the following bug (https://pulp.plan.io/issues/4094), but also for some further performance gains in the MetadataStep, as well as for reasons of conceptual completeness. |
It is possible this PR also fixes https://pulp.plan.io/issues/3750 EDIT: I can't reproduce the above bug either before or after my changes. |
5224abd
to
a7f4b32
Compare
I am about ready to rebase this PR one more time with the following new features:
|
7403fba
to
7e79363
Compare
I have now rebased the branch using a version I consider to be finished (though I am of course happy to incorporate any feedback I receive). I have also opened a pulp issue to go with this PR: https://pulp.plan.io/issues/4151 WARNING: The current version of this PR makes changes to the pulp_deb db models! (Take care when testing on systems you care about...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran out of time (and patience) following the big code change, but I promise to come back to it :-)
As noted, I would prefer if we didn't change the unit key names (checksum -> sha256).
key_id=key_id) | ||
sign_options = signer.SignOptions(cmd, repository_name=repository_name, | ||
key_id=key_id) | ||
return signer.Signer(sign_options) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the change, I just need to remember to port it into pulp_rpm, where we have almost the same code for signing repository metadata files.
I do appreciate all the detailed feedback! I understand this is a large potential change. |
9cfe71d
to
dfe46ef
Compare
plugins/pulp_deb/plugins/migrations/0002_add_multiple_hashes.py
Outdated
Show resolved
Hide resolved
ada574f
to
f3158f4
Compare
I tried to incorporate these changes in d13f5a4 "REBASE_BEFORE_MERGE: Incorporate calculate_deb_checksums() suggestions". I also renamed the function to |
I found another issue during testing: I will need to add an additional fix for this bug to the PR. |
'sha1': util.TYPE_SHA1, | ||
'sha256': util.TYPE_SHA256, | ||
} | ||
with open(input_file_path) as input_file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I be catching IOError
and raise it as Error
similar to lines 174 to 175 125 to 126?
82b2627
to
3829def
Compare
Since it was getting rather messy, I have rebased the branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation uses deb-control-fields from the database, which are not complete and depend on how the package was added (sync or upload). E.g. uploaded packages seem to miss 'Maintainer' field.
Furthermore in a proprietary-repository I saw deb-control-fields like 'License' and 'Vendor'.
These fields are probably not standard, but they should appear in the resulting Packages-file.
Idea: Save the complete control-file in the DB and use that for creating the Packages-file upon publish.
#61 fixes https://pulp.plan.io/issues/4176. This PR is blocked until #61 is merged at which point I will need to rebase and make minor adjustments. |
3829def
to
cca55b1
Compare
cca55b1
to
a3845be
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
Looks very good.
d13dc51
to
4b00f63
Compare
4b00f63
to
2a32809
Compare
Right, unless additional requests for changes, or newly identified bugs come along, I am pretty much done with this. I have looked it over one more time, and flagged any remaining spots I am not quite certain about. On the whole, I am pretty pleased with how it has turned out. Some initial test runs suggest that the MetadataStep takes less than half the time it did without these changes. It is also possible to reliably synchronize 50000+ package repositories on my test system with 12GiB of RAM, which would have been completely impossible without these changes. A big thanks to everyone who helped make it so! |
* Does not add or remove features. * Removes the debpkgr dependency from the MetadataStep. * The order of fields in 'Packages' and 'Release' files has changed.
* This does not add or remove features. All needed checksums (md5sum, sha1, sha256) are now stored in the mongodb, thus removing the need to re-calculate any checksums during the MetadataStep. In other words, publishing repositories no longer requires reading any files.
2a32809
to
ae909d3
Compare
This pull request aims to improve the performance of the MetadataStep when dealing with large Debian repositories (e.g. stretch/main). This is achieved by completely removing the debpkgr dependency from this step. The needed functionality is instead implemented directly in pulp_deb.
This pull request is intended as a pure code refactor that does not add or remove any features. (Though I have several ideas for follow on improvements including bug fixes and new features that would be easy to implement on top of these changes). The only "feature changes" I am aware of is that the relative order of fields in both 'Packages' and 'Release' files will have changed (I used the same order I found in upstream Debian Stretch mirrors), and that I no longer write redundant information into 'Release' files (debpkgr would write the codename into various non-codename fields).
I do not (yet) have any systematic benchmarks, but my initial tests suggest that the performance improvement is quite substantial. (I can reliably sync repositories with more than 50000 packages, on a system with 12 GiB of RAM. In the past such repositories have routinely lead to memory allocation failures on systems with 32 GiB of RAM.)
I am happy to answer any questions and would greatly appreciate any feedback so far.