MetadataStep performance #57

quba42 · 2018-11-05T15:52:31Z

This pull request aims to improve the performance of the MetadataStep when dealing with large Debian repositories (e.g. stretch/main). This is achieved by completely removing the debpkgr dependency from this step. The needed functionality is instead implemented directly in pulp_deb.

This pull request is intended as a pure code refactor that does not add or remove any features. (Though I have several ideas for follow on improvements including bug fixes and new features that would be easy to implement on top of these changes). The only "feature changes" I am aware of is that the relative order of fields in both 'Packages' and 'Release' files will have changed (I used the same order I found in upstream Debian Stretch mirrors), and that I no longer write redundant information into 'Release' files (debpkgr would write the codename into various non-codename fields).

I do not (yet) have any systematic benchmarks, but my initial tests suggest that the performance improvement is quite substantial. (I can reliably sync repositories with more than 50000 packages, on a system with 12 GiB of RAM. In the past such repositories have routinely lead to memory allocation failures on systems with 32 GiB of RAM.)

I am happy to answer any questions and would greatly appreciate any feedback so far.

pulpbot · 2018-11-05T15:53:05Z

Can one of the admins verify this patch?

pulpbot · 2018-11-05T15:53:27Z

Can one of the admins verify this patch?

quba42 · 2018-11-06T09:39:13Z

I was thinking that perhaps I should explain my thinking behind this pull request a bit more. The basic idea is to improve performance by not parsing the same package files multiple times.

In the current implementation of the MetadataStep, the metadata associated with a repository is loaded from the db into memory, and is then processed to gain a list of all packages associated with each architecture for each component for each release in the repository. Particularly packages with architecture = "all" will be appended to every other architecture's list (as well as a list of their own), meaning these packages will appear in multiple lists (and as a result be parsed multiple times).

Each list is then processed by a call to python-debpkgr, which proceeds to access every package file in the list, mostly to regenerate the metadata we already loaded into memory from our db. For repositories containing thousands or tens of thousands of packages (often within a single debpkgr call) this is a real resource bottleneck.

The implementation of this pull request avoids all that by using the metadata from the db directly. This does mean reimplementing some of the functionality previously supplied by debpkgr, but I believe this to be justified both by the performance improvement, as well as the greater level of control it gives us over the MetadataStep.

Once merged, it will become much easier to implement certain features and fix certain bugs I have identified, since we will no longer depend on debpkgr supplying us with the needed interfaces.

Possible New Features post merge:

Adding an 'InRelease' file (instead of just a 'Release' file)
Compressing 'Packages' files using xz instead of bz2 as is common in official Debian repositories.
Improving our pool folder structure. (Currently we just use pool/<component>/<files>, whereas Debian uses pool/<component>/<layer1>/<layer2>/<files>. Here, <layer1> is a single letter or number, or lib concatenated with a single letter or number. <layer2> is a source package name.

Possible further performance improvements:

In it's current form this pull request does not change what is stored in the db. There are several problems with this. Firstly, the db currently only stores a single checksum (along with it's type) for debian packages. However, we want to write SHA1, SHA256, and MD5sum into 'Packages' files. Currently this pull request has to regenerate these hashes for every package file in the repository and monkey patch them into the objects retrieved from the db. It would be better to store all three hashes in the db directly.

Secondly there should be arch_units_deb in additon to release_units_deb, comp_units_deb, and units_deb. Mainly to fix the following bug (https://pulp.plan.io/issues/4094), but also for some further performance gains in the MetadataStep, as well as for reasons of conceptual completeness.

quba42 · 2018-11-07T16:15:39Z

It is possible this PR also fixes https://pulp.plan.io/issues/3750
I was able to synchronize the repository described in the issue report using the changes in this PR.
However, I was not able to reproduce the issue without the changes in this PR either, since my test setup ran out of memory.
I will have another go at it with a more powerful test system tomorrow.

EDIT: I can't reproduce the above bug either before or after my changes.

quba42 · 2018-11-12T17:45:54Z

I am about ready to rebase this PR one more time with the following new features:

Using deb822 from python-debian to write the Release and Packages files
A change to the db model to store all needed hashes in the mongo db

quba42 · 2018-11-13T14:28:20Z

I have now rebased the branch using a version I consider to be finished (though I am of course happy to incorporate any feedback I receive).

I have also opened a pulp issue to go with this PR: https://pulp.plan.io/issues/4151

WARNING: The current version of this PR makes changes to the pulp_deb db models! (Take care when testing on systems you care about...)

mibanescu

I ran out of time (and patience) following the big code change, but I promise to come back to it :-)

As noted, I would prefer if we didn't change the unit key names (checksum -> sha256).

common/pulp_deb/common/ids.py

mibanescu · 2018-11-13T18:29:49Z

plugins/pulp_deb/plugins/distributors/configuration.py

-                              key_id=key_id)
+    sign_options = signer.SignOptions(cmd, repository_name=repository_name,
+                                      key_id=key_id)
+    return signer.Signer(sign_options)



I like the change, I just need to remember to port it into pulp_rpm, where we have almost the same code for signing repository metadata files.

plugins/pulp_deb/plugins/distributors/metadata_files.py

quba42 · 2018-11-14T07:15:46Z

I ran out of time (and patience) following the big code change, but I promise to come back to it :-)

I do appreciate all the detailed feedback! I understand this is a large potential change.

plugins/pulp_deb/plugins/migrations/0002_add_multiple_hashes.py

quba42 · 2018-11-15T14:58:52Z

If you made CHECKSUM_TYPES a dict, then you had the translation for the different names in one place, and it would be easier to include e.g. SHA512 somewhere in the future.

I think of somthing along the lines of util.calculate_checksums(input_file, list(CHECKSUM_TYPES.values())) and
return { k, checksums[v] for k,v in CHECKSUM_TYPES.items() }

I tried to incorporate these changes in d13f5a4 "REBASE_BEFORE_MERGE: Incorporate calculate_deb_checksums() suggestions".

I also renamed the function to calculate_deb_checksums as suggested above.

quba42 · 2018-11-21T15:35:58Z

I found another issue during testing:
https://pulp.plan.io/issues/4176

I will need to add an additional fix for this bug to the PR.

quba42 · 2018-11-23T22:17:06Z

plugins/pulp_deb/plugins/db/models.py

+            'sha1': util.TYPE_SHA1,
+            'sha256': util.TYPE_SHA256,
+        }
+        with open(input_file_path) as input_file:


Should I be catching IOError and raise it as Error similar to lines ~~174 to 175~~ 125 to 126?

quba42 · 2018-11-28T16:20:00Z

Since it was getting rather messy, I have rebased the branch.
The current version still cannot deal with this issue: https://pulp.plan.io/issues/4176
(Which should really be resolved in a separate PR.)

m-bucher

The current implementation uses deb-control-fields from the database, which are not complete and depend on how the package was added (sync or upload). E.g. uploaded packages seem to miss 'Maintainer' field.
Furthermore in a proprietary-repository I saw deb-control-fields like 'License' and 'Vendor'.
These fields are probably not standard, but they should appear in the resulting Packages-file.

Idea: Save the complete control-file in the DB and use that for creating the Packages-file upon publish.

quba42 · 2018-12-11T15:31:18Z

#61 fixes https://pulp.plan.io/issues/4176.
It also ads a control_fields dict containing all control file fields.
(This is essentially the suggestion by @m-bucher above).

This PR is blocked until #61 is merged at which point I will need to rebase and make minor adjustments.

quba42 · 2018-12-12T15:25:01Z

The blocker (#61) is now merged into master, and I have rebased this PR on top of it.

I have also incorporated use of the new control_fields dict to (hopefully) fix the issue mentioned by @m-bucher above. (See a3845be for the relevant changes).

plugins/pulp_deb/plugins/migrations/0003_add_multiple_hashes.py

m-bucher

Great work!
Looks very good.

plugins/pulp_deb/plugins/importers/sync.py

plugins/pulp_deb/plugins/distributors/distributor.py

quba42 · 2018-12-17T17:28:59Z

Right, unless additional requests for changes, or newly identified bugs come along, I am pretty much done with this. I have looked it over one more time, and flagged any remaining spots I am not quite certain about.

On the whole, I am pretty pleased with how it has turned out. Some initial test runs suggest that the MetadataStep takes less than half the time it did without these changes. It is also possible to reliably synchronize 50000+ package repositories on my test system with 12GiB of RAM, which would have been completely impossible without these changes. A big thanks to everyone who helped make it so!

plugins/pulp_deb/plugins/distributors/distributor.py

* Does not add or remove features. * Removes the debpkgr dependency from the MetadataStep. * The order of fields in 'Packages' and 'Release' files has changed.

* This does not add or remove features. All needed checksums (md5sum, sha1, sha256) are now stored in the mongodb, thus removing the need to re-calculate any checksums during the MetadataStep. In other words, publishing repositories no longer requires reading any files.

plugins/pulp_deb/plugins/distributors/distributor.py

quba42 force-pushed the metadatastep_performance branch from 5224abd to a7f4b32 Compare November 8, 2018 13:04

quba42 force-pushed the metadatastep_performance branch from 7403fba to 7e79363 Compare November 13, 2018 14:20

mibanescu requested changes Nov 13, 2018

View reviewed changes

quba42 force-pushed the metadatastep_performance branch from 9cfe71d to dfe46ef Compare November 14, 2018 08:30

goosemania reviewed Nov 14, 2018

View reviewed changes

plugins/pulp_deb/plugins/migrations/0002_add_multiple_hashes.py Outdated Show resolved Hide resolved

quba42 force-pushed the metadatastep_performance branch from ada574f to f3158f4 Compare November 14, 2018 17:22

quba42 commented Nov 23, 2018

View reviewed changes

quba42 force-pushed the metadatastep_performance branch from 82b2627 to 3829def Compare November 28, 2018 16:16

m-bucher suggested changes Nov 30, 2018

View reviewed changes

quba42 force-pushed the metadatastep_performance branch from 3829def to cca55b1 Compare December 12, 2018 10:18

quba42 changed the title ~~MetadataStep performance~~ WIP: MetadataStep performance Dec 12, 2018

quba42 force-pushed the metadatastep_performance branch from cca55b1 to a3845be Compare December 12, 2018 15:13

quba42 changed the title ~~WIP: MetadataStep performance~~ MetadataStep performance Dec 12, 2018

m-bucher reviewed Dec 12, 2018

View reviewed changes

plugins/pulp_deb/plugins/migrations/0003_add_multiple_hashes.py Show resolved Hide resolved

m-bucher approved these changes Dec 12, 2018

View reviewed changes

quba42 force-pushed the metadatastep_performance branch 3 times, most recently from d13dc51 to 4b00f63 Compare December 17, 2018 13:31

quba42 force-pushed the metadatastep_performance branch from 4b00f63 to 2a32809 Compare December 17, 2018 16:24

quba42 commented Dec 17, 2018

View reviewed changes

plugins/pulp_deb/plugins/importers/sync.py Show resolved Hide resolved

quba42 commented Dec 17, 2018

View reviewed changes

plugins/pulp_deb/plugins/distributors/distributor.py Show resolved Hide resolved

m-bucher reviewed Jan 17, 2019

View reviewed changes

plugins/pulp_deb/plugins/distributors/distributor.py Show resolved Hide resolved

quba42 added 4 commits January 22, 2019 13:42

Refactor the MetadataStep for better performance

c014a27

* Does not add or remove features. * Removes the debpkgr dependency from the MetadataStep. * The order of fields in 'Packages' and 'Release' files has changed.

Add three new tests

dc136b7

Use the new control_fields dict

ae909d3

quba42 force-pushed the metadatastep_performance branch from 2a32809 to ae909d3 Compare January 22, 2019 12:56

quba42 commented Jan 22, 2019

View reviewed changes

plugins/pulp_deb/plugins/distributors/distributor.py Show resolved Hide resolved

mibanescu merged commit e0d9027 into pulp:master Jan 22, 2019

quba42 deleted the metadatastep_performance branch January 22, 2019 15:11

quba42 mentioned this pull request Feb 6, 2019

Account for components with an extra path prefix #52

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MetadataStep performance #57

MetadataStep performance #57

quba42 commented Nov 5, 2018

pulpbot commented Nov 5, 2018

pulpbot commented Nov 5, 2018

quba42 commented Nov 6, 2018 •

edited

Loading

quba42 commented Nov 7, 2018 •

edited

Loading

quba42 commented Nov 12, 2018

quba42 commented Nov 13, 2018

mibanescu left a comment

mibanescu Nov 13, 2018

quba42 commented Nov 14, 2018

quba42 commented Nov 15, 2018

quba42 commented Nov 21, 2018

quba42 Nov 23, 2018 •

edited

Loading

quba42 commented Nov 28, 2018

m-bucher left a comment

quba42 commented Dec 11, 2018

quba42 commented Dec 12, 2018

m-bucher left a comment

quba42 commented Dec 17, 2018

MetadataStep performance #57

MetadataStep performance #57

Conversation

quba42 commented Nov 5, 2018

pulpbot commented Nov 5, 2018

pulpbot commented Nov 5, 2018

quba42 commented Nov 6, 2018 • edited Loading

quba42 commented Nov 7, 2018 • edited Loading

quba42 commented Nov 12, 2018

quba42 commented Nov 13, 2018

mibanescu left a comment

Choose a reason for hiding this comment

mibanescu Nov 13, 2018

Choose a reason for hiding this comment

quba42 commented Nov 14, 2018

quba42 commented Nov 15, 2018

quba42 commented Nov 21, 2018

quba42 Nov 23, 2018 • edited Loading

Choose a reason for hiding this comment

quba42 commented Nov 28, 2018

m-bucher left a comment

Choose a reason for hiding this comment

quba42 commented Dec 11, 2018

quba42 commented Dec 12, 2018

m-bucher left a comment

Choose a reason for hiding this comment

quba42 commented Dec 17, 2018

quba42 commented Nov 6, 2018 •

edited

Loading

quba42 commented Nov 7, 2018 •

edited

Loading

quba42 Nov 23, 2018 •

edited

Loading