Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a more appropriate compression level for exports #3884

Merged
merged 1 commit into from
Jun 13, 2023

Conversation

dralley
Copy link
Contributor

@dralley dralley commented May 25, 2023

Exports will be larger, but should be much faster. There were reports of large exports taking multiple days to complete due to the overhead incurred by compression.

closes #3869

@dralley
Copy link
Contributor Author

dralley commented May 25, 2023

Pending discussion on https://bugzilla.redhat.com/show_bug.cgi?id=2188504

Copy link
Contributor

@ggainey ggainey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Holy cow - I love this idea. Addresses the problem, without requiring a whole new config/API/workflow change - brilliant!

ipanova
ipanova previously approved these changes May 25, 2023
Copy link
Member

@ipanova ipanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the solution 🦭

@ipanova
Copy link
Member

ipanova commented May 25, 2023

bit it seems like the tests is failing E pulp_smash.pulp3.bindings.PulpTaskError: (PulpTaskError(...), "Pulp task failed (__init__() got an unexpected keyword argument 'compresslevel')")

@ipanova
Copy link
Member

ipanova commented May 25, 2023

@dralley did you ran any tests to come to the conclusion that compression level 1 is the most balanced? there is the whole range of 1-9 levels

@ggainey
Copy link
Contributor

ggainey commented May 25, 2023

bit it seems like the tests is failing E pulp_smash.pulp3.bindings.PulpTaskError: (PulpTaskError(...), "Pulp task failed (__init__() got an unexpected keyword argument 'compresslevel')")

Oh bah - looks like compresslevel isn't supported on the streaming options. From the tarfile doc:

"For modes 'w:gz', 'r:gz', 'w:bz2', 'r:bz2', 'x:gz', 'x:bz2', tarfile.open() accepts the keyword argument compresslevel (default 9) to specify the compression level of the file."

@dralley
Copy link
Contributor Author

dralley commented May 25, 2023

I checked https://github.com/python/cpython/blob/main/Lib/tarfile.py#L1826 first, it looks like it should be supported...

It looks like this was only merged a year ago and therefore might not be in the Python versions we're using python/cpython@50cd4b6

@dralley
Copy link
Contributor Author

dralley commented May 25, 2023

@dralley did you ran any tests to come to the conclusion that compression level 1 is the most balanced? there is the whole range of 1-9 levels

I didn't run tests but I did look over some general compression benchmarks that others have done. Level 3 was 30-40% slower but only about 10% smaller, and anything beyond that was an even worse tradeoff. My feeling is that a lot of the content that people want to export like RPMs, metadata, etc. are already compressed so we don't need to try hard to compress it again.

@ipanova
Copy link
Member

ipanova commented May 26, 2023

@dralley did you ran any tests to come to the conclusion that compression level 1 is the most balanced? there is the whole range of 1-9 levels

I didn't run tests but I did look over some general compression benchmarks that others have done. Level 3 was 30-40% slower but only about 10% smaller, and anything beyond that was an even worse tradeoff. My feeling is that a lot of the content that people want to export like RPMs, metadata, etc. are already compressed so we don't need to try hard to compress it again.

Gotcha, thanks for explanation. It makes sense.

Copy link
Contributor

@ggainey ggainey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Daniel, might it be possible to subclass "just enough" of Tarfile to support this, and carry just that? As opposed to needing all 3K lines of tarfile.py?

@dralley
Copy link
Contributor Author

dralley commented May 30, 2023

@ggainey I agree and that is the intention - this is still in draft remember - but until then I just wanted a sense of whether it would work or not.

@dralley dralley force-pushed the compression branch 2 times, most recently from 53bebbf to efbbb80 Compare May 30, 2023 05:50
@dralley dralley force-pushed the compression branch 2 times, most recently from 5efa600 to e372173 Compare May 31, 2023 05:34
@dralley
Copy link
Contributor Author

dralley commented May 31, 2023

Well, it works. However, is this something we're actually open to doing? We have explicitly tried to avoid doing monkeypatches or this kind of vendoring in the past. I think this solution would yield the best outcome (relative to no compression at all, or something like that) but I still definitely feel discomfort about it at a technical / maintenance level.

Am I missing an easier way to accomplish this?

@bmbouter @mdellweg @ipanova @ggainey

@dralley dralley force-pushed the compression branch 4 times, most recently from ddc442f to 7b58832 Compare May 31, 2023 14:08
Copy link
Contributor

@ggainey ggainey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this approach is the only reasonable way to address the problem for the actually-affected users. It's definitely "black magic", and needs some Very Explicit Comments to make it clear what's going on and why it works - but I can't think of a better approach.

@dralley
Copy link
Contributor Author

dralley commented Jun 8, 2023

I believe tar | gz | split would work - but I haven't tried it to know Absolutely For Sure. It's definitely worth the experiment. The important thing, tho, is "whatever we do in export, needs to be consumable in the current import code", so people don't have to recreate their exports.

Experimenting with this has been successful, although expressing it in code is a bit more challenging, but I want to double check -- is gzip-the-command-line-tool available on all systems we care about? Most distros have it installed by default, but a lot of container images do not.

@dralley
Copy link
Contributor Author

dralley commented Jun 9, 2023

I take it back, something about the data pipeline gets messed up and it's a real pain to debug. Combined with uncertainty about relying on the gzip cli tool, I think we should just continue with the original implementation.

@dralley dralley marked this pull request as ready for review June 9, 2023 01:40
@dralley dralley requested a review from mdellweg June 9, 2023 01:40
Comment on lines 52 to 53
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly if we do not add any init code, then just don't add these two lines.

@@ -0,0 +1 @@
Exports now use gzip compression level 1 rather than compression level 9. Exported archives will now be slightly larger, but exports should be much faster. This is considered to be a more optimal balance of space/time for the export operation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make some of our teammates happy, this ought to be line broken at 98 characters.

(100 is the rule, but towncrier indents by another two.)

@@ -0,0 +1,69 @@
import sys
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not how i learned python, but maybe django does some auto lookup?

Exports will be larger, but should be much faster. There were reports of
large exports taking multiple days to complete due to the overhead
incurred by compression.

closes pulp#3869
@ipanova ipanova merged commit 34c5bab into pulp:main Jun 13, 2023
@patchback
Copy link

patchback bot commented Jun 13, 2023

Backport to 3.18: 💚 backport PR created

✅ Backport PR branch: patchback/backports/3.18/34c5babec5b9503dbd627bebb75cbc0403fe2ae7/pr-3884

Backported as #3917

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

@patchback
Copy link

patchback bot commented Jun 13, 2023

Backport to 3.21: 💚 backport PR created

✅ Backport PR branch: patchback/backports/3.21/34c5babec5b9503dbd627bebb75cbc0403fe2ae7/pr-3884

Backported as #3918

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

@patchback
Copy link

patchback bot commented Jun 13, 2023

Backport to 3.22: 💚 backport PR created

✅ Backport PR branch: patchback/backports/3.22/34c5babec5b9503dbd627bebb75cbc0403fe2ae7/pr-3884

Backported as #3919

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

@patchback
Copy link

patchback bot commented Jun 13, 2023

Backport to 3.23: 💚 backport PR created

✅ Backport PR branch: patchback/backports/3.23/34c5babec5b9503dbd627bebb75cbc0403fe2ae7/pr-3884

Backported as #3920

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

@B-Woody
Copy link

B-Woody commented Jun 14, 2023

I submitted the original Red Hat BZ relating to this issue and also for #3236 . Passing on my thanks to everyone who worked on this. It's going to be a huge help for some of the work we're doing.

@dralley dralley deleted the compression branch June 14, 2023 02:23
@patchback
Copy link

patchback bot commented Aug 1, 2023

Backport to 3.28: 💚 backport PR created

✅ Backport PR branch: patchback/backports/3.28/34c5babec5b9503dbd627bebb75cbc0403fe2ae7/pr-3884

Backported as #4220

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

@patchback
Copy link

patchback bot commented Sep 13, 2023

Backport to 3.16: 💚 backport PR created

✅ Backport PR branch: patchback/backports/3.16/34c5babec5b9503dbd627bebb75cbc0403fe2ae7/pr-3884

Backported as #4425

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Exports are bottlenecked by gzip compression which cannot be disabled
5 participants