New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save space by compressing at least some of the builds #23

Open
AlexDaniel opened this Issue Aug 12, 2016 · 10 comments

Comments

Projects
None yet
4 participants
@AlexDaniel
Copy link
Member

AlexDaniel commented Aug 12, 2016

OK, so here is the math!

Each build uncompressed is ≈28 MB. This does not include unnecessary source files or anything, so it does not look like we can go lower than that (unless we use compression).

So how many builds do we want to keep? Well, one year from now back is about 3000 commits. This gives roughly 84 GB. And this is just one year of builds for just MoarVM backend. In about 10 years we will slowly start approaching the 1 TB mark. Multiply it by the number of backends we want to track.

Is it a lot? Well, no, but for folks who have an SSD (me) this might be a problem.

Given that people commit stuff at a slightly faster pace than the space becomes significantly cheaper, I think that we should compress it anyway (even if it is moved to a server with more space). It is a good idea in a long run. And it will make it easier for us to throw in some extra stuff (JVM builds, or maybe even 32-bit builds or something? You never know).

OK, so what can we do?

Filesystem compression

The most obvious one is to use compression in btrfs. The problem is that it is applied for each file individually, so we are not going to save anything across many builds. Also, it is only viable if you already have btrfs, so it looks like it is not the best option.

Compress each build individually

While it may sound like a great idea to compress all builds together, it does not work that well in practice. Well, it does, but keep reading.

The best compression I got is with 7z. Each build is ≈4 MB (≈28 MB uncompressed, therefore ≈7 times space saving!)

Compressing each build individually is also good for things like local bisect. That is, we can make these archives publicly available, and then write a script that will pull these archives for your local git bisect. How cool is that! That will be about 40 MB of stuff to download per git bisect, and you cannot really compress it any further anyway because you don't know ahead of time which files you would need.

This gives us about ≈120 GB per 10 years. Good enough, I like it.

Is there anything that performs better than 7z? Well, yes:

27M     c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar
9.0M    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.lz4
7.0M    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.zip
6.7M    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.gz
6.7M    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.lzfse
6.4M    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.bz2
5.7M    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7-L15.tar.zst
4.9M    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7-L22.tar.zst
4.7M    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.lzham
4.5M    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.brotli
4.3M    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.lrz
4.1M    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.7z
4.1M    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tpxz
4.0M    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.xz
3.8M    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7-extremely-slow.tar.lrz

xz is much slower during compression and is tiny bit slower when decompressing, so the win is insignificant. lrz with extreme options is much slower at everything, so forget it.

Let's compress everything together!

For the tests, I took only 7 builds:

c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7
3878066a953195276ef99739f157d38793395d06
ef04e1e07f1de0d4eb2666985c7290f96c912be6
eea786e2a2febef0ab0bfabca956beae95ab81fd
76be77c9d6e697c26e92dc704109b7b8780845aa
d30806bd00dee201cd891550ac65f084f18e8285
9bfbab9186d710e0603b1eb86be1e5ba2e0c84d1

Now, there are some funny entries here. Obviously, .tar is the size of all repos together, uncompressed. git-repo is a git repo with every build committed one after another in the right order. Look, it performs better than some of the other things! Amazing! And even better if you compress it afterwards. Who would've thought, right?

187M    all.tar
63M     all.tar.lz4
49M     all.zip
47M     all.tar.gz
45M     all.tar.bz2
42M     git-repo.tar
28M     all.tar.xz
28M     all.7z
19M     git-repo.7z
8.3M    all.tar.lrz

However, lrz clearly wins this time. And that's with default settings! Wow.

Conclusion

I think that there are ways to fiddle with options to get even better results. Suggestions are welcome!

However, at this point it looks like the best way is to use 7z to compress each build individually.
Messing around with one huge archive is probably not worth the savings. It will make decompression time significantly slower, but we want decompression to be as fast as possible.

@AlexDaniel AlexDaniel self-assigned this Aug 12, 2016

@niner

This comment has been minimized.

Copy link

niner commented Aug 12, 2016

On Freitag, 12. August 2016 07:51:40 CEST Aleks-Daniel Jakimenko-Aleksejev

Now, there are some funny entries here. Obviously, .tar is the size of
all repos together, uncompressed. git-repo is a git repo with every
build committed one after another in the right order. Look, it performs
better than some of the other things! Amazing! And even better if you
compress it afterwards. Who would've thought, right?

However, lrz clearly wins this time. And that's with default settings!
Wow.

Conclusion

I think that there are ways to fiddle with options to get even better
results. Suggestions are welcome!

Acutally I like the git repo idea very much. With the build files in a git repo
instead of the source files, one can use plain old git bisect to find the
offending commit. I'm not surprised that git performs so well on this task as
it stores each content only once. So unless a file got changed, it will not be
stored for the new version.

Have you run git repack after committing the different versions? That should
reduce the repository size considerably, since it uses delta compression which
should help especially with te larger build files that change a lot like
CORE.setting.moarvm

Stefan

@AlexDaniel

This comment has been minimized.

Copy link
Member

AlexDaniel commented Aug 12, 2016

Actually I like the git repo idea very much.

It is hard to tell if it is going to perform better when we put all builds into it. Currently, with just 7 builds in, 28 MB repo size is equivalent to storing each build separately (≈4 MB per build).

Also, I'm not sure if performance is going to be adequate. Bisect has to jump a couple of hundreds commits back and forth, that is definitely slower than just unpacking a 4 MB archive (or am I wrong?).

Have you run git repack after committing the different versions?

Well, yes, it says there's nothing to repack (perhaps git gc called it automatically?).

@MasterDuke17

This comment has been minimized.

Copy link
Collaborator

MasterDuke17 commented Aug 13, 2016

How about testing LZ4 or LZHAM? I suspect they won't compress as well, but they are supposed to be very fast at decompressing, so the trade-off might be worth it.

@AlexDaniel

This comment has been minimized.

Copy link
Member

AlexDaniel commented Aug 13, 2016

@MasterDuke17 I've added lz4 (and a bunch of other stuff) to the main post.

LZ4 is actually a very good finding, thank you very much. Indeed, we should probably forget about space savings and think about decompression speed instead.

How long does it take to decompress one build compressed with 7z? 0.4s. Why? See this. Basically, LZMA is not a good fit for multithreading (besides just being slow).
7z does not look like a good candidate anymore. Typical bisect takes a bit less than 15 steps, so that's a 5 second delay just to decompress the data. We can speed it up a bit by keeping at least the starting points decompressed, but in the end it is still a significant delay.

So let's see how fast things decompress:

0m45.291s   c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7-extremely-slow.tar.lrz
0m0.716s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.bz2
0m0.548s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.lzfse
0m0.421s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.lrz
0m0.415s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.xz
0m0.414s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.7z
0m0.325s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tpxz
0m0.262s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.zip
0m0.225s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.gz
0m0.208s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.lzham
0m0.169s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.brotli
0m0.130s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7-L22.tar.zst
0m0.120s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7-L15.tar.zst
0m0.087s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.lz4
0m0.049s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar

Almost everything is with default options, so feel free to recommend something specific.

As stupid as it sounds, brotli is a clear winner right now (UPDATE: nope. See next comment). It is a bit slow during compression, but I don't mind it at all.

@AlexDaniel

This comment has been minimized.

Copy link
Member

AlexDaniel commented Aug 14, 2016

We have a new winner: https://github.com/Cyan4973/zstd

≈0m0.130s decompression, ≈4.9M size, compression faster than brotli. Basically, it is a winner on all criteria except for file size, and it is only ≈0.4MB worse. Where is the catch??
In fact, zstd is rather young, and if I understand correctly it is not multithreaded yet. So perhaps it will get even better?

We can tweak it a bit by using a different compression level. The numbers above are with max level (22), but we can make it ≈10ms faster by sacrificing ≈0.8MB (level 15). I don't care about neither of these.

@xenu

This comment has been minimized.

Copy link

xenu commented Aug 14, 2016

@AlexDaniel: were you using "plain" xz or maybe pixz in these benchmarks? pixz claims to be much faster on both compression and decompression, thanks to its parallelization.

@AlexDaniel

This comment has been minimized.

Copy link
Member

AlexDaniel commented Aug 14, 2016

@xenu pixz is .tpxz in these tests. It does perform faster, but not too much.

@AlexDaniel

This comment has been minimized.

Copy link
Member

AlexDaniel commented Aug 14, 2016

By the way, I found this blog post very interesting: http://fastcompression.blogspot.com/2013/12/finite-state-entropy-new-breed-of.html
It is by the author of lz4 and zstd. Amazing stuff.

AlexDaniel added a commit that referenced this issue Aug 23, 2016

Compress builds to save space
Also, bash scripts were replaced with perl 6 code.

zstd was chosen for its fast decompression speed and good compression ratio. See
issue #23 for more info.

Rakudo is now being installed into “/tmp”. This way it will be easier for
others to extract archives on their system if we ever publish them (because
/tmp is writeable on almost any system). For example, we can let people run
git bisect locally by downloading prepared builds.

This means that all previous builds have to be rebuilt. That's not a big deal,
but it will take up to three days to process.

Another side effect of these changes is that bots will not see new commits
until they are actually built. Now when running something on HEAD you can be
sure that the actual commit that is pointed on by HEAD is actually ready to
roll. No more “whoops, try using some older commit” error messages.

Currently it is very unstable and more testing is required.
@AlexDaniel

This comment has been minimized.

Copy link
Member

AlexDaniel commented Aug 26, 2016

OK, so this has been implement some time ago along with other major changes. Given that everything is written in 6lang now, some things tend to segfault sometimes… but otherwise everything is fine. At least, compression is definitely there, so I am closing this.

@AlexDaniel AlexDaniel closed this Aug 26, 2016

@MasterDuke17

This comment has been minimized.

Copy link
Collaborator

MasterDuke17 commented Aug 31, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment