Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Save space by compressing at least some of the builds #23
OK, so here is the math!
Each build uncompressed is ≈28 MB. This does not include unnecessary source files or anything, so it does not look like we can go lower than that (unless we use compression).
So how many builds do we want to keep? Well, one year from now back is about 3000 commits. This gives roughly
Is it a lot? Well, no, but for folks who have an SSD (me) this might be a problem.
Given that people commit stuff at a slightly faster pace than the space becomes significantly cheaper, I think that we should compress it anyway (even if it is moved to a server with more space). It is a good idea in a long run. And it will make it easier for us to throw in some extra stuff (JVM builds, or maybe even 32-bit builds or something? You never know).
OK, so what can we do?
The most obvious one is to use compression in btrfs. The problem is that it is applied for each file individually, so we are not going to save anything across many builds. Also, it is only viable if you already have btrfs, so it looks like it is not the best option.
Compress each build individually
While it may sound like a great idea to compress all builds together, it does not work that well in practice. Well, it does, but keep reading.
The best compression I got is with
Compressing each build individually is also good for things like local bisect. That is, we can make these archives publicly available, and then write a script that will pull these archives for your local git bisect. How cool is that! That will be about 40 MB of stuff to download per git bisect, and you cannot really compress it any further anyway because you don't know ahead of time which files you would need.
This gives us about ≈120 GB per 10 years. Good enough, I like it.
Is there anything that performs better than 7z? Well, yes:
Let's compress everything together!
For the tests, I took only 7 builds:
Now, there are some funny entries here. Obviously,
I think that there are ways to fiddle with options to get even better results. Suggestions are welcome!
However, at this point it looks like the best way is to use
On Freitag, 12. August 2016 07:51:40 CEST Aleks-Daniel Jakimenko-Aleksejev
Acutally I like the git repo idea very much. With the build files in a git repo
Have you run git repack after committing the different versions? That should
It is hard to tell if it is going to perform better when we put all builds into it. Currently, with just 7 builds in, 28 MB repo size is equivalent to storing each build separately (≈4 MB per build).
Also, I'm not sure if performance is going to be adequate. Bisect has to jump a couple of hundreds commits back and forth, that is definitely slower than just unpacking a 4 MB archive (or am I wrong?).
Well, yes, it says there's nothing to repack (perhaps
@MasterDuke17 I've added lz4 (and a bunch of other stuff) to the main post.
LZ4 is actually a very good finding, thank you very much. Indeed, we should probably forget about space savings and think about decompression speed instead.
How long does it take to decompress one build compressed with
So let's see how fast things decompress:
Almost everything is with default options, so feel free to recommend something specific.
As stupid as it sounds, brotli is a clear winner right now (UPDATE: nope. See next comment). It is a bit slow during compression, but I don't mind it at all.
We have a new winner: https://github.com/Cyan4973/zstd
≈0m0.130s decompression, ≈4.9M size, compression faster than brotli. Basically, it is a winner on all criteria except for file size, and it is only ≈0.4MB worse. Where is the catch??
We can tweak it a bit by using a different compression level. The numbers above are with max level (22), but we can make it ≈10ms faster by sacrificing ≈0.8MB (level 15). I don't care about neither of these.
By the way, I found this blog post very interesting: http://fastcompression.blogspot.com/2013/12/finite-state-entropy-new-breed-of.html
added a commit
Aug 23, 2016
OK, so this has been implement some time ago along with other major changes. Given that everything is written in 6lang now, some things tend to segfault sometimes… but otherwise everything is fine. At least, compression is definitely there, so I am closing this.
Another article about Zstandard.