New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build everything back to 2014.01 #117

Open
AlexDaniel opened this Issue Mar 15, 2017 · 9 comments

Comments

Projects
None yet
1 participant
@AlexDaniel
Copy link
Member

AlexDaniel commented Mar 15, 2017

Kind of annoyed that we cannot bisect sometimes because the change is too old. It is surprising how often this happens.

In issue #23 it was noticed that lrz, although being relatively slow, does an amazing job compressing several builds at the same time. This means that for long-term storage we can put 50 (or so) builds together, and this way store most of them for free (in terms of storage). Yes, all operations with these builds will be slower, but I guess anyone can wait a second or two if they're trying to access something that old.

These changes are required:

  • build-exists should try to find .zst archives first, and if this fails try to find the build elsewhere. We will need some sort of lookup mechanism for finding the right file.
  • These two lines changed accordingly. Make sure that during this process we are not saving builds that are not required.
  • Change build.p6 so that it can figure out that 50 consequent .zst archives can be recompressed with .lrz instead.
@AlexDaniel

This comment has been minimized.

Copy link
Member

AlexDaniel commented May 15, 2017

Just documenting some of the initial findings:

  • If you compress 50 builds with lrz you get a ≈49 MB archive (fyi, each build is ≈28 MB, and if we compress each build separately with zstd we get ≈4.8 MB per build)
  • Pessimistically, decompression of that big archive will only take 5 seconds (I only measured extraction of the whole thing, without skipping builds that are not needed)
@AlexDaniel

This comment has been minimized.

Copy link
Member

AlexDaniel commented Jun 20, 2017

ping @MasterDuke17, @timo

OK, so my estimation of 49 MB per archive with 50 builds was a little bit wrong. I've done some tests and here is what I've found:

graph with different archiving strategies tested, from 10 to 200 builds per archive
(We don't care about compression speed. Compression ratio is divided by 6 so that it fits into the graph nicely, and also because 6 is approximately equal to the current ratio we get with zstd)

This was tested with a particular set of 200 builds, so it does not mean that we will see the same picture for other files. But it should give more insight than if I did nothing :)

A sweet spot seems to be around 60-80 builds per archive, but >4.5s delay just to get one build? Meh… Bisectable is not going to like it.

More info about my tests:

  • I cleared disk cache before decompression so that it is a bit more pessimistic.
  • tar is instructed to extract only one folder out of the whole archive, everything else is thrown away and is not saved to the disk.
  • I used lrzip with default settings. Maybe it used different window for different archives, I'm not sure (it is supposed to figure it out automatically). We should probably use it with --unlimited option, or just force the window size with --window. I don't think it will affect anything given that 200 builds uncompressed are a bit less than 6GB, and the server has much more available RAM than that.
  • Again, given that it is lrzip with default settings, it is using LZMA. Should I try with -l option so that it is using LZO instead? Note that space is not an issue at all, especially given that the compression ratio is easily over 100.
  • I was extracting the last build that was shown by tar --list (and tar --list was executed before the disk cache was cleared). I don't know if this is pessimistic, optimistic, random or something else.
@AlexDaniel

This comment has been minimized.

Copy link
Member

AlexDaniel commented Jun 20, 2017

OK, here it is with LZO:
same graph but with LZO

It is indeed faster. However, most of the time it is less than 1 second of a difference, but with a significant ratio downgrade… not worth it.

Any other ideas? :)

@AlexDaniel

This comment has been minimized.

Copy link
Member

AlexDaniel commented Jul 4, 2017

More experiments done, now with varying compression level (--level option in lrzip):

table

It's hard for me to interpret this meaningfully, but it seems that this magical ratio/decompression ratio is what we are after. For example, a value of 25 would mean that in both configurations it's doing equally, but we should pick the one that has a lower decompression speed.

I would appreciate any feedback on this, as I'm confused by the whole thing badly…

However, the sweep spot indeed seems to be on 20 builds per archive with level 9. Not exactly the best compression ratio there, but it's fast.

@AlexDaniel AlexDaniel self-assigned this Jul 5, 2017

AlexDaniel added a commit that referenced this issue Jul 7, 2017

Long-term storage using lrzip (issue #117)
After all of the experiments discussed in #117, it was decided to use 20
builds per archive with --level=9, all other settings being
default. This has freed a lot of space on the server, and the introduced
delay for using older builds does not seem to be high at all (≈2
seconds). Of course, tagged commits are not stored this way, so using
2014.01 or other builds like that is still as fast, but 2014.02^ is a
bit slower. I wonder if anybody will ever notice this.
@AlexDaniel

This comment has been minimized.

Copy link
Member

AlexDaniel commented Jul 22, 2017

Ah, this is now done. We don't have any tests for build.p6 and I'm not sure if we ever will, so maybe this is closeable.

@AlexDaniel

This comment has been minimized.

Copy link
Member

AlexDaniel commented Feb 7, 2018

OK, we might want to revisit this. From zstd changelog:

Zstandard has a new long range match finder written by our intern Stella Lau (@stellamplau), which specializes on finding long matches in the distant past. It integrates seamlessly with the regular compressor, and the output can be decompressed just like any other Zstandard compressed data.

There are some graphs comparing zstd with lrzip, but we will have to test it out with our data.

@AlexDaniel

This comment has been minimized.

Copy link
Member

AlexDaniel commented Feb 7, 2018

That said, for #122 this should be delayed. Both lrzip and zstd are in debian stable, and that should cover the majority of those who will attempt to run it.

@AlexDaniel

This comment has been minimized.

Copy link
Member

AlexDaniel commented Feb 8, 2018

Ignore what I said in the previous comment. Archives produced in long range mode should be usable by any version.

@AlexDaniel

This comment has been minimized.

Copy link
Member

AlexDaniel commented Mar 30, 2018

OK we should start dropping lrzip in favor of zstd I think (note that we're using both right now, so it will be 1 dependency less). See this: https://github.com/facebook/zstd/releases/tag/v1.3.4

@MasterDuke17++ for reminding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment