Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate options for Bazel remote caches #12458

Closed
jheidbrink opened this issue Apr 13, 2022 · 10 comments · Fixed by #12743
Closed

Investigate options for Bazel remote caches #12458

jheidbrink opened this issue Apr 13, 2022 · 10 comments · Fixed by #12743
Assignees
Labels
bazel changes for the Bazelification effort component: ci All updates on CI (Jenkins/CircleCi/Github Action)

Comments

@jheidbrink
Copy link
Contributor

jheidbrink commented Apr 13, 2022

This ticket tracks efforts in the investigation of remote caches for Bazel.

Motivation

We are currently using Github Actions cache for Bazel builds in the CI.

These come with some challenges:

  • The total cache size is limited to 10 GiB. As long as we are not fully hermetic, we should use different cache keys for different build environments (ie VM / devcontainer). That might make us exceed the limit which could make Github evict caches that are still useful as described here.
  • The Github Actions runners are limited in disk space. In order to prevent them from running full, we remove some unsed SDKs and clear the cache once it exceeds 6.5 GiB

By using remote caches, we are not limited to 10 GiB. Also, downloading cache results is much more fine-grained which can reduce the amount of data downloaded and hopefully abolish the need to maintain code that clears the cache in our runners.

@jheidbrink jheidbrink added component: ci All updates on CI (Jenkins/CircleCi/Github Action) bazel changes for the Bazelification effort labels Apr 13, 2022
@jheidbrink
Copy link
Contributor Author

So far we experimented with Google Cloud Storage and Buildbuddy as remote caches. We found that the "Bazel build & test" workflow takes ~1h20m without cache and ~30m with cache. That is still slower than the current approach using the Github Actions cache. We are still experimenting with parameters that configure how much data Bazel downloads from remote caches.

@jheidbrink
Copy link
Contributor Author

As the MCF already has an AWS account, we are also looking into setting up buchgr/bazel-remote on AWS.

@jheidbrink
Copy link
Contributor Author

jheidbrink commented Apr 20, 2022

Regarding

we should use different cache keys for different build environments (ie VM / devcontainer)

This is discussed in bazelbuild/bazel#4558

TLDR: Bazel does not track tools outside its workspace. So if a devcontainer upgrade comes with a new version of gcc, Bazel may retrieve outputs from cache that were built with the old gcc version.

To deal with that, we can

  • move as much as possible inside the workspace
    • for example by downloading a python interpreter with rules_python instead of using the one from the system.
  • Use different cache keys for different environments
    • That seems to be what most people in the mentioned issue are doing: Calculate a hash of the inputs used to build the environment that you execute Bazel in, and use that hash as a cache key.
  • Ignore the issue and enjoy the incorrect but faster builds ;)

Another thread where several people suggest to hash the build-environment inputs and use that as a cache key: https://forum.buildkite.community/t/any-experience-setting-up-a-shared-remote-build-cache-using-bazel/1119/4
Cite from user petemounce from that thread:

You absolutely don’t want to pollute your build cache by allowing differently set up bazel hosts to write to it, if you do, you lose your hermeticity, which is a lot of the point of using bazel in the first place.

@LKreutzer
Copy link
Contributor

LKreutzer commented Apr 21, 2022

Comparison of the build times for bazel-remote and buildbuddy and different remote_download flags:

This comparison uses a modified GH workflow to compare the two different remote cache backends.

bazel-remote: with this setup bazel build //... needs

  • 26min 47s for the initial build
  • 7 min plain with full cache
  • 5 min 23s with --remote_download_minimal with full cache
  • 3 min 25s with --remote_download_toplevel with full cache

buildbuddy: with this setup bazel build //... needs

  • 43min 59s for the initial build
  • 9 min 35s plain with full cache
  • 8 min 28s with --remote_download_minimal with full cache
  • 8 min 38s with --remote_download_toplevel with full cache

We can also compare the download volumes from the buildbuddy builds:

  • 5.561 GB plain with full cache
  • 0.577 GB with --remote_download_minimal with full cache
  • 3.829 GB with --remote_download_toplevel with full cache

Hence --remote_download_toplevel seems to be somewhat faster than --remote_download_minimal, but it also uses significantly more download volume (and possibly uses more disk space?). In this setup bazel-remote seems to be faster than buildbuddy. The bazel-remote setup needs ca. 11-14 min for the full bazel build and test workflow, which is comparable to current GH actions cache.

@LKreutzer
Copy link
Contributor

LKreutzer commented Apr 26, 2022

Comparison of https vs grpcs protocols for bazel-remote:

Setup

This comparison was done by running the bazel-remote container locally (due to authentication issues when using grpcs) inside the magma-vm and running the bazel command:

  • bazel build lte/gateway/python/... orc8r/gateway/python/... lte/gateway/c/...

with the remote cache flag set to one of the following options:

  • --remote_cache=https://user:password@localhost:9090 (using https)
  • --remote_cache=grpcs://user:password@localhost:9092 (using grpcs)

Some runs included additional remote download flags.

Results

The initial build time without any caching and without any tmp files, after bazel clean --expunge, is 661.859s. The build times with a filled remote cache (with a bazel clean in between each build) are listed in the table below:

Build time Protocol Additional flags
58.810s https
56.162s grpcs
41.307s https --remote_download_toplevel
40.124s grpcs --remote_download_toplevel
14.996s https --remote_download_minimal
14.826s grpcs --remote_download_minimal

A total of 1.2 GB of remote cache was generated in the initial build for this example. This does not include the external dependencies that are only fetched but not build, which are unfortunately not cached when using a remote cache - see bazelbuild/bazel#6359.

Findings

  • At least in this setup I do not see a significant decrease (as suggested by this blog post) in the build times when using grpcs over https. It thus seems reasonable to continue using https (which avoids any authentication issues related to grpcs for the full AWS setup).
  • Contrary to the findings in the above investigation I find that --remote_download_minimal leads to significantly faster builds than the --remote_download_toplevel flag. This could be affected by the connection speed vs. compute power ratio on the magma-vm vs GH runners.

@LKreutzer
Copy link
Contributor

See also #12353

@vktng
Copy link
Contributor

vktng commented May 12, 2022

Findings regarding --experimental_remote_downloader

In order to cache external dependencies in the remote cache we set up the experimental remote downloader feature - see docs. It does not seem like the relevant external libraries are cached in our setup.

The following times are for an experiment with the following command:
bazel build lte/gateway/python/... orc8r/gateway/python/...

Without experimental remote downloader:

  • 556.144s (without cache, initial build)
  • 345.292s (full cache)

With experimental remote downloader:

  • 621.736s (without cache, initial build)
  • 345.437s (full cache)

In this setup it seems that there is no build time advantage.

@zachgrayio
Copy link

zachgrayio commented Jun 8, 2022

@vktng @LKreutzer

I happened upon this issue randomly but have a few comments to offer since I happen to know a bit about this topic.

Re: "Findings regarding --experimental_remote_downloader"

In this setup it seems that there is no build time advantage.

To measure the impact of caching external deps with --experimental_remote_downloader= you'd want to do a bazel clean --expunge && bazel build lte/gateway/python/... orc8r/gateway/python/..., twice actually to ensure the cache is populated so you get hits the second time, and then measure that number, and compare it against the same command run with --experimental_remote_downloader unset.

Depending on the shape of your build graph and external deps required by the targets you're building (the external deps of these targets will be the items that are fetched), you should definitely be seeing some difference in timings (maybe even a slowdown). You also may want to manually unset --repository_cache by setting it to an empty value with --repository_cache= to ensure things are pulled over the remote.

Next up--for your comparison of http vs grpc protocols, I am guessing you don't see a big delta in performance due to running the cache backend locally. Some portion of the overhead of using http1.1 comes from dialing the remote backend for each request since http1.1 doesn't multiplex like grpc does over http2. In a real-world comparison, you might see more impact here, we often do see a difference in performance as well, inline with what Steeve outlined in his post you referenced, and as such all of our customers tend to use gRPC even if they aren't using the remote execution components.

minor followup:
I took a look at the project running in a devcontainer and can confirm I don't see much benefit in the remote_downloader for external deps, but didn't have much time to look into exactly why that's the case here.

@LKreutzer
Copy link
Contributor

LKreutzer commented Jun 24, 2022

Cross-posting here: "Bazel test options lead to different cache entries than the build options" #13073

@LKreutzer
Copy link
Contributor

Closing this issue as the remote cache has been running for some months and we are now in a phase of optimisation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bazel changes for the Bazelification effort component: ci All updates on CI (Jenkins/CircleCi/Github Action)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants