Skip to content

add support for saving and restoring caches#78

Open
emilyalbini wants to merge 4 commits intomainfrom
ea-cache
Open

add support for saving and restoring caches#78
emilyalbini wants to merge 4 commits intomainfrom
ea-cache

Conversation

@emilyalbini
Copy link
Member

@emilyalbini emilyalbini commented Feb 21, 2026

This PR adds caching support in Buildomat, driven by the bmat CLI.

The core building blocks of this implementation are the bmat cache save and bmat cache restore commands. These commands are fairly low level: they require the user to explicitly provide the cache key, and bmat cache save also requires an exhaustive list of files to cache to be provided via stdin.

On top of this building block, the bmat cache rust save and bmat cache rust restore commands wrap the low level primitives and automatically determine the cache key and list of files to cache. This allows us to enforce best practices (like caching only third-party dependencies out of the target/ directory) and reduces the effort to adopt caching.

Caches are stored in an S3-compatible object storage: when a worker requests a cache from the Buildomat server, the server will generate a pre-signed download URL and return it to the client. Similarly, when a worker wants to save a cache, the server returns the pre-signed upload URLs the agent has to upload the file to. Multipart uploads are used to increase concurrency during upload.

I chose to return pre-signed URLs to the worker rather than using the chunked upload facility in Buildomat for two reasons: (a) Buildomat chunked uploads are tied to the lifetime of a job, while caches will outlast jobs, and (b) streaming large uploads and downloads through the Buildomat server adds unnecessary latency and overhead.

On the agent side, I chose to execute the whole logic in the bmat CLI rather than the agent daemon, with the agent daemon's only responsibility being proxying API requests between bmat and the server. This allowed me to output more useful logs than what would've been practical when implementing the logic in the agent daemon.

There are three things missing in this PR that I'm going to tackle in followups:

  • We need to implement quotas for cache uploads, purging the least recently used cache(s) when a Buildomat user / GitHub repository reaches their quota.
  • We need a user API for listing and purging caches.
  • We need a scheduled task to purge incomplete multipart uploads.

Fixes #32

@emilyalbini emilyalbini requested a review from jclulow February 21, 2026 00:25
@emilyalbini
Copy link
Member Author

The first commit is best reviewed with whitespace off.

Copy link
Collaborator

@jclulow jclulow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With my apologies for the delays, I have taken an architectural level look at the backend parts of this change, and made some comments.

I have run out of time at this exact minute, but I'll come back separately and look at the additions to the bmat command. Those feel like they represent a Committed interface, but the backend storage stuff at least represents cache objects that we can always drop and recreate as needed.

One of the commits, _create wrappers for sending control messages in the agent, feels essentially separate, but also seems uncontroversial. Could we break that out into a separate PR so that we can get that in first? The GitHub review UI is even worse than I remember for large multi-commit changes, and it would help to keep separate things in separate PRs I think.

Comment on lines +1151 to +1166
/*
* According to the documentation for the request, "The processing of a
* CompleteMultipartUpload request could take several minutes to finalize".
* We don't want to block the CI job until the request succeedes. Rather,
* complete the upload in a background task, and immediately return.
*/
let c = c.clone();
let log = log.clone();
tokio::spawn(async move {
info!(log, "started async task to complete cache upload {upload_id}");
if let Err(err) = complete_cache_upload(c, upload, etags).await {
error!(log, "failed to complete cache upload {upload_id}: {err}");
} else {
info!(log, "completed cache upload {upload_id}");
}
});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather we do wait for the commit to occur. We already have a similar construct in the existing file upload mechanism where the agent waits for commit to occur.

If we don't wait, we are potentially building up an unbounded amount of unobservable work in the background here without any back pressure.

It's also not clear to me what happens if we return success from the complete request before we've actually done the work, and then the buildomat server is restarted for some reason?

Copy link
Member Author

@emilyalbini emilyalbini Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather we do wait for the commit to occur. We already have a similar construct in the existing file upload mechanism where the agent waits for commit to occur.

Hmm, my understanding is that for output uploads we do need that consistency (as job dependencies do need to download them as inputs), but for caches there isn't really any case I can see where that consistency is needed. Dependent jobs shouldn't rely on caches at all (but rather outputs), and for other jobs it doesn't matter whether the acknowledgement happens sync or async, the cache is going to be available to them at the same time.

Is there any case I'm missing where we would benefit from consistency? If we don't, it's kinda wasteful to delay the CI build until the write is acknowledged by S3, and I'd rather avoid it.

If we don't wait, we are potentially building up an unbounded amount of unobservable work in the background here without any back pressure.

It's also not clear to me what happens if we return success from the complete request before we've actually done the work, and then the buildomat server is restarted for some reason?

That's a fair point! I will move this from the one-off background task into a queue of cache uploads to complete (that a persistent background worker processes), so that we can control how many uploads are processed at the same time and we can restart them again if the server gets killed into the meantime.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To address the second and third paragraphs of this, when the agent finishes uploading and the endpoint is called, the server sets the new time_finish column in cache_pending_upload to the current timestamp and return.

There is then a new persistent background task that looks for finished uploads and completes them one at the time. This both solves the back pressure problem and enforces only one multipart upload is completed at the same time, handling the case of two caches with the same key being uploaded at the same time.

I also added some logic to handle the case Buildomat crashes between completing the S3 multipart upload and making changes to the database.

Comment on lines +1069 to +1085
while remaining_bytes > 0 {
part_number += 1;
chunk_upload_urls.push(
c.s3.upload_part()
.bucket(&c.config.storage.bucket)
.key(c.cache_object_key(j.owner, &path.name))
.upload_id(&s3_upload_id)
.part_number(part_number as _)
.content_length(remaining_bytes.min(CHUNK_SIZE))
.presigned(preconf.clone())
.await
.or_500()?
.uri()
.to_string(),
);
remaining_bytes -= CHUNK_SIZE;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this will create an amount of work to do during this request that scales with the size of the cache object, and potentially with the rate at which the backend object store is willing to complete these requests, etc, which may be variable or delayed.

I reckon we should probably instead have an endpoint that the agent calls for each target chunk (the agent would specify the offset and length) and we'd sign just one UploadPart URL for them at that time. That way, each request is essentially O(1) rather than O(overall request difficulty)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to measure the actual impact of this before saying more (since caches could get fairly large), but a thing to note is that the call to presigned() doesn't make any network request to S3 (or any network request at all). It does the digital signatures locally, using the AWS credentials configured in the SDK. The reason there's an await there is because the SDK might need to refresh credentials, but even if it happens it'd only happen once.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked and presigning takes around 1ms to do the crypto needed for the signature on my laptop. We have a chunk size of 50MB, so even for a 10GB cache (which I very much don't expect to ever have) it'd take at most 200ms. I think the load of 200 requests would be greater than doing everything in a single one.

Comment on lines +1087 to +1096
let upload_id =
c.db.record_pending_cache_upload(
j.owner,
&path.name,
w.id,
size_bytes,
&s3_upload_id,
part_number,
)
.or_500()?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We ought to do this registration prior to speaking with S3 at all, to convert the user-provided name into a ULID that represents the upload and use that ULID to create the multi-part upload. That way, we'll always have a record of every multi-part upload that we may start (assuming it doesn't fail and we don't crash) and won't accidentally leak any.

We should also probably sanitise the user-provided path name, and cap the user-provided cache object size, while doing so.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We ought to do this registration prior to speaking with S3 at all, to convert the user-provided name into a ULID that represents the upload and use that ULID to create the multi-part upload. That way, we'll always have a record of every multi-part upload that we may start (assuming it doesn't fail and we don't crash) and won't accidentally leak any.

Hmm, what would be the purpose of that database row if it doesn't contain the S3 upload ID? Note that even if we crash somewhere between calling S3 and storing the entry in the database we can still cleanup afterwards, as the ListMultipartUploads S3 API would surface the orphaned upload.

We should also probably sanitise the user-provided path name, and cap the user-provided cache object size, while doing so.

Yes! Quotas were also something I was planning to defer after the MVP implementation to reduce the review load, but I can implement a basic version of them now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I just mean setting a maximum size on a single object, even if we don't do other quota stuff yet.

Copy link
Member Author

@emilyalbini emilyalbini Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a configurable cap for individual cache objects and validation for the cache key names.

@emilyalbini
Copy link
Member Author

One of the commits, _create wrappers for sending control messages in the agent, feels essentially separate, but also seems uncontroversial. Could we break that out into a separate PR so that we can get that in first?

Done! #79

@emilyalbini emilyalbini changed the title Add support for saving and restoring caches add support for saving and restoring caches Mar 11, 2026
@emilyalbini
Copy link
Member Author

Address most of the feedback in the force push I did, there are still some open things that I'd love your thoughts on my rationale.

@emilyalbini emilyalbini requested a review from jclulow March 11, 2026 17:00
@emilyalbini
Copy link
Member Author

I noticed some slow compression and extraction for actual omicron builds that somehow I'm not seeing in my test builds, I'll investigate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

support for potentially persisting ephemera between builds

2 participants