add support for saving and restoring caches by emilyalbini · Pull Request #78 · oxidecomputer/buildomat

emilyalbini · 2026-02-21T00:25:00Z

This PR adds caching support in Buildomat, driven by the bmat CLI.

The core building blocks of this implementation are the bmat cache save and bmat cache restore commands. These commands are fairly low level: they require the user to explicitly provide the cache key, and bmat cache save also requires an exhaustive list of files to cache to be provided via stdin.

On top of this building block, the bmat cache rust save and bmat cache rust restore commands wrap the low level primitives and automatically determine the cache key and list of files to cache. This allows us to enforce best practices (like caching only third-party dependencies out of the target/ directory) and reduces the effort to adopt caching.

Caches are stored in an S3-compatible object storage: when a worker requests a cache from the Buildomat server, the server will generate a pre-signed download URL and return it to the client. Similarly, when a worker wants to save a cache, the server returns the pre-signed upload URLs the agent has to upload the file to. Multipart uploads are used to increase concurrency during upload.

I chose to return pre-signed URLs to the worker rather than using the chunked upload facility in Buildomat for two reasons: (a) Buildomat chunked uploads are tied to the lifetime of a job, while caches will outlast jobs, and (b) streaming large uploads and downloads through the Buildomat server adds unnecessary latency and overhead.

On the agent side, I chose to execute the whole logic in the bmat CLI rather than the agent daemon, with the agent daemon's only responsibility being proxying API requests between bmat and the server. This allowed me to output more useful logs than what would've been practical when implementing the logic in the agent daemon.

There are three things missing in this PR that I'm going to tackle in followups:

We need to implement quotas for cache uploads, purging the least recently used cache(s) when a Buildomat user / GitHub repository reaches their quota.
We need a user API for listing and purging caches.
We need a scheduled task to purge incomplete multipart uploads.

Fixes #32

emilyalbini · 2026-02-21T00:28:59Z

The first commit is best reviewed with whitespace off.

jclulow

With my apologies for the delays, I have taken an architectural level look at the backend parts of this change, and made some comments.

I have run out of time at this exact minute, but I'll come back separately and look at the additions to the bmat command. Those feel like they represent a Committed interface, but the backend storage stuff at least represents cache objects that we can always drop and recreate as needed.

One of the commits, _create wrappers for sending control messages in the agent, feels essentially separate, but also seems uncontroversial. Could we break that out into a separate PR so that we can get that in first? The GitHub review UI is even worse than I remember for large multi-commit changes, and it would help to keep separate things in separate PRs I think.

jclulow · 2026-03-06T06:13:21Z

server/src/api/worker.rs

+    /*
+     * According to the documentation for the request, "The processing of a
+     * CompleteMultipartUpload request could take several minutes to finalize".
+     * We don't want to block the CI job until the request succeedes.  Rather,
+     * complete the upload in a background task, and immediately return.
+     */
+    let c = c.clone();
+    let log = log.clone();
+    tokio::spawn(async move {
+        info!(log, "started async task to complete cache upload {upload_id}");
+        if let Err(err) = complete_cache_upload(c, upload, etags).await {
+            error!(log, "failed to complete cache upload {upload_id}: {err}");
+        } else {
+            info!(log, "completed cache upload {upload_id}");
+        }
+    });


I would rather we do wait for the commit to occur. We already have a similar construct in the existing file upload mechanism where the agent waits for commit to occur.

If we don't wait, we are potentially building up an unbounded amount of unobservable work in the background here without any back pressure.

It's also not clear to me what happens if we return success from the complete request before we've actually done the work, and then the buildomat server is restarted for some reason?

I would rather we do wait for the commit to occur. We already have a similar construct in the existing file upload mechanism where the agent waits for commit to occur.

Hmm, my understanding is that for output uploads we do need that consistency (as job dependencies do need to download them as inputs), but for caches there isn't really any case I can see where that consistency is needed. Dependent jobs shouldn't rely on caches at all (but rather outputs), and for other jobs it doesn't matter whether the acknowledgement happens sync or async, the cache is going to be available to them at the same time.

Is there any case I'm missing where we would benefit from consistency? If we don't, it's kinda wasteful to delay the CI build until the write is acknowledged by S3, and I'd rather avoid it.

If we don't wait, we are potentially building up an unbounded amount of unobservable work in the background here without any back pressure.

It's also not clear to me what happens if we return success from the complete request before we've actually done the work, and then the buildomat server is restarted for some reason?

That's a fair point! I will move this from the one-off background task into a queue of cache uploads to complete (that a persistent background worker processes), so that we can control how many uploads are processed at the same time and we can restart them again if the server gets killed into the meantime.

To address the second and third paragraphs of this, when the agent finishes uploading and the endpoint is called, the server sets the new time_finish column in cache_pending_upload to the current timestamp and return.

There is then a new persistent background task that looks for finished uploads and completes them one at the time. This both solves the back pressure problem and enforces only one multipart upload is completed at the same time, handling the case of two caches with the same key being uploaded at the same time.

I also added some logic to handle the case Buildomat crashes between completing the S3 multipart upload and making changes to the database.

server/src/api/worker.rs

jclulow · 2026-03-06T06:34:36Z

server/src/api/worker.rs

+    while remaining_bytes > 0 {
+        part_number += 1;
+        chunk_upload_urls.push(
+            c.s3.upload_part()
+                .bucket(&c.config.storage.bucket)
+                .key(c.cache_object_key(j.owner, &path.name))
+                .upload_id(&s3_upload_id)
+                .part_number(part_number as _)
+                .content_length(remaining_bytes.min(CHUNK_SIZE))
+                .presigned(preconf.clone())
+                .await
+                .or_500()?
+                .uri()
+                .to_string(),
+        );
+        remaining_bytes -= CHUNK_SIZE;
+    }


It seems like this will create an amount of work to do during this request that scales with the size of the cache object, and potentially with the rate at which the backend object store is willing to complete these requests, etc, which may be variable or delayed.

I reckon we should probably instead have an endpoint that the agent calls for each target chunk (the agent would specify the offset and length) and we'd sign just one UploadPart URL for them at that time. That way, each request is essentially O(1) rather than O(overall request difficulty)?

I want to measure the actual impact of this before saying more (since caches could get fairly large), but a thing to note is that the call to presigned() doesn't make any network request to S3 (or any network request at all). It does the digital signatures locally, using the AWS credentials configured in the SDK. The reason there's an await there is because the SDK might need to refresh credentials, but even if it happens it'd only happen once.

I checked and presigning takes around 1ms to do the crypto needed for the signature on my laptop. We have a chunk size of 50MB, so even for a 10GB cache (which I very much don't expect to ever have) it'd take at most 200ms. I think the load of 200 requests would be greater than doing everything in a single one.

server/src/api/worker.rs

server/src/main.rs

jclulow · 2026-03-06T06:42:05Z

server/src/api/worker.rs

+    let upload_id =
+        c.db.record_pending_cache_upload(
+            j.owner,
+            &path.name,
+            w.id,
+            size_bytes,
+            &s3_upload_id,
+            part_number,
+        )
+        .or_500()?;


We ought to do this registration prior to speaking with S3 at all, to convert the user-provided name into a ULID that represents the upload and use that ULID to create the multi-part upload. That way, we'll always have a record of every multi-part upload that we may start (assuming it doesn't fail and we don't crash) and won't accidentally leak any.

We should also probably sanitise the user-provided path name, and cap the user-provided cache object size, while doing so.

We ought to do this registration prior to speaking with S3 at all, to convert the user-provided name into a ULID that represents the upload and use that ULID to create the multi-part upload. That way, we'll always have a record of every multi-part upload that we may start (assuming it doesn't fail and we don't crash) and won't accidentally leak any.

Hmm, what would be the purpose of that database row if it doesn't contain the S3 upload ID? Note that even if we crash somewhere between calling S3 and storing the entry in the database we can still cleanup afterwards, as the ListMultipartUploads S3 API would surface the orphaned upload.

We should also probably sanitise the user-provided path name, and cap the user-provided cache object size, while doing so.

Yes! Quotas were also something I was planning to defer after the MVP implementation to reduce the review load, but I can implement a basic version of them now.

To be clear, I just mean setting a maximum size on a single object, even if we don't do other quota stuff yet.

Added a configurable cap for individual cache objects and validation for the cache key names.

emilyalbini · 2026-03-06T13:28:24Z

One of the commits, _create wrappers for sending control messages in the agent, feels essentially separate, but also seems uncontroversial. Could we break that out into a separate PR so that we can get that in first?

Done! #79

emilyalbini · 2026-03-11T17:00:24Z

Address most of the feedback in the force push I did, there are still some open things that I'd love your thoughts on my rationale.

emilyalbini · 2026-03-12T12:32:34Z

I noticed some slow compression and extraction for actual omicron builds that somehow I'm not seeing in my test builds, I'll investigate.

emilyalbini requested a review from jclulow February 21, 2026 00:25

jclulow requested changes Mar 6, 2026

View reviewed changes

emilyalbini mentioned this pull request Mar 6, 2026

agent: wrappers for sending control messages #79

Merged

emilyalbini changed the title ~~Add support for saving and restoring caches~~ add support for saving and restoring caches Mar 11, 2026

emilyalbini added 4 commits March 11, 2026 17:33

add cache endpoints based on multipart upload

8dc5be6

add support for caching in the agent

8057811

add rust-specific caching support

b65f959

document caching

5f6bd89

emilyalbini force-pushed the ea-cache branch from c7a5825 to 5f6bd89 Compare March 11, 2026 16:34

emilyalbini requested a review from jclulow March 11, 2026 17:00

Conversation

emilyalbini commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

emilyalbini commented Feb 21, 2026

Uh oh!

jclulow left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emilyalbini Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emilyalbini Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emilyalbini commented Mar 6, 2026

Uh oh!

emilyalbini commented Mar 11, 2026

Uh oh!

emilyalbini commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

emilyalbini commented Feb 21, 2026 •

edited

Loading

emilyalbini Mar 6, 2026 •

edited

Loading

emilyalbini Mar 11, 2026 •

edited

Loading