Switch to larger Helios instances for CI#10018
Merged
emilyalbini merged 1 commit intomainfrom Mar 10, 2026
Merged
Conversation
davepacheco
approved these changes
Mar 10, 2026
Collaborator
davepacheco
left a comment
There was a problem hiding this comment.
Looks great (once you rebase onto #10001).
2a54172 to
1a2568e
Compare
sunshowers
approved these changes
Mar 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR switches the slowest Helios jobs to run on larger instance sizes, significantly speeding up CI times. Along with this I already deployed a Buildomat configuration change to run all Helios jobs on Zen 4 AWS instances, instead of Zen 3 instances either on AWS or lab Gimlets. Together, these two changes should bring CI times down considerably.
Unfortunately we cannot use Zen 5 AWS instances (like we did on Linux) until oxidecomputer/stlouis#938 is fixed.
build-and-test (helios)
This job was actually slowed down for a nondeterministic amount of time by it running out of memory and being forced to aggressively page memory to disk. Turns out it was using around 150% of the RAM the VM had allocated. Switching to memory-optimized AWS instances (2x the RAM) fixed the problem.
The switch from 16 cores to 32 cores is fairly expensive and has diminishing returns, like for the Linux instance, but still, it's a 15 minutes win. When we switch to Zen 5 it might be worth it to go back to 16 cores.
omicron-common
The switch has negligible impact on a job this short, but it's not worth it to create a dedicated target just to keep this job back on Zen 3. So it gets unintentionally updated to Zen 4.
helios / package
The wins from 8 cores to 16 cores are not that impressive, but this job is a dependency of the "deploy" job which we cannot really speed up (it needs to run on a lab Gimlet, and we can't shard it as far as I'm aware), so any time we can shave is worth it.
helios / build TUF repo
Similarly to the build and test job this was paging memory to disk due to not having enough memory in the VM (even though to a less extent). After the size increase there was a lot of single-thread CPU, so I didn't bother testing more cores.
check-features (helios)
This job was mostly single-threaded so I tried aggressively reducing the VM size but with mixed results. In the end decided to keep it with the now-Zen4 standard target.
clippy (helios)
Turns out there was zero benefit going from 8 to 16 cores for this job.