Skip to content

Conversation

@ludfjig
Copy link
Contributor

@ludfjig ludfjig commented Nov 5, 2025

Many of our benchmarks exhibit different performance characteristics depending on the size of the sandbox. This PR restructures the benchmark suite to run relevant benchmarks across four different heap sizes (default, 8MB, 64MB, 256MB), providing better visibility into how performance scales with memory allocation. This PR increases the number of benchmarks which will increase CI benchmark execution time proportionally.

Also slightly reorganizes the benchmarks into better categories. The multiple consecutive for-loops over sizes might seem weird at first, but it makes sure cargo bench runs the same benchmark with all sizes before moving on to the next benchmark. cargo bench -- --list now yields the following:

sandboxes/create_uninitialized/default: benchmark
sandboxes/create_uninitialized/small: benchmark
sandboxes/create_uninitialized/medium: benchmark
sandboxes/create_uninitialized/large: benchmark
sandboxes/create_uninitialized_and_drop/default: benchmark
sandboxes/create_uninitialized_and_drop/small: benchmark
sandboxes/create_uninitialized_and_drop/medium: benchmark
sandboxes/create_uninitialized_and_drop/large: benchmark
sandboxes/create_initialized/default: benchmark
sandboxes/create_initialized/small: benchmark
sandboxes/create_initialized/medium: benchmark
sandboxes/create_initialized/large: benchmark
sandboxes/create_initialized_and_drop/default: benchmark
sandboxes/create_initialized_and_drop/small: benchmark
sandboxes/create_initialized_and_drop/medium: benchmark
sandboxes/create_initialized_and_drop/large: benchmark

guest_calls/call/default: benchmark
guest_calls/call/small: benchmark
guest_calls/call/medium: benchmark
guest_calls/call/large: benchmark
guest_calls/call_with_restore/default: benchmark
guest_calls/call_with_restore/small: benchmark
guest_calls/call_with_restore/medium: benchmark
guest_calls/call_with_restore/large: benchmark
guest_calls/call_with_host_function/default: benchmark
guest_calls/call_with_host_function/small: benchmark
guest_calls/call_with_host_function/medium: benchmark
guest_calls/call_with_host_function/large: benchmark
guest_calls/different_thread: benchmark
guest_calls/interrupt_latency: benchmark

snapshots/create/default: benchmark
snapshots/create/small: benchmark
snapshots/create/medium: benchmark
snapshots/create/large: benchmark
snapshots/restore/default: benchmark
snapshots/restore/small: benchmark
snapshots/restore/medium: benchmark
snapshots/restore/large: benchmark

guest_functions_with_large_parameters/guest_call_with_large_parameters: benchmark

function_call_serialization/serialize_function_call: benchmark
function_call_serialization/deserialize_function_call: benchmark

sample_workloads/24K_in_8K_out_c: benchmark
sample_workloads/24K_in_8K_out_rust: benchmark

Also adds the snapshots/create and snapshots/restore benchmarks, which are useful

Closes #722

@ludfjig ludfjig added kind/enhancement For PRs adding features, improving functionality, docs, tests, etc. area/performance Addresses performance labels Nov 5, 2025
@ludfjig ludfjig force-pushed the organize_bench branch 3 times, most recently from 46942d2 to 862c136 Compare November 5, 2025 19:59
Signed-off-by: Ludvig Liljenberg <4257730+ludfjig@users.noreply.github.com>
Copy link
Member

@andreiltd andreiltd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great!

I know this is not part of the changes but there are some tests that could benefit from using iter_batched to avoid measuring expensive setup. An example of this is guest_call_with_large_parameters that is cloning huge data in measure loop and could be rewritten as:

    b.iter_batched(
        || (large_vec.clone(), large_string.clone()),
        |(vec, string)| {
            sandbox.call::<()>("LargeParameters", (vec, string)).unwrap()
        },
        criterion::BatchSize::SmallInput,
    );

I think this is important because if the time spent on test setup dominates the total measured time (e.g 90%), then only a small fraction of the benchmark reflects the actual code we want to measure. This makes it hard to detect meaningful performance changes, because any improvements or regressions are drowned out by the setup overhead of cloning so we should pay extra attention if we want to maintain meaningful measuremens -- sorry for the offtopic :-)

@ludfjig
Copy link
Contributor Author

ludfjig commented Nov 7, 2025

This looks great!

I know this is not part of the changes but there are some tests that could benefit from using iter_batched to avoid measuring expensive setup. An example of this is guest_call_with_large_parameters that is cloning huge data in measure loop and could be rewritten as:

    b.iter_batched(
        || (large_vec.clone(), large_string.clone()),
        |(vec, string)| {
            sandbox.call::<()>("LargeParameters", (vec, string)).unwrap()
        },
        criterion::BatchSize::SmallInput,
    );

I think this is important because if the time spent on test setup dominates the total measured time (e.g 90%), then only a small fraction of the benchmark reflects the actual code we want to measure. This makes it hard to detect meaningful performance changes, because any improvements or regressions are drowned out by the setup overhead of cloning so we should pay extra attention if we want to maintain meaningful measuremens -- sorry for the offtopic :-)

You are totally right! We should fix this!

@ludfjig ludfjig merged commit bb0d9a7 into hyperlight-dev:main Nov 7, 2025
41 checks passed
@ludfjig ludfjig deleted the organize_bench branch November 7, 2025 18:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/performance Addresses performance kind/enhancement For PRs adding features, improving functionality, docs, tests, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Restructure micro benchmarks

3 participants