Organize benchmarks and introduce 'size' dimension #1007

ludfjig · 2025-11-05T00:34:01Z

Many of our benchmarks exhibit different performance characteristics depending on the size of the sandbox. This PR restructures the benchmark suite to run relevant benchmarks across four different heap sizes (default, 8MB, 64MB, 256MB), providing better visibility into how performance scales with memory allocation. This PR increases the number of benchmarks which will increase CI benchmark execution time proportionally.

Also slightly reorganizes the benchmarks into better categories. The multiple consecutive for-loops over sizes might seem weird at first, but it makes sure cargo bench runs the same benchmark with all sizes before moving on to the next benchmark. cargo bench -- --list now yields the following:

sandboxes/create_uninitialized/default: benchmark
sandboxes/create_uninitialized/small: benchmark
sandboxes/create_uninitialized/medium: benchmark
sandboxes/create_uninitialized/large: benchmark
sandboxes/create_uninitialized_and_drop/default: benchmark
sandboxes/create_uninitialized_and_drop/small: benchmark
sandboxes/create_uninitialized_and_drop/medium: benchmark
sandboxes/create_uninitialized_and_drop/large: benchmark
sandboxes/create_initialized/default: benchmark
sandboxes/create_initialized/small: benchmark
sandboxes/create_initialized/medium: benchmark
sandboxes/create_initialized/large: benchmark
sandboxes/create_initialized_and_drop/default: benchmark
sandboxes/create_initialized_and_drop/small: benchmark
sandboxes/create_initialized_and_drop/medium: benchmark
sandboxes/create_initialized_and_drop/large: benchmark

guest_calls/call/default: benchmark
guest_calls/call/small: benchmark
guest_calls/call/medium: benchmark
guest_calls/call/large: benchmark
guest_calls/call_with_restore/default: benchmark
guest_calls/call_with_restore/small: benchmark
guest_calls/call_with_restore/medium: benchmark
guest_calls/call_with_restore/large: benchmark
guest_calls/call_with_host_function/default: benchmark
guest_calls/call_with_host_function/small: benchmark
guest_calls/call_with_host_function/medium: benchmark
guest_calls/call_with_host_function/large: benchmark
guest_calls/different_thread: benchmark
guest_calls/interrupt_latency: benchmark

snapshots/create/default: benchmark
snapshots/create/small: benchmark
snapshots/create/medium: benchmark
snapshots/create/large: benchmark
snapshots/restore/default: benchmark
snapshots/restore/small: benchmark
snapshots/restore/medium: benchmark
snapshots/restore/large: benchmark

guest_functions_with_large_parameters/guest_call_with_large_parameters: benchmark

function_call_serialization/serialize_function_call: benchmark
function_call_serialization/deserialize_function_call: benchmark

sample_workloads/24K_in_8K_out_c: benchmark
sample_workloads/24K_in_8K_out_rust: benchmark

Also adds the snapshots/create and snapshots/restore benchmarks, which are useful

Closes #722

Signed-off-by: Ludvig Liljenberg <4257730+ludfjig@users.noreply.github.com>

src/hyperlight_host/benches/benchmarks.rs

andreiltd

This looks great!

I know this is not part of the changes but there are some tests that could benefit from using iter_batched to avoid measuring expensive setup. An example of this is guest_call_with_large_parameters that is cloning huge data in measure loop and could be rewritten as:

    b.iter_batched(
        || (large_vec.clone(), large_string.clone()),
        |(vec, string)| {
            sandbox.call::<()>("LargeParameters", (vec, string)).unwrap()
        },
        criterion::BatchSize::SmallInput,
    );

I think this is important because if the time spent on test setup dominates the total measured time (e.g 90%), then only a small fraction of the benchmark reflects the actual code we want to measure. This makes it hard to detect meaningful performance changes, because any improvements or regressions are drowned out by the setup overhead of cloning so we should pay extra attention if we want to maintain meaningful measuremens -- sorry for the offtopic :-)

ludfjig · 2025-11-07T18:11:07Z

This looks great!

I know this is not part of the changes but there are some tests that could benefit from using iter_batched to avoid measuring expensive setup. An example of this is guest_call_with_large_parameters that is cloning huge data in measure loop and could be rewritten as:
    b.iter_batched(
        || (large_vec.clone(), large_string.clone()),
        |(vec, string)| {
            sandbox.call::<()>("LargeParameters", (vec, string)).unwrap()
        },
        criterion::BatchSize::SmallInput,
    );
I think this is important because if the time spent on test setup dominates the total measured time (e.g 90%), then only a small fraction of the benchmark reflects the actual code we want to measure. This makes it hard to detect meaningful performance changes, because any improvements or regressions are drowned out by the setup overhead of cloning so we should pay extra attention if we want to maintain meaningful measuremens -- sorry for the offtopic :-)

You are totally right! We should fix this!

ludfjig requested review from danbugs, dblnz, devigned, jprendes, marosset, simongdavies and syntactically as code owners November 5, 2025 00:34

ludfjig added kind/enhancement For PRs adding features, improving functionality, docs, tests, etc. area/performance Addresses performance labels Nov 5, 2025

ludfjig force-pushed the organize_bench branch 3 times, most recently from 46942d2 to 862c136 Compare November 5, 2025 19:59

Organize benchmarks and introduce 'size' dimension

3b6e295

Signed-off-by: Ludvig Liljenberg <4257730+ludfjig@users.noreply.github.com>

ludfjig force-pushed the organize_bench branch from 862c136 to 3b6e295 Compare November 5, 2025 20:00

jsturtevant approved these changes Nov 5, 2025

View reviewed changes

src/hyperlight_host/benches/benchmarks.rs Show resolved Hide resolved

andreiltd approved these changes Nov 7, 2025

View reviewed changes

ludfjig merged commit bb0d9a7 into hyperlight-dev:main Nov 7, 2025
41 checks passed

ludfjig deleted the organize_bench branch November 7, 2025 18:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Organize benchmarks and introduce 'size' dimension #1007

Organize benchmarks and introduce 'size' dimension #1007

Uh oh!

ludfjig commented Nov 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

andreiltd left a comment •

edited

Loading

Uh oh!

ludfjig commented Nov 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Organize benchmarks and introduce 'size' dimension #1007

Organize benchmarks and introduce 'size' dimension #1007

Uh oh!

Conversation

ludfjig commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

andreiltd left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ludfjig commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ludfjig commented Nov 5, 2025 •

edited

Loading

andreiltd left a comment •

edited

Loading

ludfjig commented Nov 7, 2025 •

edited

Loading