Skip to content

Native benchmark support for A/B testing #867

@hildebrandmw

Description

@hildebrandmw

This issue describes an approach for supporting automated A/B testing in diskann-benchmark-runner for use in regression tests, both during CI and for local development.

It builds on #865.

The idea is that we can introduce a new benchmark type

trait Checkable: Benchmark<Output: Deserialize> {
    /// A tolerance input JSON that defines the accepted deviations (if any).
    type Tolerances: diskann_benchmark_runner::Input;

    /// The "success" type used for format output JSONs of differences/regressions.
    type Ok: serde::Serialize; 
    
    /// Use the input `tolerances` file to check the before and after runs.
    fn check(
        tolerances: &Self::Tolerances,
        input: &Self::Input,
        before: &Self::Output,
        after: &Self::Output,
    ) -> anyhow::Result<Self::Ok>;
}

Here, the output is like a benchmark Output and just contains useful stuff like the difference in QPS or whatever the benchmark author deems important.

The benchmark registry would then have an additional method

struct Benchmarks {
    fn register_checkable<T>(name: Into<String>)
    where
        T: Checkable;
}

The internally created trait object can be expanded to an interface that reports on whether or not a particular benchmark is checkable.

On the CLI side, we add a new command

check tolerances.json --input input.json [--before before.json] [--after after.json] [--dry-run?]

Here, tolerances.json consists of JSON like

[
  {
    "input": ...
    "tolerances:" ...
  }
  ...
]

Since Checkable is a subtrait of Benchmark, each Tolerances input is associated with an actual Input for the benchmark. At this level, the associated "input" field in the JSON can be a subset of the overall input, allowing tolerances to be associated with multiple actual inputs to reduce boilerplate.

The command is set up so

check tolerances.json --input input.json

can perform a try run - ensuring that all the tolerance structs in the input are parseable and match cleanly with the input JSON (each entry in input.json is matched exactly once, each entry in tolerances.json matches at least one entry in input.json). This works as a pre-flight check before a long-running CI job to catch misconfigurations before we spend 40 minutes running benchmarks.

The full

check tolerances.json --input input.json --before before.json --after after.json

is a quick running process of actually doing the regression test and generating a final file. Each implementation of Checkable::check can emit auxiliary files if, for example, a CSV is desired. Note, though, that we may need to think through how to properly generate unique file names when an input tolerance fuzzily matches multiple runs.

The files passed as before.json and after.json need to come from the same associated input.json. At check time, ensuring the right types get deserialized gets us 90% of the way towards gracefully handling situations where this contract is violated, and that's probably good enough.

The advantage for benchmark writers is that the comparison code for regressions can be written on the concrete Rust types for the output. No need to guess at the JSON structure with brittle parsing tools. Most mismatches will be compile time errors, and the pre-flight checks should catch most other mismatch errors early. Furthermore, this same system can be used in both CI and locally.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions