Native benchmark support for A/B testing

This issue describes an approach for supporting automated A/B testing in `diskann-benchmark-runner` for use in regression tests, both during CI and for local development.

It builds on #865.

The idea is that we can introduce a new benchmark type
```rust
trait Checkable: Benchmark<Output: Deserialize> {
    /// A tolerance input JSON that defines the accepted deviations (if any).
    type Tolerances: diskann_benchmark_runner::Input;

    /// The "success" type used for format output JSONs of differences/regressions.
    type Ok: serde::Serialize; 
    
    /// Use the input `tolerances` file to check the before and after runs.
    fn check(
        tolerances: &Self::Tolerances,
        input: &Self::Input,
        before: &Self::Output,
        after: &Self::Output,
    ) -> anyhow::Result<Self::Ok>;
}
```
Here, the output is like a benchmark `Output` and just contains useful stuff like the difference in QPS or whatever the benchmark author deems important.

The benchmark registry would then have an additional method
```rust
struct Benchmarks {
    fn register_checkable<T>(name: Into<String>)
    where
        T: Checkable;
}
```
The internally created trait object can be expanded to an interface that reports on whether or not a particular benchmark is checkable.

On the CLI side, we add a new command
```
check tolerances.json --input input.json [--before before.json] [--after after.json] [--dry-run?]
```
Here, `tolerances.json` consists of JSON like
```json
[
  {
    "input": ...
    "tolerances:" ...
  }
  ...
]
```
Since `Checkable` is a subtrait of `Benchmark`, each `Tolerances` input is associated with an actual `Input` for the benchmark. At this level, the associated "input" field in the JSON can be a subset of the overall input, allowing tolerances to be associated with multiple actual inputs to reduce boilerplate.

The command is set up so
```
check tolerances.json --input input.json
```
can perform a try run - ensuring that all the tolerance structs in the input are parseable and match cleanly with the input JSON (each entry in `input.json` is matched exactly once, each entry in `tolerances.json` matches at least one entry in `input.json`). This works as a pre-flight check before a long-running CI job to catch misconfigurations before we spend 40 minutes running benchmarks.

The full
```
check tolerances.json --input input.json --before before.json --after after.json
```
is a quick running process of actually doing the regression test and generating a final file. Each implementation of `Checkable::check` can emit auxiliary files if, for example, a CSV is desired. Note, though, that we may need to think through how to properly generate unique file names when an input `tolerance` fuzzily matches multiple runs.

The files passed as `before.json` and `after.json` need to come from the same associated `input.json`. At check time, ensuring the right types get deserialized gets us 90% of the way towards gracefully handling situations where this contract is violated, and that's probably good enough.

The advantage for benchmark writers is that the comparison code for regressions can be written on the concrete Rust types for the output. No need to guess at the JSON structure with brittle parsing tools. Most mismatches will be compile time errors, and the pre-flight checks should catch most other mismatch errors early. Furthermore, this same system can be used in both CI and locally.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Native benchmark support for A/B testing #867

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Native benchmark support for A/B testing #867

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions