Benchmark CI

The latency of Merlin queries depends on a lot of different factors, such as
- The global buffer it's run on; in particular, size and typing complexity.
- The location inside the buffer it's run on.
- The dependency graph of the buffer.
- Whether and which PPX is applied.
- Merlin's cache state at the moment the query is run.
- Which Merlin query is run.

So for meaningful benchmark results, we need to run Merlin on a big variety of input samples. We've written [`merl-an`](https://github.com/pitag-ha/merl-an) to generate such an input sample set in a random but deterministic way. It has a `merl-an benchmark` command, which persists the telemetry part of the Merlin response in the format expected by `current-bench`.

The next steps to get a Merlin benchmark CI up and running are:
- [x] Finish the PoC for a `current-bench` CI on Merlin using `merl-an`. We're currently blocked on this by [a current-bench issue](https://github.com/ocurrent/current-bench/issues/458). Done: see [PoC graphs](https://autumn.ocamllabs.io/ocaml/merlin/pull/1627/base/master/benchmark/Merlin%20benchmark?worker=autumn&image=bench.Dockerfile)
- [x] Improve the separation into different benchmarks (in `merl-an`): I think, with the current `merl-an` output, `current-bench` will create one different graph for each file that's being benchmarked. That doesn't scale. Instead: One graph per cache workflow and per query or similar.
- [x] Improve the Docker set-up: The whole benchmark set-up, such as installing `merl-an` and fetching the code base on which we run Merlin should be done inside the container etc.
- [ ] Filter out spikes (on `merl-an`). Non-reproducible latency spikes (i.e. timings that exceed the expected timing by over factor 10), mess up the scale of the `current-bench` graphs.
- [ ] Add cold-cache workflow to the benchmarks: The reason why the numbers look so good at the moment is that both cmi-caches and typer cache are fully warmed on all queries. Additionally, it would be interesting to have benchmarks for when the caches are cold. 
- [ ] Improve the output UX: When some samples call attention, we'll want to know which location and query they correspond to.
- [ ] Lock the version of the dependencies of the project on which we run Merlin: Currently, we use Irmin as a code base to run the benchmarks on. We install Irmin's dependencies via `opam` without locking the versions of its dependencies. If a dependency splits or merges modules or increases the size of a module, the cmi-files and cmt-files will vary. That adds Merlin-independent noise to the benchmarks. To avoid that, we could vendor a fixed version of each dependency.
- [ ] Find a more significant project input base. For now, we only use Irmin as a code base to run the benchmark on.
- [ ] Our CI will be very resource heavy. We'll need to decide when to run the benchmarks. `current-bench` supports running the benchmarks only "on demand" (i.e. when tagging the PR with a certain flag).
- [ ] Possibly: It might also be interesting to track the number of latency spikes.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmark CI #1633

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Benchmark CI #1633

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions