Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
214 changes: 214 additions & 0 deletions BENCHMARKS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,214 @@
# WebUI Benchmark Suite

WebUI ships a layered benchmark suite for measuring SSR rendering
performance. Each layer answers a different question, so a thorough
performance investigation runs **multiple** benches before & after a
change and compares.

This document is the reference for what to run, when to run it, and
how to compare results.

## Quick reference

| Bench | Layer | Wall time | What it measures | Use when |
|---|---|---|---|---|
| `cargo xtask bench all` | criterion micro | ~5 min | per-fn wall-clock for parser, handler, protocol, expressions, state, webui (incl. streaming + contact-book) | full snapshot of every micro-bench |
| `cargo xtask bench streaming` | criterion micro | ~60 s | writer-path wall-clock + first-chunk TTFB | inner-loop iteration on the streaming module |
| `cargo xtask bench contact-book` | criterion micro | ~90 s | end-to-end render at 10/100/1000 contacts | inner-loop iteration on handler/state/expressions |
| `cargo xtask bench streaming-resource` | example | ~30 s | exact alloc count + bytes + getrusage CPU + RSS | proving zero-alloc claims; allocation regression hunting |
| `cargo xtask bench streaming-e2e-ttfb` | example | ~10 s | HTTP-level TTFB / TTLB through actix | confirming wire-level streaming win |
| `cargo xtask bench streaming-browser` | Playwright | ~30 s | real Chromium TTFB / FCP / LCP / DCL / load | proving user-perceived paint improvement |
| `cargo xtask bench full` (= `streaming-all`) | suite | ~3 min | runs all four streaming-related benches in sequence | full streaming evidence pack for a PR |

## The before/after workflow

All benches support **named baselines**. The flag pattern is
identical across criterion, example, and Playwright benches:

```bash
# 1. Snapshot current numbers as 'before'
cargo xtask bench full --save-baseline before

# 2. Make your change …

# 3. Compare against 'before'
cargo xtask bench full --baseline before
```

Baselines are stored at `target/bench-baselines/`:

* `streaming-resource-<name>.json` — alloc + RSS + CPU table
* `e2e-ttfb-<name>.json` — HTTP TTFB/TTLB table
* `browser-<name>.json` — browser metrics table
* `target/criterion/<bench>/<name>` — criterion's native baseline
directory tree

The compare phase prints a Δ%-table for every row. Negative Δ% =
improvement; positive = regression.

### Threshold guidance

| Source | Treat as noise | Treat as signal |
|---|---|---|
| criterion (well-isolated wall-clock) | < ±2% | > ±5% |
| streaming-resource (alloc count) | exact — any change matters | any non-zero |
| streaming-resource (bytes, CPU) | < ±2% | > ±5% |
| streaming-e2e-ttfb (loopback) | < ±10% | > ±20% |
| streaming-browser (real Chromium) | < ±5% | > ±15% |

## Anatomy of each bench

### Criterion benches (`cargo bench`-driven)

Standard criterion harnesses. Each crate has its own `benches/` dir:

* `crates/webui-parser/benches/parser_bench.rs`
* `crates/webui-protocol/benches/protocol_bench.rs`
* `crates/webui-handler/benches/handler_bench.rs`
* `crates/webui-expressions/benches/expressions_bench.rs`
* `crates/webui-state/benches/state_bench.rs`
* `crates/webui/benches/contact_book_bench.rs` — end-to-end render
* `crates/webui/benches/streaming_bench.rs` — writer-path wall-clock + TTFB

These integrate with criterion's HTML reports
(`target/criterion/report/index.html`) and native baseline support
(`--save-baseline NAME` / `--baseline NAME`). `cargo xtask bench`
passes those flags through so you don't need to remember `cargo
bench` invocation details.

### `streaming-resource` (counting allocator + getrusage)

`crates/webui/examples/streaming_resource_bench.rs` installs a custom
`GlobalAlloc` that exact-counts every `alloc`/`realloc` call. Why an
example, not a criterion bench? Criterion's harness allocates during
its sampling loop, which would pollute a counting allocator. Examples
run a clean process where every `alloc` we observe came from the code
under test (or its dependencies).

Reports per (path × scale):
- **allocs/run** — exact count from the custom allocator
- **bytes/run** — exact bytes requested from the allocator
- **wall µs/run** — `Instant::elapsed()` per iteration
- **user µs/run** — `getrusage(RUSAGE_SELF).ru_utime` delta
- **process RSS** — `ru_maxrss` high-water mark

This is the **only** bench in the suite that gives you exact
allocation numbers. Use it to verify "zero per-write allocation"
claims and to detect allocation-pressure regressions.

### `streaming-e2e-ttfb` (HTTP-level)

`crates/webui/examples/streaming_e2e_ttfb_bench.rs` spawns a real
actix-web server with `/buf` and `/stream` endpoints, then drives
both with the `awc` HTTP client. Reports min/p50/p99 for both TTFB
(time to first byte) and TTLB (time to last byte) at four
render-cost scenarios.

Faster than the browser bench (~10 s vs ~30 s) and doesn't need
Chromium installed. Use it as the smoke check before paying for the
full browser bench.

### `streaming-browser` (Playwright + Chromium)

`examples/integration/streaming-browser-bench/` is a separate package
with its own actix server and a Playwright spec that drives Chromium
through `PerformanceObserver`. Reports the **only** browser-perceived
metrics in the suite:

- **TTFB** — `responseStart - requestStart` from `PerformanceNavigationTiming`
- **FCP** — first-contentful-paint from `PerformanceObserver`
- **LCP** — largest-contentful-paint from `PerformanceObserver`
- **DCL** — `domContentLoadedEventEnd - startTime`
- **load** — `loadEventEnd - startTime`

This is the bench that answers "does streaming actually help users
see the page faster?" The HTTP-level benches prove the bytes get to
the wire faster; only this one proves Chrome paints faster.

The spec also asserts a **hard regression check**: at the 100 ms
render scenario, streaming TTFB must be ≥5× lower than buffered
TTFB. If that ever fails, something is fundamentally wrong with the
implementation.

## Recommended PR workflow

For any change touching `crates/webui/src/streaming.rs` or its
callers:

```bash
# 1. Establish baseline on the unmodified code
cargo xtask bench full --save-baseline before

# 2. Make your change

# 3. Compare
cargo xtask bench full --baseline before

# 4. Paste the four Δ%-tables into the PR description
```

For changes touching the handler / parser / state / protocol /
expressions crates:

```bash
cargo xtask bench all --save-baseline before
# … change …
cargo xtask bench all --baseline before
```

The criterion `--baseline` flag emits the per-bench `change:` lines
inline (e.g. `Performance has improved` / `regressed` / `within
noise threshold`).

## Where the data lives

* **Stdout** — every bench prints a human-readable table.
* **JSON snapshots** — non-criterion benches write to
`target/bench-baselines/`.
* **Criterion HTML** — `target/criterion/report/index.html` for full
PDF/CDF plots and per-baseline violin plots.

## Why so many benches?

Each layer measures a different thing. A change can:

- improve allocation count but regress wall-clock (allocator changes)
- improve micro-bench wall-clock but regress browser FCP (chunk-size
changes that hurt parser progressive rendering)
- improve TTFB but introduce a memory leak (no cleanup of pool
buffers on error paths)

Running the full suite catches all of these. Running just one layer
catches one third of them.

## Reproducibility tips

* **Close other applications** — CPU-intensive background work adds
noise.
* **Plug in to power** (laptops) — battery savers throttle the CPU.
* **Pin to release builds** — `cargo bench` and `cargo xtask bench`
always use release; debug builds are not representative.
* **Run on the same machine** — cross-machine baselines are not
meaningful.
* **Compare medians (P50)**, not means — robust against thermal
spikes.
* **Re-run if Dev% > 15%** in any criterion row.

## Authoring guidance

If you add a new performance-sensitive feature, also add a
benchmark. The bar:

1. **Criterion** if the unit-of-work is a single function call. Add a
`[[bench]]` entry to the relevant crate's `Cargo.toml`.
2. **Example with `--save NAME`/`--compare NAME`** if you need
process-wide measurement (custom allocator, getrusage, an HTTP
server, etc.). Mirror the structure of
`streaming_resource_bench.rs`.
3. **Playwright** if the metric is browser-perceived (paint, layout,
hydration time). Mirror the structure of
`examples/integration/streaming-browser-bench/`.

Wire it into `cargo xtask bench` so the standard before/after
workflow works without users needing to know per-bench invocation
details.
55 changes: 55 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ members = [
"xtask",
"examples/integration/rust",
"examples/integration/ssr-performance-showdown",
"examples/integration/streaming-browser-bench/server",
"examples/app/commerce/server",
"examples/demo/server",
]
Expand Down Expand Up @@ -46,6 +47,10 @@ mime_guess = "2.0.5"
html-escape = "0.2.13"
async-stream = "0.3.6"
futures-util = "0.3.31"
tokio-stream = "0.1.18"
bytes = "1.10.1"
crossbeam-queue = "0.3.12"
memchr = "2.8.0"
notify = "8.2.0"
notify-debouncer-mini = "0.7.0"
percent-encoding = "2.3.2"
Expand Down
Loading
Loading