# Inference Metrics: Latency, Throughput, and UX

This notebook is a hands-on introduction to the key performance aspects of running inference with Large Language Models (LLMs).

You’ll learn:

- What **latency** and **throughput** mean in the context of LLMs.
- Why these metrics often trade off against each other. 
- How different parameters (like batch size, prompt length, sampling strategy) affect performance.
- How to measure and visualize **p50 vs p90 latency**, **first-token vs total latency**, and find the "sweet spot".
- What this means for real-world **user experience**.

By the end of this notebook, you'll not only be able to benchmark an LLM — you'll know what the numbers actually mean and how to tune them for real applications.

## Preliminaries

**Before you begin**, make sure you have:

- An NVIDIA GPU environment
- Your Hugging Face [access token](https://huggingface.co/settings/token)

Let's test which GPUs are avaialable in our system:

In [None]:
!nvidia-smi

### Authenticating with Hugging Face

In [None]:
# ⬇️ Run this cell once
from ipywidgets import Password, Button, HBox, Output
import os, pathlib
import sys

from huggingface_hub import HfFolder, whoami

# ---- UI widgets ----
token_box = Password(
    description="HF Token:",
    placeholder="paste your Hugging Face token here",
    layout={"width": "450px"},
)
save_btn = Button(description="Save", button_style="success")
out = Output()

# ---- Callback ----
def save_token(_):
    out.clear_output()
    token = token_box.value.strip()
    with out:
        if not token:
            print("❌ No token entered.")
            return
        # Persist token
        HfFolder.save_token(token)                 # writes to ~/.cache/huggingface/token
        os.environ["HF_TOKEN"] = token             # current kernel env (optional)
        # Sanity-check who we are
        try:
            user = whoami(token)["name"]
            print(f"✅ Token saved. Logged in as: {user}")
        except Exception as e:
            print("⚠️ Token saved, but user lookup failed:", e)

save_btn.on_click(save_token)

display(HBox([token_box, save_btn]), out)

## Loading the model

In [1]:
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
import os

MODEL_ID = "nvidia/NVIDIA-Nemotron-Nano-9B-v2"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)

tp = int(os.getenv("TP_SIZE", "1"))  # set >1 if you want tensor parallel across GPUs

llm = LLM(
    model=MODEL_ID,
    dtype="float16",
    trust_remote_code=True,
    tensor_parallel_size=tp,
    max_num_seqs=64,
    max_model_len=131072,
    mamba_ssm_cache_dtype="float32",
)

INFO 09-10 00:15:00 [__init__.py:241] Automatically detected platform cuda.
INFO 09-10 00:15:03 [utils.py:326] non-default args: {'model': 'nvidia/NVIDIA-Nemotron-Nano-9B-v2', 'trust_remote_code': True, 'dtype': 'float16', 'max_model_len': 131072, 'max_num_seqs': 64, 'disable_log_stats': True, 'mamba_ssm_cache_dtype': 'float32'}


The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.


INFO 09-10 00:15:15 [__init__.py:711] Resolved architecture: NemotronHForCausalLM


`torch_dtype` is deprecated! Use `dtype` instead!


INFO 09-10 00:15:15 [__init__.py:1750] Using max model len 131072
INFO 09-10 00:15:15 [arg_utils.py:1781] Mamba is experimental on VLLM_USE_V1=1. Falling back to V0 Engine.
INFO 09-10 00:15:15 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 09-10 00:15:15 [llm_engine.py:222] Initializing a V0 LLM engine (v0.10.1.1) with config: model='nvidia/NVIDIA-Nemotron-Nano-9B-v2', speculative_config=None, tokenizer='nvidia/NVIDIA-Nemotron-Nano-9B-v2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False,

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


INFO 09-10 00:15:33 [default_loader.py:262] Loading weights took 11.97 seconds
INFO 09-10 00:15:33 [model_runner.py:1112] Model loading took 16.5557 GiB and 12.859213 seconds
INFO 09-10 00:15:42 [worker.py:295] Memory profiling takes 8.32 seconds
INFO 09-10 00:15:42 [worker.py:295] the current vLLM instance can use total_gpu_memory (47.41GiB) x gpu_memory_utilization (0.90) = 42.67GiB
INFO 09-10 00:15:42 [worker.py:295] model weights take 16.56GiB; non_torch_memory takes 0.06GiB; PyTorch activation peak memory takes 9.44GiB; the rest of the memory reserved for KV Cache is 16.61GiB.
INFO 09-10 00:15:42 [executor_base.py:114] # cuda blocks: 68042, # CPU blocks: 16384
INFO 09-10 00:15:42 [executor_base.py:119] Maximum concurrency for 131072 tokens per request: 8.31x
INFO 09-10 00:15:44 [model_runner.py:1383] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eage

Capturing CUDA graph shapes:   0%|          | 0/11 [00:00<?, ?it/s]

INFO 09-10 00:15:53 [model_runner.py:1535] Graph capturing finished in 9 secs, took 0.14 GiB
INFO 09-10 00:15:53 [llm_engine.py:417] init engine (profile, create kv cache, warmup model) took 19.99 seconds
INFO 09-10 00:15:55 [llm.py:298] Supported_tasks: ['generate']


### Sampling parameters

When generating text, the model can either:

- **Always pick the highest-probability token** (greedy decoding — fast and deterministic)
- **Sample from the probability distribution** over possible next tokens — which adds variety and creativity

We use the following parameters to control that behavior:

- `temperature = 0.3`: Controls randomness. Lower values → more confident, deterministic outputs. Higher values → more diverse, sometimes erratic responses.
- `top_p = 0.9`: Enables **nucleus sampling** — the model samples only from the top tokens that together make up 90% of the probability mass. Balances diversity and coherence.
- `max_tokens=512`: Controls maximum number of tokens to generate.

Together, these settings ensure the output is diverse but not chaotic — a good balance for most use cases.

When running experiments, you can try changing the prompts and sampling values below to see how the model’s behavior changes!

In [None]:
sampling = SamplingParams(
    max_tokens=512, temperature=0.3, top_p=0.9
)

### Convertng user prompts to chat

In [None]:
def to_chat(messages):
    """
    messages = [
      {"role": "system", "content": "/no_think"},   # or "/think"
      {"role": "user",   "content": "Write a haiku about GPUs"}
    ]
    """
    return tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        return_tensors="pt"
    )

## Executing inference with Nemotron Nano 9b v2

In [None]:
# Chat prompt with reasoning OFF
messages = [
    {"role":"system", "content": "/no_think"},
    {"role":"user",   "content": "Explain KV cache in one paragraph."}
]
chat_prompt = to_chat(messages)

outs = llm.generate([chat_prompt], sampling)
for i, o in enumerate(outs, 1):
    print(f"=== Output {i} ===\n{o.outputs[0].text}\n")

## Performance experiments

In [None]:
# Nemotron Nano-9B-v2 benchmarking helpers (reasoning OFF)
# --- Imports & reproducibility --- 
import os, time, random, math, json, itertools, statistics, gc
import numpy as np
import torch
import matplotlib.pyplot as plt
import pandas as pd
import nest_asyncio; nest_asyncio.apply()

from transformers import AutoTokenizer
from vllm import SamplingParams  # llm must already exist in your session

import os, logging
os.environ["TQDM_DISABLE"] = "1"                 # tqdm
os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1" # huggingface-hub
os.environ["TRANSFORMERS_VERBOSITY"] = "error"   # Transformers text logs

# Optional: quiet vLLM info logs too
logging.getLogger("vllm").setLevel(logging.ERROR)
try:
    from transformers.utils.logging import set_verbosity_error
    set_verbosity_error()
except Exception:
    pass


SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)
torch.backends.cudnn.benchmark = False  # stabilizes variance a bit


def to_chat_no_think(user_text: str) -> str:
    """
    Format a single-turn chat with reasoning explicitly OFF
    using the model's chat template.
    """
    messages = [
        {"role": "system", "content": "/no_think"},
        {"role": "user",   "content": user_text},
    ]
    return tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

# --- Timed generation with optional batching --- 
import io, contextlib

def timed_generate(prompts, sampling_params, as_chat=True, return_counts=True, quiet=True):
    """Return (elapsed_s, total_gen_tokens, [counts], outputs)."""
    if isinstance(prompts, str):
        prompts = [prompts]

    wrapped = [to_chat_no_think(p) for p in prompts] if as_chat else prompts

    # define the actual call with tqdm disabled
    def _call():
        return llm.generate(wrapped, sampling_params, use_tqdm=False)

    # time + suppress stdout/stderr if quiet
    torch.cuda.synchronize()
    t0 = time.perf_counter()
    if quiet:
        with contextlib.redirect_stdout(io.StringIO()), contextlib.redirect_stderr(io.StringIO()):
            outputs = _call()
    else:
        outputs = _call()
    torch.cuda.synchronize()
    elapsed = time.perf_counter() - t0

    # per-sequence counts
    counts = []
    for o in outputs:
        cand = o.outputs[0]
        if getattr(cand, "token_ids", None) is not None:
            cnt = len(cand.token_ids)
        elif getattr(cand, "token_count", None) is not None:
            cnt = int(cand.token_count)
        else:
            cnt = len(tokenizer.encode(cand.text, add_special_tokens=False))
        counts.append(cnt)

    total = sum(counts)
    if return_counts:
        return elapsed, total, counts, outputs
    return elapsed, total, outputs

### Baseline latency and throughput

Let’s establish a **baseline** for how our model performs with a single prompt under typical sampling settings.

This test measures:

- **Latency**: the total time it takes to generate a complete response
- **Throughput**: the number of tokens generated per second

What we’re doing here:

- Use a **single input prompt** (feel free to change it!)
- Generate up to `256` tokens using greedy sampling (`temperature=0.0`, `top_p=1.0`)
- Call `timed_generate()` to:
  - time the entire inference run
  - count how many tokens were actually generated
- Print the generated response

This gives us a **reference point** to compare against later experiments where we vary different factors.

> This is also the simplest “real-world” case: one user, one prompt, one answer.

In [None]:
# --- Example usage --- 
# Greedy for latency/throughput measurement
greedy = SamplingParams(max_tokens=256, temperature=0.0, top_p=1.0)

batch = [
    "The future of AI is",
    "Explain KV cache eviction in ~4 lines.",
    "INT8 vs FP16 for LLMs—give 5 concise tradeoffs.",
    "Why does batching help prefill more than decode?"
] * 4  # 16 requests to exercise scheduling

elapsed, gen_toks, counts, outs  = timed_generate(batch, greedy, as_chat=True, return_counts=True)
tok_s = gen_toks / elapsed if elapsed > 0 else float("nan")
print(f"Requests: {len(batch)} | Gen tokens: {gen_toks} | Time: {elapsed:.2f}s | {tok_s:.1f} tok/s")

# Peek at first few outputs
for i, o in enumerate(outs[:3], 1):
    print(f"\n[{i}] {o.outputs[0].text.strip()[:600]}...")

### Batch size sweep

Now we’re going to benchmark how inference performance changes with **different batch sizes** — that is, how many prompts we process in a single inference.

This experiment helps us understand the **tradeoff between throughput and latency**, and find the batch size that balances performance and responsiveness.

For each batch size in:

```python
[1, 8, 32, 64, 128, 256]
```

we run several repetitions and record:

- Sustained throughput (capacity)
- p50 / p90 total latency and TTFT
- p50 / p90 throughput (from sec/token)
- p50 / p90 ITL (ms/token, excludes first token per active sequence)

> These runs can take a few minutes, especially at large batch sizes — start executing the cell and grab a coffee!

We will be saving the results in the file `batch_benchmark.csv` so you can reuse or visualize the data later.

In [None]:
# Nemotron Nano-9B-v2 batch benchmark (reasoning OFF) — meaningful metrics only

import gc, numpy as np
import torch
import pandas as pd
from vllm import SamplingParams

BATCH_SIZES     = [1, 8, 32, 64, 128, 256]
RUNS_PER_SIZE   = 10
WARMUP_RUNS     = 2                      # drop warmups (JIT/capture)
MAX_TOKENS_FULL = 128
USE_GREEDY_SAMPLING = False
TEMPERATURE     = 0.7
TOP_P           = 0.9

PROMPT_POOL = [
    "The future of AI is",
    "Summarize in 1–2 sentences: why batching helps decode less than prefill.",
    "Give 5 bullets on KV cache eviction strategies.",
    "Explain attention vs. Mamba-2 in 3 concise points.",
    "Write a 2-sentence TL;DR on FP16 vs. BF16 tradeoffs.",
    "List 3 pitfalls when benchmarking LLMs."
]

def make_batch(bs):
    import random
    return [random.choice(PROMPT_POOL) for _ in range(bs)]


def pct(vals, p, method="higher"):
    """Nearest-rank percentile so results reflect actual runs."""
    arr = np.asarray(vals, float)
    arr = arr[~np.isnan(arr)]
    if arr.size == 0:
        return float("nan")
    return float(np.percentile(arr, p, method=method))

records = []

if USE_GREEDY_SAMPLING:
    sampling_opts   = dict(temperature=0.0, top_p=1.0)  # greedy, deterministic
else:
    sampling_opts   = dict(temperature=TEMPERATURE, top_p=TOP_P) # random

for bs in BATCH_SIZES:
    print(f"Processing with the batch size (BS) = {bs:>3} ...")

    # Per-run logs
    first_latencies = []            # seconds
    total_latencies = []            # seconds
    sec_per_token_runs = []         # seconds / token
    itl_sec_per_token_runs = []     # seconds / token (beyond first tokens)

    # Aggregates for sustained throughput
    sum_tokens_total = 0
    sum_elapsed_total = 0.0

    for r in range(RUNS_PER_SIZE + WARMUP_RUNS):
        prompts = make_batch(bs)

        # TTFT (first token only)
        t_first, _, _, _ = timed_generate(
            prompts,
            SamplingParams(**sampling_opts, max_tokens=1),
            as_chat=True,
            return_counts=True
        )

        # Full answer
        t_total, gen_tokens, counts, outs = timed_generate(
            prompts,
            SamplingParams(**sampling_opts, max_tokens=MAX_TOKENS_FULL),
            as_chat=True,
            return_counts=True
        )

        # Warmup discard
        if r < WARMUP_RUNS:
            gc.collect(); torch.cuda.empty_cache()
            continue

        # Track latencies
        first_latencies.append(t_first)
        total_latencies.append(t_total)

        # Sustained throughput aggregates
        sum_tokens_total += gen_tokens
        sum_elapsed_total += t_total

        # Throughput percentiles (via sec/token)
        if gen_tokens > 0:
            sec_per_token_runs.append(t_total / gen_tokens)

        # ITL: exclude one first token per *active* sequence
        post_first_tokens = sum((c - 1) for c in counts if c > 0)
        itl_den = max(post_first_tokens, 1)
        itl_sec_per_token_runs.append(max(t_total - t_first, 0.0) / itl_den)

        gc.collect(); torch.cuda.empty_cache()

    # --- Metrics ---

    # Latency percentiles (end-to-end)
    p50_ttft_s    = pct(first_latencies, 50)
    p90_ttft_s    = pct(first_latencies, 90)
    p50_latency_s = pct(total_latencies, 50)
    p90_latency_s = pct(total_latencies, 90)

    # Sustained throughput = total generated tokens / total wall time
    throughput_sustained_tok_s = (sum_tokens_total / sum_elapsed_total) if sum_elapsed_total > 0 else float("nan")

    # Throughput percentiles (invert sec/token percentiles)
    p50_thru_tok_s = (1.0 / pct(sec_per_token_runs, 50)) if sec_per_token_runs else float("nan")
    p90_thru_tok_s = (1.0 / pct(sec_per_token_runs, 90)) if sec_per_token_runs else float("nan")

    # ITL percentiles (ms/token)
    itl_p50_ms_per_tok = 1000.0 * pct(itl_sec_per_token_runs, 50) if itl_sec_per_token_runs else float("nan")
    itl_p90_ms_per_tok = 1000.0 * pct(itl_sec_per_token_runs, 90) if itl_sec_per_token_runs else float("nan")

    records.append(dict(
        batch_size                 = bs,
        p50_ttft_s                 = p50_ttft_s,
        p90_ttft_s                 = p90_ttft_s,
        p50_latency_s              = p50_latency_s,
        p90_latency_s              = p90_latency_s,
        throughput_sustained_tok_s = throughput_sustained_tok_s,
        throughput_p50_tok_s       = p50_thru_tok_s,
        throughput_p90_tok_s       = p90_thru_tok_s,
        itl_p50_ms_per_tok         = itl_p50_ms_per_tok,
        itl_p90_ms_per_tok         = itl_p90_ms_per_tok,
        runs                       = RUNS_PER_SIZE
    ))

    print(
        f"BS={bs:>3} | p50_lat={p50_latency_s:.3f}s | p90_lat={p90_latency_s:.3f}s | "
        f"sustained={throughput_sustained_tok_s:.1f} tok/s | p50_thru={p50_thru_tok_s:.1f} | p90_thru={p90_thru_tok_s:.1f} | "
        f"ITL p50={itl_p50_ms_per_tok:.1f} ms/tok | ITL p90={itl_p90_ms_per_tok:.1f} ms/tok"
    )

df = pd.DataFrame.from_records(records).sort_values("batch_size")
display(df)

df.to_csv("batch_benchmark.csv", index=False)
print("✅ Benchmark finished – saved to batch_benchmark.csv")

### Interpreting our benchmarks

Now that we've run our performance sweep, let’s dive into what the results actually mean — and what insights we can extract from them.

Let's load the benchmark file if you are returning to the notebook. Ignore this step if you've just executed the benchmark.

In [None]:
# Load df from memory if present, otherwise read the CSV; validate columns.
import os
import pandas as pd

if "df" not in globals():
    candidates = ["batch_benchmark.csv", "/workspace/batch_benchmark.csv"]
    for p in candidates:
        if os.path.exists(p):
            df = pd.read_csv(p)
            break
    else:
        raise FileNotFoundError("Couldn't find batch_benchmark.csv. Run the benchmark first.")

# Ensure types and order
df["batch_size"] = df["batch_size"].astype(int)
df = df.sort_values("batch_size").reset_index(drop=True)

df.head()

#### Helper for consistent plotting

In [None]:
# Small helper to keep plots consistent (one chart per figure).
import matplotlib.pyplot as plt

def plot_lines(x, series_list, title, xlabel, ylabel, ylog=False, xticks=None):
    plt.figure()
    for y, label, marker in series_list:
        plt.plot(x, y, marker=marker, label=label)
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.grid(True, which="both", linestyle="--", linewidth=0.5)
    if xticks is not None:
        plt.xticks(xticks)
    if ylog:
        plt.yscale("log")
    plt.legend()
    plt.tight_layout()
    plt.show()

#### Throughput vs batch size (sustained, p50, p90)

In [None]:
x = df["batch_size"].tolist()

plot_lines(
    x,
    [
        (df["throughput_sustained_tok_s"], "Sustained", "o"),
        (df["throughput_p50_tok_s"],       "p50",       "s"),
        (df["throughput_p90_tok_s"],       "p90",       "^"),
    ],
    title="Throughput vs Batch Size (tokens/sec)",
    xlabel="Batch size",
    ylabel="Tokens / second",
    ylog=False,
    xticks=x
)

#### Total latency vs batch size (p50, p99)

In [None]:
x = df["batch_size"].tolist()

plot_lines(
    x,
    [
        (df["p50_latency_s"], "p50 total latency", "o"),
        (df["p90_latency_s"], "p90 total latency", "s"),
    ],
    title="Total Latency vs Batch Size",
    xlabel="Batch size",
    ylabel="Seconds",
    ylog=True,      # latencies can span orders of magnitude
    xticks=x
)

#### Time to First Token (TTFT) vs batch size (p50, p99)

In [None]:
x = df["batch_size"].tolist()

plot_lines(
    x,
    [
        (df["p50_ttft_s"], "p50 TTFT", "o"),
        (df["p90_ttft_s"], "p90 TTFT", "s"),
    ],
    title="Time to First Token (TTFT) vs Batch Size",
    xlabel="Batch size",
    ylabel="Seconds",
    ylog=True,
    xticks=x
)

#### First-token vs total latency across batch sizes (p50)

In [None]:
# Sort & extract
x = df["batch_size"].tolist()

ttft_p50 = df["p50_ttft_s"].tolist()
ttft_p90 = df["p90_ttft_s"].tolist()
lat_p50  = df["p50_latency_s"].tolist()
lat_p90  = df["p90_latency_s"].tolist()

# Single figure with four series
plt.figure()
plt.plot(x, ttft_p50, marker="o", label="TTFT (p50)")
plt.plot(x, ttft_p90, marker="^", label="TTFT (p90)")
plt.plot(x, lat_p50,  marker="s", label="Total latency (p50)")
plt.plot(x, lat_p90,  marker="x", label="Total latency (p90)")

plt.title("First-Token vs Total Latency — p50 & p90")
plt.xlabel("Batch size")
plt.ylabel("Seconds")
plt.grid(True, which="both", linestyle="--", linewidth=0.5)
plt.xticks(x)
plt.yscale("log")  # latency scales fast with batch size; log makes it readable
plt.legend()
plt.tight_layout()
plt.show()

#### Inter-Token Latency (ITL) vs batch size (p50, p90)

In [None]:
x = df["batch_size"].tolist()

plot_lines(
    x,
    [
        (df["itl_p50_ms_per_tok"], "ITL p50", "o"),
        (df["itl_p90_ms_per_tok"], "ITL p90", "s"),
    ],
    title="Inter-Token Latency (ITL) vs Batch Size",
    xlabel="Batch size",
    ylabel="Milliseconds / token",
    ylog=True,
    xticks=x
)

#### Throughput vs latency trade-off (capacity vs responsiveness)

In [None]:
import matplotlib.pyplot as plt

x = df["p50_latency_s"]
y = df["throughput_sustained_tok_s"]
labels = df["batch_size"].tolist()

plt.figure()
plt.scatter(x, y, marker="o")
for xi, yi, lbl in zip(x, y, labels):
    plt.annotate(str(lbl), (xi, yi), textcoords="offset points", xytext=(5,5))
plt.title("Throughput vs p50 Latency")
plt.xlabel("p50 total latency (s)")
plt.ylabel("Sustained throughput (tok/s)")
plt.grid(True, which="both", linestyle="--", linewidth=0.5)
plt.tight_layout()
plt.show()

#### Latency vs Throughput with knee point (p50 latency vs sustained throughput)

In [None]:
L_COL = "p50_latency_s"              # latency column
T_COL = "throughput_sustained_tok_s" # throughput column

needed = ["batch_size", L_COL, T_COL]
missing = [c for c in needed if c not in df.columns]
if missing:
    raise ValueError(f"Missing columns: {missing}. Re-run the benchmark with those fields.")

# Sort and extract arrays
df = df.sort_values("batch_size").reset_index(drop=True)
bs  = df["batch_size"].to_numpy(dtype=int)
lat = df[L_COL].to_numpy(dtype=float)
thr = df[T_COL].to_numpy(dtype=float)

# Normalize latency to [0,1]; normalize throughput and invert to [0,1]
lat_n   = lat / np.max(lat)                    # higher is worse
thr_inv = 1.0 - (thr / np.max(thr))            # invert so higher is worse

# Gap between the two normalized curves → knee where gap is smallest
gap    = np.abs(lat_n - thr_inv)
best_i = int(np.argmin(gap))

# Plot in your requested style
plt.figure(figsize=(7,4))
plt.plot(bs, lat_n,   'o-', label=f'{L_COL} (norm)')
plt.plot(bs, thr_inv, '^--', label=f'{T_COL} (inverted norm)')
plt.scatter(bs[best_i], lat_n[best_i], c='red', s=120,
            label=f'sweet spot ≈ BS {bs[best_i]}')
plt.xlabel("Batch size"); plt.ylabel("Normalised metric (0-1)")
plt.title("Latency vs Throughput – knee point")
plt.grid(True, which="both", linestyle="--", linewidth=0.5)
plt.legend()
plt.tight_layout()
plt.show()

print(
    f"Knee ~ BS={bs[best_i]} | "
    f"latency={lat[best_i]:.3f}s | throughput={thr[best_i]:.1f} tok/s"
)

#### Finding the sweet spot based on UX budget

Choosing the optimal batch size isn’t just about throughput — it’s about respecting your users’ **latency expectations**.

This section helps you find the best-performing batch size that stays within a given **latency budget**, defined in seconds. You can adjust the value of `latency_budget` depending on your application’s needs (e.g. 2.5s for chat, 5s for summarization, etc.).

What this code does:

- Filters all benchmark results to include only configurations where latency is **below your target threshold**
- Then selects the **batch size with the highest throughput** from the remaining options
- Runs this selection twice:
  - Once using **p50 latency** (typical case)
  - Once using **p99 latency** (worst-case tail latency)

Try changing the `latency_budget` value and observe how the recommended batch size shifts depending on the metric used.

In [None]:
latency_budget = 7  # seconds

filtered_p50 = df[df["p50_latency_s"] <= latency_budget]
filtered_p90 = df[df["p90_latency_s"] <= latency_budget]

if not filtered_p50.empty:
    best_row_p50 = filtered_p50.loc[filtered_p50["throughput_sustained_tok_s"].idxmax()]
    print(f"✅ Recommended: batch_size={int(best_row_p50.batch_size)}  "
          f"⇒  p50 latency = {best_row_p50.p50_latency_s:.2f}s, "
          f"throughput = {best_row_p50.throughput_sustained_tok_s:.0f} tok/s")
else:
    print(f"❌ No configuration meets the latency budget of {latency_budget:.1f}s")

if not filtered_p90.empty:
    best_row_p90 = filtered_p90.loc[filtered_p90["throughput_sustained_tok_s"].idxmax()]
    print(f"✅ Recommended: batch_size={int(best_row_p90.batch_size)}  "
          f"⇒  p90 latency = {best_row_p90.p90_latency_s:.2f}s, "
          f"throughput = {best_row_p90.throughput_sustained_tok_s:.0f} tok/s")
else:
    print(f"❌ No configuration meets the latency budget of {latency_budget:.1f}s based on p90 latency")

### Asessing impact of other parameters

So far, we’ve seen that **batch size** directly affects latency and throughput. But batch size isn’t the only factor that matters.

Let’s now explore how **other inference parameters**, starting with **output sequence length**, influence performance.

#### Sequence Length

The **sequence length** `max_tokens` parameter defines how many tokens the model is allowed to generate per prompt.

In practice:

- Short outputs (e.g. 32 tokens) return quickly
- Long outputs (e.g. 512 tokens) take significantly more time, especially at high batch sizes

This is because while **prefill cost is fixed**, the **decode phase scales linearly** with the number of tokens generated — we will look into that in future tutorials!

What this experiment does:

- Uses a fixed batch size (`batch = 32`)
- Varies the `max_tokens` cap from 32 to 256
- Measures **mean** and **p99 latency** across 5 runs for each setting

> Note: This experiment can take a few minutes — longer sequence lengths at high batch size are compute-intensive.

Let’s visualize how increasing the output length impacts both average latency and tail latency.

In [None]:
# Latency vs Generated Length (p50 & p90) — batch fixed

import gc, numpy as np
import matplotlib.pyplot as plt
from vllm import SamplingParams

seq_lengths = [32, 64, 128, 256]
BATCH       = 32
RUNS        = 20   # effective samples per point (after warmup)
WARMUP      = 2

# Non-greedy (realistic) sampling
NON_GREEDY = dict(temperature=0.7, top_p=0.9)

lat_p50s, lat_p90s = [], []

for mt in seq_lengths:
    sp = SamplingParams(max_tokens=mt, **NON_GREEDY)
    lats = []
    for r in range(RUNS + WARMUP):
        prompts = ["Summer is"] * BATCH
        t_total, _, _, _ = timed_generate(prompts, sp, as_chat=True, return_counts=True)
        if r < WARMUP:
            gc.collect(); torch.cuda.empty_cache()
            continue
        lats.append(t_total)
        gc.collect(); torch.cuda.empty_cache()

    # Percentiles on observed latencies (nearest-rank style)
    p50 = float(np.percentile(lats, 50, method="higher")) if lats else float("nan")
    p90 = float(np.percentile(lats, 90, method="higher")) if lats else float("nan")
    lat_p50s.append(p50)
    lat_p90s.append(p90)

plt.figure()
plt.plot(seq_lengths, lat_p50s, marker='o', label='p50 latency')
plt.plot(seq_lengths, lat_p90s, marker='s', label='p90 latency')
plt.xlabel('max_tokens'); plt.ylabel('Latency (s)'); plt.title('Latency vs Generated Length (Batch = 32)')
plt.grid(True, which='both', linestyle='--', linewidth=0.5)
plt.legend(); plt.tight_layout(); plt.show()