In [None]:
# Bootstrap: Import helpers and create directories
import sys
from pathlib import Path

# Add repo root to Python path
repo_root = Path.cwd().parent if Path.cwd().name == 'notebooks' else Path.cwd()
if str(repo_root) not in sys.path:
    sys.path.insert(0, str(repo_root))

from utils.nb_helpers import run_module, run_script
print("✅ Notebook helpers loaded - ready for benchmarking!")


# 02 - Latency & Throughput Benchmarking

## Learning Goals

* Distinguish **latency** (time for one request) vs **throughput** (requests/second).
* Understand **warm-up**: the first inferences are slower due to lazy initialization and caches.
* Interpret percentiles (**P50**, **P95**, **P99**) and why tail latency matters.
* See how **batch size** trades latency for throughput.

## You Should Be Able To...

- Explain why warm-up runs are necessary in latency benchmarking
- Run benchmarks with different batch sizes and interpret results
- Calculate and compare P50/P95 latency percentiles
- Identify performance bottlenecks in model inference
- Make informed decisions about batch size for deployment

---

## Concepts

**Warm-up runs**: prime kernels, JITs, memory. Don't include them in metrics.

**P50 vs P95**: P95 tells you about "slow outliers". SLAs often target a percentile, not the mean.

**Providers/EPs**: same ONNX model, different backends (CPU/GPU/NNAPI).

**Batch size**: larger batches can improve throughput but increase per-request latency.

## Common Pitfalls

* Measuring latency without warm-up runs (first runs are slower)
* Using mean latency instead of percentiles for SLA planning
* Not considering batch size impact on single-request latency
* Ignoring system variability in benchmark results

## Success Criteria

* ✅ Report shows mean/P50/P95 and a PNG plot
* ✅ You can explain whether latency distribution is tight or spiky
* ✅ You can justify a batch size for your target use case

## Reflection

After completing this notebook, reflect on:
- How did batch size affect latency vs throughput?
- Why is P95 latency more important than mean for user experience?
- What factors contribute to latency variability?

---

## Setup & Environment Check


In [None]:
# ruff: noqa: E401
import os
import sys
from pathlib import Path

if Path.cwd().name == "labs":
    os.chdir(Path.cwd().parent)
    print("→ Working dir set to repo root:", os.getcwd())
if os.getcwd() not in sys.path:
    sys.path.insert(0, os.getcwd())

import time
import numpy as np
import matplotlib.pyplot as plt
import onnxruntime as ort
from piedge_edukit.preprocess import FakeData as PEDFakeData
import piedge_edukit as _pkg  # noqa: F401

# Hints & Solutions helper (pure Jupyter, no extra deps)
from IPython.display import Markdown, display

def hints(*lines, solution: str | None = None, title="Need a nudge?"):
    """Render progressive hints + optional collapsible solution."""
    md = [f"### {title}"]
    for i, txt in enumerate(lines, start=1):
        md.append(f"<details><summary>Hint {i}</summary>\n\n{txt}\n\n</details>")
    if solution:
        # keep code fenced as python for readability
        md.append(
            "<details><summary><b>Show solution</b></summary>\n\n"
            f"```python\n{solution.strip()}\n```\n"
            "</details>"
        )
    display(Markdown("\n\n".join(md)))


In [None]:
# Environment self-heal (Python 3.12 + editable install)
import subprocess
import importlib

print(f"Python: {sys.version.split()[0]} (need 3.12)")

try:
    import piedge_edukit  # noqa: F401
    print("✅ PiEdge EduKit package OK")
except ModuleNotFoundError:
    print("ℹ️ Installing package in editable mode …")
    root = os.getcwd()
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-e", root])
    importlib.invalidate_caches()
    import piedge_edukit  # noqa: F401
    print("✅ Package installed")


In [None]:
# All imports are now in the first cell above
print("✅ All imports successful")


## Concept: Latency vs Throughput

**Latency** measures how long a single inference takes (time per prediction).
**Throughput** measures how many inferences can be processed per second.

Key metrics:
- **Mean latency**: Average time per inference
- **P50 latency**: Median time (50% of inferences are faster)
- **P95 latency**: 95th percentile (95% of inferences are faster)
- **Warm-up**: Initial runs that "prime" the system (GPU memory allocation, JIT compilation, etc.)


### TODO A1 — Why warm-up?
Write 2–3 sentences explaining graph initialization, JIT/caches and memory allocation effects on first iterations.

<details><summary>Solution</summary>
Warm-up amortizes one-time costs (kernel/JIT init, memory allocation, cache fills) so measured latency reflects steady-state. Without warm-up, p50/p95 overestimates real throughput.
</details>

hints(
    "The first inferences include one-off costs (graph init, memory alloc).",
    "Warm-up runs stabilize timing so measured latencies reflect steady-state.",
    solution="""\
Warm-up eliminates one-time overhead (graph compilation/init, allocator warm-up).
It makes the reported latencies representative of steady-state performance."""
)

## Task A: Explain Warm-up

**Multiple Choice**: Why are warm-up runs important in latency benchmarking?

A) They improve model accuracy
B) They initialize system resources (GPU memory, JIT compilation, etc.)
C) They reduce model size
D) They increase throughput

**Your answer**: _____

**Short justification** (1-2 sentences): Why does this matter for accurate benchmarking?

*Your answer here:*


## Task B: Batch Size Experiment

Run benchmarks with different batch sizes and analyze the performance trends.


In [None]:
# TODO B1: run latency for batch_sizes = [1, 2, 4, 8], collect p50/p95
# Plot/print a small table and briefly interpret which batch best fits *latency-first* deployments.

def benchmark_batch_size(model_path, batch_sizes, runs=10, warmup=3):
    """
    Benchmark model with different batch sizes.
    Returns list of dicts with 'batch', 'p50', 'p95', 'mean' keys.
    """
    results = []
    
    # Load model once
    session = ort.InferenceSession(model_path, providers=['CPUExecutionProvider'])
    input_name = session.get_inputs()[0].name
    output_name = session.get_outputs()[0].name
    
    for batch_size in batch_sizes:
        print(f"Benchmarking batch size {batch_size}...")
        
        # Generate test data
        fake_data = PEDFakeData(num_samples=batch_size * runs, image_size=64, num_classes=2)
        latencies = []
        
        # Warm-up runs
        for _ in range(warmup):
            dummy_input = np.random.randn(batch_size, 3, 64, 64).astype(np.float32)
            _ = session.run([output_name], {input_name: dummy_input})
        
        # Actual benchmark runs
        for i in range(runs):
            dummy_input = np.random.randn(batch_size, 3, 64, 64).astype(np.float32)
            
            start_time = time.time()
            _ = session.run([output_name], {input_name: dummy_input})
            end_time = time.time()
            
            latency_ms = (end_time - start_time) * 1000
            latencies.append(latency_ms)
        
        # Calculate percentiles
        p50 = np.percentile(latencies, 50)
        p95 = np.percentile(latencies, 95)
        mean_lat = np.mean(latencies)
        
        results.append({
            'batch': batch_size,
            'p50': p50,
            'p95': p95,
            'mean': mean_lat
        })
        
        print(f"  Batch {batch_size}: P50={p50:.2f}ms, P95={p95:.2f}ms, Mean={mean_lat:.2f}ms")
    
    return results

# TODO: Run the experiment
# Hint: benchmark_batch_size("./models/model.onnx", [1, 8, 32], runs=10)

print("✅ Benchmark function ready")


In [None]:
# Run the benchmark experiment
model_path = "./models/model.onnx"
if not os.path.exists(model_path):
    print("❌ Model not found. Please complete Notebook 01 first.")
    print("Expected path:", model_path)
else:
    # TODO: Run the benchmark with batch sizes [1, 8, 32]
    results = benchmark_batch_size(model_path, [1, 8, 32], runs=10)
    
    # Display results in a table
    print("\n📊 Benchmark Results:")
    print("Batch Size | P50 (ms) | P95 (ms) | Mean (ms)")
    print("-" * 45)
    for r in results:
        print(f"{r['batch']:10} | {r['p50']:8.2f} | {r['p95']:8.2f} | {r['mean']:8.2f}")
    
    # Auto-check
    assert len(results) >= 3 and all({'batch','p50','p95'} <= set(r) for r in results)
    print("✅ Results format OK")


In [None]:
hints(
    "Use matplotlib: hist + vertical lines at np.percentile(..., 50/95).",
    "Label axes: milliseconds; title with provider/batch size.",
    solution='''
import numpy as np, matplotlib.pyplot as plt

def plot_latency(lat_ms, title="Latency"):
    p50 = np.percentile(lat_ms, 50)
    p95 = np.percentile(lat_ms, 95)
    plt.figure()
    plt.hist(lat_ms, bins=20)
    plt.axvline(p50, linestyle="--", label=f"P50={p50:.2f} ms")
    plt.axvline(p95, linestyle="--", label=f"P95={p95:.2f} ms")
    plt.xlabel("Latency (ms)"); plt.ylabel("Count"); plt.title(title); plt.legend()
    plt.show()
'''
)

# Visualize the results
if 'results' in locals() and len(results) >= 3:
    batch_sizes = [r['batch'] for r in results]
    p50_values = [r['p50'] for r in results]
    p95_values = [r['p95'] for r in results]
    mean_values = [r['mean'] for r in results]
    
    plt.figure(figsize=(10, 6))
    plt.plot(batch_sizes, p50_values, 'o-', label='P50 Latency', linewidth=2)
    plt.plot(batch_sizes, p95_values, 's-', label='P95 Latency', linewidth=2)
    plt.plot(batch_sizes, mean_values, '^-', label='Mean Latency', linewidth=2)
    
    plt.xlabel('Batch Size')
    plt.ylabel('Latency (ms)')
    plt.title('Latency vs Batch Size')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()
    
    print("📈 Chart shows latency trends across batch sizes")
else:
    print("⚠️ No results to visualize. Run the benchmark first.")


hints(
    "Larger batches can improve throughput but may increase single-sample latency.",
    "Edge scenarios often prioritize tail latency (P95) over throughput.",
    solution="""\
Batching amortizes overhead per call (↑ throughput), but single-request latency
often grows with batch size. On-device UX typically targets low P95, so choose
small batches unless you have parallel demand."""
)

## Analysis Questions

Based on your benchmark results, answer these questions:

**1. How does latency change as batch size increases? Explain the trend.**

*Your answer here (2-3 sentences):*

---

**2. Why might P95 latency be higher than P50 latency? What does this tell us about system performance?**

*Your answer here (2-3 sentences):*

---

**3. If you were deploying this model to a real-time application, which batch size would you choose and why?**

*Your answer here (2-3 sentences):*


## Next Steps

Excellent work! You've learned how to measure and analyze model performance.

**Next**: Open `03_quantization.ipynb` to learn about model compression and optimization.

---

### Summary
- ✅ Understood latency vs throughput concepts
- ✅ Implemented warm-up benchmarking
- ✅ Analyzed batch size effects on performance
- ✅ Interpreted P50/P95 latency metrics


# ⚡ Latency Benchmark - Understand model performance

**Goal**: Understand how we measure and interpret model latency (response time).

In this notebook we will:
- Understand what latency is and why it's important
- See how benchmark works (warmup, runs, providers)
- Interpret results (p50, p95, histogram)
- Experiment with different settings

> **💡 Tip**: Latency is critical for edge deployment - a model that's too slow is not usable in real life!


## 🤔 What is latency and why is it important?

**Latency** = the time it takes for the model to make a prediction (inference time).

**Why important for edge**:
- **Real-time applications** - robots, autonomous vehicles
- **User experience** - no one wants to wait 5 seconds for image classification
- **Resource constraints** - Raspberry Pi has limited CPU/memory

<details>
<summary>🔍 Click to see typical latency targets</summary>

**Typical latency targets**:
- **< 10ms**: Real-time video, gaming
- **< 100ms**: Interactive applications
- **< 1000ms**: Batch processing, offline analysisisisisisis

**Our model**: Expect ~1-10ms on CPU (good for edge!)

</details>


## 🔧 How does benchmark work?

**Benchmark process**:
1. **Warmup** - run the model a few times to "warm up" (JIT compilation, cache)
2. **Runs** - measure latency for many runs
3. **Statistics** - calculate p50, p95, mean, std

**Why warmup?**
- First run is often slow (JIT compilation)
- Cache warming affects performance
- We want to measure "steady state" performance


In [None]:
# Run benchmark with different settings
print("🚀 Running benchmark...")

# Use the model from the previous notebook (or create a quick one)
!python -m piedge_edukit.train --fakedata --no-pretrained --epochs 1 --batch-size 256 --output-dir ./models_bench


In [None]:
# Benchmark with different numbers of runs to see variance
import os

# Test 1: Few runs (fast)
print("📊 Test 1: 10 runs")
!python -m piedge_edukit.benchmark --fakedata --model-path ./models_bench/model.onnx --warmup 3 --runs 10 --providers CPUExecutionProvider


In [None]:
# Show Benchmark results
if os.path.exists("./reports/latency_summary.txt"):
    with open("./reports/latency_summary.txt", "r") as f:
        print("📈 Benchmark results:")
        print(f.read())
else:
    print("❌ Benchmark report missing")


In [None]:
# Read detailed latency data and visualize
import pandas as pd
import matplotlib.pyplot as plt

if os.path.exists("./reports/latency.csv"):
    df = pd.read_csv("./reports/latency.csv")
    
    print(f"📊 Latency statistics:")
    print(f"Num measurements: {len(df)}")
    print(f"Mean: {df['latency_ms'].mean():.2f} ms")
    print(f"Std: {df['latency_ms'].std():.2f} ms")
    print(f"Min: {df['latency_ms'].min():.2f} ms")
    print(f"Max: {df['latency_ms'].max():.2f} ms")
    
    # Histogram
    plt.figure(figsize=(10, 6))
    plt.hist(df['latency_ms'], bins=20, alpha=0.7, edgecolor='black')
    plt.xlabel('Latency (ms)')
    plt.ylabel('Count')
    plt.title('Latency distribution')
    plt.grid(True, alpha=0.3)
    plt.show()
    
    # Box plot
    plt.figure(figsize=(8, 6))
    plt.boxplot(df['latency_ms'])
    plt.ylabel('Latency (ms)')
    plt.title('Latency Box Plot')
    plt.grid(True, alpha=0.3)
    plt.show()
else:
    print("❌ Latency CSV missing")


## 🤔 Reflection Questions

<details>
<summary>💭 Why is p95 more important than mean for edge deployment?</summary>

**Answer**: p95 (95th percentile) shows the worst latency that 95% of users experience. It is more important than mean because:

- **User experience**: A user who gets 100ms latency will notice it, even if the mean is 10ms
- **SLA targets**: Many systems have SLA targets at p95 latency
- **Outliers**: Mean can be skewed by outliers; p95 is more robust

</details>

<details>
<summary>💭 What happens to latency variance when you increase the number of runs?</summary>

**Answer**: With more runs we get:
- **More stable statistics** - p50/p95 become more reliable
- **Better understanding of variance** - see if the model is consistent
- **Less impact of outliers** - occasional slow runs matter less

**Experiment**: Run the benchmark with 10, 50, 100 runs and compare standard deviation.

</details>


## 🎯 Your own experiment

**Task**: Run the benchmark with different settings and compare the results.

**Suggestions**:
- Try different numbers of runs (10, 50, 100)
- Compare the warmup effect (0, 3, 10 warmup)
- Analyze the variance between runs

**Code to modify**:
```python
# Change these values:
WARMUP_RUNS = 5
BENCHMARK_RUNS = 50

!python -m piedge_edukit.benchmark --fakedata --model-path ./models_bench/model.onnx --warmup {WARMUP_RUNS} --runs {BENCHMARK_RUNS} --providers CPUExecutionProvider
```


In [None]:
# TODO: Implement your experiment here
# Change the values below and run the benchmark

WARMUP_RUNS = 5
BENCHMARK_RUNS = 50

print(f"🧪 My experiment: warmup={WARMUP_RUNS}, runs={BENCHMARK_RUNS}")

# TODO: Run the benchmark with your settings
# !python -m piedge_edukit.benchmark --fakedata --model-path ./models_bench/model.onnx --warmup {WARMUP_RUNS} --runs {BENCHMARK_RUNS} --providers CPUExecutionProvider


## 🎉 Summary

You have now learned:
- What latency is and why it is critical for edge deployment
- How the benchmark works (warmup, runs, statistics)
- How to interpret latency results (p50, p95, variance)
- Why P95 is more important than mean for user experience

**Next step**: Go to `03_quantization.ipynb` to understand how quantization can improve performance.

**Key concepts**:
- **Latency**: Inference time (critical for edge)
- **Warm-up**: Prepares the model for measurement
- **p50/p95**: Percentiles for the latency distribution
- **Variance**: Consistency in performance
