# Part 5: A/B Testing for LLM Evaluation

This notebook illustrates how to design and analyse a simple A/B test to compare two versions of a language model. We simulate user interactions and apply a two‑proportion z‑test to determine whether observed differences in satisfaction are statistically significant.

## Simulating User Feedback

We randomly assign 1,000 users to two models (A and B). Each user either rates the model’s response as satisfactory (1) or unsatisfactory (0). In our simulation, Model A has a true satisfaction rate of 60 % and Model B has 55 %. The code below generates the synthetic data.

In [None]:

import numpy as np
import math

np.random.seed(42)
N = 1000
assignments = np.random.choice(['A', 'B'], size=N)
# True satisfaction rates
p_A_true = 0.6
p_B_true = 0.55
outcomes = np.where(assignments == 'A', np.random.binomial(1, p_A_true, N), np.random.binomial(1, p_B_true, N))

# Count per group
n_A = np.sum(assignments == 'A')
n_B = np.sum(assignments == 'B')
success_A = np.sum((assignments == 'A') & (outcomes == 1))
success_B = np.sum((assignments == 'B') & (outcomes == 1))

# Observed rates
p_A_obs = success_A / n_A
p_B_obs = success_B / n_B


In [1]:

# Difference in observed success rates
diff = p_A_obs - p_B_obs

# Pooled proportion and standard error
p_pool = (success_A + success_B) / (n_A + n_B)
se = math.sqrt(p_pool * (1 - p_pool) * (1/n_A + 1/n_B))

# z statistic for two-sample proportion test
z = diff / se

# Two-sided p-value (using error function for normal CDF)
p_value = 2 * (1 - (0.5 * (1 + math.erf(abs(z) / math.sqrt(2)))))

results = {
    'Users_A': n_A,
    'Users_B': n_B,
    'Success_A': success_A,
    'Success_B': success_B,
    'Observed_rate_A': round(p_A_obs, 3),
    'Observed_rate_B': round(p_B_obs, 3),
    'Difference': round(diff, 3),
    'z_stat': round(z, 3),
    'p_value': round(p_value, 4)
}
results


{'Users_A': 490, 'Users_B': 510, 'Success_A': 296, 'Success_B': 282, 'Observed_rate_A': 0.604, 'Observed_rate_B': 0.553, 'Difference': 0.051, 'z_stat': 1.637, 'p_value': 0.1016}

### Interpreting the A/B Test

* **Observed satisfaction rates:** Model A users were satisfied `60.4 %` of the time, while Model B users were satisfied `55.3 %` of the time.
* **Difference:** The observed gap in satisfaction is `5.1 percentage points`.
* **z‑statistic:** `z = 1.64`.  This measures how many standard errors the observed difference is from zero.
* **p-value:** `p = 0.1016`.  A p‑value below 0.05 would typically suggest the difference is statistically significant.

In this simulation, the p-value is above, indicating that the higher satisfaction rate for Model A is not statistically significant at the 5 % level.

---

This notebook accompanies an article on A/B testing for language models. It provides a simple template for designing experiments and computing significance. Real evaluations should consider user segmentation, effect sizes, and ethical constraints.