From "did it work?" to "how confident are we, and should we ship it?"
1. Pre-Test Planning → calculate_sample_size() — how long must the test run?
2. SRM Check → check_srm() — is the experiment valid?
3. Conversion Rate Test → test_proportions() — z-test for binary metrics
4. Revenue / Time Test → test_continuous() — Welch's t-test for numeric metrics
5. Decision Report → format_report() — SHIP / DO NOT SHIP recommendation
from src.ab_test import calculate_sample_size, test_proportions, format_report
# Step 1: How long do we need to run this?
plan = calculate_sample_size(
baseline_rate=0.05, # Current 5% conversion
min_detectable_effect=0.10, # Want to detect a 10% relative lift (to 5.5%)
alpha=0.05, # 5% false positive rate
power=0.80, # 80% chance of detecting a real effect
daily_traffic=2000, # 2,000 users/day across both variants
)
print(f"Run for at least {plan.expected_duration_days} days with {plan.n_per_variant:,} users per variant")
# → Run for at least 35.0 days with 35,000 users per variant
# Step 2: After the test, analyse results
result = test_proportions(
control_conversions=1750, control_n=35000, # Control: 5.0%
treatment_conversions=1960, treatment_n=35000, # Treatment: 5.6%
)
print(format_report(result, "CTA Colour — Green vs Blue"))Output:
============================================================
A/B TEST REPORT — CTA Colour — Green vs Blue
============================================================
Test type: proportions
Control: n=35,000 | metric=0.050000
Treatment: n=35,000 | metric=0.056000
Absolute lift: +0.006000
Relative lift: +12.00%
P-value: 0.000032
95% CI: (0.003300, 0.008700)
Effect size: 0.0261
SRM p-value: 1.0000
Significant: YES
------------------------------------------------------------
RECOMMENDATION: SHIP — statistically significant positive lift (+12.00%)
============================================================
| Concept | Explanation | Framework Function |
|---|---|---|
| Sample size | Users needed to detect a given lift with statistical confidence | calculate_sample_size() |
| SRM | Sample Ratio Mismatch — experiment integrity check | check_srm() |
| Z-test | For conversion rates — tests if two proportions differ | test_proportions() |
| Welch's t-test | For revenue/time — doesn't assume equal variance | test_continuous() |
| P-value | Probability of seeing this result if there's no real effect | Both test functions |
| Cohen's d | Effect size for continuous metrics (0.2=small, 0.5=medium, 0.8=large) | test_continuous() |
| Cohen's h | Effect size for proportions | test_proportions() |
| 95% CI | Range where the true effect likely falls | Both test functions |
| Mistake | How This Framework Handles It |
|---|---|
| Peeking at results early | calculate_sample_size() tells you how long to run — don't stop early |
| Underpowered tests | Required sample size is calculated before the test starts |
| SRM ignored | check_srm() runs automatically — flags invalid experiments |
| Stat-sig ≠ business sig | Reports relative lift — a 0.001% lift isn't worth shipping even if p < 0.05 |
| Wrong test for metric type | test_proportions() for conversion, test_continuous() for revenue |
git clone https://github.com/YOUR_USERNAME/ab-testing-framework.git
cd ab-testing-framework
python -m venv venv && venv\Scripts\activate
pip install -r requirements.txt
pytest tests/ -v- Type I vs Type II errors: alpha (false positive) and beta (false negative) — and why reducing one increases the other
- Welch's t-test over Student's t: doesn't assume equal variance — almost always safer for real data
- SRM is a real problem: in practice, 10-20% of A/B tests have SRM from bot filtering, cookie deletion, or logging bugs
- Effect size matters: p-value tells you IF there's a difference; effect size tells you HOW LARGE it is
- Statistical vs practical significance: a conversion lift from 10.00% to 10.01% might be stat-sig with 10M users but not worth engineering time
MIT
Part of a 10-project Data Analyst portfolio