A/B Testing Framework

Statistical Experiment Analysis — Sample Size, Significance & Business Impact

From "did it work?" to "how confident are we, and should we ship it?"

Framework Capabilities

1. Pre-Test Planning    → calculate_sample_size() — how long must the test run?
2. SRM Check            → check_srm()             — is the experiment valid?
3. Conversion Rate Test → test_proportions()       — z-test for binary metrics
4. Revenue / Time Test  → test_continuous()        — Welch's t-test for numeric metrics
5. Decision Report      → format_report()          — SHIP / DO NOT SHIP recommendation

Example: Homepage CTA Button Colour Test

from src.ab_test import calculate_sample_size, test_proportions, format_report

# Step 1: How long do we need to run this?
plan = calculate_sample_size(
    baseline_rate=0.05,          # Current 5% conversion
    min_detectable_effect=0.10,  # Want to detect a 10% relative lift (to 5.5%)
    alpha=0.05,                  # 5% false positive rate
    power=0.80,                  # 80% chance of detecting a real effect
    daily_traffic=2000,          # 2,000 users/day across both variants
)
print(f"Run for at least {plan.expected_duration_days} days with {plan.n_per_variant:,} users per variant")
# → Run for at least 35.0 days with 35,000 users per variant

# Step 2: After the test, analyse results
result = test_proportions(
    control_conversions=1750, control_n=35000,    # Control: 5.0%
    treatment_conversions=1960, treatment_n=35000, # Treatment: 5.6%
)
print(format_report(result, "CTA Colour — Green vs Blue"))

Output:

============================================================
A/B TEST REPORT — CTA Colour — Green vs Blue
============================================================
Test type:        proportions
Control:          n=35,000 | metric=0.050000
Treatment:        n=35,000 | metric=0.056000
Absolute lift:    +0.006000
Relative lift:    +12.00%
P-value:          0.000032
95% CI:           (0.003300, 0.008700)
Effect size:      0.0261
SRM p-value:      1.0000
Significant:      YES
------------------------------------------------------------
RECOMMENDATION:   SHIP — statistically significant positive lift (+12.00%)
============================================================

Statistical Concepts

Concept	Explanation	Framework Function
Sample size	Users needed to detect a given lift with statistical confidence	`calculate_sample_size()`
SRM	Sample Ratio Mismatch — experiment integrity check	`check_srm()`
Z-test	For conversion rates — tests if two proportions differ	`test_proportions()`
Welch's t-test	For revenue/time — doesn't assume equal variance	`test_continuous()`
P-value	Probability of seeing this result if there's no real effect	Both test functions
Cohen's d	Effect size for continuous metrics (0.2=small, 0.5=medium, 0.8=large)	`test_continuous()`
Cohen's h	Effect size for proportions	`test_proportions()`
95% CI	Range where the true effect likely falls	Both test functions

Common A/B Testing Mistakes This Framework Prevents

Mistake	How This Framework Handles It
Peeking at results early	`calculate_sample_size()` tells you how long to run — don't stop early
Underpowered tests	Required sample size is calculated before the test starts
SRM ignored	`check_srm()` runs automatically — flags invalid experiments
Stat-sig ≠ business sig	Reports relative lift — a 0.001% lift isn't worth shipping even if p < 0.05
Wrong test for metric type	`test_proportions()` for conversion, `test_continuous()` for revenue

Quick Start

git clone https://github.com/YOUR_USERNAME/ab-testing-framework.git
cd ab-testing-framework
python -m venv venv && venv\Scripts\activate
pip install -r requirements.txt
pytest tests/ -v

What I Learned

Type I vs Type II errors: alpha (false positive) and beta (false negative) — and why reducing one increases the other
Welch's t-test over Student's t: doesn't assume equal variance — almost always safer for real data
SRM is a real problem: in practice, 10-20% of A/B tests have SRM from bot filtering, cookie deletion, or logging bugs
Effect size matters: p-value tells you IF there's a difference; effect size tells you HOW LARGE it is
Statistical vs practical significance: a conversion lift from 10.00% to 10.01% might be stat-sig with 10M users but not worth engineering time

License

MIT

_{Part of a 10-project Data Analyst portfolio}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A/B Testing Framework

Statistical Experiment Analysis — Sample Size, Significance & Business Impact

Framework Capabilities

Example: Homepage CTA Button Colour Test

Statistical Concepts

Common A/B Testing Mistakes This Framework Prevents

Quick Start

What I Learned

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

A/B Testing Framework

Statistical Experiment Analysis — Sample Size, Significance & Business Impact

Framework Capabilities

Example: Homepage CTA Button Colour Test

Statistical Concepts

Common A/B Testing Mistakes This Framework Prevents

Quick Start

What I Learned

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages