Skip to content

monroesolisdata/ab-testing-framework

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A/B Testing Framework

Statistical Experiment Analysis — Sample Size, Significance & Business Impact

From "did it work?" to "how confident are we, and should we ship it?"


Python scipy numpy License


Framework Capabilities

1. Pre-Test Planning    → calculate_sample_size() — how long must the test run?
2. SRM Check            → check_srm()             — is the experiment valid?
3. Conversion Rate Test → test_proportions()       — z-test for binary metrics
4. Revenue / Time Test  → test_continuous()        — Welch's t-test for numeric metrics
5. Decision Report      → format_report()          — SHIP / DO NOT SHIP recommendation

Example: Homepage CTA Button Colour Test

from src.ab_test import calculate_sample_size, test_proportions, format_report

# Step 1: How long do we need to run this?
plan = calculate_sample_size(
    baseline_rate=0.05,          # Current 5% conversion
    min_detectable_effect=0.10,  # Want to detect a 10% relative lift (to 5.5%)
    alpha=0.05,                  # 5% false positive rate
    power=0.80,                  # 80% chance of detecting a real effect
    daily_traffic=2000,          # 2,000 users/day across both variants
)
print(f"Run for at least {plan.expected_duration_days} days with {plan.n_per_variant:,} users per variant")
# → Run for at least 35.0 days with 35,000 users per variant

# Step 2: After the test, analyse results
result = test_proportions(
    control_conversions=1750, control_n=35000,    # Control: 5.0%
    treatment_conversions=1960, treatment_n=35000, # Treatment: 5.6%
)
print(format_report(result, "CTA Colour — Green vs Blue"))

Output:

============================================================
A/B TEST REPORT — CTA Colour — Green vs Blue
============================================================
Test type:        proportions
Control:          n=35,000 | metric=0.050000
Treatment:        n=35,000 | metric=0.056000
Absolute lift:    +0.006000
Relative lift:    +12.00%
P-value:          0.000032
95% CI:           (0.003300, 0.008700)
Effect size:      0.0261
SRM p-value:      1.0000
Significant:      YES
------------------------------------------------------------
RECOMMENDATION:   SHIP — statistically significant positive lift (+12.00%)
============================================================

Statistical Concepts

Concept Explanation Framework Function
Sample size Users needed to detect a given lift with statistical confidence calculate_sample_size()
SRM Sample Ratio Mismatch — experiment integrity check check_srm()
Z-test For conversion rates — tests if two proportions differ test_proportions()
Welch's t-test For revenue/time — doesn't assume equal variance test_continuous()
P-value Probability of seeing this result if there's no real effect Both test functions
Cohen's d Effect size for continuous metrics (0.2=small, 0.5=medium, 0.8=large) test_continuous()
Cohen's h Effect size for proportions test_proportions()
95% CI Range where the true effect likely falls Both test functions

Common A/B Testing Mistakes This Framework Prevents

Mistake How This Framework Handles It
Peeking at results early calculate_sample_size() tells you how long to run — don't stop early
Underpowered tests Required sample size is calculated before the test starts
SRM ignored check_srm() runs automatically — flags invalid experiments
Stat-sig ≠ business sig Reports relative lift — a 0.001% lift isn't worth shipping even if p < 0.05
Wrong test for metric type test_proportions() for conversion, test_continuous() for revenue

Quick Start

git clone https://github.com/YOUR_USERNAME/ab-testing-framework.git
cd ab-testing-framework
python -m venv venv && venv\Scripts\activate
pip install -r requirements.txt
pytest tests/ -v

What I Learned

  • Type I vs Type II errors: alpha (false positive) and beta (false negative) — and why reducing one increases the other
  • Welch's t-test over Student's t: doesn't assume equal variance — almost always safer for real data
  • SRM is a real problem: in practice, 10-20% of A/B tests have SRM from bot filtering, cookie deletion, or logging bugs
  • Effect size matters: p-value tells you IF there's a difference; effect size tells you HOW LARGE it is
  • Statistical vs practical significance: a conversion lift from 10.00% to 10.01% might be stat-sig with 10M users but not worth engineering time

License

MIT

Part of a 10-project Data Analyst portfolio

About

Production-grade A/B testing library: sample size calculator, SRM chi-squared check, proportion z-test, Welch's t-test, Cohen's d/h effect sizes, and formatted reports.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages