# A/B Testing Statistical Framework: Demo & Scenarios

This notebook demonstrates how to use the `ab_testing_framework` to design, run, and analyze A/B tests correctly. We will cover common scenarios and pitfalls.

## 1. Importing the Framework

First, let's import the classes from our `ab_testing_framework.py` file.

In [1]:
import numpy as np
import pandas as pd
from ab_testing_framework import (
    SampleSizeCalculator,
    HypothesisTester,
    EffectSizeCalculator,
    MultipleTesting,
    Visualizer,
    Utils
)

# Set default test parameters
ALPHA = 0.05
POWER = 0.80

# Instantiate our tools
calculator = SampleSizeCalculator(alpha=ALPHA, power=POWER)
tester = HypothesisTester(alpha=ALPHA)
effect_calc = EffectSizeCalculator()
viz = Visualizer()
utils = Utils()

## 2. Part 1: Designing an Experiment (Sample Size)

Before running a test, we MUST determine the sample size. Running a test without this step (i.e., "peeking" at results) is the #1 mistake in A/B testing and leads to false positives.

In [2]:
baseline_rate = 0.10  # 10% CVR
mde_relative = 0.10   # We want to detect a 10% lift (i.e., from 10% to 11%)
mde_absolute = 0.01    # This is 11% - 10%

sample_size_needed = calculator.calculate_sample_size(
    baseline_rate=baseline_rate, 
    mde_absolute=mde_absolute
)

print(f"Baseline Rate: {baseline_rate*100:.0f}%")
print(f"Absolute MDE: {mde_absolute*100:.1f}%")
print(f"Required sample size per variant (Alpha={ALPHA}, Power={POWER}): {sample_size_needed:,}")

Baseline Rate: 10%
Absolute MDE: 1.0%
Required sample size per variant (Alpha=0.05, Power=0.8): 14,745


In [3]:
# We can also visualize the power curve
fig = viz.plot_power_curve(baseline_rate=0.10, mde=0.01, max_n=50000)
fig.show()

## 3. Part 2: Analyzing 12 Test Scenarios

Let's simulate various scenarios to see how our framework performs.

In [4]:
def run_and_print_analysis(scenario_name, n_a, n_b, conv_a, conv_b, mde_abs=0.01):
    """Helper function to run and print a full test analysis."""
    print(f"\n--- SCENARIO: {scenario_name} ---")
    rate_a = conv_a / n_a
    rate_b = conv_b / n_b
    
    print(f"Control:   {conv_a:,} / {n_a:,} (Rate: {rate_a:.2%})")
    print(f"Treatment: {conv_b:,} / {n_b:,} (Rate: {rate_b:.2%})")
    print(f"Observed Lift: {rate_b - rate_a:+.2%}")

    # 1. Hypothesis Test
    test_results = tester.proportion_z_test(conv_a, n_a, conv_b, n_b)
    
    # 2. Confidence Interval
    ci = tester.confidence_interval(conv_a, n_a, conv_b, n_b)
    
    # 3. Effect Size
    effect_size = effect_calc.cohens_h(rate_a, rate_b)
    
    # 4. Interpretation
    stat_msg, prac_msg, color = utils.interpret_results(test_results, ci, mde_abs)
    
    print(f"Z-Test Results: {test_results}")
    print(f"Confidence Interval: [{ci[0]:.2%}, {ci[1]:.2%}]")
    print(f"Effect Size (Cohen's h): {effect_size:.3f}")
    print(f"\nInterpretation (Color: {color}):")
    print(f"- {stat_msg}")
    print(f"- {prac_msg}")
    
    # 5. Visualization
    fig = viz.plot_confidence_interval(ci[0], ci[1], rate_a, rate_b, mde_abs)
    fig.show()

### Scenario 1: Clear True Positive (Landing Page)
- **Setup:** 10% control CVR. True lift is 2% (to 12%). We have *more* than enough power.
- **Expected:** Statistically & Practically Significant.

In [5]:
n = 30000
c_a, n_a = utils.generate_synthetic_data(n, 0.10)
c_b, n_b = utils.generate_synthetic_data(n, 0.12)
run_and_print_analysis("1. True Positive", n_a, n_b, c_a, c_b, mde_abs=0.01)


--- SCENARIO: 1. True Positive ---
Control:   2,914 / 30,000 (Rate: 9.71%)
Treatment: 3,576 / 30,000 (Rate: 11.92%)
Observed Lift: +2.21%
Z-Test Results: {'z_stat': np.float64(8.701493312423782), 'p_value': np.float64(3.275450922868069e-18), 'is_significant': np.True_, 'alpha': 0.05}
Confidence Interval: [1.71%, 2.70%]
Effect Size (Cohen's h): 0.071

Interpretation (Color: green):
- **Statistically Significant (p=0.0000).** We are >95% confident the observed change is not due to random chance.
- **Practically Significant (Positive).** The entire 95% CI (1.71% to 2.70%) is above the MDE of 1.00%.


### Scenario 2: Underpowered (Button Color)
- **Setup:** 10% control CVR. True lift is 0.5% (to 10.5%). We use a small sample size.
- **Expected:** Not Statistically Significant (even if a real effect exists). A **Type II Error (False Negative)**.

In [6]:
n = 500
c_a, n_a = utils.generate_synthetic_data(n, 0.10)
c_b, n_b = utils.generate_synthetic_data(n, 0.105)
run_and_print_analysis("2. Underpowered (False Negative)", n_a, n_b, c_a, c_b, mde_abs=0.01)


--- SCENARIO: 2. Underpowered (False Negative) ---
Control:   41 / 500 (Rate: 8.20%)
Treatment: 51 / 500 (Rate: 10.20%)
Observed Lift: +2.00%
Z-Test Results: {'z_stat': np.float64(1.094115478516522), 'p_value': np.float64(0.27390433465534014), 'is_significant': np.False_, 'alpha': 0.05}
Confidence Interval: [-1.61%, 5.59%]
Effect Size (Cohen's h): 0.069

Interpretation (Color: red):
- **Not Statistically Significant (p=0.2739).** We cannot conclude the observed change is due to the test.
- **Not Practically Significant.** The 95% CI includes 0, meaning no effect is a plausible outcome.


### Scenario 3: Clear True Negative
- **Setup:** 10% control CVR. True lift is 0% (to 10%). We have a large sample size.
- **Expected:** Not Statistically Significant. CI should be tight around 0.

In [7]:
n = 30000
c_a, n_a = utils.generate_synthetic_data(n, 0.10)
c_b, n_b = utils.generate_synthetic_data(n, 0.10)
run_and_print_analysis("3. True Negative", n_a, n_b, c_a, c_b, mde_abs=0.01)


--- SCENARIO: 3. True Negative ---
Control:   3,037 / 30,000 (Rate: 10.12%)
Treatment: 3,014 / 30,000 (Rate: 10.05%)
Observed Lift: -0.08%
Z-Test Results: {'z_stat': np.float64(-0.3118158486236596), 'p_value': np.float64(0.7551804786897183), 'is_significant': np.False_, 'alpha': 0.05}
Confidence Interval: [-0.56%, 0.41%]
Effect Size (Cohen's h): -0.003

Interpretation (Color: red):
- **Not Statistically Significant (p=0.7552).** We cannot conclude the observed change is due to the test.
- **Not Practically Significant.** The 95% CI includes 0, meaning no effect is a plausible outcome.


### Scenario 4: False Positive (By Chance)
- **Setup:** 10% control CVR. True lift is 0%. We run 100 tests and find one where p < 0.05 by luck.
- **Expected:** A **Type I Error (False Positive)**. This is what Alpha=0.05 means (5% of true negatives will look positive).

In [8]:
print("Running 100 simulations to find a False Positive...")
n = 2000 # Small sample size makes it more volatile
found_fp = False
for i in range(100):
    c_a, n_a = utils.generate_synthetic_data(n, 0.10)
    c_b, n_b = utils.generate_synthetic_data(n, 0.10)
    
    test_results = tester.proportion_z_test(c_a, n_a, c_b, n_b)
    
    if test_results['is_significant']:
        print(f"Found False Positive on run {i+1}!")
        run_and_print_analysis("4. False Positive (Type I Error)", n_a, n_b, c_a, c_b, mde_abs=0.01)
        found_fp = True
        break

if not found_fp:
    print("Did not find a false positive in 100 runs. This is also normal!")

Running 100 simulations to find a False Positive...
Found False Positive on run 23!

--- SCENARIO: 4. False Positive (Type I Error) ---
Control:   169 / 2,000 (Rate: 8.45%)
Treatment: 211 / 2,000 (Rate: 10.55%)
Observed Lift: +2.10%
Z-Test Results: {'z_stat': np.float64(2.2648174497820897), 'p_value': np.float64(0.02352388415805994), 'is_significant': np.True_, 'alpha': 0.05}
Confidence Interval: [0.28%, 3.92%]
Effect Size (Cohen's h): 0.072

Interpretation (Color: orange):
- **Statistically Significant (p=0.0235).** We are >95% confident the observed change is not due to random chance.
- **Inconclusive (Practicality).** The 95% CI is either fully within the MDE bounds or overlaps 0, but not the MDE.


### Scenario 5: Statistically Significant, NOT Practically Significant
- **Setup:** 10% control CVR. True lift is 0.1% (to 10.1%). We use a *massive* sample size (e.g., 1,000,000).
- **Expected:** Statistically Significant (p < 0.05) but CI is below our MDE of 1%. This is a "who cares?" result.

In [9]:
n = 1000000
c_a, n_a = utils.generate_synthetic_data(n, 0.100)
c_b, n_b = utils.generate_synthetic_data(n, 0.101)
run_and_print_analysis("5. Stat. Sig, Not Practical Sig.", n_a, n_b, c_a, c_b, mde_abs=0.01)


--- SCENARIO: 5. Stat. Sig, Not Practical Sig. ---
Control:   99,748 / 1,000,000 (Rate: 9.97%)
Treatment: 100,534 / 1,000,000 (Rate: 10.05%)
Observed Lift: +0.08%
Z-Test Results: {'z_stat': np.float64(1.851460086070718), 'p_value': np.float64(0.06410339179015863), 'is_significant': np.False_, 'alpha': 0.05}
Confidence Interval: [-0.00%, 0.16%]
Effect Size (Cohen's h): 0.003

Interpretation (Color: red):
- **Not Statistically Significant (p=0.0641).** We cannot conclude the observed change is due to the test.
- **Not Practically Significant.** The 95% CI includes 0, meaning no effect is a plausible outcome.


### Scenario 6: Unequal Sample Sizes
- **Setup:** 10% control CVR. True lift is 2% (to 12%). Control has 2x sample of Treatment.
- **Expected:** Still significant. Our formulas handle unequal N.

In [10]:
c_a, n_a = utils.generate_synthetic_data(40000, 0.10)
c_b, n_b = utils.generate_synthetic_data(20000, 0.12)
run_and_print_analysis("6. Unequal Sample Sizes", n_a, n_b, c_a, c_b, mde_abs=0.01)


--- SCENARIO: 6. Unequal Sample Sizes ---
Control:   3,969 / 40,000 (Rate: 9.92%)
Treatment: 2,408 / 20,000 (Rate: 12.04%)
Observed Lift: +2.12%
Z-Test Results: {'z_stat': np.float64(7.9334168722372), 'p_value': np.float64(2.131972837712901e-15), 'is_significant': np.True_, 'alpha': 0.05}
Confidence Interval: [1.58%, 2.66%]
Effect Size (Cohen's h): 0.068

Interpretation (Color: green):
- **Statistically Significant (p=0.0000).** We are >95% confident the observed change is not due to random chance.
- **Practically Significant (Positive).** The entire 95% CI (1.58% to 2.66%) is above the MDE of 1.00%.


### Scenario 7: Very Low Conversion Rate
- **Setup:** 0.5% control CVR. We want to detect a 20% relative lift (to 0.6%).
- **Expected:** Requires *much* larger sample sizes.

In [11]:
baseline_low = 0.005
mde_abs_low = 0.001 # 0.6% - 0.5%

sample_size_low = calculator.calculate_sample_size(
    baseline_rate=baseline_low, 
    mde_absolute=mde_abs_low
)
print(f"Required sample size for low CVR test: {sample_size_low:,}")

c_a, n_a = utils.generate_synthetic_data(sample_size_low, baseline_low)
c_b, n_b = utils.generate_synthetic_data(sample_size_low, baseline_low + mde_abs_low)
run_and_print_analysis("7. Low CVR Test", n_a, n_b, c_a, c_b, mde_abs=mde_abs_low)

Required sample size for low CVR test: 85,686

--- SCENARIO: 7. Low CVR Test ---
Control:   414 / 85,686 (Rate: 0.48%)
Treatment: 510 / 85,686 (Rate: 0.60%)
Observed Lift: +0.11%
Z-Test Results: {'z_stat': np.float64(3.166716793508096), 'p_value': np.float64(0.001541703819929853), 'is_significant': np.True_, 'alpha': 0.05}
Confidence Interval: [0.04%, 0.18%]
Effect Size (Cohen's h): 0.015

Interpretation (Color: orange):
- **Statistically Significant (p=0.0015).** We are >95% confident the observed change is not due to random chance.
- **Inconclusive (Practicality).** The 95% CI is either fully within the MDE bounds or overlaps 0, but not the MDE.


### Scenario 8: Very High Conversion Rate
- **Setup:** 50% control CVR. We want to detect a 5% relative lift (to 52.5%).
- **Expected:** Requires smaller sample sizes than the 10% baseline case.

In [12]:
baseline_high = 0.50
mde_abs_high = 0.025 # 52.5% - 50%

sample_size_high = calculator.calculate_sample_size(
    baseline_rate=baseline_high, 
    mde_absolute=mde_abs_high
)
print(f"Required sample size for high CVR test: {sample_size_high:,}")

c_a, n_a = utils.generate_synthetic_data(sample_size_high, baseline_high)
c_b, n_b = utils.generate_synthetic_data(sample_size_high, baseline_high + mde_abs_high)
run_and_print_analysis("8. High CVR Test", n_a, n_b, c_a, c_b, mde_abs=mde_abs_high)

Required sample size for high CVR test: 6,274

--- SCENARIO: 8. High CVR Test ---
Control:   3,185 / 6,274 (Rate: 50.77%)
Treatment: 3,302 / 6,274 (Rate: 52.63%)
Observed Lift: +1.86%
Z-Test Results: {'z_stat': np.float64(2.090157562457695), 'p_value': np.float64(0.03660364826974656), 'is_significant': np.True_, 'alpha': 0.05}
Confidence Interval: [0.12%, 3.61%]
Effect Size (Cohen's h): 0.037

Interpretation (Color: orange):
- **Statistically Significant (p=0.0366).** We are >95% confident the observed change is not due to random chance.
- **Inconclusive (Practicality).** The 95% CI is either fully within the MDE bounds or overlaps 0, but not the MDE.


### Scenario 9: Chi-Square Test (Same as Z-Test)
- **Setup:** Use the same data as Scenario 1.
- **Expected:** p-value should be identical to the z-test. This confirms our methods.

In [13]:
print("Re-running Scenario 1 data with Chi-Square Test")
n = 30000
c_a, n_a = utils.generate_synthetic_data(n, 0.10)
c_b, n_b = utils.generate_synthetic_data(n, 0.12)

chi2_results = tester.chi_square_test(c_a, n_a, c_b, n_b)
z_results = tester.proportion_z_test(c_a, n_a, c_b, n_b)

print(f"Z-Test p-value:    {z_results['p_value']}")
print(f"Chi2-Test p-value: {chi2_results['p_value']}")
print("Result: They are effectively identical.")

Re-running Scenario 1 data with Chi-Square Test
Z-Test p-value:    2.0469276762911842e-17
Chi2-Test p-value: 2.2910766007083818e-17
Result: They are effectively identical.


## 4. Part 3: Multiple Testing Correction

What happens when we run an A/B/C/D test? We are testing:
1. A vs B
2. A vs C
3. A vs D

This increases our chance of a False Positive. We must correct for this.

### Scenario 10: A/B/C Test with one winner
- **Setup:** A (10%), B (12%), C (10%).
- **Expected:** A vs B is significant, A vs C is not.

In [14]:
n = 30000
c_a, n_a = utils.generate_synthetic_data(n, 0.10)
c_b, n_b = utils.generate_synthetic_data(n, 0.12)
c_c, n_c = utils.generate_synthetic_data(n, 0.10)

p_values = []
p_values.append(tester.proportion_z_test(c_a, n_a, c_b, n_b)['p_value']) # A vs B
p_values.append(tester.proportion_z_test(c_a, n_a, c_c, n_c)['p_value']) # A vs C

print(f"Original p-values: {p_values}")

corrector = MultipleTesting(p_values, alpha=ALPHA)

bonferroni = corrector.bonferroni_correction()
print(f"\nBonferroni Results: {bonferroni}")

fdr = corrector.benjamini_hochberg()
print(f"FDR (B-H) Results: {fdr}")

print("\nInterpretation: Both methods correctly identify A vs B as significant and A vs C as not.")

Original p-values: [np.float64(7.872954635684655e-15), np.float64(0.15185263120928238)]

Bonferroni Results: {'method': 'Bonferroni', 'corrected_alpha': 0.025, 'corrected_p_values': [np.float64(1.574590927136931e-14), np.float64(0.30370526241856477)], 'is_significant': [np.True_, np.False_]}
FDR (B-H) Results: {'method': 'Benjamini-Hochberg (FDR)', 'corrected_p_values': [np.float64(1.574590927136931e-14), np.float64(0.15185263120928238)], 'is_significant': [np.True_, np.False_]}

Interpretation: Both methods correctly identify A vs B as significant and A vs C as not.


### Scenario 11: A/B/C Test with a False Positive
- **Setup:** A (10%), B (10%), C (10%). By chance, one might get p < 0.05.
- **Expected:** Bonferroni/FDR correction should successfully screen out the false positive.

In [15]:
print("Running simulations to find a multi-test False Positive...")
n = 10000
found_multi_fp = False
for i in range(100):
    c_a, n_a = utils.generate_synthetic_data(n, 0.10)
    c_b, n_b = utils.generate_synthetic_data(n, 0.10)
    c_c, n_c = utils.generate_synthetic_data(n, 0.10)

    p_values = []
    p_values.append(tester.proportion_z_test(c_a, n_a, c_b, n_b)['p_value'])
    p_values.append(tester.proportion_z_test(c_a, n_a, c_c, n_c)['p_value'])
    
    if any(p < ALPHA for p in p_values):
        print(f"\nFound a potential FP on run {i+1}!")
        print(f"Original p-values: {p_values}")
        
        corrector = MultipleTesting(p_values, alpha=ALPHA)
        bonferroni = corrector.bonferroni_correction()
        fdr = corrector.benjamini_hochberg()
        
        print(f"Bonferroni Significant: {any(bonferroni['is_significant'])}")
        print(f"FDR Significant: {any(fdr['is_significant'])}")
        
        if not any(bonferroni['is_significant']) and not any(fdr['is_significant']):
            print("SUCCESS: Correction methods prevented the False Positive.")
            found_multi_fp = True
            break

if not found_multi_fp:
    print("\nDid not find a clear example in 100 runs.")

Running simulations to find a multi-test False Positive...

Found a potential FP on run 8!
Original p-values: [np.float64(0.011538648603128308), np.float64(0.06382076424987304)]
Bonferroni Significant: True
FDR Significant: True

Found a potential FP on run 10!
Original p-values: [np.float64(0.15449575645214605), np.float64(0.017934997211637018)]
Bonferroni Significant: True
FDR Significant: True

Found a potential FP on run 20!
Original p-values: [np.float64(0.2165039545247458), np.float64(0.04860556572805946)]
Bonferroni Significant: False
FDR Significant: False
SUCCESS: Correction methods prevented the False Positive.


### Scenario 12: Clear Negative Result (Inconclusive)
- **Setup:** 10% CVR. Small lift (0.5%) and small sample (1000).
- **Expected:** Not Statistically Significant. CI is very wide and overlaps 0. We are **inconclusive**.

In [16]:
n = 1000
c_a, n_a = utils.generate_synthetic_data(n, 0.10)
c_b, n_b = utils.generate_synthetic_data(n, 0.105)
run_and_print_analysis("12. Inconclusive (Wide CI)", n_a, n_b, c_a, c_b, mde_abs=0.01)


--- SCENARIO: 12. Inconclusive (Wide CI) ---
Control:   112 / 1,000 (Rate: 11.20%)
Treatment: 110 / 1,000 (Rate: 11.00%)
Observed Lift: -0.20%
Z-Test Results: {'z_stat': np.float64(-0.14236480184079878), 'p_value': np.float64(0.8867918632212457), 'is_significant': np.False_, 'alpha': 0.05}
Confidence Interval: [-2.96%, 2.56%]
Effect Size (Cohen's h): -0.006

Interpretation (Color: red):
- **Not Statistically Significant (p=0.8868).** We cannot conclude the observed change is due to the test.
- **Not Practically Significant.** The 95% CI includes 0, meaning no effect is a plausible outcome.
