# Table of Contents

1. **[Experiment Brief](#1.-Experiment-Brief)**

2. **[Working Simulator (simulate data)](#2.-Working-Simulator)**
   - 2.1 [Setting Defaults](#2.1.-Setting-Defaults-(config.py))
   - 2.2 [Simulating data via config.py and simulate.py](#2.2-Simulating-data-via-config.py-and-simulate.py)
   - 2.3 [Sanity Checks](#2.3-Quick-Sanity-Checks)

3. **[A/A Test](#3.-A/A-Test)**

4. **[A/B Test Analysis](#4.-A/B-Test-Analysis)**

5. **[Checking Secondary and Guardrail Metrics](#5.-Calculating-Metrics-(Primary,-Secondary,-Guardrail))**
   - 5.1 [Secondary Metric: Purchase CVR per Exposure](#5.1-Secondary-Metric:-Purchase-CVR-per-Exposure)
   - 5.2 [Guardrail Metric: Purchase Given Signup](#5.2-Guardrail-Metric:-Purchase-Given-Signup-(purchase-quality-among-signups))
   - 5.3 [Metrics Conclusion](#5.3-Metrics-Conclusion)

6. **[Decision](#6.-Applying-Decision-Criteria)**

7. **[Testing Statistical Power](#7.-Repeated-Simulations-to-test-statistical-power,-estimator-variance,-and-confidence-interval-coverage)**

8. **[Stress-testing](#8.-Stress-testing-experiment-against-realistic-failure-modes)**
   - 8.1 [Non-Compliance](#8.1-Non-Compliance)
   - 8.2 [Novelty Effects](#8.2-Novelty-Effects)
   - 8.3 [Summary of Stress Tests](#8.3-Summary-of-Stress-Tests)

9. **[Limitations & Robustness](#9.-Limitations-&-Robustness)**
   - 9.1 [Limitations](#9.1-Limitations)
   - 9.2 [Robustness Checks](#9.2-Robustness-Checks)
   - 9.3 [Summary](#9.3-Summary)
     
10. **[Conclusion](#10.-Conclusion)**

---

# 1. Experiment Brief

## Experiment Overview

**Experiment Name:** Landing Page Copy Test — "Limited Time Offer"  
**Owner:** Julian Lu  
**Date:** January 4, 2026

---

## Business Question

Does adding the phrase "Limited time offer" to the landing page increase signup
conversion rate for users arriving from paid ads?

---

## Hypothesis

**Primary Hypothesis:** Users exposed to the landing page with urgency-based copy
(Variant B) will have a higher signup conversion rate than users shown the control
page (Variant A).

**Direction:** One-sided test (B > A)

**Null Hypothesis (H₀):**
$$
\text{CVR}_B - \text{CVR}_A \le 0
$$

**Alternative Hypothesis (H₁):**
$$
\text{CVR}_B - \text{CVR}_A > 0
$$

---

## Experiment Design

| **Parameter** | **Value** |
|--------------|-----------|
| **Unit of Randomization** | User (first exposure only) |
| **Randomization Method** | Deterministic hash-based assignment |
| **Traffic Split** | 50% Control / 50% Treatment |
| **Target Population** | Users arriving from paid ad clicks |
| **Exclusion Criteria** | Repeat exposures (analyze first exposure only) |

### Variants

- **Control (A):** Standard landing page copy  
- **Treatment (B):** Landing page copy with "Limited time offer" messaging

---

## User Funnel

```
Ad Click → Landing Page → Signup → Purchase
                 ↑
         Experiment intervention
```

The experiment directly modifies the landing page, so the primary impact is expected at the signup step.

---

## Metrics

### Primary Metric

**Signup Conversion Rate (CVR)**

$$
\text{Signup CVR} =
\frac{\text{\# unique users who signed up}}
{\text{\# unique users exposed to landing page}}
$$


### Secondary Metrics

**Purchase Conversion Rate**

$$
\text{Purchase CVR} =
\frac{\text{\# unique users who purchased}}
{\text{\# unique users exposed to landing page}}
$$

> *Secondary metrics are monitored for downstream effects but are not used for the
primary launch decision.*


### Guardrail Metrics


**Purchase Given Signup**

$$
\text{Purchase | Signup} =
\frac{\text{\# unique users who purchased}}
{\text{\# unique users who signed up}}
$$

---

## Sample Size and Power

| **Parameter** | **Value** |
|--------------|-----------|
| **Baseline Signup CVR** | 12.0% |
| **Minimum Detectable Effect (MDE)** | +1.2 pp (12% → 13.2%) |
| **Significance Level (α)** | 0.05 (one-sided) |
| **Target Statistical Power (1−β)** | 80% |
| **Planned Sample Size** | 200,000 total users (~100,000 per group) |
| **Stopping Rule** | Stop when required sample size per group is reached |

### Power Calculation (Planning Approximation)

For a two-proportion z-test:

$$
n =
\frac{(z_{1-\alpha} + z_{1-\beta})^2
\left[ \hat{p}_A(1-\hat{p}_A) + \hat{p}_B(1-\hat{p}_B) \right]}
{\text{MDE}^2}
$$

Where:
- $z_{1-\alpha} = 1.645$ (one-sided, α = 0.05)
- $z_{1-\beta} = 0.84$ (power = 80%)
- $\hat{p}_A = 0.12$, $\hat{p}_B = 0.132$

**Validation:** Planned power was additionally verified using repeated Monte
Carlo simulations under realistic data-generating assumptions.

---

## Analysis Plan

### Estimation

**Effect Size (Lift):**

$$
\text{Lift} = \hat{p}_B - \hat{p}_A
$$

**Unpooled Standard Error (for estimation and CI):**

$$
SE =
\sqrt{
\frac{\hat{p}_A(1-\hat{p}_A)}{n_A} +
\frac{\hat{p}_B(1-\hat{p}_B)}{n_B}
}
$$

**One-Sided 95% Lower Confidence Bound:**

$$
\text{LB}_{95\%} = \text{Lift} - 1.645 \times SE
$$

---

### Hypothesis Test

**Test Statistic (Z-test for proportions):**

$$
z =
\frac{\hat{p}_B - \hat{p}_A}
{\sqrt{
\hat{p}_{\text{pooled}}(1-\hat{p}_{\text{pooled}})
\left(\frac{1}{n_A} + \frac{1}{n_B}\right)
}}
$$

Where:

$$
\hat{p}_{\text{pooled}} =
\frac{\text{conversions}_A + \text{conversions}_B}
{n_A + n_B}
$$

**P-value:** One-sided test using the standard normal distribution.

---

### Validity Checks

1. **Sample Ratio Mismatch (SRM) Test**
   - Chi-square test comparing observed vs. expected 50/50 split
   - Acceptance threshold: p-value ≥ 0.05

2. **A/A Test Validation**
   - Run separately to validate randomization and logging
   - False positive rate should be approximately 5%

---

## Decision Criteria

### ✅ Ship Variant B if:

1. **Statistical Significance**
   - One-sided 95% lower confidence bound for lift > 0

2. **Practical Significance**
   - Lower bound of lift > 0.5 percentage points

3. **Business Value**
   - Expected incremental revenue over 30 days exceeds implementation cost

4. **Guardrail Safety**
   - No statistically or practically significant degradation in guardrail metrics

### ❌ Do Not Ship if:

- Lift is inconclusive or negative
- Guardrail metrics show meaningful degradation
- Business case (ROI) is not met

---

## Risks and Limitations

### Potential Risks

1. **Quality vs. Quantity Trade-off**  
   Urgency messaging may increase low-intent signups, reducing downstream purchase
   quality (monitored via Purchase | Signup).

2. **Novelty Effects**  
   Treatment lift may decay over time due to novelty effects.

3. **Simulation Assumptions**  
   Simulated data assumes stable user behavior aside from modeled novelty decay.

### Mitigation Strategies

- Explicitly model novelty decay in the simulator to stress-test detectability
- Validate power and false-positive rates via repeated simulations
- Monitor secondary and guardrail metrics post-launch


# 2. Working Simulator

## 2.1. Setting Defaults (config.py)

**Traffic per Day:** 10,000

**Baseline Signup CVR:** 12% (for reference user)

**Baseline Lift from Treatment:** +1.2 pp

**Purchase given Signup:** 20%

**Practical Lift Needed:** 0.5 pp

**Compliance Rate:** 100% 

**Experiment Length:** 28 days  

**Novelty Decay:** 0%

## 2.2 Simulating data via config.py and simulate.py

Create realistic data for the results of our A/B test

#### 2.2.1 Generating users (simulate.py)

Added covariates create different baseline conversation rates, realistic variance, and makes it so that you're able to segment behavior based on different traits (ie. mobile vs desktop conversion). And make the generated data more realistic. 

#### 2.2.2 Deterministic user-level assignment via hashing

- Prevents users from switching variants
- Reproducible without relying on RNG order
- Mirrors how many real experimentation systems work

Why use a hash instead of user ID to assign A/B split: 

Even though user IDs are unique and deterministic (same user -> same variant every time), they often encode structure like signup time or source (ie. lower user_id = early adopters). Hashing removes that structure while preserving determinism, which protects the independence assumption required for causal inference.

#### 2.2.3 Simulate signup and purchase via a logistic model

A logistic model keeps probabilities between 0 and 1, lets covariates shift baseline realistically, and makes the treatment effect clean. Work in log odds, with the baseline probability for a reference user being 0.12. Model both signup and purchase for downstream metrics as well.
- Applies a small treatment shift on the probability scale that approximates to +1.2pp

#### 2.2.4 Running Experiment and Creating Simulated Data

In [1]:
from src.simulate import run_experiment, simulate_signup
from src.config import ExperimentConfig


cfg = ExperimentConfig(compliance_rate=1.0, baseline_signup_cvr=0.12, treatment_abs_lift=0.012)
df = run_experiment(n_users = 200_000, cfg=cfg)

df.head()

Unnamed: 0,user_id,is_new,device,channel,variant,assigned_B,exposed_B,p_signup_base,day,p_signup,signed_up,p_purchase_given_signup,purchased
0,1,0,mobile,search,B,1,1,0.109835,24,0.121835,0,0.23,0
1,2,1,mobile,social,B,1,1,0.130968,25,0.142968,0,0.18,0
2,3,0,desktop,search,B,1,1,0.130968,9,0.142968,1,0.23,0
3,4,1,mobile,social,B,1,1,0.130968,11,0.142968,0,0.18,0
4,5,1,desktop,search,B,1,1,0.169042,4,0.181042,0,0.2,0


## 2.3 Quick Sanity Checks

Check whether simulated data has results as expected

In [2]:
# Check whether avg signup rate is as expected: ~1.2% difference; ~12% and ~13.2% baseline signup rates
df.groupby("variant")["signed_up"].mean()

variant
A    0.150205
B    0.163963
Name: signed_up, dtype: float64

In [3]:
# Average purchase rate per group, check for postive/negative downstream effect
df.groupby("variant")["purchased"].mean()

variant
A    0.030350
B    0.032425
Name: purchased, dtype: float64

In [4]:
# Check for Sample Ratio Mismatch 
df.groupby("variant").size()

variant
A     99770
B    100230
dtype: int64

In [5]:
# Check that # purchased < # signed up
(df["purchased"] <= df["signed_up"]).all()

True

# 3. A/A Test

Validate the analysis pipeline using repeated A/A tests (true lift = 0). If the observed false positive rate closely matches the nominal 5% level, indicates correct calibration of hypothesis tests and confidence intervals.

In [6]:
from src.AATest import run_aa_once, run_aa_simulation

# Evaluate A/A test results over 1000 simulations
aa_cfg = ExperimentConfig(
    baseline_signup_cvr=0.12,
    treatment_abs_lift=0.0,  # Set absolute lift to 0 for A/A
    seed=123
)

aa_results = run_aa_simulation(
    n_sims=1000,
    n_users=200_000,
    cfg=aa_cfg
)

aa_results.mean()

p_value             0.501765
ci_low             -0.003152
ci_high             0.003140
reject              0.045000
ci_excludes_zero    0.052000
srm_passes          1.000000
dtype: float64

# 4. A/B Test Analysis

**Estimator Defined**

What is being estimated (target quantity):

$$
\Delta = \text{Signup CVR}_B - \text{Signup CVR}_A
$$

This is the average treatment effect (ATE) on signup probability

**Point Estimates**
- Sample size per group
- Conversion rate per group
- Lift (difference)

In [7]:
from src.analysis import run_ab_analysis

results = run_ab_analysis(df, outcome_col = "signed_up")
# Results is a dictionary
results["group_stats"], results["lift_ci"], results["test"], results["srm"]

(GroupStats(n_A=99770, n_B=100230, conv_A=14986, conv_B=16434, p_A=0.15020547258695, p_B=0.16396288536366357),
 {'lift': 0.013757412776713579,
  'se': 0.0016269681783505879,
  'ci_low': 0.010568613743153687,
  'ci_high': 0.01694621181027347,
  'alpha': 0.05,
  'conf_level': 0.95},
 {'z': 8.453654651002372,
  'p_value': 0.0,
  'alternative': 'greater',
  'pooled_rate': 0.1571,
  'se0': 0.0016273923344008766,
  'lift': 0.013757412776713579},
 {'n_A': 99770,
  'n_B': 100230,
  'expected_A': 100000.0,
  'expected_B': 100000.0,
  'chi2': 1.058,
  'p_value': 0.30367178216369867,
  'passes': True})

### Result Summary

**Lift:** +1.376 pp

**95% CI:** 
[+1.057 pp,+1.695 pp]

**p-value:** <0.001 (one-sided)

**SRM:** passes (p = 0.30)


Therefore:
- Statistically significant 
- Directionally positive 
- Randomization intact 

# 5. Calculating Metrics (Primary, Secondary, Guardrail)

**Primary Metric:**

Signup CVR Lift (already calculated above)

**Secondary Metric:** 

Purchase CVR per Exposure (purchase per landing exposure to "Limited time offer")

**Guardrail Metric:**

Purchase given signup as a guardrail for “low-intent signups" (purchase quality among signups)

### 5.1 Secondary Metric: Purchase CVR per Exposure

In [8]:
import pandas as pd
from src.analysis import run_ab_analysis

res_purchase = run_ab_analysis(df, outcome_col = "purchased")
res_purchase

{'group_stats': GroupStats(n_A=99770, n_B=100230, conv_A=3028, conv_B=3250, p_A=0.030349804550466073, p_B=0.0324254215304799),
 'lift_ci': {'lift': 0.002075616980013826,
  'se': 0.0007797342808445344,
  'ci_low': 0.0005473658720472987,
  'ci_high': 0.003603868087980353,
  'alpha': 0.05,
  'conf_level': 0.95},
 'test': {'z': 2.6617112259386877,
  'p_value': 0.0038872272421878185,
  'alternative': 'greater',
  'pooled_rate': 0.03139,
  'se0': 0.0007798054724294264,
  'lift': 0.002075616980013826},
 'srm': {'n_A': 99770,
  'n_B': 100230,
  'expected_A': 100000.0,
  'expected_B': 100000.0,
  'chi2': 1.058,
  'p_value': 0.30367178216369867,
  'passes': True}}

**Interpretation:**

- Variant B leads to more purchases per landing exposure
- B increases signups → more users reach the purchase stage
- However, effect is extremely small to realistically matter

### 5.2 Guardrail Metric: Purchase Given Signup (purchase quality among signups)

In [9]:
df_signed = df[df["signed_up"] == 1].copy()
res_p_given_s = run_ab_analysis(df_signed, outcome_col="purchased")
res_p_given_s

{'group_stats': GroupStats(n_A=14986, n_B=16434, conv_A=3028, conv_B=3250, p_A=0.20205525156813026, p_B=0.19776073992941462),
 'lift_ci': {'lift': -0.004294511638715637,
  'se': 0.004518018381523736,
  'ci_low': -0.013149664947992103,
  'ci_high': 0.00456064167056083,
  'alpha': 0.05,
  'conf_level': 0.95},
 'test': {'z': -0.9508692721570479,
  'p_value': 0.8291646289644039,
  'alternative': 'greater',
  'pooled_rate': 0.1998090388287715,
  'se0': 0.004516405950287502,
  'lift': -0.004294511638715637},
 'srm': {'n_A': 14986,
  'n_B': 16434,
  'expected_A': 15710.0,
  'expected_B': 15710.0,
  'chi2': 66.7315085932527,
  'p_value': 3.1112221436754375e-16,
  'passes': False}}

**Interpretation:**

- Point estimate suggests slightly worse purchase quality in B
- However, the effect is statistically inconclusive (ie. CI includes 0)

**Conclusion for Guardrail:**
- There is no evidence of meaningful degradation
- Also no evidence of improvement

### 5.3 Metrics Conclusion

Primary metric (signup CVR)

- CI excludes 0 in positive direction ✅

- Lower bound > 0.5 pp ✅

- SRM passes ✅

Secondary metric (purchase per exposure)

- Small improvement ✅

- Directionally consistent with funnel logic ✅

Guardrails (Purchase|signup):

- No significant degradation ✅

- Slight negative point estimate, but inconclusive and small ✅


**Summary:**

Variant B significantly increases signup conversion rate with a practically meaningful lift. Downstream purchase per exposure also increases, driven by higher signup volume. Purchase conversion conditional on signup shows no statistically or practically significant degradation, suggesting the urgency copy does not materially reduce lead quality. Based on primary and available guardrail metrics, Variant B would be recommended for rollout, subject to validation of revenue impact and bounce behavior.

# 6. Applying Decision Criteria

**Ship Variant B if:**
- 95% CI for signup CVR lift excludes 0 in the positive direction
- Ship only if practically significance; if the lower bound of the 95% CI > 0.05pp
- The expected incremental revenue (over 30 days) exceeds cost of implementation
- No statistically or practically significant degradation in guardrail

**Do not ship if:**
- Lift is inconclusive or negative
- Guardrail metrics show degradation

In [10]:
from src.decision import decision_clean, print_decision

summary = decision_clean(df, practical_lift=0.005)
print_decision(summary)


DECISION: SHIP

VALIDITY
  SRM passes: True  (p=0.3037, nA=99770, nB=100230)

PRIMARY (Signup CVR)
  A: 15.02%  |  B: 16.40%
  Lift: 1.376 pp  95% CI: [1.057 pp, 1.695 pp]
  p(one-sided): 0.000e+00
  Stat sig: True | Practical: True (threshold 0.500 pp)

SECONDARY (Purchase per exposure)
  A: 3.03%  |  B: 3.24%
  Lift: 0.208 pp  95% CI: [0.055 pp, 0.360 pp]
  p(one-sided): 3.887e-03

GUARDRAIL (Purchase | Signup)
  A: 20.21%  |  B: 19.78%
  Lift: -0.429 pp  95% CI: [-1.315 pp, 0.456 pp]
  Guardrail OK: True

NOTES
  - Revenue check not available (need order_value + cost model).
  - Bounce rate not available (need session/event data).
  - SRM on purchase|signup subset is not applicable (conditioning on post-treatment variable).


# 7. Repeated Simulations to test statistical power, estimator variance, and confidence interval coverage

Simulate many repeated experiments at different total sample sizes (n_users) with a true lift of +1.2pp to quantify:

**Power vs sample size** (how often you detect the effect)

**Estimator behavior** (distribution of estimated lift)

**CI coverage** (how often the 95% CI contains the true lift)

Was the scale/design of our experiment adequate to answer our hypothesis?

**Expected:**

- power should increase with N

- coverage should be near ~0.95

- mean_lift should be near 0.012

In [11]:
from src.power_validation import run_phase5
# Run experiment across different sample sizes for alternative (+1.2pp lift)

cfg_alt = ExperimentConfig(
    compliance_rate = 1.0,
    baseline_signup_cvr=0.12,
    treatment_abs_lift=0.012,  # true effect
    seed=123
)

# Different sample sizes
n_users_grid = [10_000, 25_000, 50_000, 100_000, 200_000]
raw_alt, summary_alt = run_phase5(
    n_sims=300,
    n_users_grid=n_users_grid,
    cfg_base=cfg_alt,
    true_lift=0.012,
    alpha=0.05
)

summary_alt

Unnamed: 0,n_users,power,mean_lift,std_lift,mean_ci_width,coverage,srm_pass_rate
0,10000,0.463333,0.012058,0.007964,0.028601,0.936667,1.0
1,25000,0.83,0.01171,0.004404,0.018074,0.963333,1.0
2,50000,0.98,0.011681,0.002956,0.012787,0.97,1.0
3,100000,1.0,0.011925,0.002367,0.009037,0.946667,1.0
4,200000,1.0,0.011964,0.001636,0.00639,0.94,1.0


**Results & Interpretation:**

- Anything above 25,000 total users fits the criteria of 80% statistical power, with our sample size of 200,000 being ~100%; our test almost always correctly detects the effect when the effect is truly there
- Average lift (~1.2 pp) matches the true lift; estimator is unbiased
- As n_users increases, the CI intervals and estimator become more precise; stronger evidence as n_users goes up
- Coverage, the fraction of times the 95% CI actually contains the true lift, matches ~94% across the levels
- SRM works reliably

# 8. Stress-testing experiment against realistic failure modes

Up until this point, we assumed
- Perfect randomization
- Stable user behavior
- No behavioral adaptation
- No logging issues
  
If these assumptions break, how does our inference break, and how do we detect it?

### 8.1 Non-Compliance

If some users assigned to B don’t actually see B, and see A instead (cached page, ad blocker, slow JS, etc.)

Randomization is still intact, but exposure isn't

Leads to:

- Attenuation bias (effects shrink)

- Reduced power

- Misleading “small effect” conclusions

### 8.2 Novelty Effects

In [12]:
from src.compliance import run_once, power_at_compliance

# Run compliance test on original experiment
run_ab_analysis(df, outcome_col="signed_up")

pd.DataFrame([run_once(c) for c in [1.0, 0.8, 0.6, 0.4]])

Unnamed: 0,compliance,exposure_rate_realized,lift_hat,predicted_lift,ci_low,ci_high,p_value
0,1.0,1.0,0.010931,0.012,0.007749,0.014113,8.365975e-12
1,0.8,0.7998,0.008367,0.009598,0.005196,0.011539,1.168496e-07
2,0.6,0.597815,0.005913,0.007174,0.002752,0.009074,0.0001233652
3,0.4,0.396039,0.003429,0.004752,0.000278,0.00658,0.0164783


**Interpretation:** Lift decreases as compliance decreases; the less people that see the treatment in Group B, the harder it is to detect the effect of the treatment

In [13]:
rows = []
for c in [1.0, 0.9, 0.8, 0.7, 0.6]:
    rows.append({"compliance": c, "power": power_at_compliance(c)})

pd.DataFrame(rows)

Unnamed: 0,compliance,power
0,1.0,0.98
1,0.9,0.926667
2,0.8,0.86
3,0.7,0.783333
4,0.6,0.703333


**Interpretation:** As compliance decreases, the statistical power, or the likelihood in detecting the effect if its there, goes down

**Summary**: As compliance decreases, fewer users assigned to treatment actually experience the treatment, which attenuates the observed lift. This reduction in signal makes the effect harder to detect, leading to lower statistical power, aka the probability of detecting the effect when it truly exists.

Although we analyze only first exposure per user, we additionally evaluate novelty effects by allowing treatment impact to vary over calendar time, reflecting adaptation or saturation among incoming users.

Make the treatment lift shrink 8% per day, simulating realistic novelty decay.

In [14]:
cfg = ExperimentConfig(
    treatment_abs_lift=0.012,
    compliance_rate=1.0,
    experiment_days=28,
    novelty_decay_k=0.12
)
df = run_experiment(200_000, cfg)

In [15]:
early = df[df["day"] < 7]
late  = df[df["day"] >= 21]

def lift(df):
    g = df.groupby("variant")["signed_up"].mean()
    return g["B"] - g["A"]

# Compare average lift in early vs late days
lift(early), lift(late)

(0.01648074220416923, 0.014379836348535513)

In [16]:
by_day = (
    df.groupby(["day", "variant"])["signed_up"]
      .mean()
      .unstack()
)

# Looking at average lift in 5 day intervals
by_day["lift"] = by_day["B"] - by_day["A"]
by_day[["lift"]].iloc[::5]

variant,lift
day,Unnamed: 1_level_1
0,0.015516
5,0.015305
10,0.007077
15,0.013096
20,0.007132
25,0.009324


### If the treatment effect decays over time, what does our experiment actually estimate—and can we still detect a meaningful effect?

Instead of the A/B test estimating a constant average treatment effect, we're now estimating a time-weighted average effect across the experiment window

#### Redo Power Analysis

Given novelty decay, is our experiment still powerful enough to detect the average effect over the experiment window?

In [17]:
# Redoing Power Analysis w/ Novelty Effect in place

# Different sample sizes
n_users_grid = [10_000, 25_000, 50_000, 100_000, 200_000]
raw_alt, summary_alt = run_phase5(
    n_sims=300,
    n_users_grid=n_users_grid,
    cfg_base=cfg,
    true_lift=0.012,
    alpha=0.05
)

summary_alt

Unnamed: 0,n_users,power,mean_lift,std_lift,mean_ci_width,coverage,srm_pass_rate
0,10000,0.48,0.012129,0.007654,0.028602,0.94,1.0
1,25000,0.82,0.011649,0.004561,0.01808,0.956667,1.0
2,50000,0.986667,0.011989,0.003053,0.012781,0.963333,1.0
3,100000,1.0,0.012029,0.002368,0.009038,0.94,1.0
4,200000,1.0,0.0119,0.001534,0.00639,0.97,1.0


Even with novelty effects implemented, statistical power only decreases about ~0.01 pp for lower samples sizes. Statistical power still holds strong and above our target power for 25,000 users or more.

### 8.3 Summary of Stress Tests

We extended the simulator to include novelty decay, where treatment effects are stronger early in the experiment and diminish over time. While the standard A/B analysis estimates the time-averaged effect, novelty reduces the effective signal and therefore statistical power. We re-evaluated power under this setting to assess whether the experiment design remains adequate under more realistic dynamics, in which the results indicated our sample size alongside novelty effects was more than enough to reliably detect the average effecct.

# 9. Limitations & Robustness

## 9.1 Limitations

**Simulation assumptions**:
This analysis relies on simulated data and therefore assumes stable user behavior, known baseline conversion rates, and a simplified funnel structure. While the simulator incorporates realistic heterogeneity (user type, device, channel), real-world user behavior may deviate due to unobserved factors.

**Constant treatment lift assumption (baseline analysis)**:
The primary analysis assumes a constant treatment effect across users and time. In practice, treatment effects may vary by cohort, traffic source, or over the experiment duration.

**Absolute lift modeling**:
Treatment impact is modeled as an absolute increase in signup probability. In real applications, effects may operate on a relative or log-odds scale, particularly when baseline rates differ substantially across segments.

**Single-metric primary decision**:
The shipping decision is based on signup conversion rate as the primary metric. While downstream metrics are monitored, longer-term effects such as user lifetime value are not explicitly modeled.

---

## 9.2 Robustness Checks

**A/A test validation**:
Before analyzing treatment effects, repeated A/A simulations were conducted to validate the analysis pipeline. The observed false positive rate closely matched the nominal 5% significance level, confirming correct calibration of hypothesis tests and confidence intervals.

**Sample ratio mismatch (SRM) monitoring**:
Exposure counts were checked using a chi-square goodness-of-fit test to ensure correct randomization. No SRM was detected in the primary analysis population. SRM checks were not applied to post-treatment subsets (e.g., purchase conditional on signup), where conditioning would invalidate the test.

**Power analysis across sample sizes**:
Repeated simulations were used to estimate statistical power, estimator variance, and confidence interval coverage across sample sizes. This analysis demonstrated that the planned experiment scale was sufficient to detect the minimum detectable effect with high probability, and that confidence intervals were well calibrated.

**Non-compliance stress test**:
The simulator was extended to allow partial treatment exposure, where only a fraction of users assigned to treatment actually experienced the modified landing page. As expected, non-compliance attenuated the observed treatment effect and reduced power, illustrating how imperfect experiment delivery can cause real effects to be missed even when randomization remains valid.

**Novelty decay analysis**:
To assess temporal robustness, treatment effects were allowed to decay over the experiment window. While novelty reduced the average measured effect and statistical power, the analysis framework remained valid, estimating a time-weighted average treatment effect. This highlights the importance of experiment duration and timing when interpreting results for features with short-lived impact.

---

## 9.3 Summary

Overall, the experiment design and analysis pipeline were validated under multiple realistic stress scenarios. While simplifying assumptions remain, the robustness checks demonstrate that the core conclusions are not driven by artifacts of randomization, sample size, or idealized behavior, and that observed effects degrade in predictable and interpretable ways under non-compliance and novelty dynamics.

# 10. Conclusion

In this project, we designed, simulated, and analyzed an end-to-end A/B test to evaluate whether adding the phrase “Limited time offer” to a landing page increases signup conversion. Users were randomly assigned at the user level, and realistic behavioral heterogeneity was incorporated through covariates such as device type, acquisition channel, and user status.

Before evaluating treatment effects, we validated the experimental pipeline using repeated A/A tests, confirming correct calibration of hypothesis tests and confidence intervals. We then conducted the primary A/B analysis, estimating the treatment effect as the difference in signup conversion rates between variants and testing it using a one-sided two-sample proportion test. The observed lift was statistically significant, exceeded the pre-specified practical significance threshold, and showed no evidence of harmful downstream effects on purchase quality.

To assess reliability beyond a single experimental outcome, we performed power analysis across sample sizes and demonstrated that the chosen experiment scale was sufficient to detect the hypothesized effect with high probability. We further stress-tested the design by introducing non-compliance and novelty decay into the simulation. These robustness checks showed how imperfect exposure and time-varying effects attenuate observed lift and reduce power in predictable ways, without invalidating the causal interpretation of the experiment.

Overall, this notebook demonstrates a complete experimentation workflow — from design and validation to analysis, decision-making, and robustness evaluation — highlighting both the strengths and limitations of A/B testing in realistic settings. The results illustrate not only how to determine whether a change should ship, but also how to reason about uncertainty, power, and real-world deviations from ideal assumptions.