# Power Sensitivity Analysis

**Notebook:** power_sensitivity  
**Purpose:** Analyze detection rates across different sample sizes and effect sizes  
**Data Source:** `reports/results/sensitivity_summary.csv`

---

This notebook visualizes the relationship between experimental parameters (users per day, uplift) and statistical power (detection rate) to inform sample size planning and MDE selection for future A/B tests.


## Goals

The primary objectives of this analysis are:

1. **Quantify statistical power** across different sample sizes (users per day) and effect sizes (uplift)
2. **Identify minimum viable sample size** needed to detect a 2% uplift with adequate power (≥80%)
3. **Assess false positive rate** by examining detection rates when true uplift is 0 (A/A test)
4. **Inform experiment planning** by providing data-driven recommendations for:
   - Minimum experiment duration
   - Realistic MDE (Minimum Detectable Effect)
   - Trade-offs between sample size and runtime

**Success Criteria:**
- Understand power curves for key uplift values (0%, 1.5%, 2%, 3%)
- Confirm false positive rate ≈ α (5%)
- Provide actionable guidance for future A/B tests


## Data source

**Primary Data:** `reports/results/sensitivity_summary.csv`

This CSV contains results from the sensitivity analysis pipeline (`src/analysis/sensitivity.py`), which:
- Simulates checkout funnel data for various parameter combinations
- Runs statistical tests (two-proportion z-test for CCR)
- Records whether each test detected a significant difference (p < α)

**Schema:**
- `users_per_day`: Sample size (number of users exposed per day)
- `uplift`: True treatment effect (0.0 for A/A, 0.02 for 2% lift, etc.)
- `repeats`: Number of independent simulations per grid point
- `detections`: Count of significant results (p < 0.05)
- `detection_rate`: detections / repeats (empirical statistical power)
- `alpha`: Significance level (typically 0.05)

**Generation Command:**
```bash
make sensitivity
# or
python src/analysis/sensitivity.py --users "20000,50000" --uplifts "0.0,0.02" --repeats 10
```

**Note:** Data are synthetic but parameterized to match realistic checkout funnel behavior (baseline CCR ~35%, cart-to-checkout ~67%).


## Plots to create

### 1. Power Curve by Sample Size
**Type:** Line plot  
**X-axis:** Users per day (sample size)  
**Y-axis:** Detection rate (statistical power, 0-100%)  
**Lines:** One per uplift level (0%, 1.5%, 2%, 3%)  
**Reference line:** Horizontal line at 80% (conventional power threshold)

**Purpose:** Visualize how power increases with sample size for different effect sizes

---

### 2. Heatmap: Power across Parameter Grid
**Type:** Heatmap  
**X-axis:** Uplift (effect size)  
**Y-axis:** Users per day (sample size)  
**Color:** Detection rate (0% = red, 100% = green)  
**Annotations:** Show exact detection rate in each cell

**Purpose:** Quickly identify parameter combinations that achieve target power

---

### 3. False Positive Rate Check
**Type:** Bar chart  
**X-axis:** Users per day  
**Y-axis:** Detection rate for uplift = 0.0 (A/A test)  
**Reference line:** Horizontal line at 5% (expected false positive rate = α)

**Purpose:** Validate that the testing framework maintains proper Type I error control

---

### 4. Sample Size Recommendation Table
**Type:** Formatted dataframe/table  
**Columns:** Uplift | Users for 50% power | Users for 80% power | Users for 95% power  
**Rows:** Different uplift levels (1%, 1.5%, 2%, 2.5%, 3%)

**Purpose:** Provide actionable sample size recommendations for experiment planning


## Takeaways

This section will be populated after running the analysis with key insights such as:

### Expected Findings:

**Power Relationships:**
- Power increases with both sample size and effect size
- Diminishing returns: doubling sample size doesn't double power
- Small effect sizes (< 1.5pp) require prohibitively large samples

**Sample Size Guidance:**
- Minimum viable sample for 2% uplift at 80% power: _TBD_
- Trade-off analysis: runtime vs. confidence in results
- If power is insufficient, consider: longer runtime, larger MDE, or sequential testing

**A/A Validation:**
- False positive rate should be ≈5% (matching α)
- If systematically higher: investigate SRM (Sample Ratio Mismatch) or instrumentation issues
- If systematically lower: tests may be too conservative

**Practical Recommendations:**
- Recommended minimum sample size for standard 2pp MDE: _TBD_
- When to use larger samples (small expected effects, high business impact)
- When to use smaller samples (rapid iteration, low cost of false negatives)

---

**Action Items:**
1. Update experiment configuration (`configs/experiment.yml`) with validated sample size
2. Adjust MDE expectations if current sample size provides insufficient power
3. Plan experiment duration to accumulate required samples
4. Document assumptions and revisit if baseline metrics shift significantly


In [None]:
import pandas as pd
from pathlib import Path

# Read sensitivity analysis results
sensitivity_path = Path("../reports/results/sensitivity_summary.csv")
df = pd.read_csv(sensitivity_path)

print("Sensitivity Analysis Grid")
print("=" * 50)
print(f"\nDistinct users_per_day values:")
print(sorted(df['users_per_day'].unique()))

print(f"\nDistinct uplift values:")
print(sorted(df['uplift'].unique()))

print(f"\nTotal grid points: {len(df)}")
print(f"Repeats per point: {df['repeats'].iloc[0]}")

print("\nData preview:")
print(df.head())

# TODO: Add visualizations below
# - Power curve by sample size
# - Heatmap of power across parameter grid
# - False positive rate validation
# - Sample size recommendation table
