# Statistical Analysis Sanity Check

**Notebook:** stats_sanity  
**Purpose:** Verify that `src/analysis/run_stats.py` produces correct statistical outputs and follows best practices.


This notebook provides a sanity check for the statistical analysis framework used in the checkout flow optimization experiment.


## CLI Output Structure

The `python src/analysis/run_stats.py` command prints a compact statistical report with the following sections:

### 1. Header
- **Analysis Date:** Most recent date in the dataset

### 2. Primary Metric: Conditional Conversion Rate (CCR)
- **Per-Variant Rates:** Orders / Adders for control and treatment
- **Effect Size (Absolute):** Difference in percentage points (pp)
- **Effect Size (Relative):** Percentage change relative to control
- **95% Confidence Interval:** Range for the absolute effect
- **p-value:** Two-tailed significance test
- **Significance Indicator:** ✓ SIGNIFICANT or ✗ NOT SIGNIFICANT at α=0.05

### 3. Guardrail Metrics
For each guardrail:
- **Payment Authorization Rate:** Per-variant rates with 95% CIs
- **Average Order Value (AOV):** Per-variant means with sample sizes
- **Guardrail Status:** PASS or FAIL based on configured thresholds

### 4. Decision
- **Primary Metric Status:** Significant or not
- **Guardrails Status:** All passed or one or more failed
- **Recommendation:** SHIP or DO NOT SHIP

### 5. Exit Code
- **0:** Primary significant AND all guardrails pass
- **1:** Primary not significant OR any guardrail fails
- **2:** Error occurred


## Statistical Methods

### Primary Metric: CCR (Conditional Conversion Rate)

**Test:** Two-sample z-test for proportions

**Method:**
- Uses **pooled variance** for the standard error in hypothesis testing
- Uses **unpooled variance** for confidence interval calculation
- **95% Confidence Interval** for the absolute difference (treatment - control)
- **Two-tailed p-value** using normal approximation

**Formula:**
```
z = (p_treatment - p_control) / SE_pooled

where SE_pooled = sqrt(p_pooled * (1 - p_pooled) * (1/n_control + 1/n_treatment))
and p_pooled = (successes_control + successes_treatment) / (n_control + n_treatment)
```

**Assumptions:**
- Large sample sizes (n > 30 per variant)
- Independent observations
- Random assignment to variants

### Guardrail Metrics

**Payment Authorization Rate:**
- **Method:** Proportion confidence interval (Wald method)
- **95% CI** for each variant independently
- Comparison against baseline to check for drops

**Average Order Value (AOV):**
- **Method:** Mean confidence interval with normal approximation
- Uses sample standard deviation and sample size
- Suitable for large samples (n > 30)

**Note:** For small samples or non-normal distributions, bootstrap methods would provide more robust confidence intervals. This is a future enhancement.


## Important Considerations

### Multiple Testing Corrections

**Current State:**
- We test **one primary metric** (CCR) at α=0.05
- Guardrails are **not** hypothesis tests; they are threshold checks
- No multiple testing correction is currently applied

**Future Considerations:**
If you expand to test **multiple primary or secondary metrics** simultaneously, consider applying corrections for multiple comparisons to control the family-wise error rate (FWER) or false discovery rate (FDR):

- **Bonferroni correction:** α_adjusted = α / number_of_tests (conservative)
- **Holm-Bonferroni:** Sequential Bonferroni that's less conservative
- **Benjamini-Hochberg (FDR):** Controls false discovery rate instead of FWER
- **Pre-registration:** Declare primary metric before the experiment to avoid p-hacking

**Current Approach:**
- **Primary metric:** CCR is pre-specified in `configs/experiment.yml` and `references/metrics_spec.md`
- **Guardrails:** Act as constraints, not hypothesis tests. They ensure we don't ship something that degrades critical metrics, even if CCR improves.
- **Secondary metrics:** Reported for context only, not used for decision-making

**Recommendation:** If you add more than one primary metric, revisit the significance threshold and apply appropriate corrections.


## Run Statistical Analysis CLI

The following code cell will call the `run_stats.py` CLI via subprocess and capture the output for inspection and validation.


In [None]:
# Placeholder: Call run_stats.py via subprocess and capture output
# 
# import subprocess
# import sys
# from pathlib import Path
# 
# # Ensure we're in the project root
# project_root = Path.cwd().parent if Path.cwd().name == 'notebooks' else Path.cwd()
# 
# # Run the CLI
# result = subprocess.run(
#     [sys.executable, 'src/analysis/run_stats.py'],
#     cwd=project_root,
#     capture_output=True,
#     text=True
# )
# 
# # Display output
# print(result.stdout)
# if result.stderr:
#     print("STDERR:", result.stderr)
# 
# print(f"\nExit code: {result.returncode}")

# TODO: Implement CLI call and output capture
