# Experiment Readout

**Notebook:** experiment_readout  
**Purpose:** Comprehensive analysis and decision framework for A/B test results  

---

This notebook provides a structured readout for evaluating A/B test results, assessing guardrail metrics, and making a ship/no-ship decision based on statistical evidence and business context.


In [None]:
# Load results from reports/results/
# - ccr_summary.json
# - guardrails_summary.json
# - variant_funnel.csv
# - _run_meta.json



## 1. Experiment Context

### Experiment Overview

**Experiment Name:** [From configs/experiment.yml]  
**Hypothesis:** [What change was made and why]  
**Start Date:** [YYYY-MM-DD]  
**End Date:** [YYYY-MM-DD]  
**Duration:** [N days]  
**Traffic Allocation:** 50% control / 50% treatment

### Data Sources

**Raw Event Data:**
- `data/raw/add_to_cart/date=YYYY-MM-DD/`
- `data/raw/begin_checkout/date=YYYY-MM-DD/`
- `data/raw/checkout_step_view/date=YYYY-MM-DD/`
- `data/raw/payment_attempt/date=YYYY-MM-DD/`
- `data/raw/order_completed/date=YYYY-MM-DD/`

**Processed Results:**
- `reports/results/ccr_summary.json` - Primary metric statistical test results
- `reports/results/guardrails_summary.json` - Secondary metric results
- `reports/results/variant_funnel.csv` - Funnel metrics by variant

**Database:** `duckdb/warehouse.duckdb`

### Sample Size and Power

**Pre-experiment Planning:**
- Target MDE: [X.X%]
- Planned sample size: [N users]
- Expected power: [XX%]
- Significance level (α): 0.05

**Actual Experiment:**
- Control users: [N]
- Treatment users: [N]
- Total users: [N]
- Sample Ratio Mismatch (SRM): [Pass/Fail]


## 2. Primary Metric Results

### Conditional Conversion Rate (CCR)

**Definition:** Orders / Adders (users who added to cart)

**Raw Metrics:**

| Variant | Adders | Orders | CCR |
|---------|--------|--------|-----|
| Control | [N] | [N] | [X.X%] |
| Treatment | [N] | [N] | [X.X%] |

### Statistical Test Results

**Test:** Two-proportion z-test (pooled variance)  
**Null Hypothesis:** Treatment CCR = Control CCR  
**Alternative Hypothesis:** Treatment CCR ≠ Control CCR (two-sided)

**Results:**
- **Effect Size (Treatment - Control):** [+X.XX percentage points]
- **95% Confidence Interval:** [[X.XX, X.XX] pp]
- **p-value:** [0.XXXX]
- **Statistical Significance:** [✓ Significant / ✗ Not Significant]
- **Practical Significance:** [Meets/Does not meet MDE threshold of X.X pp]

### Interpretation

**Statistical:**
- [If p < 0.05: We reject the null hypothesis. There is statistically significant evidence that treatment affects CCR.]
- [If p >= 0.05: We fail to reject the null hypothesis. No statistically significant difference detected.]

**Practical:**
- [If effect > MDE: The observed lift exceeds our minimum detectable effect, suggesting meaningful business impact.]
- [If effect < MDE: The lift is smaller than our target, may not justify launch costs.]
- [If CI includes 0: We cannot rule out no effect or even a negative effect.]

**Business Impact (if launched):**
- Estimated incremental orders per day: [N]
- Estimated incremental revenue per day: [$X,XXX]
- Annualized revenue impact: [$X.X million]


## 3. Guardrails

Guardrail metrics protect against unintended negative consequences. All guardrails must remain within acceptable bounds (no statistically significant degradation).

### Payment Authorization Rate

**Definition:** Percentage of payment attempts that are successfully authorized

| Variant | Attempts | Authorized | Auth Rate | 95% CI |
|---------|----------|------------|-----------|--------|
| Control | [N] | [N] | [XX.X%] | [[XX.X%, XX.X%]] |
| Treatment | [N] | [N] | [XX.X%] | [[XX.X%, XX.X%]] |

**Status:** [✓ Pass / ✗ Fail]  
**Threshold:** No degradation > 0.5pp  
**Decision:** [Treatment does not significantly harm payment authorization]

---

### Average Order Value (AOV)

**Definition:** Mean order value across completed orders

| Variant | Orders | Mean AOV | 95% CI |
|---------|--------|----------|--------|
| Control | [N] | [$XXX.XX] | [[$XXX, $XXX]] |
| Treatment | [N] | [$XXX.XX] | [[$XXX, $XXX]] |

**Status:** [✓ Pass / ✗ Fail]  
**Threshold:** No degradation > $5  
**Decision:** [Treatment does not significantly harm AOV]

---

### Payment Latency (p95)

**Definition:** 95th percentile of payment step latency (milliseconds)

| Variant | p95 Latency (ms) | Threshold | Status |
|---------|------------------|-----------|--------|
| Control | [XXX ms] | < 3000 ms | ✓ |
| Treatment | [XXX ms] | < 3000 ms | [✓ / ✗] |

**Status:** [✓ Pass / ✗ Fail]  
**Threshold:** p95 < 3000ms AND no degradation > 200ms  
**Decision:** [Treatment maintains acceptable latency performance]

---

### Guardrail Summary

**Overall Guardrail Status:** [✓ All Pass / ✗ One or more failures]

- Payment Authorization: [Pass/Fail]
- Average Order Value: [Pass/Fail]
- Payment Latency p95: [Pass/Fail]

**Concerns:** [List any guardrails that are borderline or require follow-up monitoring]


## 4. Funnel Diagnostics

### Full Funnel View

**Checkout Funnel Stages:**

| Stage | Control Count | Control Rate | Treatment Count | Treatment Rate | Δ |
|-------|---------------|--------------|-----------------|----------------|---|
| Add to Cart | [N] | 100% | [N] | 100% | - |
| Begin Checkout | [N] | [XX%] | [N] | [XX%] | [+X.X pp] |
| Payment Step View | [N] | [XX%] | [N] | [XX%] | [+X.X pp] |
| Payment Attempt | [N] | [XX%] | [N] | [XX%] | [+X.X pp] |
| Order Completed | [N] | [XX%] | [N] | [XX%] | [+X.X pp] |

### Step-Through Rates

**Add to Cart → Begin Checkout:**
- Control: [XX.X%]
- Treatment: [XX.X%]
- Δ: [+X.X pp]
- **Insight:** [Treatment increases/decreases intent to checkout]

**Begin Checkout → Payment Attempt:**
- Control: [XX.X%]
- Treatment: [XX.X%]
- Δ: [+X.X pp]
- **Insight:** [Treatment reduces/increases friction in checkout flow]

**Payment Attempt → Order Completed:**
- Control: [XX.X%]
- Treatment: [XX.X%]
- Δ: [+X.X pp]
- **Insight:** [Treatment payment success rate]

### Drop-off Analysis

**Largest Drop-off Point:**
- Stage: [Stage name]
- Control drop-off: [XX.X%]
- Treatment drop-off: [XX.X%]
- **Action:** [Investigate/monitor this step if treatment worsens drop-off]

**Form Errors:**
- Control error rate: [X.X%]
- Treatment error rate: [X.X%]
- **Status:** [Treatment reduces/increases form errors]

### Latency by Step

| Step | Control Median (ms) | Treatment Median (ms) | Δ |
|------|---------------------|----------------------|---|
| Shipping | [XXX] | [XXX] | [+XX] |
| Payment | [XXX] | [XXX] | [+XX] |
| Review | [XXX] | [XXX] | [+XX] |

**Insight:** [Treatment increases/decreases latency at certain steps]


## 5. Segment Cuts

Analyzing key segments to understand if treatment effects vary across user populations.

### By Device Type

**Desktop:**
- Control CCR: [XX.X%]
- Treatment CCR: [XX.X%]
- Lift: [+X.X pp]
- Significance: [✓ / ✗]

**Mobile:**
- Control CCR: [XX.X%]
- Treatment CCR: [XX.X%]
- Lift: [+X.X pp]
- Significance: [✓ / ✗]

**Tablet:**
- Control CCR: [XX.X%]
- Treatment CCR: [XX.X%]
- Lift: [+X.X pp]
- Significance: [✓ / ✗]

**Insight:** [Treatment performs better/worse on mobile vs desktop]

---

### By Country/Region

**United States:**
- Control CCR: [XX.X%]
- Treatment CCR: [XX.X%]
- Lift: [+X.X pp]

**International:**
- Control CCR: [XX.X%]
- Treatment CCR: [XX.X%]
- Lift: [+X.X pp]

**Insight:** [Treatment effect is consistent/varies across regions]

---

### By Traffic Source

**Organic:**
- Control CCR: [XX.X%]
- Treatment CCR: [XX.X%]
- Lift: [+X.X pp]

**Paid:**
- Control CCR: [XX.X%]
- Treatment CCR: [XX.X%]
- Lift: [+X.X pp]

**Direct:**
- Control CCR: [XX.X%]
- Treatment CCR: [XX.X%]
- Lift: [+X.X pp]

**Insight:** [Treatment benefits certain traffic sources more than others]

---

### Heterogeneous Treatment Effects (HTE)

**Key Findings:**
- [Identify any segments where treatment effect is significantly different]
- [Note if effect is positive in aggregate but negative in important segments]
- [Consider targeted rollout if HTE is substantial]

**Recommendation:**
- [Launch globally / Consider segment-specific rollout / Further investigation needed]


## 6. Decision and Rationale

### Decision Framework

**Criteria for Launch:**
1. ✓ / ✗ Primary metric (CCR) shows positive lift
2. ✓ / ✗ Statistical significance (p < 0.05)
3. ✓ / ✗ Practical significance (lift meets or exceeds MDE)
4. ✓ / ✗ All guardrails pass
5. ✓ / ✗ No concerning segment effects
6. ✓ / ✗ No operational/technical risks

---

### Final Decision: [SHIP / DO NOT SHIP / ITERATE]

### Rationale

**In Favor of Shipping:**
- [Primary metric shows X.X pp lift with p = 0.XXX]
- [Effect size exceeds MDE of X.X pp]
- [All guardrails remain healthy]
- [Consistent positive effects across key segments]
- [Estimated annualized revenue impact: $X.X million]

**Observations from Current Experiment:**
- Treatment shows a +0.89pp lift in CCR (36.9% → 37.8%), but this does not reach statistical significance (p = 0.36)
- The 95% confidence interval [-1.00pp, +2.79pp] includes zero, meaning we cannot rule out no effect or even a slight negative effect
- All guardrails remain healthy with no degradation in payment authorization rate or average order value, suggesting the treatment is safe even if inconclusive

**Concerns/Risks:**
- [List any concerns, e.g., borderline significance, small sample size]
- [Note any guardrails that are trending negative but not failing]
- [Highlight any segments with negative effects]

**Alternative Considered:**
- [If not shipping: What iteration or follow-up experiment is recommended]
- [If shipping: Any phased rollout or monitoring plan]

---

### Stakeholder Sign-off

**Product Manager:** [Name] - [Approved / Declined]  
**Data Science:** [Name] - [Approved / Declined]  
**Engineering:** [Name] - [Approved / Declined]  
**Business Owner:** [Name] - [Approved / Declined]

**Launch Date:** [YYYY-MM-DD] or [Declined - no launch]

---

### Decision Audit Trail

**Key Assumptions:**
1. [Baseline CCR remains stable post-launch]
2. [Treatment implementation matches experiment experience]
3. [No confounding seasonal or promotional effects]

**Tie to Business Goals:**
- [How this decision supports quarterly/annual OKRs]
- [Alignment with product strategy and roadmap]

**Reversibility:**
- [Can this change be easily rolled back if issues arise?]
- [What metrics will trigger a rollback?]


## 7. Risks and Follow-ups

### Known Risks

**Technical Risks:**
- [Risk 1: e.g., Increased latency on slower networks]
  - Likelihood: [Low/Medium/High]
  - Impact: [Low/Medium/High]
  - Mitigation: [Monitoring plan, rollback criteria]

- [Risk 2: e.g., Browser compatibility issues]
  - Likelihood: [Low/Medium/High]
  - Impact: [Low/Medium/High]
  - Mitigation: [Testing plan, progressive enhancement]

**Business Risks:**
- [Risk 1: e.g., Effect may not persist long-term (novelty effect)]
  - Likelihood: [Low/Medium/High]
  - Impact: [Low/Medium/High]
  - Mitigation: [Post-launch monitoring for 30 days]

- [Risk 2: e.g., Negative impact on customer support volume]
  - Likelihood: [Low/Medium/High]
  - Impact: [Low/Medium/High]
  - Mitigation: [Monitor support tickets, prepare CS team]

**Data Quality Risks:**
- [Risk: Sample ratio mismatch or instrumentation issues]
  - Status: [SRM check passed/failed]
  - Action: [Re-validate instrumentation post-launch]

---

### Post-Launch Monitoring Plan

**Metrics to Monitor:**
1. **Conditional Conversion Rate (CCR)** - Daily for 2 weeks, then weekly
2. **Payment Authorization Rate** - Daily for 1 week
3. **Average Order Value** - Weekly
4. **Payment Latency p95** - Daily for 1 week
5. **Customer Support Contacts** - Daily for 2 weeks

**Rollback Criteria:**
- CCR degrades by > 1pp for 3 consecutive days
- Payment authorization rate drops below 90%
- p95 latency exceeds 3000ms
- Any critical bugs or customer escalations

**Review Cadence:**
- **Week 1:** Daily review of key metrics
- **Week 2:** Every 2 days
- **Week 3-4:** Weekly review
- **30-day retrospective:** Full analysis comparing pre/post launch

---

### Follow-up Experiments

**Iteration Opportunities:**
1. [Experiment idea 1: e.g., Test different messaging on treatment page]
   - Rationale: [Further optimize conversion]
   - Timeline: [Q3 2025]

2. [Experiment idea 2: e.g., Extend treatment to mobile app]
   - Rationale: [Expand treatment reach]
   - Timeline: [Q4 2025]

**Open Questions:**
- [Question 1: What drives the lift - reduced friction or improved trust?]
  - Proposed investigation: [User survey, session replay analysis]

- [Question 2: Will the effect persist beyond 30 days?]
  - Proposed investigation: [Extended monitoring, cohort analysis]

---

### Lessons Learned

**What Went Well:**
- [e.g., Clean instrumentation, no data quality issues]
- [e.g., Strong cross-functional collaboration]
- [e.g., Clear hypothesis and success criteria upfront]

**What Could Be Improved:**
- [e.g., Should have run sensitivity analysis earlier]
- [e.g., Could have tested more granular segments]
- [e.g., Need better mobile-specific metrics]

**Process Improvements:**
- [Action 1: Add segment analysis to pre-launch checklist]
- [Action 2: Automate guardrail checks]
- [Action 3: Standardize readout template]

---

### Documentation Links

**Related Artifacts:**
- Experiment plan: [Link]
- Design mocks: [Link]
- Technical spec: [Link]
- JIRA ticket: [Link]
- Dashboard: http://localhost:8501
- Final report: `reports/REPORT.md`

**References:**
- Metrics specification: `references/metrics_spec.md`
- Analysis checklist: `references/analysis_checklist.md`
- A/A validation plan: `references/aa_validation_plan.md`


In [None]:
# Save figures to reports/figures/
# - funnel_comparison.png
# - ccr_by_segment.png
# - power_curve.png

