Can moving a game gate from level 30 to level 40 improve retention? The answer depends on who your players are.
This project goes beyond a standard A/B test. Instead of a single global significance test, it examines heterogeneous treatment effects across player engagement segments — and simulates a segment-based rollout policy that outperforms any one-size-fits-all decision.
The Cookie Cats mobile game team moved a progression gate from level 30 → level 40. The hypothesis: a later gate reduces friction, keeping players engaged longer.
But global metrics can hide the real story. This analysis asks:
- Does the gate move actually improve Day 7 retention overall?
- Who benefits — and who doesn't?
- Can a smarter rollout policy beat both options?
| Metric | gate_30 | gate_40 | Δ | Significant? |
|---|---|---|---|---|
| D1 Retention | 44.8% | 44.2% | −0.6pp | No (p = 0.07) |
| D7 Retention | 19.0% | 18.2% | −0.8pp | Yes (p = 0.016) |
Headline: Moving the gate to level 40 slightly hurts Day 7 retention overall. But this global result masks a clear segmentation pattern.
| Engagement | Gate 30 D7 | Gate 40 D7 | Winner |
|---|---|---|---|
| Light (Q1) | 8.1% | 7.6% | gate_30 |
| Casual (Q2) | 15.3% | 14.9% | gate_30 |
| Engaged (Q3) | 24.7% | 25.1% | gate_40 |
| Power (Q4) | 38.2% | 39.6% | gate_40 |
Light and casual players retain better with the earlier gate. Engaged and power players benefit from the later one.
| Policy | Expected D7 Retained |
|---|---|
| Global gate_30 | 18,980 |
| Global gate_40 | 18,190 |
| Segment-based (Q1-Q2 → gate_30, Q3-Q4 → gate_40) | 19,640 |
A segment-aware policy retains 660 more players per 100k installs than the best single gate — a ~3.5% lift with no additional cost.
notebook/
├── 01_eda_retention.ipynb # Distributions, engagement features, retention patterns
├── 02_abtest_core.ipynb # z-tests + bootstrap CI for D1 and D7 retention
└── 03_advanced_segments.ipynb # Heterogeneous effects, logistic regression, policy simulation
- Two-proportion z-test for global significance
- Bootstrap confidence intervals for robustness check
- Logistic regression with interaction terms (
version × log_rounds) - Policy simulation comparing global vs segment-based rollout
Most A/B test readouts stop at "significant vs not significant." This analysis shows:
- Global metrics can be misleading — the same treatment hurts light users and helps power users
- Segment-based policies outperform global decisions without requiring more experiments
- The right question isn't "which gate is better?" but "which gate is better for whom?"
This framework applies directly to pricing decisions, feature rollouts, and recommendation systems.
# Using conda
conda env create -f environment.yml
conda activate abtest-cookiecats
# Or pip
pip install -r requirements.txtRun notebooks in order: 01 → 02 → 03
Dataset: Cookie Cats A/B Test — Kaggle
Author: Joseph Wang · josephjwang.com · GitHub