# Uplift Modeling, Multi-Armed Bandits, and Reinforcement Learning for Personalization

This notebook provides a research-level introduction to **uplift modeling**, **multi-armed bandits**, and **reinforcement learning** with **mathematical details**.  
Applications include **fantasy sports, sports betting, mobile gaming, and consumer entertainment**.


# 1. Uplift Modeling (Causal ML)

We want to model **heterogeneous treatment effects (HTE)**:

- Features: $X \in \mathbb{R}^d$
- Treatment: $T \in \{0,1\}$
- Outcome: $Y$

Potential outcomes:
$$
Y(1), Y(0), \quad \tau(x) = \mathbb{E}[Y(1)-Y(0)\mid X=x]
$$

We observe only:
$$
Y = T Y(1) + (1-T) Y(0)
$$

### Estimation Frameworks

**T-Learner:**
$$
\hat{\mu}_1(x) = \hat{\mathbb{E}}[Y\mid X=x,T=1], \quad 
\hat{\mu}_0(x) = \hat{\mathbb{E}}[Y\mid X=x,T=0]
$$
$$
\hat{\tau}(x) = \hat{\mu}_1(x) - \hat{\mu}_0(x)
$$

**S-Learner:**
$$
\hat{\mu}(x,t) = \hat{\mathbb{E}}[Y\mid X=x,T=t]
$$
$$
\hat{\tau}(x) = \hat{\mu}(x,1)-\hat{\mu}(x,0)
$$

### Decision Rule
Treat if:
$$
\tau(x) V > c
$$
where $V$ = value per conversion, $c$ = treatment cost.


# 2. Multi-Armed Bandits (MABs)

We have $K$ arms with unknown means $\mu_a$.  
At each round $t$, choose arm $A_t$ and observe reward $R_t$.

**Regret:**
$$
R(T) = T \mu^* - \sum_{t=1}^T \mu_{A_t}, \quad \mu^* = \max_a \mu_a
$$

### Algorithms

**ε-Greedy:**
$$
A_t = \begin{cases}
\arg\max_a \hat{\mu}_a & \text{w.p. } 1-\epsilon \\
\text{random arm} & \text{w.p. } \epsilon
\end{cases}
$$

**UCB1:**
$$
A_t = \arg\max_a \left( \hat{\mu}_a + \sqrt{\tfrac{2\ln t}{n_a}} \right)
$$

**Thompson Sampling:**  
For Bernoulli rewards, posterior $\theta_a \sim \text{Beta}(\alpha_a,\beta_a)$:
- Sample $\tilde{\theta}_a$ from posterior.
- Play $A_t = \arg\max_a \tilde{\theta}_a$.

### Contextual Bandits
Rewards depend on features $X_t$:
$$
\mathbb{E}[R_t \mid A_t=a,X_t=x] = f(x,a)
$$


# 3. Reinforcement Learning (RL)

Formulate personalization as an MDP:

- State $s_t \in S$
- Action $a_t \in A$
- Transition $s_{t+1} \sim P(\cdot|s_t,a_t)$
- Reward $r_t = R(s_t,a_t)$

Objective:
$$
J(\pi) = \mathbb{E}_\pi \left[ \sum_{t=0}^\infty \gamma^t r_t \right]
$$

### Value Functions

State-value:
$$
V^\pi(s) = \mathbb{E}_\pi\left[ \sum_{t=0}^\infty \gamma^t r_t \mid s_0=s \right]
$$

Action-value:
$$
Q^\pi(s,a) = \mathbb{E}_\pi\left[ \sum_{t=0}^\infty \gamma^t r_t \mid s_0=s,a_0=a \right]
$$

Bellman equations:
$$
V^\pi(s) = \sum_a \pi(a|s) \Big( R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^\pi(s') \Big)
$$

### Learning Algorithms

**Q-learning:**
$$
Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha [r_t + \gamma \max_{a'} Q(s_{t+1},a') - Q(s_t,a_t)]
$$

**Policy Gradient:**
$$
\nabla_\theta J(\pi_\theta) = \mathbb{E}_{\pi_\theta}[ \nabla_\theta \log \pi_\theta(a|s) Q^{\pi_\theta}(s,a) ]
$$


# 4. Summary Table

| Method              | Objective                     | Key Equation(s)                                  | Typical Use |
|---------------------|-------------------------------|--------------------------------------------------|-------------|
| **Uplift Modeling** | Estimate causal effect        | $\tau(x)=\mathbb{E}[Y(1)-Y(0)\mid X=x]$        | Targeting |
| **Bandits**         | Minimize cumulative regret    | $R(T)=T\mu^* - \sum_{t=1}^T \mu_{A_t}$        | Online personalization |
| **Reinforcement Learning** | Maximize discounted reward | $J(\pi)=\mathbb{E}[\sum_t \gamma^t r_t]$ | Sequential personalization |
