# Bayesian Bandits: An Application of Thompson Sampling

This notebook implements the Thompson Sampling algorithm described in [Analysis of Thompson Sampling for the Multi-Armed Bandit Problem](http://proceedings.mlr.press/v23/agrawal12/agrawal12.pdf) by Shipra Agrawal and Navin Goyal and applies to the Bayesian Bandit problem.

## Table of Contents
[Section 1: Thompson Sampling for Bernoulli Bandits](#section1)

[Section 2: Thompson Sampling for General Stochastic Bandits](#section2)

## Section 1: Thompson Sampling for Bernoulli Bandits <a name="section1"></a>

We will first examine the Bernoulli Bandit case where:
* Rewards $r_t \in \{0, 1\}$
* For arm $i$, the probability of success ($r_{t,i} = 1$) is its mean, $\mu_i$

We maintain a Bayesian prior on the arm means $\mu_i$. For Bernoulli rewards (either 0 or 1), it turns out that a Beta distribution is a convenient choice of priors. This is because if the prior is a Beta($\alpha$, $\beta$) distribution, then updating the posterior distributions become much simpler. After observing a Bernoulli trial, the posterior distribution is:
* Beta($\alpha+1$, $\beta$) if the trial succeeded (reward = 1)
* Beta($\alpha$, $\beta+1$) if the trial failed (reward = 0)

The Thompson Sampling algorithm initializes a uniform prior on all arms, meaning that arm $i$ has a prior Beta(1,1) on $\mu_i$ because Beta(1,1) is a uniform distribution on (0,1).

Define:
* $S_i(t)$ = number of successes (reward = 1) for arm $i$ up to time $t$
* $F_i(t)$ = number of failures (reward = 0) for arm $i$ up to time $t$

The algorithm will then update the distribution on $\mu_i$ as Beta($S_i(t)+1$, $F_i(t)+1$), sample from these posterior distributions, and play an arm according to the probability of its mean being the largest.

---
#### Algorithm 1: Thompson Sampling for Bernoulli Bandits
For each arm $i=1,...,N$, set $S_i=0$, $F_i=0$

For each  $t=1,2,...$ do

* For each arm $i=1,...,N$, sample $\theta_i(t)$ from the Beta($S_i+1$, $F_i+1$) distribution
* Play arm $i(t) := argmax_i$ $\theta_i(t)$ and observe reward $r_t$
* If $r_t=1$, then $S_i(t) = S_i(t)+1$, else $F_i(t) = F_i(t)+1$
---

## Section 2: Thompson Sampling for General Stochastic Bandits <a name="section2"></a>