# Thompson Sampling

Thompson sampling is an ingenious algorithm that implicitly balances exploration and exploitation based on quality and uncertainty. Let's say we sample a 3-armed bandit and model the probability that each arm gives us a positive reward. The goal is of course to maximize our rewards by pulling the most promising arm. Assume at the current timestep arm-3 has mean reward of 0.9 over 800 pulls, arm-2 has mean reward of 0.8 over 800 pulls, and arm-1 has mean reward of 0.78 over 10 pulls. So far, arm-3 is clearly the best. But if we were to explore, would we choose arm-2 or arm-1? An $\epsilon$-greedy algorithm would, with probability $\epsilon$, just as likely choose arm-3, arm-2, or arm-1. However, arm-2 has been examined many times, as many as arm-1, and has a mean reward lower than arm-1. Selecting arm-2 seems like a wasteful exploratory action. Arm-1 however, has a lower mean reward than either arm-2 or arm-3, but has only been pulled a few times. In other words, arm-1 has a higher chance of being a better action than arm-3 when compared to arm-2, since we are more uncertain about its true value. The $\epsilon$-greedy algorithm completely misses this point. Thompson sampling, on the other hand, incorporates uncertainty by modelling the bandit's Bernouilli parameter with a prior beta distribution.

The beauty of the algorithm is that it always chooses the action with the highest expected reward, with the twist that this reward is weighted by uncertainty. It is in fact a Bayesian approach to the bandit problem. In our Bernouilli bandit setup, each action $k$ returns reward of 1 with probability $\theta_k$, and 0 with probability $1-\theta_k$. At the beginning of a simulation, each $\theta_k$ is sampled from a uniform distribution $\theta_k \sim Uniform(0,1)$ with $\theta_k$ held constant for the rest of that simulation (in the stationary case). The agent begins with a prior belief of the reward of each arm $k$ with a beta distribution, where $\alpha = \beta = 1$. The prior probability density of each $\theta_k$ is:

$$
p(\theta_k) = \frac{\Gamma(\alpha_k + \beta_k)}{\Gamma(\alpha_k)\Gamma(\beta_k)} \theta_k^{\alpha_k -1} (1-\theta_k)^{\beta_k-1}
$$
An action is chosen by first sampling from the beta distribution, followed by choosing the action with highest mean reward:$$
x_t = \text{argmax}_k (\hat{\theta}_k), \quad \hat{\theta}_k \sim \text{beta}(\alpha_k, \beta_k)
$$

According to Bayes' rule, an action's posterior distribution is updated depending on the reward $r_t$ received:$$
(\alpha_k, \beta_k) = (\alpha_k, \beta_k) + (r_t, 1-r_t)
$$

Thus the actions' posterior distribution are constantly updated throughout the simulation. We will measure the Thompson algorithm by comparing it with the $\epsilon$-greedy and Upper Confidence Bound (UCB) algorithms using regret. The per-period regret for the Bernouilli bandit problem is the difference between the mean reward of the optimal action minus the mean reward of the selected action:$$
\text{regret}_t(\theta) = \max_k \theta_k - \theta_{x_t}
$$

First we setup the necessary imports and the standard k-armed bandit. The get_reward_regret samples the reward for the given action, and returns the regret based on the true best action.


In [0]:
import numpy as np
import matplotlib.pyplot as plt
from pdb import set_trace

stationary = True
class Bandit():
    def __init__(self, arm_count):
        """
        Multi-armed bandit with rewards 1 or 0
        At initialization, multiple arms are created. The probability of each arm returning reward 1 
        if pulled is sample from Bernoulli(p), where randomly chosen from Uniform(0, 1) at initization
        """
        self.arm_count = arm_count
        self.generate_thetas()
        self.timestep = 0
        global stationary
        self.stationary=stationary
        
    def generate_thetas(self):
        self.thetas = np.random.uniform(0, 1, self.arm_count)
        
    def get_reward_regret(self, arm):
        """
        Returns random reward for arm action. Assument action are zero-indexed
        Args:
            arg is an int
        """
        self.timestep += 1
        if (self.stationary==False) and (self.timestep%100 == 0) :
          self.generate_thetas()
        # Simulate bernouilli sampling
        sim = np.random.uniform(0,1,self.arm_count)
        rewards = (sim<self.thetas).astype(int)
        reward = rewards[arm]
        regret = self.thetas.max() - self.thetas[arm]

        return reward, regret

We implement the two beta algorithms from [1], although we focus only on the Thompson algorithm. For the Bernouilli-greedy algorithm, the Bernouilli parameters are the expected values of the Beta distribution, i.e.:$$
\mathbb{E}(x_k) = \frac{\alpha_k}{(\alpha_k + \beta_k)}
$$

The Thompson algorithm follows the pseudocode below, based on [1]

Algorithm: Thompson($K$,$\alpha$, $\beta$)
<br>for $t$ = 1,2, ..., do<br>
   &emsp;// sample action parameter from beta distribution<br>
   &emsp;for $k = 1, \dots, K$ do<br>
      &emsp;&emsp;Sample $\hat{\theta}_k \sim \text{beta}(\alpha_k, \beta_k)$<br>
   &emsp;end for<br>
   
   &emsp;// select action, get reward<br>
   &emsp;$x_t \leftarrow \text{argmax}_k \hat{\theta}_k$<br>
   &emsp;$r_t \leftarrow \text{observe}(x_t)$<br>

   &emsp;// update beta parameters<br>
   &emsp;$(\alpha_{x_t}, \beta_{x_t}) \leftarrow (\alpha_{x_t}, \beta_{x_t})+(r_t, 1-r_t)$<br>
end for

In [0]:
# Algorithm: Thompson(K, alpha, beta)
# for t = 1,2, ..., do
#    // sample action parameter from beta distribution
#    for k = 1, ..., K do
#       Sample $\hat{\theta}_k \sim \text{beta}(\alpha_k, \beta_k)$
#    end for

#    // select action, get reward
#    $x_t \leftarrow \text{argmax}_k \hat{\theta}_k$
#    $r_t \leftarrow \text{observe}(x_t)$

#    // update beta parameters
#    $(\alpha_{x_t}, \beta_{x_t}) \leftarrow (\alpha_{x_t}, \beta_{x_t})+(r_t, 1-r_t)$
# end for