<a href="https://colab.research.google.com/github/jmhuer/utaustin_optimization/blob/main/homework10/UCB_fix_horizon_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Upper Confidence Bound (fixed horizon)

In this excercise, we will be playing with the Multi-arm bandit problem with the UCB algorithm.

## Setup

As before, consider unstructural bandit problem. Suppose we have $k$ arms, each with random rewards $p_i = u_i + \epsilon$ where $\epsilon$ is draw from i.i.d. standard gaussian. (Note that we only require $\epsilon$ to be sub-gaussian for the analysis to go through)

The following codes is capturing the setup.

In [3]:
import numpy as np
import pdb
import matplotlib.pyplot as plt

class Gaussian_Arm:
  def __init__(self, num_arms, mu=None):
    '''
    num_arms: (int). the number of arms
    mu: (None or list-type). the mean of the reward of each arm.
        if set to None, a random vector will be generated.
    '''
    if num_arms <= 1 or not isinstance(num_arms, int):
      print('number of arms has an int that is at least two')
      return
    
    self.num_arms = num_arms
    #
    if mu:
      self.mu = np.asarray(mu)
      if len(self.mu) != num_arms:
        print('The lenth of mu does not match the number of arms')
        return
    else:
      self.mu = np.random.rand((num_arms))
    # 
    self.delta = max(self.mu) - min(self.mu)
    #

    # keep track of the rewards for the user
    self.rewards_history = []
    # keep track of how many times the arms have been pulled
    self.total_pull = 0 

  def pull_arm(self, arm_id=-1, pull_time=1):
    if arm_id < 0 or arm_id >= self.num_arms:
      print('please specify arm id in the range of 0-%d' % (self.num_arms))
      return
    assert (isinstance(pull_time, int) and pull_time >= 1)
    self.total_pull += pull_time
    # Generate reward
    reward = self.mu[arm_id] * pull_time + sum(np.random.randn(pull_time))
    self.rewards_history.append(reward)
    return reward


  def genie_reward(self):
    '''
    the best expected reward after pulling self.total_pull times
    '''
    best_mu = max(self.mu)
    return self.total_pull * best_mu

  def my_rewards(self):
    return sum(self.rewards_history)

  def clear_reward_hist(self):
    self.rewards_history = []
    self.total_pull = 0



## Goal of this exercise
1. Basic UCB algorithm implementation
2. Plot the regret VS horizon ($n$).

Please refer to previous assignment for tips.


In [51]:


NUM_ARMS = 2

class Empirical_mean:
    def __init__(self,num_arms):
        self.rewards = np.zeros(num_arms)
        self.count = np.zeros(num_arms)
        self.mean =  np.zeros(num_arms)
    def add_new_reward(self, reward, indx):
        self.rewards[indx] += reward
        self.count[indx] += 1
        self.mean[indx] = self.rewards[indx] / self.count[indx]
    def mean(self, indx):
        return self.mean[indx]


def UCB(arm, N, num_arms=NUM_ARMS):
  '''
  arm: Arm object.
  N: the ultimate time horizon
  delta: separation
  m: int. The exploration time for EACH arm
  num_arms: int. The total number of arms
  '''
  bonus = lambda sig,t: np.sqrt((2*np.log(1/sig))/t) if t > 0 else float('inf')
  delta = lambda t: 1/N**2 if t > 0 else float('inf') 
  mean_vals = Empirical_mean(num_arms)
  for i in range(N):
      UCBbest_arm = np.argmax([u + bonus(delta(t), t) for (u,t) in zip(mean_vals.mean, mean_vals.count)])
      mean_vals.add_new_reward(arm.pull_arm(UCBbest_arm), UCBbest_arm)
  return arm.my_rewards()



In [55]:

def regret_vs_horizon(Ns:list, REPEAT:int, algorithm: type(lambda x: None)):
  regret = []
  mu = [0.1, 0.0]
  my_arm = Gaussian_Arm(NUM_ARMS, mu=mu)
  for NUM_RUNs in Ns:
    print(NUM_RUNs)
    cur_regret = 0
    for repeat in range(REPEAT):
      rewards = algorithm(my_arm, NUM_RUNs, my_arm.num_arms) ## everyrun NUM_RUNs += 10000
      cur_regret += my_arm.genie_reward() - rewards
      my_arm.clear_reward_hist()
      #
    cur_regret /= REPEAT
    regret.append(cur_regret)
    #
  return regret


import plotly.graph_objects as graph
def plot(all_history:list, xname:str, yname:str ,title:str, log = False):
    """
    input:
        all_history: list of dicts to plot
    ret:
        None: show plotly fig
    """
    fig = graph.Figure(layout = graph.Layout(title=graph.layout.Title(text=title))) 
    for i in range(len(all_history)):
        fig.add_trace(graph.Scatter(x = all_history[i][xname], 
                                    y = all_history[i][yname],
                                    name = all_history[i]["legend"])) 
    if log: fig.update_xaxes(type="log")
    fig.show()



In [45]:


Ninit = 2000
Ns  = [Ninit * i for i in range(1, 20)]
avg = 25
mean_UCB_regret = np.zeros((avg , len(Ns))) ##initialize mean arrays
for i in range(avg):
  print("simulation {} completed".format(i))
  UCB_regret = regret_vs_horizon(Ns)
  mean_UCB_regret[i] = UCB_regret


simulation 0 completed
simulation 1 completed
simulation 2 completed
simulation 3 completed
simulation 4 completed
simulation 5 completed
simulation 6 completed
simulation 7 completed
simulation 8 completed
simulation 9 completed
simulation 10 completed
simulation 11 completed
simulation 12 completed
simulation 13 completed
simulation 14 completed
simulation 15 completed
simulation 16 completed
simulation 17 completed
simulation 18 completed
simulation 19 completed
simulation 20 completed
simulation 21 completed
simulation 22 completed
simulation 23 completed
simulation 24 completed


In [None]:

Ninit = 10000
Ns  = [Ninit * i for i in range(1, 10)]
UCB_regret = regret_vs_horizon(Ns, REPEAT=100, algorithm=UCB)


10000
20000
30000
40000
50000
60000


In [53]:

mean_UCB_regret = {"legend": "mean_UBC_regret", 
                   "x": Ns , 
                   "y": UCB_regret}

plot([mean_UCB_regret], xname="x", yname="y" , title="regret VS horizon - linear" , log = False)
plot([mean_UCB_regret], xname="x", yname="y" , title="regret VS horizon - Log" , log = True)
