<a href="https://colab.research.google.com/github/jmhuer/utaustin_optimization/blob/main/homework11/Exp3_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exponential Weight Algorithm for Explore and Exploit (EXP3)

In this exercise, we will be studying the exponential weight algorithm for explore and exploit (EXP3).

While this algorithm is designed for adversarial bandit setting, we will test this algorithm in the Bernoulli bandit setting, for the ease of implementation.

As an example of the Bernoulli bandit, the following codes create a rewards history matrix $rewards$, such that $rewards[t, a]$ returns the reward you will get if you query arm $a$ at time $t$.

In [40]:
import numpy as np
import pdb
import matplotlib.pyplot as plt

class Adverserial_Arm:
  def __init__(self, num_arms):
    '''
    num_arms: (int). the number of arms
    mu: (None or list-type). the mean of the reward of each arm.
        if set to None, a random vector will be generated.
    '''
    if num_arms <= 1 or not isinstance(num_arms, int):
      print('number of arms has an int that is at least two')
      return
    
    self.num_arms = num_arms
    self.reward_sequence = None #here we store adverserial setting
    # self.create_reward_seqeunce(N=10000) #we can create reward sequence automatically 
    
    # keep track of the rewards for the user
    self.rewards_history = []
    # keep track of how many times the arms have been pulled
    self.total_pull = 0 
  def create_reward_seqeunce(self, N): #no statistical/prob assumptions 
    numAction = self.num_arms 
    numRound = N
    expected_rewards = np.arange(numAction) + 1
    expected_rewards = 1 / expected_rewards
    expected_rewards = np.repeat(expected_rewards.reshape(1,-1), numRound, axis=0)
    rewards = np.random.rand(numRound, numAction) < expected_rewards
    rewards = rewards.astype(float)
    self.reward_sequence =  rewards
  def pull_arm(self, time_step, arm_id=-1, pull_time=1):
    if arm_id < 0 or arm_id >= self.num_arms:
      print('please specify arm id in the range of 0-%d' % (self.num_arms))
      return
    assert (isinstance(pull_time, int) and pull_time >= 1)
    assert (self.reward_sequence.all()!=None), "please create adverserial setting, i.g sequence of rewards"
    self.total_pull += pull_time #check THIS
    # Generate reward
    reward = self.reward_sequence[time_step, arm_id] #check THIS; trying multi array indexing for multiple pulls in case of greedy action
    self.rewards_history.append(reward)
    return reward

  def genie_reward(self):
    '''
    the best expected reward after pulling self.total_pull times
    '''
    best_reward = sum(np.max(self.reward_sequence[0:self.total_pull,], axis=1))
    return best_reward

  def my_rewards(self):
    return sum(self.rewards_history)

  def clear_reward_hist(self):
    self.rewards_history = []
    self.total_pull = 0


In [42]:
##quick test to check arm works as expected 
NUM_ARMS = 2
my_arm = Adverserial_Arm(num_arms = NUM_ARMS)
my_arm.create_reward_seqeunce(N=5)
print(my_arm.reward_sequence)

my_arm.pull_arm(time_step=0,arm_id=1)
my_arm.pull_arm(time_step=2,arm_id=1)

print("I recieved: ", my_arm.my_rewards())
print("genie recieved: ", my_arm.genie_reward())



[[1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 0.]
 [1. 1.]]
I recieved:  2.0
genie recieved:  2.0



## Goal of these exercises

Implement the following:

1. Basic EXP3 algorithm implementation under the Bernoulli bandit setting.
2. Plot the expected regret of EXP3 versus horizon (number of rounds).

Optional:

1. Plot the expected regret of EXP3 versus the number of arms.
2. Implement an adversarial bandit, and test EXP3 algorithm on it. 

## Tips:

1. To see if the regret is correct, try to run your EXP3 algorithm repeatedly with horizon equals to $[50^2, 60^2, 70^2, 80^2, 90^2, 100^2]$. Plot your regret (as the y-axis), versus $[50, 60, 70, 80, 90, 100]$ (as the x-axis). The figure should look like a straight line.
2. Check out numpy.random.choice for drawing from a discrete distribution

In [68]:


def exp3(arm, N, num_arms=NUM_ARMS, n_rate=0.01):
  S = np.zeros(num_arms)
  for i in range(N): #verify 
      prob = np.exp(n_rate*S)/sum(np.exp(n_rate*S)) #calculate sampling distribution 
      action = int(np.random.choice(len(prob), 1, p = prob)) #here we sample
      reward = arm.pull_arm(time_step=i, arm_id=action)
      S += 1 ##add 1 to every arm
      S[action] -= (1 - reward) * prob[action] ##subtract on arm where indicator is 1
  return arm.my_rewards()



In [67]:

def regret_vs_horizon(Ns:list, REPEAT:int, algorithm: type(lambda x: None)):
  regret = []
  # mu = [0.1, 0.0]
  my_arm = Adverserial_Arm(NUM_ARMS)
  for NUM_RUNs in Ns:
    print(NUM_RUNs)
    my_arm.create_reward_seqeunce(N=NUM_RUNs) ## verify; we are creating sequence for each run 
    cur_regret = 0
    for repeat in range(REPEAT):
        rewards = algorithm(my_arm, NUM_RUNs, my_arm.num_arms) ## everyrun NUM_RUNs += 10000
        cur_regret += my_arm.genie_reward() - rewards
        my_arm.clear_reward_hist()
    cur_regret /= REPEAT
    regret.append(cur_regret)
    #
  return regret


import plotly.graph_objects as graph
def plot(all_history:list, title:str, log = False):
    """
    input:
        all_history: list of dicts to plot
    ret:
        None: show plotly fig
    """
    fig = graph.Figure(layout = graph.Layout(title=graph.layout.Title(text=title))) 
    for i in range(len(all_history)):
        fig.add_trace(graph.Scatter(x = all_history[i]["x"], 
                                    y = all_history[i]["y"],
                                    name = all_history[i]["legend"])) 
    if log: fig.update_xaxes(type="log")
    fig.show()



In [72]:
NUM_ARMS = 10
x_axis  = [50,60,70,80,90,100]
Ns = np.power(x_axis,2)

exp3 = regret_vs_horizon(Ns, REPEAT=100, algorithm=exp3)


plot_exp3 = {"legend": "mean_exp3_regret", 
                   "x": x_axis , 
                   "y": exp3}

plot([plot_exp3], title="regret VS horizon - linear" , log = False)
# plot([plot_exp3], title="regret VS horizon - Log" , log = True)


2500
3600
4900
6400
8100
10000
