Recall the four actions that were explained by Alex in the video.
- Action 1: Reward is always $8$
- Action 2: Reward is $\begin{cases}0 & \text{ w.p. } 88\% \\ 100 & \text{ w.p. } 12\% \end{cases}$
- Action 3: Reward is $\sim \mathbb{U}[-10, 35]$
- Action 4: Reward is $\begin{cases}0 & \text{ w.p. } 33.3\% \\ 20 & \text{ w.p. } 33.3\%  \\ \sim \mathbb{U}[8,18] & \text{ w.p. } 33.3\% \end{cases}$

Firstly, recall that in reality we will not know these distributions, rather we will "pull" the arm and observe one sample from the corresponding distribution. 

Therefore, let us generate 10,000 samples from each distribution. In other words, these 10,000 values will be what we would have seen if we pulled the arm of each slot machine 10,000 times.

In [1]:
import numpy as np

In [2]:
def action_1(): #first slot machine
    return 8

def action_2(): #second slot machine
    if np.random.uniform(0, 1) < 0.88: #wp 88% return 0
        return 0
    return 100

def action_3():
    return np.random.randint(-10, 35+1) #assuming integer values. Adding 1 because randint has upper bound excluded

def action_4():
    u = np.random.uniform(0, 1)
    if u < 1/3:
        return 0
    elif u < 2/3:
        return 20
    return np.random.randint(8,18+1)

In [3]:
slot_1 = [action_1() for _ in range(10**4)]
slot_2 = [action_2() for _ in range(10**4)]
slot_3 = [action_3() for _ in range(10**4)]
slot_4 = [action_4() for _ in range(10**4)]
slots = np.array([slot_1, slot_2, slot_3, slot_4])

In [4]:
slots[:, 5] #e.g., the 6-th pull, if we pulled one of each. But we will not see them, rather pick only one of them.

array([ 8,  0, 35, 17])

Now let us simulate the $\epsilon$-greedy approach. Initialize the number of times each slot is used, as well as the values. We will illustrate one step of the algorithm. Then, try to implement this in a for-loop and return the final set of decisions and results.

In [5]:
eps = 0.1
k = 4 #number of arms
q = np.zeros(k)
n = np.zeros(k)

In [6]:
pull = 0#first pull
if np.random.uniform(0,1) < eps: #this means we will EXPLORE
    idx = np.random.randint(1, k+1) #random pull
else: #then EXPLOIT the best
    idx = np.argmax(q) #now they are all zeros so doesn't make much sense, but in the next steps this will make sense!
n[idx] += 1#we pulled one more time the "idx" slot
r = slots[idx, pull] #first pull from the "idx" slot
# this is a standard form for learning/update rules (*)
q[idx] += (r - q[idx])/n[idx] #update the average value

A very simple "for-loop" can be implemneted as:

In [7]:
eps = 0.1
k = 4 #number of arms
q = np.zeros(k)
n = np.zeros(k)

In [8]:
for pull in range(10**4):
    if np.random.uniform(0,1) < eps: #this means we will EXPLORE
        idx = np.random.randint(0, k) #random pull
    else: #then EXPLOIT the best
        idx = np.argmax(q) #now they are all zeros so doesn't make much sense, but in the next steps this will make sense!
    n[idx] += 1#we pulled one more time the "idx" slot
    r = slots[idx, pull] #first pull from the "idx" slot
    # this is a standard form for learning/update rules (*)
    q[idx] += (r - q[idx])/n[idx] #update the average value

In [9]:
print(int(np.sum(n)), "many rounds are completed.")
print("number of times each machine is pulled", n)
print("values:", q)

10000 many rounds are completed.
number of times each machine is pulled [ 288.  232. 9211.  269.]
values: [ 8.          9.9137931  12.54782325 11.20446097]


Now, compare these values with the expected rewards from the true distribution of each slot machine. Are they close? 

*Hint: The 4-th expected reward would be computed as:
$$\frac{1}{3}0 + \frac{1}{3} 20 + \frac{1}{3}\mathbb{E}[\mathbb{U}[8,18]] =  \frac{1}{3}0 + \frac{1}{3} 20 + \frac{1}{3}13 = 11.$$*

Before you look at the main notebook of this module (where you will see further tricks and visualizations), try to implement the following:
- Compute the total reward amount over all the $10^4$ rounds.
- Currently, we are "argmax" ing in the first step. But as we initialized all the values as $0$, the argmax will always start with the first slot machine. Try to start with different machines and check whether this changes the total revard.
- Compare the total reward of this approach $\epsilon = 0.1$ with different values $\epsilon = 0, 0.05, 0.15, 0.5$.
- Visualize the reward/time over time. 
- Finally: Note that in the beginning we sampled $10^4$ times from each slot machine, so overall we sampled $4*10^4$ times. This was done for the simplicity of presentation. This is not very efficient, as in reality, in each step we will only use one reward according to which arm we pull. Try to sample once in every "pull" instead of first sampling all and then picking.

If you find interesting results, please share them with us :)!