In [None]:
import numpy as np
import numpy.random as nprand
import matplotlib.pyplot as plt
import seaborn as sns

# Playing with k-Armed Bandits

In this notebook, we will implement the algorithms presented in chapter 2 of Sutton's book and try to reproduce its figures.

## Bandits 

The following classes can be used to generate bandits:

In [None]:
class Bandit():
    # a one armed bandit that gives normally distributed rewards
    # with mean 'q_star' and variance 'scale'
    def __init__(self, q_star, scale):
        self.mean = q_star
        self.scale = scale
        
    def reward(self):
        return nprand.normal(self.mean, self.scale)

In [None]:
class kBandits():
    # a collection of k 1-armed bandits, normally distributed with
    # mean 'mean' and variance 'scale'
    def __init__(self, k, mean, scale):
        self.bandits = [Bandit(nprand.normal(mean, scale), 1) for i in range(0, k)]

### Example

We create a 10-armed bandit:

In [None]:
K = kBandits(10, 0, 1)

and we generate rewards to plot a figure like Fig 4.1

In [None]:
# we add a slight jitter in x to our points, for a clearer plot
points = [[i+nprand.normal(0, 0.05), K.bandits[i].reward()] for i in range(0, 10) for k in range(0, 100)]

In [None]:
x = [p[0] for p in points]
y = [p[1] for p in points]

In [None]:
ax = plt.scatter(x, y)
plt.xlabel('Action')
plt.ylabel('Reward')
plt.title('Reward distribution for a 10-armed bandit')
plt.xlim([-1, 10])
plt.show()

## Exercise: implementing the first algorithm

To implement the first algorithm of the chapter, you'll need an argmax function that breaks ties randomly:

In [None]:
def randargmax(array):
    # your function here

Hint: take a look at `numpy.where` and `numpy.random.choice`

You'll also need a function that decides wether the action should be to exploratory (random), with proba epsilon, or greedy

In [None]:
def choose_at_random(epsilon):
    # your function here
    # should return 1 if exploration, 0 otherwise

You're ready to implement the first algorithm!

In [None]:
def simple_algo(steps, epsilon):
    # your algorithm here
    # it should return an array of rewards, one per step

### Figures

To reproduce Fig 2.2, the algorithm should be run 2000 times with 1000 steps, and the mean should be taken over the 2000 rewards per step. Such operations can be easily managed using `numpy` in the following way:

In [None]:
n_episodes = 2000
n_steps = 1000

# initiate an empty array of the desired shape
rewards = np.ndarray(shape=(n_episodes, n_steps))

# fill the array with the results of the algorithm
for i in range(0, np.size(rews, 0)):
    R = simple_algo(n_steps, epsilon)
    # replace the ith row with the computed rewards
    rewards[i, :] = R
    
# then take the mean over the rows using array.mean
mean_reward = rews.mean(axis=0)

You can now plot the result and compare it to the book's figure

In [None]:
plt.plot(mean_r)
plt.xlabel('Steps')
plt.ylabel('Average reward')
plt.show()

The lower part of Fig. 2.2 shows the percentage of optimal actions chosen by the algorithm. Modify your `simple_algo` to keep track of optimal actions and try to reproduce the book's bottom figure.