# N ARM Bandit Exercise

Create the most basic reinforcement learning agents that will learn to obtain the most reward from pulling the arms of a slots machine (one arm bandit) and recieving some previously unknown reward. 

In [1]:
import numpy as np
import matplotlib.pyplot as plt

<h1> Environment </h1>

---

The environtment will allow us to access the rewards for pulling on a single arm and keep the range of rewards hidden from our agents. 

Upon pulling an arm, an updated state of the environment and the corresponding reward will be returned. 

The environment is kept static throught the experiment as the arms returns to their initial position after each agent action. 

In [3]:
class Environment():
    
    def __init__(self, num_arms, arms=None):
        self.num_arms = num_arms
        
        #set lever rewards
        self.arms = arms if arms is not None else self.init_rewards()
        
    def init_rewards(self):
        return np.random.normal(0, 1, self.num_arms)
    
    def get_state(self):
        return np.full(self.num_arms, 0, dtype=np.int8)
    
    def pull_arm(self, index):
        
        #copy the arms array and pull the chosen arm
        pulled_arms = np.array(self.get_state(), copy=True)
        pulled_arms[index] = 1
        
        #return the new state and the reward for the pulled arm
        return pulled_arms, self.arms[index]

<h1> Agent </h1>

---

This will be a template agent that will seve as the bare minimum for what we think of as an agent that can interct and learn from its environment. 

It also has a fairly unique identifier that can be used to pick a single agent out of the crowd. 

In [2]:
class Agent():
    
    def __init__(self):
        self.get_name()
        pass
    
    def get_name(self):
        self.name = "Agent "+str(id(self))[-3:]
        
    #the act phase
    def __call__():
        pass
        
    #the learn phase
    def update():
        pass

<h1> Bandit </h1>

---

This is an agent that has access to a number of arms (levers) from its environmen that deliver unknown levels of rewards.


The Bandit Agent has a set level of exploration it is comfortable with and based on whether the Bandit is exploring or exploiting it will either pull its best lever or some lever at random.

It keeps track of its favorite lever within its knowledge parameter.


In [4]:
class Bandit(Agent):
    
    def __init__(self, init_expected, exploration):
        super().__init__()
        self.e = exploration
        self.acc_reward = 0
        self.initial = init_expected
        
        #remember to set the agent's knowledge after initilization
        self.knowledge = None
    
    def set_knowledge(self, state):
        self.knowledge = np.full(state.shape, self.initial, dtype=np.float64)
        
    def __call__(self, state):
        is_random = np.random.choice(2, 1, p=[1-self.e, self.e])
        
        #exploration
        if is_random:
            choice = np.random.randint(0,len(self.knowledge))
        
        #exploitation
        else:                             
            choice = np.argmax(self.knowledge)
            
        return choice
    
    def update(self, state, reward):
        choice = np.argmax(state)
        
        #update the agent's knowledge of the environment
        self.knowledge[choice] = reward
        
        self.acc_reward += reward

<h1> Trainer </h1>

---

We'll let the trainer perform learning for a set number of iteratons on our bandits and we will see how fast it takes for each one to get wise 🧐.

It will aso allow us to display the knowledge the agents obtained and kep track of their average reward




In [5]:
class Trainer:
    def __init__(self, env, bandits, iters):
        self.iters = iters
        self.env = env
        self.bandits = bandits
        self.tallies = np.zeros((len(self.bandits),2,iters))
        
    def train(self):
        #iterate over all bandits and execute the agent environment loop with an update to the agent knowledge
        for i in range(iters):
            for b in range(len(self.bandits)):
                
                #get state
                state = env.get_state()
                #make choice
                choice = self.bandits[b](state)
                #submit choice to environemnt
                state, reward = env.pull_arm(choice)
                #obtain state and reward
                self.bandits[b].update(state, reward)
                
                #make note of the average and total reward recieved by the agent
                average_reward = self.bandits[b].acc_reward/(i+1)
                self.tallies[b][0][i] = average_reward
                self.tallies[b][1][i] = self.bandits[b].acc_reward
                
    def show_knowledge(self):
        #access and plot the knowledge object of each agent 
        figure, axis = plt.subplots(3,3, constrained_layout = True)
        figure.suptitle("Bandit knowledges", y=1.08)
        
        for e in range(3):
            for i in range(3):
                
                b = e*3 + i
                
                axis[e][i].set_ylim([-10, 10])
                axis[e][i].title.set_text("%s \n exploration: %s \n init state: %s" % 
                                          (self.bandits[b].name, self.bandits[b].e, self.bandits[b].initial) )
                axis[e][i].bar(range(env.num_arms),self.bandits[b].knowledge, color = "green")
        
        plt.show()
        
    def show_stats(self):
        #plot the average reward for each agent alongside the total reward
        colors = ["blue","orange", "red"]
        
        figure, axis = plt.subplots(1, 2)
        figure.suptitle("Reward", y=1.08)
        
        axis[0].title.set_text("Average reward")
        legend = []
        for b in range(len(self.bandits)):
            axis[0].plot(self.tallies[b][0])
            legend.append(self.bandits[b].name)
        axis[0].legend(legend)
        
        axis[1].title.set_text("Total reward")
        legend = []
        for b in range(len(self.bandits)):
            axis[1].plot(self.tallies[b][1])
            legend.append(self.bandits[b].name)
        axis[1].legend(legend)
        plt.show()

We will initialize our environment and display the rewards given for each arm

In [6]:
%matplotlib notebook

#initialize the environment
num_arms = 20
env = Environment(num_arms)

plt.title("Arm Rewards")
plt.bar(range(env.num_arms), env.arms, color = "blue")
plt.show()

<IPython.core.display.Javascript object>

We will initialize our bandits and set their knowledge to the shape of our state, with their own preprogrammed expected initial state values

In [7]:
#initialize bandit agents with various explorations and initial expectations

explorations = [0.001, 0.01, 0.1]
init_expected = [-3, 0 , 10]
bandits = []

for e in range(len(explorations)):
    for i in range(len(init_expected)):
        bandit = Bandit(init_expected[i],explorations[e])
        bandit.name = "Agent: "+ str(e*3 + i)
        bandit.set_knowledge(env.get_state())
        bandits.append(bandit)

Initialize the trainer and show that our bandits have no knowledge of the environment rewards beyond their initial expectation

In [8]:
iters = 1000

#train bandits on the environment for a number of iterations
trainer = Trainer(env, bandits, iters)
trainer.show_knowledge()

<IPython.core.display.Javascript object>

In [9]:
#train the agents loading their knowledge objects based on the environment
trainer.train()

In [10]:
#show the results of the training
trainer.show_stats()
trainer.show_knowledge()

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Results

---

We can see that our agents fall into 3 seperate categories:

1. Agents 0,1,3 have poor performance due to low expectations and low exploration causing them to be happy to stay pull arms that they think are good compared to their pesemistic view of the arms they have not explored
    
2. Agents 4,6,7,8 have high rates of exploration so they are able to discover all arms even for the agent with high pessimism of undescovered states however they still lag behind due to their exploration leading them to pick arms they know are bad but they pull them anyway for the sake of "exploration"
    
3. Agents 2,5 perform the best due to high initial expectations and subsequent low explorations. This causes them to explore all arms off the bat due to their high initial state of optimism and then pull the highest discovered arm without much unnecessary exploration.