# Reinforcement Learning: multi-armed bandit with Tensorflow

In [12]:
import tensorflow as tf
import numpy as np

In [13]:
# For this simple example, 4 bandits will be used, the value of every bandit its used to calculate its reward or punishment
# In a different application or business case, the bandits list could be used just as a reference of avaliable resources or items
bandits = [0.2,0,-0.2,-5]
num_bandits = len(bandits) #4 bandits in this example

### The pullBandit(banditNumber) function simulates an action on the given bandit(banditNumber) and returns a reward(or punishment) for this action. 
In this example, the reward or punishment is calculated by a simple random number that depends on the bandit value, the more negative the bandit value the more probability of getting reward, the more positive the bandit value the more probability of getting a punishment so for this 4 bandits example, the last bandit its the best one and the first bandit its the worst as seen in bandits list :

bandits = [0.2,0,-0.2,-5]

### For a different application or different business case: 
The problem modeling changes to 1 bandit per resource(or item), and every resource(or item) provides a reward or punishment if its used(or activated) , this means the pullBandit function logic needs to change in order to provide a way to get  the reward( or punishment) associated with the given bandit(resource or item) ,for example a monetary performance indicator(money earns and losses)

In [14]:
# Return the reward or punishment of using the given bandit
def pullBandit(banditNumber):
    bandit = bandits[banditNumber]
    result = np.random.randn(1)
    
    if result > bandit:
        return 1
    else:
        return -1
    

### Model definition

The model consists on a single output layer neural network , with 1 neuron(and its weight) per bandit ,the untuition behind it its that for every bandit ,the neurons weight means how good the agent has learned the bandit is, so the biggest weight corresponds to which bandit the agent learned as being the one that produces best results,this means the best action to take(bandit to pull) at every step its the one corresponding to the biggest weight.

For every step(or episode)   the agent performs an exploration/exploration tradeoff ,this means that with a given probability "e" it will explore how a random bandit(action) performs to adquire new knowledge, and with a probability of 1-e , it will exploit the bandit that it has learned to be the one with the best rewards so the selected action depends on the agent being exploring or exploiting . 

The neurons weights are trained using gradient descent with a loss function of:

__Loss = -log(Policy)*Advantage__

Where:

* Policy :is the weight of the selected action(pulled bandit) at every training step.
* Advantage: its a measurment of how good the selected action was compared to some baseline, in this example to keep it simple, the advantage its the same as the reward of pulling the selected bandith(executing the selected action)

In [15]:
tf.reset_default_graph()

#one weight per bandit , the biggest weight corresponds to the bandit with better results.
weights = tf.Variable(tf.ones([num_bandits]))
best_action = tf.argmax(weights,0)

#At every episode the agent action(or bandit pulled) will be selected depending on its decision  
#to exploit the bandit it has learned to be best action(with 1-e probability) or to 
#explore other bandits to learn and adquire new knowledge(with e probability)
selected_action = tf.placeholder(shape = [1], dtype=tf.int32) #the selected action
selected_action_reward = tf.placeholder(shape = [1],dtype=tf.float32) #the reward of executing the selected action
selected_action_weight = tf.slice(weights,selected_action,[1]) #the score the agent gives to the selected action

advantage = selected_action_reward #the advantage measure to calculate the loss(to compare the selected action to a baseline)
loss = -(tf.log(selected_action_weight)*advantage) #the loss function to minimize 

# minimize loss function with gradient descent
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
update = optimizer.minimize(loss)

In [16]:
total_episodes = 1100 #number of steps
total_reward = np.zeros(num_bandits) #total reward of every bandith
explore_probability = 0.1 #probability to explore and get new knowledge
stats_print_steps = 50

In [17]:
initialize = tf.global_variables_initializer()

with tf.Session() as session:
    session.run(initialize)
    
    for step in range(total_episodes):
        # define if explore or exploit
        if np.random.rand(1) < explore_probability:
            action = np.random.randint(num_bandits)
        else:
            action = session.run(best_action)
            
        #get the reward of the given action
        reward = pullBandit(action)
        
        #perform a gradient descent run  to update weights by minimizing loss for the selected action and its rewad
        feed_dict = {selected_action:[action],selected_action_reward:[reward]}
        _,selected_weight,all_weights = session.run([update,selected_action_weight,weights],
                                                   feed_dict=feed_dict)
        #increase the total reward of the selected action
        total_reward[action]+=reward 
        
        if step% stats_print_steps == 0:
            print("Step "+str(step)+" ,bandits rewards:"+str(total_reward))
    print("Final rewards:"+str(total_reward))
        
    print("The agent thinks bandit "+str(np.argmax(all_weights)+1)+ " is the best ")
    if np.argmax(all_weights) == np.argmax(-np.array(bandits)):
        print("...The agent got the right answer")
    else:
        print("...The agent got the wrong answer")

Step 0 ,bandits rewards:[ 1.  0.  0.  0.]
Step 50 ,bandits rewards:[ -1.  -1.   0.  45.]
Step 100 ,bandits rewards:[ -2.  -3.  -1.  91.]
Step 150 ,bandits rewards:[  -2.   -5.   -1.  139.]
Step 200 ,bandits rewards:[  -2.   -5.   -1.  189.]
Step 250 ,bandits rewards:[  -3.   -5.   -2.  233.]
Step 300 ,bandits rewards:[  -2.   -5.   -1.  279.]
Step 350 ,bandits rewards:[  -2.   -2.   -1.  322.]
Step 400 ,bandits rewards:[  -1.   -2.   -1.  367.]
Step 450 ,bandits rewards:[   0.   -2.   -3.  414.]
Step 500 ,bandits rewards:[   0.   -2.   -1.  460.]
Step 550 ,bandits rewards:[   0.   -4.    0.  505.]
Step 600 ,bandits rewards:[   0.   -5.    0.  554.]
Step 650 ,bandits rewards:[   0.   -5.    1.  603.]
Step 700 ,bandits rewards:[   0.   -5.    2.  650.]
Step 750 ,bandits rewards:[  -2.   -6.    2.  697.]
Step 800 ,bandits rewards:[  -3.   -8.    5.  741.]
Step 850 ,bandits rewards:[  -4.  -10.    4.  787.]
Step 900 ,bandits rewards:[  -5.   -9.    4.  835.]
Step 950 ,bandits rewards:[  -7