## The Contextual Bandits

This tutorial contains a simple example of how to build a policy-gradient based agent that can solve the contextual bandit problem.

In [1]:
import numpy as np
import tensorflow as tf
import tensorflow.contrib.slim as slim

### The Contextual Bandits

Here we define our contextual bandits. In this example, we are using three four-armed bandits. What this means is that each bandit has four arms that can be pulled. Each bandit has different success probabilities for each arm, and as such, requires different actions to obtain the best result. The `pull_bandit` function generates a random number from a normal distribution with a mean of 0. The lower the bandit number, the more likely a positive reward will be returned. We want our agent to learn to always choose the bandit-arm that will most often give a positive reward, depending on the bandit presented.

In [2]:
class ContextualBandit(object):
    
    def __init__(self):
        self.state = 0
        
        # Currently arms 4, 2, and 1 (respectively) are the most optimal
        self.bandits = np.array([[.2, 0., -0., -5.], [.1, -5., 1., .25], [-5., 5., 5., 5.]])
        self.n_bandits = self.bandits.shape[0]
        self.n_actions = self.bandits.shape[1]
        
    def get_bandit(self):
        """ returns a random state for each episode """
        self.state = np.random.randint(0, len(self.bandits))
        return self.state
    
    def pull_arm(self, action):
        bandit = self.bandits[self.state, action]
        result = np.random.randn(1)
        return 1 if result > bandit else -1

### The Policy-Based Agent

The code below establishes our simple neural agent. It takes as input the current state, and returns an action. This allows the agent to take actions which are conditioned on the state of the environment - a critical step toward being able to solve full RL problems.

The agent uses a single set of weights, within which each value is an estimate of the value of the return from choosing a particular arm given a bandit. We use a policy gradient method to update the agent by moving the value for the selected action towards the received reward.

In [3]:
class Agent(object):
    
    def __init__(self, learning_rate, state_dim, n_actions):
        # These lines established the feed-forward part of the network.
        # The agent takes a state and produces an action.
        self.state_in = tf.placeholder(shape=[1], dtype=tf.int32)
        state_in_one_hot = slim.one_hot_encoding(self.state_in, state_dim)
        output = slim.fully_connected(state_in_one_hot, n_actions,
                                      activation_fn=tf.nn.sigmoid,
                                      weights_initializer=tf.ones_initializer(),
                                      biases_initializer=None)
        self.output = tf.reshape(output, [-1])
        self.chosen_action = tf.argmax(self.output, 0)
        
        self.reward_ph = tf.placeholder(shape=[1], dtype=tf.float32)
        self.action_ph = tf.placeholder(shape=[1], dtype=tf.int32)
        self.responsible_weight = tf.slice(self.output, self.action_ph, [1])
        self.loss = -tf.log(self.responsible_weight) * self.reward_ph
        optimizer = tf.train.GradientDescentOptimizer(learning_rate)
        self.update_op = optimizer.minimize(self.loss)

### Training the Agent

We train our agent by getting a state from the environment, taking an action, and recieving a reward. Using these three things, we know how to update our network in order to more often choose actions given states that will yield the highest rewards over time.

In [4]:
tf.reset_default_graph()

cbandit = ContextualBandit()
agent = Agent(learning_rate=0.001, state_dim=cbandit.n_bandits, n_actions=cbandit.n_actions)
W = tf.trainable_variables()[0]

In [5]:
# hyperparams:
n_episodes = 10000
total_reward = np.zeros([cbandit.n_bandits, cbandit.n_actions])  # set scoreboard for bandits to zeros
epsilon = 0.1  # set the chance of taking a random action

In [6]:
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(n_episodes):
        s = cbandit.get_bandit()  # get a state from the environment
        
        # explore else exploit
        if np.random.rand(1) < epsilon:
            action = np.random.randint(cbandit.n_actions)
        else:
            action = sess.run(agent.chosen_action, feed_dict={agent.state_in: [s]})
            
        reward = cbandit.pull_arm(action)  # get our reward for taking an action given a bandit
        
        # Update the network
        feed_dict = {agent.reward_ph: [reward], agent.action_ph: [action], agent.state_in: [s]}
        _, W1 = sess.run([agent.update_op, W], feed_dict)
        
        # Update the running tally of rewards
        total_reward[s, action] += reward
        
        if i % 500 == 0:
            print('Mean reward for each of the %i bandits: %s' % 
                  (cbandit.n_bandits, str(np.mean(total_reward, axis=1))))

Mean reward for each of the 3 bandits: [ 0.    0.   -0.25]
Mean reward for each of the 3 bandits: [34.5  42.   34.25]
Mean reward for each of the 3 bandits: [72.   81.25 69.5 ]
Mean reward for each of the 3 bandits: [110.75 117.   106.5 ]
Mean reward for each of the 3 bandits: [152.5  152.   143.25]
Mean reward for each of the 3 bandits: [192.25 188.75 177.75]
Mean reward for each of the 3 bandits: [227.5  229.25 213.  ]
Mean reward for each of the 3 bandits: [264.5  267.5  249.25]
Mean reward for each of the 3 bandits: [301.5  300.25 289.  ]
Mean reward for each of the 3 bandits: [334.5  342.5  321.75]
Mean reward for each of the 3 bandits: [371.   383.   355.25]
Mean reward for each of the 3 bandits: [407.75 422.25 390.75]
Mean reward for each of the 3 bandits: [451.   457.5  419.75]
Mean reward for each of the 3 bandits: [486.   499.5  452.75]
Mean reward for each of the 3 bandits: [523.5  539.75 486.  ]
Mean reward for each of the 3 bandits: [556.5  578.75 523.5 ]
Mean reward for e

In [11]:
W1

array([[0.99623156, 1.00242   , 0.9975794 , 1.6312536 ],
       [0.9997333 , 1.6507342 , 0.98159117, 0.99757993],
       [1.635012  , 0.97640234, 0.9706452 , 0.9769494 ]], dtype=float32)

In [7]:
for i in range(cbandit.n_bandits):
    print('The agent thinks action %s for bandit %i is the most promising' %
          (str(np.argmax(W1[i]) + 1), i + 1))
    
    if np.argmax(W1[i]) == np.argmin(cbandit.bandits[i]):
        print('and it is right!')
    else:
        print('and it is wrong!')
    
    print('')

The agent thinks action 4 for bandit 1 is the most promising
and it is right!

The agent thinks action 2 for bandit 2 is the most promising
and it is right!

The agent thinks action 1 for bandit 3 is the most promising
and it is right!

