# Reinforcement Learning

See the [OpenAI Gym](https://gym.openai.com/) page on [CartPole-v0](https://gym.openai.com/envs/CartPole-v0/).

In [1]:
import gym

env = gym.make('CartPole-v0')

[2017-09-27 11:48:00,623] Making new env: CartPole-v0


In [2]:
import numpy as np
import tensorflow as tf
from tensorflow.contrib.layers import fully_connected

In [3]:
n_inputs  = 4
n_hidden  = 4
n_outputs = 1

initializer = tf.contrib.layers.variance_scaling_initializer()

learning_rate = 0.01

X       = tf.placeholder(tf.float32, shape=[None, n_inputs])
hidden  = fully_connected(X, n_hidden, activation_fn=tf.nn.elu, weights_initializer=initializer)
logits  = fully_connected(hidden, n_outputs, activation_fn=None, weights_initializer=initializer)
outputs = tf.nn.sigmoid(logits)

p_left_and_right = tf.concat(axis=1, values=[outputs, 1 - outputs])
action           = tf.multinomial(tf.log(p_left_and_right), num_samples=1)

y = 1. - tf.to_float(action) 

cross_entropy  = tf.nn.sigmoid_cross_entropy_with_logits(labels=y, logits=logits)
optimizer      = tf.train.AdamOptimizer(learning_rate)
grads_and_vars = optimizer.compute_gradients(cross_entropy)

gradients = [grad for (grad, variable) in grads_and_vars]

gradient_placeholders = []
grads_and_vars_feed   = []
for grad, variable in grads_and_vars:
    gradient_placeholder = tf.placeholder(tf.float32, shape=grad.get_shape())
    gradient_placeholders.append(gradient_placeholder)
    grads_and_vars_feed.append((gradient_placeholder, variable))

training_op = optimizer.apply_gradients(grads_and_vars_feed)

init = tf.global_variables_initializer()
saver = tf.train.Saver()

The following line had me stumped for a while:

```python
y = 1. - tf.to_float(action)
```

Or, in essence:

```python
y = 1 - action
```

What is going on here?

We need to back up to the definition on pages 446 and 447 of the book, of what the network _outputs_ in its single output neuron named '`outputs`':

> It will output the **probability** _p_ of action 0 (left)

So if the network outputs 0.7, for example, that means it wants us to choose between the only two possible actions of **left** and **right** with assigning _left_ a probability of 0.7 and _right_ a probability of 0.3. We then choose the `action` in exactly this way, by calling `tf.multinomial(...)`.

We know have an `action` which is either 0 (left) or 1 (right).

So far so good, but to make the network _learn_, we have to tell it which _output_ (i.e. target) we would have _wanted_ to get. This is supervised learning and allows us to compute an error term, then gradients, then update various weights using back-propagation, and tune the network in the direction of getting better at keeping the pole upright.

We don't _know_ at any time whether moving left is better than moving right, or vice versa, so the best we can do is _assume_ that whatever action was chosen, it was the best one.

That means:

- if the action chosen was to go _left_, we would have wanted the network's output to strongly indicate we should go left;
- if the action chosen was to go _right_, we would have wanted the network's output to strongly indicate we should go right.

As before, we go back to the definition of _output_ in the book and realize that the desired network output represents:

> the **probability** _p_ of action 0 (left)

That means, quite simply, that:

- if the action chosen was to go _left_, we would have wanted the network to output the _highest_ possible probability _p_ of action 0 (left), or 1.0;
- if action chosen was to go _right_, we would have wanted the network to output the _lowest_ possible probability _p_ of action 0 (left), or 0.0.

In table form:

Chosen action | Desired network output | By definition
:------------:|:----------------------:|:-----------------------------------:
left (0)      | go left                | maximum probability to go left (1.0)
right (1)     | go right               | minimum probability to go left (0.0)

And this last column is exactly what the following simple Python expression computes:

```python
y = 1 - action
```

action | y
:-----:|:-:
0      | 1
1      | 0

In [4]:
def discount_rewards(rewards, discount_rate):
    discounted_rewards = np.empty(len(rewards))

    cumulative_rewards = 0
    for step in reversed(range(len(rewards))):
        cumulative_rewards = rewards[step] + cumulative_rewards * discount_rate
        discounted_rewards[step] = cumulative_rewards

    return discounted_rewards

def discount_and_normalize_rewards(all_rewards, discount_rate):
    all_discounted_rewards = [discount_rewards(rewards, discount_rate) for rewards in all_rewards]
    flat_rewards           = np.concatenate(all_discounted_rewards)
    
    reward_mean = flat_rewards.mean()
    reward_std  = flat_rewards.std()
    
    return [(discounted_rewards - reward_mean) / reward_std
            for discounted_rewards in all_discounted_rewards]

In [5]:
discount_rewards([10, 0, -50], discount_rate=0.8)

array([-22., -40., -50.])

In [6]:
discount_and_normalize_rewards([[10, 0, -50], [10, 20]], discount_rate=0.8)

[array([-0.28435071, -0.86597718, -1.18910299]),
 array([ 1.26665318,  1.0727777 ])]

In [7]:
n_iterations       = 2000
n_max_steps        = 1000
n_games_per_update = 10
save_iterations    = 10
discount_rate      = 0.95

with tf.Session() as sess:
    init.run()

    for iteration in range(n_iterations):
        all_rewards   = []
        all_gradients = []
        
        # Run the game for a certain number of episodes.
        for game in range(n_games_per_update):
            current_rewards   = []
            current_gradients = []
            
            obs = env.reset()
            
            for step in range(n_max_steps):
                action_val, gradients_val = sess.run(
                    [action, gradients],
                    feed_dict={X: obs.reshape(1, n_inputs)})
                
                obs, reward, done, info = env.step(action_val[0][0])
                
                current_rewards.append(reward)
                current_gradients.append(gradients_val)
                
                if done:
                    break
            
            all_rewards.append(current_rewards)
            all_gradients.append(current_gradients)
        
        # It's time for a policy update.
        all_rewards = discount_and_normalize_rewards(all_rewards, discount_rate)
        feed_dict   = {}
        
        for var_index, grad_placeholder in enumerate(gradient_placeholders):
            # Multiply the gradients by the action scores, and compute the mean.
            mean_gradients = np.mean(
                [reward * all_gradients[game_index][step][var_index]
                    for game_index, rewards in enumerate(all_rewards)
                    for step, reward        in enumerate(rewards)],
                axis=0)
            
            feed_dict[grad_placeholder] = mean_gradients
            
        sess.run(training_op, feed_dict=feed_dict)
        
        if iteration % save_iterations == 0:
            saver.save(sess, './my_policy_net_pg.ckpt')

Let's play!

In [None]:
obs = env.reset()
env.render()

with tf.Session() as sess:
    saver.restore(sess, './my_policy_net_pg.ckpt')
    
    while True:
        action_val = sess.run(action, feed_dict={X: obs.reshape(1, n_inputs)})
        obs, _reward, done, _info = env.step(action_val[0][0])
        env.render()
        
        if done:
            break