# REINFORCE in TensorFlow (3 pts)¶

This notebook implements a basic reinforce algorithm a.k.a. policy gradient for CartPole env.

It has been deliberately written to be as simple and human-readable.

Authors: [Practical_RL](https://github.com/yandexdataschool/Practical_RL) course team

In [1]:
%env THEANO_FLAGS = 'floatX=float32'
import os
if type(os.environ.get("DISPLAY")) is not str or len(os.environ.get("DISPLAY")) == 0:
    !bash ../xvfb start
    %env DISPLAY = : 1

env: THEANO_FLAGS='floatX=float32'


The notebook assumes that you have [openai gym](https://github.com/openai/gym) installed.

In case you're running on a server, [use xvfb](https://github.com/openai/gym#rendering-on-a-server)

In [2]:
import gym
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

env = gym.make("CartPole-v0")

# gym compatibility: unwrap TimeLimit
if hasattr(env, 'env'):
    env = env.env

s = env.reset()
n_actions = env.action_space.n
state_dim = env.observation_space.shape

# plt.imshow(env.render("rgb_array"))

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m


  result = entry_point.load(False)


# Building the network for REINFORCE

For REINFORCE algorithm, we'll need a model that predicts action probabilities given states.

In [3]:
import tensorflow as tf

# create input variables. We only need <s,a,R> for REINFORCE
states = tf.placeholder('float32', (None,) + state_dim, name="states")
actions = tf.placeholder('int32', name="action_ids")
cumulative_rewards = tf.placeholder('float32', name="cumulative_returns")

init = tf.random_normal_initializer
hidden_size = 256
w1 = tf.get_variable('w1', initializer=init, shape=(state_dim[0], hidden_size),
                     dtype=tf.float32, trainable=True)
b1 = tf.get_variable('b1', initializer=init, shape=(1, hidden_size),
                     dtype=tf.float32, trainable=True)
hidden_layer = tf.nn.relu(tf.matmul(states, w1) + b1)

w2 = tf.get_variable('w2', initializer=init, shape=(hidden_size, n_actions),
                     dtype=tf.float32, trainable=True)
b2 = tf.get_variable('b2', initializer=init, shape=(1, n_actions),
                     dtype=tf.float32, trainable=True)
logits = tf.matmul(hidden_layer, w2) + b2
    
policy = tf.nn.softmax(logits)
log_policy = tf.nn.log_softmax(logits)

In [4]:
hidden_size = 256
#w1 = tf.Variable(initial_value=np.random.randn(state_dim[0], hidden_size),
#                 name='w1', dtype=tf.float32, trainable=True)
#b1 = tf.Variable(initial_value=np.random.randn(1, hidden_size), name='b1',
#                 dtype=tf.float32, trainable=True)

hidden_layer = tf.layers.dense(states, hidden_size, tf.nn.relu) # tf.nn.relu(tf.matmul(states, w1) + b1)

#w2 = tf.Variable(initial_value=np.random.randn(hidden_size, n_actions), name='w2',
#                 dtype=tf.float32, trainable=True)
#b2 = tf.Variable(initial_value=np.random.randn(1, n_actions), name='b2',
#                 dtype=tf.float32, trainable=True)
logits = tf.layers.dense(hidden_layer, n_actions) # tf.matmul(hidden_layer, w2) + b2
    
policy = tf.nn.softmax(logits)
log_policy = tf.nn.log_softmax(logits)

Instructions for updating:
Use keras.layers.dense instead.
Instructions for updating:
Colocations handled automatically by placer.


In [5]:
# utility function to pick action in one given state
def get_action_proba(s): 
    return policy.eval({states: [s]})[0]

#### Loss function and updates

We now need to define objective and update over policy gradient.

Our objective function is

$$ J \approx  { 1 \over N } \sum  _{s_i,a_i} \pi_\theta (a_i | s_i) \cdot G(s_i,a_i) $$


Following the REINFORCE algorithm, we can define our objective as follows: 

$$ \hat J \approx { 1 \over N } \sum  _{s_i,a_i} log \pi_\theta (a_i | s_i) \cdot G(s_i,a_i) $$

When you compute gradient of that function over network weights $ \theta $, it will become exactly the policy gradient.


In [6]:
# get probabilities for parti
indices = tf.stack([tf.range(tf.shape(log_policy)[0]), actions], axis=-1)
log_policy_for_actions = tf.gather_nd(log_policy, indices)

In [7]:
# REINFORCE objective function
# hint: you need to use log_policy_for_actions to get log probabilities for actions taken
# <policy objective as in the last formula. Please use mean, not sum.>
J = tf.math.reduce_mean(log_policy_for_actions*cumulative_rewards)

In [8]:
# regularize with entropy
entropy = tf.math.reduce_mean(policy*log_policy)

In [9]:
# all network weights
# all_weights = [w1, b1, w2, b2]

# weight updates. maximizing J is same as minimizing -J. Adding negative entropy.
loss = -J - 0.1*entropy

optimizer = tf.train.AdamOptimizer().minimize(loss)

Instructions for updating:
Use tf.cast instead.


### Computing cumulative rewards

In [10]:
def get_cumulative_rewards(rewards,  # rewards at each step
                           gamma=0.99  # discount for reward
                           ):
    """
    take a list of immediate rewards r(s,a) for the whole session 
    compute cumulative returns (a.k.a. G(s,a) in Sutton '16)
    G_t = r_t + gamma*r_{t+1} + gamma^2*r_{t+2} + ...

    The simple way to compute cumulative rewards is 
    to iterate from last to first time tick
    and compute G_t = r_t + gamma*G_{t+1} recurrently

    You must return an array/list of cumulative rewards with as many elements 
    as in the initial rewards.
    """
    def G_t(reward_arr, gamma):
        return sum([gamma**index*r for index, r in enumerate(reward_arr)])
    
    G = [G_t(rewards[index:], gamma) for index, r in enumerate(rewards)]
    
    return G

In [11]:
assert len(get_cumulative_rewards(range(100))) == 100
assert np.allclose(get_cumulative_rewards([0, 0, 1, 0, 0, 1, 0], gamma=0.9), [
                   1.40049, 1.5561, 1.729, 0.81, 0.9, 1.0, 0.0])
assert np.allclose(get_cumulative_rewards(
    [0, 0, 1, -2, 3, -4, 0], gamma=0.5), [0.0625, 0.125, 0.25, -1.5, 1.0, -4.0, 0.0])
assert np.allclose(get_cumulative_rewards(
    [0, 0, 1, 2, 3, 4, 0], gamma=0), [0, 0, 1, 2, 3, 4, 0])
print("looks good!")

looks good!


In [12]:
def train_step(_states, _actions, _rewards):
    """given full session, trains agent with policy gradient"""
    _cumulative_rewards = get_cumulative_rewards(_rewards)
    optimizer.run({states: _states, actions: _actions,
                   cumulative_rewards: _cumulative_rewards})

### Playing the game

In [13]:
def generate_session(t_max=1000):
    """play env with REINFORCE agent and train at the session end"""

    # arrays to record session
    states, actions, rewards = [], [], []

    s = env.reset()

    for t in range(t_max):

        # action probabilities array aka pi(a|s)
        action_probas = get_action_proba(s)
        
        a = np.random.choice(n_actions, p=action_probas)

        new_s, r, done, info = env.step(a)

        # record session history to train later
        states.append(s)
        actions.append(a)
        rewards.append(r)

        s = new_s
        if done:
            break

    train_step(states, actions, rewards)

    return sum(rewards)

In [14]:
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())

In [15]:
for i in range(100):
    rewards = [generate_session() for _ in range(100)]  # generate new sessions

    print("mean reward: %.3f" % (np.mean(rewards)))

    if np.mean(rewards) > 300:
        print("You Win!")
        break

mean reward: 24.070
mean reward: 47.940
mean reward: 97.500
mean reward: 230.410
mean reward: 337.020
You Win!


### Results & video

In [17]:
# record sessions
import gym.wrappers
env = gym.wrappers.Monitor(gym.make("CartPole-v0"),
                           directory="videos", force=True)
sessions = [generate_session() for _ in range(100)]
env.close()

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m


In [14]:
# show video
from IPython.display import HTML
import os

video_names = list(
    filter(lambda s: s.endswith(".mp4"), os.listdir("./videos/")))

HTML("""
<video width="640" height="480" controls>
  <source src="{}" type="video/mp4">
</video>
""".format("./videos/"+video_names[-1]))  # this may or may not be _last_ video. Try other indices

In [None]:
# That's all, thank you for your attention!