**Reinforcement Learning**

# Setup

First, let's make sure this notebook works well in both python 2 and 3, import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures:

In [1]:
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals

# Common imports
import numpy as np
import os
import sys

# to make this notebook's output stable across runs
def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

# To plot pretty figures and animations
%matplotlib nbagg
import matplotlib
import matplotlib.animation as animation
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "rl"

def save_fig(fig_id, tight_layout=True):
    path = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID, fig_id + ".png")
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format='png', dpi=300)
    
def update_scene(num, frames, patch):
    patch.set_data(frames[num])
    return patch,

def plot_animation(frames, repeat=False, interval=40):
    plt.close()  # or else nbagg sometimes plots in the previous cell
    fig = plt.figure()
    patch = plt.imshow(frames[0])
    plt.axis('off')
    return animation.FuncAnimation(fig, update_scene, fargs=(frames, patch), frames=len(frames), repeat=repeat, interval=interval)

# A simple environment: the Cart-Pole

In [2]:
import gym

In [3]:
import tensorflow as tf

  from ._conv import register_converters as _register_converters


The Cart-Pole is a very simple environment composed of a cart that can move left or right, and pole placed vertically on top of it. The agent must move the cart left or right to keep the pole upright.

The make() function creates an environment, in this case a CartPole environment.

This is a 2D simulation in which a cart can be accelerated left or right in order to balance a pole placed on top of it


In [11]:
env = gym.make("CartPole-v0")

[2019-05-26 02:48:15,775] Making new env: CartPole-v0


After the environment is created, we must initialize it using the reset() method.

This returns the first observation. Observations depend on the type of environment.

For the CartPole environment, each observation is a 1D NumPy array containing four floats:
    
These floats represent the

1. Cart’s horizontal position (0.0 = center)
2. Its velocity
3. The angle of the pole (0.0 = vertical) and
4. Its angular velocity

![image.png](attachment:image.png)

In [12]:
obs = env.reset()

In [13]:
obs

array([ 0.03972928,  0.01604397, -0.04001799, -0.03611603])

Let's render the environment

If you want render() to return the rendered image as a NumPy array, you can set the mode parameter to rgb_array

Unfortunately we need to fix an annoying rendering issue first.

## Fixing the rendering issue

Some environments (including the Cart-Pole) require access to your display, which opens up a separate window, even if you specify the `rgb_array` mode. In general you can safely ignore that window. However, if Jupyter is running on a headless server (ie. without a screen) it will raise an exception. One way to avoid this is to install a fake X server like Xvfb. You can start Jupyter using the `xvfb-run` command

    $ xvfb-run -s "-screen 0 1400x900x24" jupyter notebook

_Note - Just execute below cell on CloudxLab. We do not have to run xvfb-run command on CloudxLab_

If Jupyter is running on a headless server but you don't want to worry about Xvfb, then you can just use the following rendering function for the Cart-Pole:


In [29]:
from PIL import Image, ImageDraw

try:
    from pyglet.gl import gl_info
    openai_cart_pole_rendering = True   # no problem, let's use OpenAI gym's rendering function
except Exception:
    openai_cart_pole_rendering = False  # probably no X server available, let's use our own rendering function

def render_cart_pole(env, obs):
    if openai_cart_pole_rendering:
        # use OpenAI gym's rendering function
        return env.render(mode="rgb_array")
    else:
        # rendering for the cart pole environment (in case OpenAI gym can't do it)
        img_w = 600
        img_h = 400
        cart_w = img_w // 12
        cart_h = img_h // 15
        pole_len = img_h // 3.5
        pole_w = img_w // 80 + 1
        x_width = 2
        max_ang = 0.2
        bg_col = (255, 255, 255)
        cart_col = 0x000000 # Blue Green Red
        pole_col = 0x669acc # Blue Green Red

        pos, vel, ang, ang_vel = obs
        img = Image.new('RGB', (img_w, img_h), bg_col)
        draw = ImageDraw.Draw(img)
        cart_x = pos * img_w // x_width + img_w // x_width
        cart_y = img_h * 95 // 100
        top_pole_x = cart_x + pole_len * np.sin(ang)
        top_pole_y = cart_y - cart_h // 2 - pole_len * np.cos(ang)
        draw.line((0, cart_y, img_w, cart_y), fill=0)
        draw.rectangle((cart_x - cart_w // 2, cart_y - cart_h // 2, cart_x + cart_w // 2, cart_y + cart_h // 2), fill=cart_col) # draw cart
        draw.line((cart_x, cart_y - cart_h // 2, top_pole_x, top_pole_y), fill=pole_col, width=pole_w) # draw pole
        return np.array(img)

def plot_cart_pole(env, obs):
    plt.close()  # or else nbagg sometimes plots in the previous cell
    img = render_cart_pole(env, obs)
    plt.imshow(img)
    plt.axis("off")
    plt.show()

In [15]:
# Render the environment

plot_cart_pole(env, obs)

<IPython.core.display.Javascript object>

Now let’s ask the environment what actions are possible

In [16]:
env.action_space

Discrete(2)

Discrete(2) means that the possible actions are integers 0 and 1, which represent accelerating towards left (0)
or towards right (1)


The step() method executes the given action and returns four values:

*obs *- This is the new observation. The cart is now moving toward the right (obs[1]>0). The pole is still
tilted toward the right (obs[2]>0), but its angular velocity is now negative (obs[3]<0), so it will
likely be tilted toward the left after the next step.

reward - In this environment, you get a reward of 1.0 at every step, no matter what you do, so the goal is to
keep running as long as possible.

done - This value will be True when the episode is over. This will happen when the pole tilts too much.
After that, the environment must be reset before it can be used again.

info - This dictionary may provide extra debug information in other environments. This data should not be
used for training (it would be cheating)


Let's push the cart left until the pole falls:

In [30]:
obs = env.reset()
total_steps = 0
while True:
    obs, reward, done, info = env.step(0)
    print(reward)
    total_steps += 1
    if done:
        print("Finished. Steps: %s" % total_steps)
        break

1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
Finished. Steps: 9


In [31]:
obs = env.reset()
i = 0;
while True:
    r = 1
    if i%3 != 0:
        r =0
    obs, reward, done, info = env.step(r)
    i += 1
#     print("obs: %s" % obs)
#     print(reward)
    if done:
        print("Finished. Steps: %s" % i)
        break

Finished. Steps: 19


In [32]:
plt.close()  # or else nbagg sometimes plots in the previous cell
img = render_cart_pole(env, obs)
plt.imshow(img)
plt.axis("off")

<IPython.core.display.Javascript object>

(-0.5, 599.5, 399.5, -0.5)

In [53]:
env.step(0)

[2018-09-22 18:58:05,121] You are calling 'step()' even though this environment has already returned done = True. You should always call 'reset()' once you receive 'done = True' -- any further steps are undefined behavior.


(array([-0.23667963, -2.13630877,  0.29466916,  3.37443758]), 0.0, True, {})

In [54]:
img.shape

(400, 600, 3)

Notice that the game is over when the pole tilts too much, not when it actually falls. Now let's reset the environment and push the cart to right instead:

In [55]:
obs = env.reset()
steps = 0
while True:
    obs, reward, done, info = env.step(1)
    steps += 1
    if done:
        print(steps)
        break
        
plot_cart_pole(env, obs)

9


<IPython.core.display.Javascript object>

In [33]:
frames = []
obs = env.reset()
steps = 0
while True:
    img = render_cart_pole(env, obs)
    frames.append(img)
    obs, reward, done, info = env.step(steps % 2)
    steps += 1
    if done:
        print(steps)
        break
        
video = plot_animation(frames)
plt.show()
print(steps)

40


<IPython.core.display.Javascript object>

40


In [34]:
frames = []
obs = env.reset()
steps = 0
while True:
    img = render_cart_pole(env, obs)
    frames.append(img)
    r = 0
    if np.random.random() > 0.5:
        r = 1
    obs, reward, done, info = env.step(r)
    steps += 1
    if done:
        print(steps)
        break
        
video = plot_animation(frames)
plt.show()
print(steps)

16


<IPython.core.display.Javascript object>

16


Looks like it's doing what we're telling it to do. Now how can we make the poll remain upright? We will need to define a _policy_ for that. This is the strategy that the agent will use to select an action at each step. It can use all the past actions and observations to decide what to do.

# A simple hard-coded policy

Let's hard code a simple strategy: if the pole is tilting to the left, then push the cart to the left, and _vice versa_. Let's see if that works:

In [36]:
frames = []

n_max_steps = 1000
n_change_steps = 10

num_steps = 0
obs = env.reset()
for step in range(n_max_steps):
    img = render_cart_pole(env, obs)
    frames.append(img)

    # hard-coded policy
    position, velocity, angle, angular_velocity = obs
    if angle < 0:
        action = 0
    else:
        action = 1
    obs, reward, done, info = env.step(action)
    num_steps += 1
    if done:
        print(num_steps)
        break

video = plot_animation(frames)
plt.show()

56


<IPython.core.display.Javascript object>

Nope, the system is unstable and after just a few wobbles, the pole ends up too tilted: game over. We will need to find a better policy

# Neural Network Policies

Let's create a neural network that will take observations as inputs, and output the action to take for each observation. To choose an action, the network will first estimate a probability for each action, then select an action randomly according to the estimated probabilities. In the case of the Cart-Pole environment, there are just two possible actions (left or right), so we only need one output neuron: it will output the probability `p` of the action 0 (left), and of course the probability of action 1 (right) will be `1 - p`.

In [37]:
import tensorflow as tf

In [74]:
## Snippet: Understanding Multinomial
p = [[0.4, 0.3, 0.3]] # Estimated probabilities of 0, 1
actions = tf.multinomial(tf.log(p), 1)
# select an action randomly according to the estimated probabilities
with tf.Session() as ses:
    print(ses.run(actions))

[[1]]


In [75]:
import tensorflow as tf

reset_graph()

## 1. Specify the network architecture
The number of inputs is the size of the observation space (which in the case of the CartPole is four), we just have four hidden units and no need for more, and we have just one output probability (the probability of going left).


In [77]:
n_inputs = 4  # == env.observation_space.shape[0]
n_hidden = 4  # it's a simple task, we don't need more than this
n_outputs = 1 # only outputs the probability of accelerating left
initializer = tf.contrib.layers.variance_scaling_initializer()

In [63]:
# tf.contrib.layers.variance_scaling_initializer?

## 2. Build the neural network

In this example, it’s a vanilla Multi-Layer Perceptron, with a single output. Note that the output layer uses the logistic (sigmoid) activation function in order to output a probability from 0.0 to 1.0. If there were more than two possible actions, there would be one output neuron per action, and you would use the softmax activation function instead.


In [79]:
X = tf.placeholder(tf.float32, shape=[None, n_inputs])
hidden = tf.layers.dense(X, n_hidden, activation=tf.nn.elu,
                         kernel_initializer=initializer)
outputs = tf.layers.dense(hidden, n_outputs, activation=tf.nn.sigmoid,
                          kernel_initializer=initializer)

Lastly, we call the multinomial() function to  pick a random action. This function independently samples one (or more) integers, given the log probability of each integer. 

For example, if you call it with the array [np.log(0.5), np.log(0.2), np.log(0.3)] and with num_samples=5, then it will output five integers, each of which will have a 50% probability of being 0, 20% of being 1, and 30% of being 2.

In [80]:
# 3. Select a random action based on the estimated probabilities
### [[0.4]] => [[0.4]], [[0.6]] => [[0.4, 0.6]]
p_left_and_right = tf.concat(axis=1, values=[outputs, 1 - outputs])
action = tf.multinomial(tf.log(p_left_and_right), num_samples=1)

init = tf.global_variables_initializer()

In this particular environment, the **past actions and observations can safely be ignored**, since each observation contains the environment's full state. If there were some hidden state then you may need to consider past actions and observations in order to try to infer the hidden state of the environment.

For example, if the environment only revealed the position of the cart but not its velocity, you would have to consider not only the current observation but also the previous observation in order to estimate the current velocity. Another example is if the observations are noisy: you may want to use the past few observations to estimate the most likely current state. Our problem is thus as simple as can be: the current observation is noise-free and contains the environment's full state.

Let's randomly initialize this policy neural network and use it to play one game:


In [83]:
n_max_steps = 1000
frames = []

with tf.Session() as sess:
    init.run()
    print("Started")
    obs = env.reset()
    for step in range(n_max_steps):
        img = render_cart_pole(env, obs)
        frames.append(img)
        action_val = action.eval(feed_dict={X: obs.reshape(1, n_inputs)})
        obs, reward, done, info = env.step(action_val[0][0])
        if done:
            print("Finished")
            break
env.close()

Started
Finished


Now let's look at how well this randomly initialized policy network performed:


In [84]:
len(frames)

11

In [85]:
video = plot_animation(frames)
plt.show()

<IPython.core.display.Javascript object>

Yeah... pretty bad. The neural network will have to learn to do better. First let's see if it is capable of learning the basic policy we used earlier: go left if the pole is tilting left, and go right if it is tilting right. The following code defines the same neural network but we add the target probabilities `y`, and the training operations (`cross_entropy`,  `optimizer` and `training_op`):

In [107]:
import tensorflow as tf

reset_graph()

n_inputs = 4
n_hidden = 4
n_outputs = 1

learning_rate = 0.01

initializer = tf.contrib.layers.variance_scaling_initializer()

X = tf.placeholder(tf.float32, shape=[None, n_inputs])
y = tf.placeholder(tf.float32, shape=[None, n_outputs])

hidden = tf.layers.dense(X, n_hidden, activation=tf.nn.elu, kernel_initializer=initializer)
logits = tf.layers.dense(hidden, n_outputs)
outputs = tf.nn.sigmoid(logits) # probability of action 0 (left)
p_left_and_right = tf.concat(axis=1, values=[outputs, 1 - outputs])
action = tf.multinomial(tf.log(p_left_and_right), num_samples=1)

cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(labels=y, logits=logits)
optimizer = tf.train.AdamOptimizer(learning_rate)
training_op = optimizer.minimize(cross_entropy)

init = tf.global_variables_initializer()
saver = tf.train.Saver()

We can make the same net play in 10 different environments in parallel, and train for 1000 iterations. We also reset environments when they are done.


In [108]:
n_environments = 10
n_iterations = 500

envs = [gym.make("CartPole-v0") for _ in range(n_environments)]
observations = [env.reset() for env in envs]

with tf.Session() as sess:
    init.run()
    for iteration in range(n_iterations):
        
        # Defining Targets
        target_probas = np.array([([1.] if obs[2] < 0 else [0.]) for obs in observations]) # if angle<0 we want proba(left)=1., or else proba(left)=0.
        
        # Training Neural Network
        sess.run(training_op, feed_dict={X: np.array(observations), y: target_probas})

        # Inference
        action_val = sess.run(action, feed_dict={X: np.array(observations), y: target_probas})
        
        # Taking action based on inference
        for env_index, env in enumerate(envs):
            obs, reward, done, info = env.step(action_val[env_index][0])
            observations[env_index] = obs if not done else env.reset()
    saver.save(sess, "./my_policy_net_basic.ckpt")

for env in envs:
    env.close()

[2019-05-26 04:14:42,611] Making new env: CartPole-v0
[2019-05-26 04:14:42,612] Making new env: CartPole-v0
[2019-05-26 04:14:42,614] Making new env: CartPole-v0
[2019-05-26 04:14:42,615] Making new env: CartPole-v0
[2019-05-26 04:14:42,616] Making new env: CartPole-v0
[2019-05-26 04:14:42,617] Making new env: CartPole-v0
[2019-05-26 04:14:42,618] Making new env: CartPole-v0
[2019-05-26 04:14:42,619] Making new env: CartPole-v0
[2019-05-26 04:14:42,620] Making new env: CartPole-v0
[2019-05-26 04:14:42,622] Making new env: CartPole-v0


In [33]:
def test_policy_net(model_path, action, X, n_max_steps = 1000, n_games = 10):
    steps = 0
    envs = [gym.make("CartPole-v0") for i in range(n_games)]
    with tf.Session() as sess:
        #TensorFlow Restore
        saver.restore(sess, model_path)
        for env in envs:
            obs = env.reset()
            for step in range(n_max_steps):
                steps += 1
                # Tensorflow evaluate
                action_val = action.eval(feed_dict={X: obs.reshape(1, n_inputs)})
                obs, reward, done, info = env.step(action_val[0][0])
                if done:
                    print("env done!")
                    break
    env.close()
    return steps

In [110]:
test_policy_net("./my_policy_net_basic.ckpt", action, X, n_games=100)

[2019-05-26 04:14:58,182] Making new env: CartPole-v0
[2019-05-26 04:14:58,184] Making new env: CartPole-v0
[2019-05-26 04:14:58,185] Making new env: CartPole-v0
[2019-05-26 04:14:58,186] Making new env: CartPole-v0
[2019-05-26 04:14:58,187] Making new env: CartPole-v0
[2019-05-26 04:14:58,188] Making new env: CartPole-v0
[2019-05-26 04:14:58,190] Making new env: CartPole-v0
[2019-05-26 04:14:58,191] Making new env: CartPole-v0
[2019-05-26 04:14:58,192] Making new env: CartPole-v0
[2019-05-26 04:14:58,193] Making new env: CartPole-v0
[2019-05-26 04:14:58,194] Making new env: CartPole-v0
[2019-05-26 04:14:58,195] Making new env: CartPole-v0
[2019-05-26 04:14:58,196] Making new env: CartPole-v0
[2019-05-26 04:14:58,197] Making new env: CartPole-v0
[2019-05-26 04:14:58,198] Making new env: CartPole-v0
[2019-05-26 04:14:58,199] Making new env: CartPole-v0
[2019-05-26 04:14:58,200] Making new env: CartPole-v0
[2019-05-26 04:14:58,201] Making new env: CartPole-v0
[2019-05-26 04:14:58,202] Ma

env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!


4866

In [26]:
def render_policy_net(model_path, action, X, n_max_steps = 1000):
    frames = []
    steps = 0
    env = gym.make("CartPole-v0")
    obs = env.reset()
    with tf.Session() as sess:
        #TensorFlow Restore
        saver.restore(sess, model_path)
        for step in range(n_max_steps):
            img = render_cart_pole(env, obs)
            frames.append(img)
            steps += 1
            # Tensorflow evaluate
            action_val = action.eval(feed_dict={X: obs.reshape(1, n_inputs)})
            obs, reward, done, info = env.step(action_val[0][0])
            if done:
                print(steps)
                break
    env.close()
    return frames

In [96]:
frames = render_policy_net("./my_policy_net_basic.ckpt", action, X)

[2019-05-26 04:03:20,802] Making new env: CartPole-v0


39


In [97]:

video = plot_animation(frames)
plt.show()

<IPython.core.display.Javascript object>

Looks like it learned the policy correctly. Now let's see if it can learn a better policy on its own.

# Policy Gradients

To train this neural network we will need to define the target probabilities `y`. If an action is good we should increase its probability, and conversely if it is bad we should reduce it. But how do we know whether an action is good or bad? The problem is that most actions have delayed effects, so when you win or lose points in a game, it is not clear which actions contributed to this result: was it just the last action? Or the last 10? Or just one action 50 steps earlier? This is called the _credit assignment problem_.

The _Policy Gradients_ algorithm tackles this problem by first playing multiple games, then making the actions in good games slightly more likely, while actions in bad games are made slightly less likely. First we play, then we go back and think about what we did.

In [35]:
import tensorflow as tf

reset_graph()

n_inputs = 4
n_hidden = 20
n_outputs = 1

learning_rate = 0.01

initializer = tf.contrib.layers.variance_scaling_initializer()

X = tf.placeholder(tf.float32, shape=[None, n_inputs])

hidden = tf.layers.dense(X, n_hidden, activation=tf.nn.elu, kernel_initializer=initializer)
logits = tf.layers.dense(hidden, n_outputs)
outputs = tf.nn.sigmoid(logits)  # probability of action 0 (left)
p_left_and_right = tf.concat(axis=1, values=[outputs, 1 - outputs])
action = tf.multinomial(tf.log(p_left_and_right), num_samples=1)

In [36]:
#Since we are acting as though the chosen action is the best possible 
#action, the target probability must be 1.0 if the chosen action 
# is action 0 (left) and 0.0 if it is action 1 (right):
y = 1. - tf.to_float(action)

In [37]:
# Let’s start by completing the construction phase
# Note that we are calling the optimizer’s compute_gradients() 
#  method instead of the minimize() method. 
# This is because we want to tweak the gradients before we apply them.
# The compute_gradients() method returns a list of 
# gradient vector/variable pairs (one pair per trainable variable)

cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(labels=y, logits=logits)
optimizer = tf.train.AdamOptimizer(learning_rate)


In [38]:
# Let’s put all the gradients in a list, to make 
# it more convenient to obtain their values:
grads_and_vars = optimizer.compute_gradients(cross_entropy)
gradients = [grad for grad, variable in grads_and_vars]

In [39]:
gradient_placeholders = []
grads_and_vars_feed = []
for grad, variable in grads_and_vars:
    gradient_placeholder = tf.placeholder(tf.float32, shape=grad.get_shape())
    gradient_placeholders.append(gradient_placeholder)
    grads_and_vars_feed.append((gradient_placeholder, variable))
training_op = optimizer.apply_gradients(grads_and_vars_feed)

init = tf.global_variables_initializer()
saver = tf.train.Saver()

In [40]:
def discount_rewards(rewards, discount_rate):
    discounted_rewards = np.zeros(len(rewards))
    cumulative_rewards = 0
    for step in reversed(range(len(rewards))):
        cumulative_rewards = rewards[step] + cumulative_rewards * discount_rate
        discounted_rewards[step] = cumulative_rewards
    return discounted_rewards

def discount_and_normalize_rewards(all_rewards, discount_rate):
    all_discounted_rewards = [discount_rewards(rewards, discount_rate) for rewards in all_rewards]
    flat_rewards = np.concatenate(all_discounted_rewards)
    reward_mean = flat_rewards.mean()
    reward_std = flat_rewards.std()
    return [(discounted_rewards - reward_mean)/reward_std for discounted_rewards in all_discounted_rewards]

In [20]:
discount_rewards([10, 0, -50], discount_rate=0.8)

array([-22., -40., -50.])

In [21]:
discount_rewards([10, 0, -50, 100], discount_rate=0.8)

array([ 29.2,  24. ,  30. , 100. ])

In [22]:
discount_and_normalize_rewards([[10, 0, -50], [10, 20]], discount_rate=0.8)

[array([-0.28435071, -0.86597718, -1.18910299]),
 array([1.26665318, 1.0727777 ])]

In [23]:
env = gym.make("CartPole-v0")

n_games_per_update = 40
n_max_steps = 1000
n_iterations = 100
save_iterations = 10
discount_rate = 0.95

with tf.Session() as sess:
    init.run()
    for iteration in range(n_iterations):
        print("\rIteration: {}".format(iteration), end="")
        all_rewards = []
        all_gradients = []
        for game in range(n_games_per_update):
            current_rewards = []
            current_gradients = []
            obs = env.reset()
            for step in range(n_max_steps):
                action_val, gradients_val = sess.run([action, gradients], feed_dict={X: obs.reshape(1, n_inputs)})
                obs, reward, done, info = env.step(action_val[0][0])
                current_rewards.append(reward)
                current_gradients.append(gradients_val)
                if done:
                    break
            all_rewards.append(current_rewards)
            all_gradients.append(current_gradients)

        all_rewards = discount_and_normalize_rewards(all_rewards, discount_rate=discount_rate)
        feed_dict = {}
        for var_index, gradient_placeholder in enumerate(gradient_placeholders):
            mean_gradients = np.mean([reward * all_gradients[game_index][step][var_index]
                                      for game_index, rewards in enumerate(all_rewards)
                                          for step, reward in enumerate(rewards)], axis=0)
            feed_dict[gradient_placeholder] = mean_gradients
        sess.run(training_op, feed_dict=feed_dict)
        if iteration % save_iterations == 0:
            saver.save(sess, "./my_policy_net_pg_1.ckpt")

[2019-05-26 05:05:49,461] Making new env: CartPole-v0


Iteration: 99

In [24]:
env.close()

In [30]:
frames = render_policy_net("./my_policy_net_pg_1.ckpt", action, X, n_max_steps=1000)
video = plot_animation(frames)
plt.show()

[2019-05-26 05:12:36,646] Making new env: CartPole-v0


196


<IPython.core.display.Javascript object>

In [34]:
test_policy_net("./my_policy_net_pg_1.ckpt", action, X, n_games=100)

[2019-05-26 05:13:52,175] Making new env: CartPole-v0
[2019-05-26 05:13:52,176] Making new env: CartPole-v0
[2019-05-26 05:13:52,178] Making new env: CartPole-v0
[2019-05-26 05:13:52,180] Making new env: CartPole-v0
[2019-05-26 05:13:52,182] Making new env: CartPole-v0
[2019-05-26 05:13:52,183] Making new env: CartPole-v0
[2019-05-26 05:13:52,184] Making new env: CartPole-v0
[2019-05-26 05:13:52,185] Making new env: CartPole-v0
[2019-05-26 05:13:52,187] Making new env: CartPole-v0
[2019-05-26 05:13:52,188] Making new env: CartPole-v0
[2019-05-26 05:13:52,189] Making new env: CartPole-v0
[2019-05-26 05:13:52,190] Making new env: CartPole-v0
[2019-05-26 05:13:52,191] Making new env: CartPole-v0
[2019-05-26 05:13:52,192] Making new env: CartPole-v0
[2019-05-26 05:13:52,193] Making new env: CartPole-v0
[2019-05-26 05:13:52,194] Making new env: CartPole-v0
[2019-05-26 05:13:52,195] Making new env: CartPole-v0
[2019-05-26 05:13:52,196] Making new env: CartPole-v0
[2019-05-26 05:13:52,197] Ma

env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!
env done!


18692