# Policy Gradients

In [chapter 13](http://amzn.to/2Fp48Tr), we're introduced to *policy gradient* methods, which are very powerful tools for reinforcement learning. Rather than learning action values or state values, we attempt to learn a *parameterized policy* which takes input data and maps that to a probability over available actions. If that's not clear, then no worries, we'll break it down step-by-step!

## TL;DR

In this post we'll look at the policy gradient class of algorithms and two algorithms in particular: REINFORCE and REINFORCE with Baseline. We test the two using OpenAI's `CartPole` environment.

## Policies and Action-Values

When we're talking about a reinforcement learning *policy* ($\pi$), all we mean is something that maps our state to an action. A policy can be very simple. Consider a policy for your home, if the temperature of the home (in this case our state) is below $20^{\circ}$C ($68^{\circ}$F) then turn the heat on (action). If it is above $22^{\circ}$C ($71.6^{\circ}$F) then turn the heat off. This is a very basic policy that takes some input (temperature in this case) and turns that into an action (turn the heat on or off). Easy, right?

Now, when we talk about a *parameterized* policy, we take that same idea except we can represent our policy by a mathematical function that has a series of weights to map our input to an output. We will represent our parameters by the value $\theta$ which could be a vector of linear weights, or all the connections in a neural network (as we'll show in an example). Whatever we choose, the only requirement is that the policy is differentiable with respect to it's parameters, $\theta$. 
This representation has a big advantage because we don't need to code our policy as a series of if-else statements or explicit rules like the thermostat example. Additionally, we can use the *policy gradient* algorithm to learn our rules. Beyond these obvious reasons, parametrized policies offer a [few benefits versus the action-value methods](https://pdfs.semanticscholar.org/beef/9a0f27e4ca54eb5c5311f6ac90d90fa88f12.pdf) (i.e. [tabular Q-learning](http://mnemstudio.org/path-finding-q-learning-tutorial.htm)) that we've [covered previously](https://www.datahubbs.com/reinforcement-learning/) that make them much more powerful.

First, parameterized methods enable learning *stochastic policies* so that actions are taken probabalistically. This is far superior to deterministic methods in situations where the [state may not be fully-observable](https://papers.nips.cc/paper/951-reinforcement-learning-algorithm-for-partially-observable-markov-decision-problems.pdf) - which is the case in many real-world applications. Large problems or continuous problems are also easier to deal with when using parameterized policies because tabular methods would need a clever discretization scheme often incorporating additional prior knowledge about the environment, or must grow incredibly large in order to handle the problem. The parameterized policy methods also change the policy in a more stable manner than tabular methods. In tabular Q-learning, for example, you are selecting the action that gives the highest expected reward ($max_a [Q(s', a)]$, possibly also in an $\epsilon$-greedy fashion) which means if the values change slightly, the actions and trajectories may change radically.

In our examples here, we'll select our actions using a [softmax function](https://en.wikipedia.org/wiki/Softmax_function):

$$\pi(a \mid s, \theta) = \frac{e^{h(s,a,\theta}}{\sum e^{h(s,a,\theta)}}$$

This works well because the output is a probability over available actions. If we feed it with a neural network, we'll get higher values and thus we will be more likely to choose the actions that we learned produce a better reward. In the long-run, this will trend towards a deterministic policy, $\pi(a \mid s, \theta) = 1$, but it will continue to explore as long as one of the probabilities doesn't dominate the others (which will likely take some time).

### REINFORCE: A First Policy Gradient Algorithm

What we'll call the **REINFORCE** algorithm was part of a family of algorithms first proposed by [Ronald Williams in 1992](http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf). In his original paper, he wasn't able to show that this algorithm converges to a local optimum, although he was quite confident it would. The proof of its convergence came along a few years later in [Richard Sutton's paper on the topic](https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf). With that in place, we know that the algorithm will converge, at least locally, to an optimal policy. 

In order to implement the algorithm, we need to initialize a policy, which we can do with any neural network, select our step-size parameter (often called $\alpha$ or the *learning rate*), and train our agent many times. We update the policy at the end of every episode - like with the [Monte Carlo methods](https://www.datahubbs.com/monte-carlo-simulation-and-reinforcement-learning-1/) - by taking the rewards we received at each time step ($G_t$) and multiplying that by our discount factor ($\gamma$), the step-size, and the gradient of the policy ($\nabla_\theta$). The full algorithm looks like this:


#### REINFORCE Algorithm

&nbsp; &nbsp; Input a differentiable policy parameterization $\pi(a \mid s, \theta)$

&nbsp; &nbsp; Define step-size $\alpha > 0$

&nbsp; &nbsp; Initialize policy parameters $\theta \in \rm I\!R^d$

&nbsp; &nbsp; Loop through $n$ episodes (or forever):

&nbsp; &nbsp; &nbsp; &nbsp; Loop through $N$ batches:

&nbsp; &nbsp; &nbsp; &nbsp; Generate an episode $S_0, A_0, R_1...,S_{T-1},A_{T-1}, R_T$, following $\pi(a \mid s, \theta)$

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; For each step $t=0,...T-1$:

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $G_t \leftarrow$ from step $t$

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; At the end of each batch of episodes:

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Calculate the loss $L(\theta) = -\frac{1}{N} \sum_t^T ln(\gamma^t G_t \pi(A_t \mid S_t, \theta))$

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Update policy parameters through backpropagation: $\theta := \theta + \alpha \nabla_\theta L(\theta)$

My formulation differs slightly from Sutton's book, but I think it makes easier to understand when it comes time to implement (take a look at section [13.3](http://amzn.to/2Fp48Tr) if you want to see the derivation and full write-up he has). Thankfully, we can use some modern tools like TensorFlow when implementing this so we don't need to worry about calculating the dervative of the parameters ($\nabla_\theta$). So, with that, let's get this going with an OpenAI implementation of the classic Cart-Pole problem.

### Cart-Pole

![Image of reinforce_baseline_comparison](https://www.datahubbs.com/wp-content/uploads/2018/04/cart_pole.jpg)

Just for quick refresher here, the goal of [Cart-Pole](https://gym.openai.com/envs/CartPole-v0/) is to keep the pole in the air for as long as possible. Your agent needs to determine whether to push the cart to the left or the right to keep it balanced while not going over the edges on the left and right. If you don't have OpenAI's library installed yet, just run `pip install gym` and you should be set.

Go ahead and import some packages:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import gym
import tensorflow as tf
from tensorflow.contrib.layers import fully_connected
%matplotlib inline
import warnings

To set this up, we'll implement REINFORCE using a shallow, two layer neural network with [ReLU activation functions](https://medium.com/the-theory-of-everything/understanding-activation-functions-in-neural-networks-9491262884e0) and the aforementioned softmax output. To do this, we'll build a class called `policy_estimator` and a seperate function called `reinforce` that we'll use to train the policy estimation network. The network will have methods to allow us to get the current gradients, make predictions (which will return an action based on the current state), and update the network at the end of each episode.

In [None]:
class policy_estimator(object):
    
    def __init__(self, sess, env):
        # Pass TensorFlow session object
        self.sess = sess
        # Get number of inputs and outputs from environment
        self.n_inputs = env.observation_space.shape[0]
        self.n_outputs = env.action_space.n
        self.learning_rate = 0.01
        
        # Define number of hidden nodes
        self.n_hidden_nodes = 16
        
        # Set graph scope name
        self.scope = "policy_estimator"
        
        # Create network
        with tf.variable_scope(self.scope):
            initializer = tf.contrib.layers.xavier_initializer()
            
            # Define placholder tensors for state, actions, and rewards
            self.state = tf.placeholder(tf.float32, [None, self.n_inputs], 
                                        name='state')
            self.rewards = tf.placeholder(tf.float32, [None], name='rewards')
            self.actions = tf.placeholder(tf.int32, [None], name='actions')
            
            layer_1 = fully_connected(self.state, self.n_hidden_nodes,
                                      activation_fn=tf.nn.relu,
                                      weights_initializer=initializer)
            output_layer = fully_connected(layer_1, self.n_outputs,
                                           activation_fn=None,
                                           weights_initializer=initializer)
            
            # Get probability of each action
            self.action_probs = tf.squeeze(
                tf.nn.softmax(output_layer - tf.reduce_max(output_layer)))
            
            # Get indices of actions
            indices = tf.range(0, tf.shape(output_layer)[0]) \
                * tf.shape(output_layer)[1] + self.actions
                
            selected_action_prob = tf.gather(tf.reshape(self.action_probs, [-1]),
                                             indices)
    
            # Define loss function
            self.loss = -tf.reduce_mean(tf.log(selected_action_prob) * self.rewards)

            # Get gradients and variables
            self.tvars = tf.trainable_variables(self.scope)
            self.gradient_holder = []
            for j, var in enumerate(self.tvars):
                self.gradient_holder.append(tf.placeholder(tf.float32, 
                    name='grads' + str(j)))
            
            self.gradients = tf.gradients(self.loss, self.tvars)
            
            # Minimize training error
            self.optimizer = tf.train.AdamOptimizer(self.learning_rate)
            self.train_op = self.optimizer.apply_gradients(
                zip(self.gradient_holder, self.tvars))
            
    def predict(self, state):
        probs = self.sess.run([self.action_probs], 
                              feed_dict={
                                  self.state: state
                              })[0]
        return probs
    
    def update(self, gradient_buffer):
        feed = dict(zip(self.gradient_holder, gradient_buffer))
        self.sess.run([self.train_op], feed_dict=feed)

    def get_vars(self):
        net_vars = self.sess.run(tf.trainable_variables(self.scope))
        return net_vars

    def get_grads(self, states, actions, rewards):
        grads = self.sess.run([self.gradients], 
            feed_dict={
            self.state: states,
            self.actions: actions,
            self.rewards: rewards
            })[0]
        return grads   

The `predict` method defined in the network above returns the output of the softmax output, meaning we will get a probability over the action space when we call it. When training, we can then sample the actions according to the probabilities that are output, so for the Cart-Pole, where we have two actions (left and right), we may get something like `[0.1, 0.9]` out. This would mean that we have a 10% chance of selecting action `0` and a 90% chance of selecting action `1`. You'll also notice that we have two other methods called `get_vars` and `get_grads`. The `get_vars` method returns a list of all of the trainable variables (i.e. the weights and biases in the network) which we'll use to set up a *gradient buffer* to store all of the gradients as we train, which will be filled by the output of the `get_grads` method. The reason we are doing this, is because we'll update our policy gradients in batches and so average our gradients as we go help smooth out the training.

We also want to [discount the rewards](https://cs.stackexchange.com/questions/44905/the-meaning-of-discount-factor-on-reinforcement-learning) that our agent receives, so we can do that with a brief function:

In [None]:
def discount_rewards(rewards, gamma):
    discounted_rewards = np.zeros(len(rewards))
    cumulative_rewards = 0
    for i in reversed(range(0, len(rewards))):
        cumulative_rewards = cumulative_rewards * gamma + rewards[i]
        discounted_rewards[i] = cumulative_rewards
    return discounted_rewards

With the policy estimation network in place, it's just a matter of setting up the REINFORCE algorithm and letting it run. For this, we'll define a function called `reinforce` that takes the environment and the policy estimation class as inputs and runs the algorithm.

In [None]:
def reinforce(env, policy_estimator, num_episodes=2000,
              batch_size=10, gamma=0.99):
    
    total_rewards = []
    
    # Set up gradient buffers and set values to 0
    grad_buffer_pe = policy_estimator.get_vars()
    for i, g in enumerate(grad_buffer_pe):
        grad_buffer_pe[i] = g * 0
        
    # Get possible actions
    action_space = np.arange(env.action_space.n)
        
    for ep in range(num_episodes):
        # Get initial state
        s_0 = env.reset()
        reward = 0
        episode_log = []
        complete = False
        
        # Run through each episode
        while complete == False:
            
            # Get the probabilities over the actions
            action_probs = policy_estimator.predict(
                s_0.reshape(1,-1))
            # Stochastically select the action
            action = np.random.choice(action_space,
                                      p=action_probs)
            # Take a step
            s_1, r, complete, _ = env.step(action)
            
            # Append results to the episode log
            episode_log.append([s_0, action, r, s_1])
            s_0 = s_1
            
            # If complete, store results and calculate the gradients
            if complete:
                episode_log = np.array(episode_log)
                
                # Store raw rewards and discount episode rewards
                total_rewards.append(episode_log[:,2].sum())
                discounted_rewards = discount_rewards(
                    episode_log[:,2], gamma)
                
                # Calculate the gradients for the policy estimator and
                # add to buffer
                pe_grads = policy_estimator.get_grads(
                    states=np.vstack(episode_log[:,0]),
                    actions=episode_log[:,1],
                    rewards=discounted_rewards)
                for i, g in enumerate(pe_grads):
                    grad_buffer_pe[i] += g
                    
        # Update policy gradients based on batch_size parameter
        if ep % batch_size == 0 and ep != 0:
            policy_estimator.update(grad_buffer_pe)
            # Clear buffer values for next batch
            for i, g in enumerate(grad_buffer_pe):
                grad_buffer_pe[i] = g * 0
                
    return total_rewards

Now that everything is in place, we can train it and check the output.

In [None]:
env = gym.make('CartPole-v0')
tf.reset_default_graph()
sess = tf.Session()

pe = policy_estimator(sess, env)

# Initialize variables
init = tf.global_variables_initializer()
sess.run(init)

rewards = reinforce(env, pe)

![Image of REINFORCE](https://www.datahubbs.com/wp-content/uploads/2018/04/reinforce.png)

We can look at the performance either by viewing the raw rewards, or by taking a look at a moving average (which looks much cleaner).

In [None]:
smoothed_rewards = [np.mean(rewards[max(0,i-10):i+1]) for i in range(len(rewards))]

plt.figure(figsize=(12,8))
plt.plot(smoothed_rewards)
plt.title('REINFORCE with Policy Estimation')
plt.show()

### REINFORCE with Baseline

There's a bit of a tradeoff for the simplicity of the straightforward REINFORCE algorithm implementation we did above. Namely, there's a high variance in the gradient estimation. This can be addressed by introducing a baseline approximation that estimates the value of the state and compares that to the actual rewards garnered. Sutton referes to this as **REINFORCE with Baseline**. The algorithm is nearly identitcal, however, for updating, the network parameters we now have:

$$\theta_p := \theta_p + \alpha_{p}\gamma^t \delta \nabla_{\theta p} ln(\pi(A_t \mid S_t, \theta_p)$$

Where $\delta$ is the difference between the actual value and the predicted value at that given state:

$$\delta = G_t - v(S_t, \theta_v)$$

Note that I introduced the subscripts $p$ and $v$ to differentiate between the policy estimation function and the value estimation function that we'll be using. Looking at the algorithm, we now have:

#### REINFORCE Algorithm with Baseline

&nbsp; &nbsp; Input a differentiable policy parameterization $\pi(a \mid s, \theta_p)$

&nbsp; &nbsp; Input a differentiable policy parameterization $v(s, \theta_v)$

&nbsp; &nbsp; Define step-size $\alpha_p > 0$, $\alpha_v > 0$

&nbsp; &nbsp; Initialize policy parameters $\theta_p \in \rm I\!R^d$, $\theta_v \in \rm I\!R^d$

&nbsp; &nbsp; Loop through $n$ episodes (or forever):

&nbsp; &nbsp; &nbsp; &nbsp; Loop through $N$ batches:

&nbsp; &nbsp; &nbsp; &nbsp; Generate an episode $S_0, A_0, R_1...,S_{T-1},A_{T-1}, R_T$, following $\pi(a \mid s, \theta_p)$

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; For each step $t=0,...T-1$:

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $G_t \leftarrow$ from step $t$

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $\delta \leftarrow G_t - v(s, \theta_v)$

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Calculate the loss $L(\theta_v) = \frac{1}{N} \sum_t^T (\gamma^t G_t - v(S_t, \theta_v))^2$

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Calculate the loss $L(\theta_p) = -\frac{1}{N} \sum_t^T ln(\gamma^t \delta \pi(A_t \mid S_t, \theta_p))$

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Update policy parameters through backpropagation: $\theta_p := \theta_p + \alpha_p \nabla_\theta^p L(\theta_p)$

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Update policy parameters through backpropagation: $\theta_v := \theta_v + \alpha_v \nabla_\theta^v L(\theta_v)$

To implement this, we can represent our value estimation function by a second neural network. It will be very similar to the first network except instead of getting a probability over actions, we're trying to estimate the value of being in that given state.

In [None]:
class value_estimator(object):
    
    def __init__(self, sess, env):
        # Pass TensorFlow session object
        self.sess = sess
        # Get number of inputs and outputs from environment
        self.n_inputs = env.observation_space.shape[0]
        self.n_outputs = 1
        self.learning_rate = 0.01
        
        # Define number of hidden nodes
        self.n_hidden_nodes = 16
        
        # Set graph scope name
        self.scope = "value_estimator"
        
        # Create network
        with tf.variable_scope(self.scope):
            initializer = tf.contrib.layers.xavier_initializer()
            
            # Define placholder tensors for state, actions, and rewards
            self.state = tf.placeholder(tf.float32, [None, self.n_inputs], 
                                        name='state')
            self.rewards = tf.placeholder(tf.float32, [None], name='rewards')
            
            layer_1 = fully_connected(self.state, self.n_hidden_nodes,
                                      activation_fn=tf.nn.relu,
                                      weights_initializer=initializer)
            output_layer = fully_connected(layer_1, self.n_outputs,
                                           activation_fn=None,
                                           weights_initializer=initializer)
            
            self.state_value_estimation = tf.squeeze(output_layer)
    
            # Define loss function as squared difference between estimate and 
            # actual
            self.loss = tf.reduce_mean(tf.squared_difference(
                self.state_value_estimation, self.rewards))

            # Get gradients and variables
            self.tvars = tf.trainable_variables(self.scope)
            self.gradient_holder = []
            for j, var in enumerate(self.tvars):
                self.gradient_holder.append(tf.placeholder(tf.float32, 
                    name='grads' + str(j)))
            
            self.gradients = tf.gradients(self.loss, self.tvars)
            
            # Minimize training error
            self.optimizer = tf.train.AdamOptimizer(self.learning_rate)
            self.train_op = self.optimizer.apply_gradients(
                zip(self.gradient_holder, self.tvars))
            
    def predict(self, state):
        value_est = self.sess.run([self.state_value_estimation], 
                              feed_dict={
                                  self.state: state
                              })[0]
        return value_est
    
    def update(self, gradient_buffer):
        feed = dict(zip(self.gradient_holder, gradient_buffer))
        self.sess.run([self.train_op], feed_dict=feed)

    def get_vars(self):
        net_vars = self.sess.run(tf.trainable_variables(self.scope))
        return net_vars

    def get_grads(self, states, rewards):
        grads = self.sess.run([self.gradients], 
            feed_dict={
            self.state: states,
            self.rewards: rewards
            })[0]
        return grads   

The `value_estimation` class above is just like the `policy_estimation` class, except I've removed any reference to actions, changed the loss function to mean squared difference between the prediction and the actual returns, and ensured that the network outputs only a single value. To implement this, we need to make a slight change to our `reinforce` function to account for the `value_estimation`.

In [None]:
def reinforce_baseline(env, policy_estimator, value_estimator,
                       num_episodes=2000, batch_size=10, gamma=0.99):
    
    total_rewards = []
    
    # Set up gradient buffers and set values to 0
    # Policy estimation buffer
    grad_buffer_pe = policy_estimator.get_vars()
    for i, g in enumerate(grad_buffer_pe):
        grad_buffer_pe[i] = g * 0
    # Value estimation buffer
    grad_buffer_ve = value_estimator.get_vars()
    for i, g in enumerate(grad_buffer_ve):
        grad_buffer_ve[i] = g * 0
        
    # Get possible actions
    action_space = np.arange(env.action_space.n)
        
    for ep in range(num_episodes):
        # Get initial state
        s_0 = env.reset()
        reward = 0
        episode_log = []
        # Log value estimation
        complete = False
        
        # Run through each episode
        while complete == False:
            
            # Get the probabilities over the actions
            action_probs = policy_estimator.predict(
                s_0.reshape(1,-1))
            
            # Estimate the value
            value_est = value_estimator.predict(
                s_0.reshape(1,-1))
            
            # Stochastically select the action
            action = np.random.choice(action_space,
                                      p=action_probs)
            # Take a step
            s_1, r, complete, _ = env.step(action)
            
            # Calculate reward-estimation delta
            re_delta = r - value_est
            
            # Append results to the episode log
            episode_log.append([s_0, action, re_delta, r, s_1])
            s_0 = s_1
            
            # If complete, store results and calculate the gradients
            if complete:
                episode_log = np.array(episode_log)
                
                # Store raw rewards and discount reward-estimation delta
                total_rewards.append(episode_log[:,3].sum())
                discounted_rewards = discount_rewards(
                    episode_log[:,3], gamma)
                discounted_reward_est = discount_rewards(
                    episode_log[:,2], gamma)
                
                # Calculate the gradients for the policy estimator and
                # add to buffer
                pe_grads = policy_estimator.get_grads(
                    states=np.vstack(episode_log[:,0]),
                    actions=episode_log[:,1],
                    rewards=discounted_rewards)
                for i, g in enumerate(pe_grads):
                    grad_buffer_pe[i] += g
                    
                # Calculate the gradients for the value estimator and
                # add to buffer
                ve_grads = value_estimator.get_grads(
                    states=np.vstack(episode_log[:,0]),
                    rewards=discounted_reward_est)
                for i, g in enumerate(ve_grads):
                    grad_buffer_ve[i] += g
                    
        # Update policy gradients based on batch_size parameter
        if ep % batch_size == 0 and ep != 0:
            policy_estimator.update(grad_buffer_pe)
            value_estimator.update(grad_buffer_ve)
            
            # Clear buffer values for next batch
            for i, g in enumerate(grad_buffer_pe):
                grad_buffer_pe[i] = g * 0
                
            for i, g in enumerate(grad_buffer_ve):
                grad_buffer_ve[i] = g * 0
                
    return total_rewards

The `reinforce_baseline` function is nearly identical to the prior algorithm, the only thing we added here were the value estimation commands and the updates to the value estimation network. To run this, we can follow the same pattern as above.

In [None]:
env = gym.make('CartPole-v0')
tf.reset_default_graph()
sess = tf.Session()

pe = policy_estimator(sess, env)
ve = value_estimator(sess, env)

# Initialize variables
init = tf.global_variables_initializer()
sess.run(init)

rewards = reinforce_baseline(env, pe, ve)

smoothed_rewards = [np.mean(rewards[max(0,i-10):i+1]) for i in range(len(rewards))]

plt.figure(figsize=(12,8))
plt.plot(smoothed_rewards)
plt.title('REINFORCE with Policy Baseline')
plt.show()

![Image of REINFORCE with baseline](https://www.datahubbs.com/wp-content/uploads/2018/04/reinforce_baseline.png)

The baseline slows the algorithm a bit, but does it provide any benefits? Let's run these multiple times and take a look to see if we can spot any difference between the training rates for REINFORCE and REINFORCE with Baseline.

**Warning: this may take some time, particularly if you're not utilizing GPU's!**

In [None]:
env = gym.make('CartPole-v0')

N = 50 # Number of training runs
num_episodes = 2000
pe_rewards = np.zeros(num_episodes)
pe_baseline_rewards = np.zeros(num_episodes)

for n in range(N):
    tf.reset_default_graph()
    sess = tf.Session()

    pe = policy_estimator(sess, env)

    # Initialize variables
    init = tf.global_variables_initializer()
    sess.run(init) 
    
    # Train model
    rewards = reinforce(env, pe, num_episodes)
    pe_rewards += rewards

for n in range(N):
    tf.reset_default_graph()
    sess = tf.Session()

    pe = policy_estimator(sess, env)
    ve = value_estimator(sess, env)

    # Initialize variables
    init = tf.global_variables_initializer()
    sess.run(init)    
    
    # Train model
    baseline_rewards = reinforce_baseline(env, pe, ve, num_episodes)
    pe_baseline_rewards += baseline_rewards

pe_rewards /= N
pe_baseline_rewards /= N

plt.figure(figsize=(12,8))
plt.plot(pe_rewards, label='Policy Estimation')
plt.plot(pe_baseline_rewards, label='Policy Estimation with Baseline')
plt.legend(loc='best')
plt.title('Comparison of REINFORCE Algorithms for Cart-Pole')
plt.show()

![Image of reinforce_baseline_comparison](https://www.datahubbs.com/wp-content/uploads/2018/04/reinforce_baseline_comp2.png)

For this example and set-up, the results don't show a significant difference one way or another, however, generally the REINFORCE with Baseline algorithm learns faster as a result of the reduced variance of the algorithm. That being said, there are additional hyperparameters to tune in such a case such as the learning rate for the value estimation, the number of layers (if we utilize a neural network as we did in this case), activation functions, etc. 

This algorithm was implemented in a Monte Carlo fashion providing updates only at the end of $N$ episodes. This ensures the proven convergence values, but does tend to make learning slower than on-line policy gradient algorithms. Regardless, the policy gradient approach is very powerful and the REINFORCE algorithms shown here are relatively easy to implement and hopefully give you a good taste of the terminology as well as the power of this class. This approach can be further modified ([Schulman 2016](https://arxiv.org/pdf/1506.02438.pdf)) and used in a variety of different environments, both continuous as well as discrete. 

Leave any questions or comments below!