<a href="https://colab.research.google.com/github/prasadchelsea33/ReinforcementLearning/blob/master/DFL_TRPO_REINFORCE_CartPole_Exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# REINFORCE CartPole Exercise

*Contributors and Authors*: [**Ethan Holly**](https://www.linkedin.com/in/ethan-holly-3a181539/) wrote this fantastic guide to REINFORCE and the solution in Colab form, [**Charles Weill**](https://www.linkedin.com/in/charles-weill-9426323a/) implemented environment visualization for the Colab, and [Surya Bhupatiraju]() ported it for public use for Depth-First Learning (DFL). 

*NOTE: The solution to this exercise (along with very pretty visualizations) is [here](https://drive.google.com/file/d/1UFWxcu42B_x262NbrUwTH-r7A3BCrfm2/view?usp=sharing). Please give this exercise an earnest attempt before peeking at the solutions!*  

---

In this exercise, you will implement and experiment with a policy gradient RL algorithm and environment.

The environment is CartPole: from OpenAI Gym https://gym.openai.com/envs/CartPole-v0. 


## Exercise

The following code blocks contain the parts of the REINFORCE algorithm. Fill in the parts marked TODO, and then run the training block below. Training should take up to 2 min, and your performance curve should approach 200 reward per episode. Then investigate the questions next to the training block.

## Extra Challenges:

* Once you finish this exercise, try to **swap out CartPole-v0 for Pendulum-v0**. CartPole is a simple discrete-action environment, while Pendulum is a simple continuous-action environment. You will need to make some modifications to the code to make it work. Try it out!
* Try to implement some more complex policy gradient features:
  * **Actor-Critic**: Uses a neural network (or other function approximation) instead of the empirical return. This has the effect of smoothing out the return estimates for a lower-variance policy gradient estimate. However, the estimate is now biased by the inaccuracy in the "Critic" network. It can also do some neat things like sharing hidden layers with the policy, which can speed up training. See: https://arxiv.org/abs/1602.01783
  * **Generalized Advantage Estimation (GAE)**: Uses a weighted sum of advantage estimates over future timesteps. It uses the lambda parameter for trading of between short-term and long-term dependencies, which trades off between bias and variance in policy gradient estimates. For an intuitive explanation, see: http://www.breloff.com/DeepRL-OnlineGAE/
  * **Proximal Policy Optimization (PPO)**: Policy gradients are an on-policy algorithm, which means that you can only update a policy with data sampled from that policy - otherwize the state distribution wouldn't match and you would have a poor estimate of the expected return! However, we can use importance sampling to correct for this drift in sampling, and use one batch of data to do multiple updates to the policy. PPO is an algorithm that does this effectively, while maintaining stable training. See: https://blog.openai.com/openai-baselines-ppo/#ppo

In [0]:
# @title (Run This) Imports
import os
import gym
import sys
from gym import spaces
import numpy as np
import pyglet
pyglet.options['search_local_libs']=False
pyglet.options['shadow_window']=False
from pyglet.window import xlib
xlib._have_utf8 = False
import pyglet.window
import tensorflow as tf
from google3.third_party.tensorflow.contrib import slim

In [0]:
# @title (Run This) Visualization Source Code
# Source: https://github.com/jakevdp/JSAnimation
import os
import sys
import random
import string
import warnings
if sys.version_info < (3, 0):
    from cStringIO import StringIO as InMemory
else:
    from io import BytesIO as InMemory
from matplotlib.animation import writers, FileMovieWriter
from base64 import b64encode


JS_INCLUDE = """
<script language="javascript">
  /* Define the Animation class */
  function Animation(frames, img_id, slider_id, interval, loop_select_id){
    this.img_id = img_id;
    this.slider_id = slider_id;
    this.loop_select_id = loop_select_id;
    this.interval = interval;
    this.current_frame = 0;
    this.direction = 0;
    this.timer = null;
    this.frames = new Array(frames.length);

    for (var i=0; i<frames.length; i++)
    {
     this.frames[i] = new Image();
     this.frames[i].src = frames[i];
    }
    document.getElementById(this.slider_id).max = this.frames.length - 1;
    this.set_frame(this.current_frame);
  }

  Animation.prototype.get_loop_state = function(){
    var button_group = document[this.loop_select_id].state;
    for (var i = 0; i < button_group.length; i++) {
        var button = button_group[i];
        if (button.checked) {
            return button.value;
        }
    }
    return undefined;
  }

  Animation.prototype.set_frame = function(frame){
    this.current_frame = frame;
    document.getElementById(this.img_id).src = this.frames[this.current_frame].src;
    document.getElementById(this.slider_id).value = this.current_frame;
  }

  Animation.prototype.next_frame = function()
  {
    this.set_frame(Math.min(this.frames.length - 1, this.current_frame + 1));
  }

  Animation.prototype.previous_frame = function()
  {
    this.set_frame(Math.max(0, this.current_frame - 1));
  }

  Animation.prototype.first_frame = function()
  {
    this.set_frame(0);
  }

  Animation.prototype.last_frame = function()
  {
    this.set_frame(this.frames.length - 1);
  }

  Animation.prototype.slower = function()
  {
    this.interval /= 0.7;
    if(this.direction > 0){this.play_animation();}
    else if(this.direction < 0){this.reverse_animation();}
  }

  Animation.prototype.faster = function()
  {
    this.interval *= 0.7;
    if(this.direction > 0){this.play_animation();}
    else if(this.direction < 0){this.reverse_animation();}
  }

  Animation.prototype.anim_step_forward = function()
  {
    this.current_frame += 1;
    if(this.current_frame < this.frames.length){
      this.set_frame(this.current_frame);
    }else{
      var loop_state = this.get_loop_state();
      if(loop_state == "loop"){
        this.first_frame();
      }else if(loop_state == "reflect"){
        this.last_frame();
        this.reverse_animation();
      }else{
        this.pause_animation();
        this.last_frame();
      }
    }
  }

  Animation.prototype.anim_step_reverse = function()
  {
    this.current_frame -= 1;
    if(this.current_frame >= 0){
      this.set_frame(this.current_frame);
    }else{
      var loop_state = this.get_loop_state();
      if(loop_state == "loop"){
        this.last_frame();
      }else if(loop_state == "reflect"){
        this.first_frame();
        this.play_animation();
      }else{
        this.pause_animation();
        this.first_frame();
      }
    }
  }

  Animation.prototype.pause_animation = function()
  {
    this.direction = 0;
    if (this.timer){
      clearInterval(this.timer);
      this.timer = null;
    }
  }

  Animation.prototype.play_animation = function()
  {
    this.pause_animation();
    this.direction = 1;
    var t = this;
    if (!this.timer) this.timer = setInterval(function(){t.anim_step_forward();}, this.interval);
  }

  Animation.prototype.reverse_animation = function()
  {
    this.pause_animation();
    this.direction = -1;
    var t = this;
    if (!this.timer) this.timer = setInterval(function(){t.anim_step_reverse();}, this.interval);
  }
</script>
"""


DISPLAY_TEMPLATE = """
<div class="animation" align="center">
    <img id="_anim_img{id}">
    <br>
    <input id="_anim_slider{id}" type="range" style="width:350px" name="points" min="0" max="1" step="1" value="0" onchange="anim{id}.set_frame(parseInt(this.value));"></input>
    <br>
    <button onclick="anim{id}.slower()">&#8211;</button>
    <button onclick="anim{id}.first_frame()"><img class="anim_icon" src="https://github.com/jakevdp/JSAnimation/blob/master/JSAnimation/icons/first.png?raw=true"></button>
    <button onclick="anim{id}.previous_frame()"><img class="anim_icon" src="https://github.com/jakevdp/JSAnimation/blob/master/JSAnimation/icons/prev.png?raw=true"></button>
    <button onclick="anim{id}.reverse_animation()"><img class="anim_icon" src="https://github.com/jakevdp/JSAnimation/blob/master/JSAnimation/icons/reverse.png?raw=true"></button>
    <button onclick="anim{id}.pause_animation()"><img class="anim_icon" src="https://github.com/jakevdp/JSAnimation/blob/master/JSAnimation/icons/pause.png?raw=true"></button>
    <button onclick="anim{id}.play_animation()"><img class="anim_icon" src="https://github.com/jakevdp/JSAnimation/blob/master/JSAnimation/icons/play.png?raw=true"></button>
    <button onclick="anim{id}.next_frame()"><img class="anim_icon" src="https://github.com/jakevdp/JSAnimation/blob/master/JSAnimation/icons/next.png?raw=true"></button>
    <button onclick="anim{id}.last_frame()"><img class="anim_icon" src="https://github.com/jakevdp/JSAnimation/blob/master/JSAnimation/icons/last.png?raw=true"></button>
    <button onclick="anim{id}.faster()">+</button>
  <form action="#n" name="_anim_loop_select{id}" class="anim_control">
    <input type="radio" name="state" value="once" {once_checked}> Once </input>
    <input type="radio" name="state" value="loop" {loop_checked}> Loop </input>
    <input type="radio" name="state" value="reflect" {reflect_checked}> Reflect </input>
  </form>
</div>


<script language="javascript">
  /* Instantiate the Animation class. */
  /* The IDs given should match those used in the template above. */
  (function() {{
    var img_id = "_anim_img{id}";
    var slider_id = "_anim_slider{id}";
    var loop_select_id = "_anim_loop_select{id}";
    var frames = new Array({Nframes});
    {fill_frames}

    /* set a timeout to make sure all the above elements are created before
       the object is initialized. */
    setTimeout(function() {{
        anim{id} = new Animation(frames, img_id, slider_id, {interval}, loop_select_id);
    }}, 0);
  }})()
</script>
"""

INCLUDED_FRAMES = """
  for (var i=0; i<{Nframes}; i++){{
    frames[i] = "{frame_dir}/frame" + ("0000000" + i).slice(-7) + ".{frame_format}";
  }}
"""


def _included_frames(frame_list, frame_format):
    """frame_list should be a list of filenames"""
    return INCLUDED_FRAMES.format(Nframes=len(frame_list),
                                  frame_dir=os.path.dirname(frame_list[0]),
                                  frame_format=frame_format)


def _embedded_frames(frame_list, frame_format):
    """frame_list should be a list of base64-encoded png files"""
    template = '  frames[{0}] = "data:image/{1};base64,{2}"\n'
    embedded = "\n"
    for i, frame_data in enumerate(frame_list):
        embedded += template.format(i, frame_format,
                                    frame_data.replace('\n', '\\\n'))
    return embedded


@writers.register('html')
class HTMLWriter(FileMovieWriter):
    # we start the animation id count at a random number: this way, if two
    # animations are meant to be included on one HTML page, there is a
    # very small chance of conflict.
    rng = random.Random()
    exec_key = 'animation.ffmpeg_path'
    args_key = 'animation.ffmpeg_args'
    supported_formats = ['png', 'jpeg', 'tiff', 'svg']

    @classmethod
    def new_id(cls):
        #return '%16x' % cls.rng.getrandbits(64)
        return ''.join(cls.rng.choice(string.ascii_uppercase)
                       for x in range(16))

    def __init__(self, fps=30, codec=None, bitrate=None, extra_args=None,
                 metadata=None, embed_frames=False, default_mode='loop'):
        self.embed_frames = embed_frames
        self.default_mode = default_mode.lower()

        if self.default_mode not in ['loop', 'once', 'reflect']:
            self.default_mode = 'loop'
            warnings.warn("unrecognized default_mode: using 'loop'")

        self._saved_frames = list()
        super(HTMLWriter, self).__init__(fps, codec, bitrate,
                                         extra_args, metadata)

    def setup(self, fig, outfile, dpi, frame_dir=None):
        if os.path.splitext(outfile)[-1] not in ['.html', '.htm']:
            raise ValueError("outfile must be *.htm or *.html")

        if not self.embed_frames:
            if frame_dir is None:
                frame_dir = outfile.rstrip('.html') + '_frames'
            if not os.path.exists(frame_dir):
                os.makedirs(frame_dir)
            frame_prefix = os.path.join(frame_dir, 'frame')
        else:
            frame_prefix = None

        super(HTMLWriter, self).setup(fig, outfile, dpi,
                                      frame_prefix, clear_temp=False)

    def grab_frame(self, **savefig_kwargs):
        if self.embed_frames:
            suffix = '.' + self.frame_format
            f = InMemory()
            self.fig.savefig(f, format=self.frame_format,
                             dpi=self.dpi, **savefig_kwargs)
            f.seek(0)
            self._saved_frames.append(b64encode(f.read()).decode('ascii'))
        else:
            return super(HTMLWriter, self).grab_frame(**savefig_kwargs)

    def _run(self):
        # make a ducktyped subprocess standin
        # this is called by the MovieWriter base class, but not used here.
        class ProcessStandin(object):
            returncode = 0
            def communicate(self):
                return ('', '')
        self._proc = ProcessStandin()

        # save the frames to an html file
        if self.embed_frames:
            fill_frames = _embedded_frames(self._saved_frames,
                                           self.frame_format)
        else:
            # temp names is filled by FileMovieWriter
            fill_frames = _included_frames(self._temp_names,
                                           self.frame_format)

        mode_dict = dict(once_checked='',
                         loop_checked='',
                         reflect_checked='')
        mode_dict[self.default_mode + '_checked'] = 'checked'

        interval = int(1000. / self.fps)

        with open(self.outfile, 'w') as of:
            of.write(JS_INCLUDE)
            of.write(DISPLAY_TEMPLATE.format(id=self.new_id(),
                                             Nframes=len(self._temp_names),
                                             fill_frames=fill_frames,
                                             interval=interval,
                                             **mode_dict))
            
  # from .html_writer import HTMLWriter
from matplotlib.animation import Animation
import matplotlib.pyplot as plt
import tempfile
import random
import os


__all__ = ['anim_to_html', 'display_animation']


class _NameOnlyTemporaryFile(object):
    """A context-managed temporary file which is not opened.

    The file should be accessible by name on any system.

    Parameters
    ----------
    suffix : string
        The suffix of the temporary file (default = '')
    prefix : string
        The prefix of the temporary file (default = '_tmp_')
    hash_length : string
        The length of the random hash.  The size of the hash space will
        be 16 ** hash_length (default=8)
    seed : integer
        the seed for the random number generator.  If not specified, the
        system time will be used as a seed.
    absolute : boolean
        If true, return an absolute path to a temporary file in the current
        working directory.

    Example
    -------

    >>> with _NameOnlyTemporaryFile(seed=0, absolute=False) as f:
    ...     print(f)
    ...
    _tmp_d82c07cd
    >>> os.path.exists('_tmp_d82c07cd')  # file removed after context
    False

    """
    def __init__(self, prefix='_tmp_', suffix='', hash_length=8,
                 seed=None, absolute=True):
        rng = random.Random(seed)
        self.name = '%s%0*x%s' % (prefix, hash_length,
                                  rng.getrandbits(4 * hash_length), suffix)
        if absolute:
            self.name = os.path.abspath(self.name)

    def __enter__(self):
        return self

    def __exit__(self, *exc_info):
        if os.path.exists(self.name):
            os.remove(self.name)


def anim_to_html(anim, fps=None, embed_frames=True, default_mode='loop'):
    """Generate HTML representation of the animation"""
    if fps is None and hasattr(anim, '_interval'):
        # Convert interval in ms to frames per second
        fps = 1000. / anim._interval

    plt.close(anim._fig)
    if hasattr(anim, "_html_representation"):
        return anim._html_representation
    else:
        # tempfile can't be used here: we need a filename, and this
        # fails on windows.  Instead, we use a custom filename generator
        #with tempfile.NamedTemporaryFile(suffix='.html') as f:
        with _NameOnlyTemporaryFile(suffix='.html') as f:
            anim.save(f.name,  writer=HTMLWriter(fps=fps,
                                                 embed_frames=embed_frames,
                                                 default_mode=default_mode))
            html = open(f.name).read()

        anim._html_representation = html
        return html


def display_animation(anim, **kwargs):
    """Display the animation with an IPython HTML object"""
    from IPython.display import HTML
    return HTML(anim_to_html(anim, **kwargs))


# This is the magic that makes animations display automatically in the
# IPython notebook.  The _repr_html_ method is a special method recognized
# by IPython.
Animation._repr_html_ = anim_to_html

%matplotlib inline
# from JSAnimation.IPython_display import display_animation
import matplotlib.pyplot as plt
from IPython.display import display
from matplotlib import animation

def display_frames_as_gif(frames):
    """
    Displays a list of frames as a gif, with controls
    """
    plt.figure(figsize=(frames[0].shape[1] / 72.0, frames[0].shape[0] / 72.0), dpi = 72)
    patch = plt.imshow(frames[0])
    plt.axis('off')

    def animate(i):
        patch.set_data(frames[i])

    anim = animation.FuncAnimation(plt.gcf(), animate, frames = len(frames), interval=50)
    display(display_animation(anim, default_mode='loop'))



## DiscretePolicy

This class implements a discrete policy that uses a neural network to parameterize a categorical distribution on a discrete set of actions.

The neural net takes the observation vector as input, then should produce probabilities that can be fed to a categorical distribution.

The parameters of this neural net will be trained below using the REINFORCE policy gradients algorithm.

In [0]:
class DiscretePolicy:
  """Simple discrete-action policy with feed-forward net."""
  
  def __init__(self,
               obs_dim,
               action_dim):
    """Creates a discrete policy paramaterized by a neural network.
    
    Args:
      obs_dim: The dimension of observation vectors. For cartpole, this will
        be 4.
      action_dim: The number of discrete actions the policy has to choose from.
        for cartpole, this will be 2.
    """
    self._obs_dim = obs_dim
    self._action_dim = action_dim
    self._make_net()
  
  def _make_net(self):
    # Make observation placeholder with batch dimension.
    with tf.variable_scope('discrete_policy'):
      self._obs_placeholder = tf.placeholder(dtype=tf.float32, 
                                             shape=[None, self._obs_dim])
      
      # TODO: Create neural network layers from obs_placeholder that produce
      #         values for action_probabilities.
      # Hint: Use the tf.slim library to create basic neural network layers.
      # Hint: You can use tf.nn.softmax to turn arbitrary weights into probabilities that sum to 1.
      self._action_probabilities = ???
      
      # TODO: Use the action_probabilities to parameterize a TF distribution,
      #         which can be used to sample actions.
      # https://www.tensorflow.org/api_guides/python/contrib.distributions
      # Hint: The distribution to use should among discrete choices.
      self._action_distribution = ???
      
      self._action_sample = self._action_distribution.sample()
    
    # Set up references to variables for getting & setting params.
    self._variables = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='discrete_policy')
    self._var_placeholders = []
    self._var_assigns = []
    for var in self._variables:
      var_ph = tf.placeholder(dtype=tf.float32, shape=var.shape)
      self._var_placeholders.append(var_ph)
      self._var_assigns.append(var.assign(var_ph))
    
  @property
  def action_dim(self):
    return self._action_dim
  
  @property
  def obs_placeholder(self):
    """Property returns observation placeholder. shape=[batch_dim, obs_dim]"""
    return self._obs_placeholder
  
  @property
  def action_probabilities(self):
    """Returns tensor of action probabilities. shape=[batch_dim, action_dim]"""
    return self._action_probabilities
  
  def get_params(self, sess=None):
    """Return the parameter vals as an array of tensors, one for each param."""
    sess = sess or tf.get_default_session()
    return sess.run(self._variables)
  
  def set_params(self, param_vals, sess=None):
    """Sets policy variables equal to param_vals"""
    sess = sess or tf.get_default_session()
    return sess.run(self._var_assigns,
                    feed_dict={var_ph: param_val
                               for (var_ph, param_val)
                               in zip(self._var_placeholders, param_vals)})
  
  def sample_action(self, obs, sess=None):
    """Samples an action from the policy distribution, given observation.
    
    Args:
      obs: Vector of observation values.
      sess: The session to use for execution. If not specified, uses
        tf.get_default_session().
    """
    sess = sess or tf.get_default_session()
    sample = sess.run(self._action_sample,
                      feed_dict={self._obs_placeholder: [obs]})
    return sample[0] # remove batch index

## REINFORCE

This class implements the REINFORCE training algorithm for maximizing a policy's expected return.

It is constructed with the policy to train as well as hyperparameters for the learning algorithm.

The train function takes a list of episodes that are used to compute the expected return of the current policy, which is used for the policy gradient update.

In [0]:
class REINFORCE:
  """Implements REINFORCE training of a policy."""
  
  def __init__(self,
               policy,
               value_function=None,
               learning_rate=1e-3,
               discount_rate=0.99,
               baseline_update_rate=0.01,
               estimation_algorithm='constant_baseline'):
    """Constructs training ops for REINFORCE training of a policy.
    
    Args:
      policy: The discrete policy to train.
      value_function: The value function approximator to use for critic.
      learning_rate: The learning rate to use for optimizer.
      sess: The session to use for execution. If not specified, uses
        tf.get_default_session().
    """
    self._policy = policy
    self._value_function = value_function
    self._learning_rate = learning_rate
    self._discount_rate = discount_rate
    self._baseline_update_rate = baseline_update_rate
    self._estimation_algorithm = estimation_algorithm
    
    with tf.variable_scope('REINFORCE'):
      # A one-hot mask of actions taken.
      self._action_placeholder = tf.placeholder(dtype=tf.float32,
                                                shape=[None, self._policy.action_dim],
                                                name='action_placeholder')
      # The per-timestep discounted return.
      self._return_placeholder = tf.placeholder(dtype=tf.float32,
                                                shape=[None, 1],
                                                name='return_placeholder')

      # Use the action_placeholder mask, to select which actions were actually
      #   taken, then reduce along the action index, preserving the batch index.
      self._action_prob = tf.reduce_sum(self._action_placeholder * 
                                        self._policy.action_probabilities,
                                        axis=1, keep_dims=True)
      
      # TODO: Define the policy gradient loss.
      # Hint: See slide 15 of ICML 2017 Deep RL lecture slides.
      #   https://sites.google.com/corp/view/icml17deeprl
      #
      # The policy gradient is shown in these slides to maximize the
      #   quantity log(pi(a_t|s_t)) * R_t.
      self._loss = ???
      
      self._optimizer = tf.train.AdamOptimizer(learning_rate=self._learning_rate)

      # TODO: Create an op that uses the optimizer to minimize self._loss.
      self._train_op = ???
  
  def _estimate_returns(self,
                        episode,
                        algorithm,):
    """Estimates the return to use for each timestep for policy gradient.
    
    Changes behavior based on algorithm specified.
      - no_baseline: Does not subtract a baseline from empirical return.
      - constant_baseline: Subtracts a constant baseline from empirical return.
          Updates the baseline used online to be a moving average of the
          observed return.
    
    Args:
      episode: Tuple of (observations, actions, rewards).
      algorithm: A string value indicating which algorithm to use for computing
        return / critic values.
    """
    observations, actions, rewards = episode
    
    # TODO: Compute the return at each point in the episode, indeces aligned on
    #   timestep with the other arrays. The return for timestep t is r_t plus
    #   the discounted future return.
    returns = ???
    
    if algorithm == 'no_baseline':
      returns = [[R] for R in returns]
    elif algorithm == 'constant_baseline':
      # Subtract the baseline from the empirical returns.
      r_mean = np.mean(returns)
      r_std = np.std(returns)
      returns = [[(R - r_mean) / (r_std + 1e-6)] for R in returns]
      
    return returns
  
  def train(self, episodes, sess=None):
    """Trains the DiscretePolicy with episodes of experience.
    
    Prepares observed actions as one-hot vectors and computes return at each
    timestep, used by the policy gradient update rule.
    
    Args:
      episodes: A list of tuples, each tuple of form
        (observations, actions, rewards).
    """
    sess = sess or tf.get_default_session()
    
    # Observations, actions, returns. Timesteps aligned on indexes.
    observations_batch = []
    actions_batch = []
    returns_batch = []
    
    # For each episode, prepare the data. Computes returns, one-hot actions,
    #   and aggregates all episodes into a batch for tf graph execution.
    for episode in episodes:
      observations, actions, rewards = episode
      
      # Back up timesteps from end of episode to compute return from each t.
      returns = self._estimate_returns(episode,
                                       algorithm=self._estimation_algorithm,
                                       sess=sess)
      
      # TODO: Create one-hot representation of actions taken. Use the array
      #   'actions', and self._policy._action_dim.
      # Hint: Use numpy for simpler quick-and-dirty array manipulation. No need
      #   to use TensorFlow for this.
      one_hot_actions = ???
      
      # Add this episode of data to the minibatch.
      observations_batch.extend(observations)
      actions_batch.extend(one_hot_actions)
      returns_batch.extend(returns)
    
    sess.run(self._train_op,
             feed_dict={
                 self._policy.obs_placeholder: observations_batch,
                 self._action_placeholder: actions_batch,
                 self._return_placeholder: returns_batch, # foo
             })
    
    # If value function is not None, train it, too.
    if self._value_function is not None:
      returns_batch_no_baseline = []
      for episode in episodes:
        returns_batch_no_baseline.extend(self._estimate_returns(episode,
                                                                algorithm='no_baseline',
                                                                sess=sess))
      self._value_function.train(observations_batch, returns_batch, sess=sess)

## Collect Episode

This function collects a single episode of data. It alternates between querying environment for observations and rewards, then policy for actions given observations.

In [0]:
def collect_episode(env, policy, frames=None):
  """Collect an episode of environment interaction.
  
  Args:
    env: The environment to interact with.
    policy: The policy to use for action selection.
    frames: Optional array to store the frames. If not provided, env does not
      bother rendering.
  Returns:
    Tuple of arrays of observations, actions, and rewards.
  """
  obs = env.reset()
  if frames is not None:
    frames.append(env.render(mode='rgb_array'))
  done = False
  observations, actions, rewards = ([], [], [])
  while not done:
    observations.append(obs)
    action = policy.sample_action(obs)
    actions.append(action)
    obs, reward, done, _ = env.step(action)
    rewards.append(reward)
    if frames is not None:
      frames.append(env.render(mode='rgb_array'))
  return observations, actions, rewards

## Running Training

Before you can run training, you will need to fill in the TODO's in the above code blocks. Once you have filled them in correctly, the following training loop should successfully train your policy, and the output curve should approach an episode performance of 200 reward.

### TODO: Questions
* What happens when you vary the learning rate parameter? Does the policy learn faster? Does it change stability?
* If you train for longer, does the policy get more stable? Does it reach a point where it never falls over? (score=200 is a perfect episode)
* What happens when you restart training and run again? How much variance is there across runs? (may need to reduce SLIDING_WINDOW_SIZE to see a difference)
* What happens when you change the estimation_algorithm parameter from 'no_baseline' to 'constant_baseline'? Compare the training curves qualitatively. Take a look at the REINFORCE implementation to see what's going on. What is happening?

In [0]:
EXPERIMENT_NAME = 'reinforce' # @param

tf.reset_default_graph()
sess = tf.Session()

env = gym.make('CartPole-v0')
policy = DiscretePolicy(obs_dim=4, action_dim=2)

LEARNING_RATE = 1e-3 # @param
reinforce = REINFORCE(policy,
                    learning_rate=LEARNING_RATE,
                    estimation_algorithm='no_baseline')

sess.run(tf.initialize_all_variables())

NUM_ITERATIONS = 1000 # @param
EPISODES_PER_ITERATION = 1 # @param
all_actions = []

POLICY_SAVE_FREQUENCY = 20

# Make dictionary for saving policy parameters.
try:
  policy_params[EXPERIMENT_NAME] = {}
except NameError:
  policy_params = {EXPERIMENT_NAME: {}}

training_scores = []
with sess.as_default():
  for i in xrange(NUM_ITERATIONS):
    # Collect episodes.
    episodes = []
    for _ in xrange(EPISODES_PER_ITERATION):
      episodes.append(collect_episode(env,policy))
      
    # Train
    reinforce.train(episodes)
    
    # Sum rewards across episodes.
    avg_reward = sum(map(lambda episode: sum(episode[2]),
                           episodes)) / EPISODES_PER_ITERATION
    training_scores.append(avg_reward)
    sys.stdout.write('\r{}/{} Iterations, {} Total reward this iteration.'.format(i+1, NUM_ITERATIONS, avg_reward))
    
    # Save intermittent policy parameters for running episodes later.
    if i % POLICY_SAVE_FREQUENCY == 0:
      policy_params[EXPERIMENT_NAME][i] = policy.get_params()        
  print('')
  policy_params[EXPERIMENT_NAME][NUM_ITERATIONS] = policy.get_params()
  
# Avg training_scores by sliding window.
SLIDING_WINDOW_SIZE = 30 # @param
training_scores = [np.mean(training_scores[max(i-SLIDING_WINDOW_SIZE, 0):i+1])
                   for i in xrange(len(training_scores))]

# Save training_scores list by EXPERIMENT_NAME.
try:
  all_scores[EXPERIMENT_NAME] = training_scores
except NameError:
  all_scores = {EXPERIMENT_NAME: training_scores}

# Plot all experiments so far.
experiment_names = all_scores.keys()
for name in experiment_names:
  plt.plot(all_scores[name])
plt.legend(experiment_names, loc='upper left')


In [0]:
# @title Run this to clear scores.
try:
  del all_scores
  del policy_params
except NameError:
  pass

# Observing Episodes

Load the policy parameters from different points in training, and rerun this code block to generate & visualize a new episode.

NOTE: This will only work if you have started the colab kernel manually on your desktop machine. It will not work if you started it through an ssh connection.

In [0]:
experiment_name = 'reinforce' # @param
policy_version = 0 # @param

# Round down to the last saved policy version.
policy_version = int(policy_version/POLICY_SAVE_FREQUENCY) * POLICY_SAVE_FREQUENCY

render_env = gym.make('CartPole-v0')

try:
  with sess.as_default():
    policy.set_params(policy_params[experiment_name][policy_version])
    frames = []
    collect_episode(render_env, policy, frames=frames)
    display_frames_as_gif(frames)
except Exception e:
  print ('DISPLAY ERROR: Cannot render episodes unless colab kernel was '
         'manually run from corp machine.')
  raise e
  
del render_env