# Reinforcement Learning Algorithms

This notebook is adapted from https://spinningup.openai.com/en/latest/ and relies on PyTorch and the OpenAI Gym.

The purpose of this notebook is to provide template code and pedagogical commentary on a selection of RL algorithms. Ideally, people will be able to copy this notebook and insert their own neural network architecture into these templates without having to change the learning algorithms. 

In [1]:
import numpy as np
import gym

import matplotlib.pyplot as plt
%matplotlib inline

The following variables are used for defining the actions and states of the game "LunarLander-v2".

Because of its simplicity, we'll use the LunarLander game for all of the following examples. The nice thing about OpenAI Gym is that all the different environments have essentially the same API, so that all that should be needed to modify this code for another environment is to change the input and output dimensions expected by your RL model.

In [2]:
STATE_SPACE = 8
ACTION_SPACE = 4
ENV = gym.make("LunarLander-v2")

# actions
DO_NOTHING = 0
LEFT_ENGINE = 1
MAIN_ENGINE = 2
RIGHT_ENGINE = 3

# state
X_POS = 0
Y_POS = 1
X_SPEED = 2
Y_SPEED = 3
ANGLE = 4
ANGLE_SPEED = 5
FIRST_LEG = 6
SECOND_LEG = 7

OpenAI provides pedagogical code for several RL algorithms, and has made the learning process simpler by ensuring that all algorithms follow the same basic set of steps. These are:

1. Logger setup
2. Random seed setting
3. Environment instantiation
4. Constructing the actor-critic PyTorch module via the actor_critic function passed to the algorithm function as an argument
5. Instantiating the experience buffer
6. Setting up callable loss functions that also provide diagnostics specific to the algorithm
7. Making PyTorch optimizers
8. Setting up model saving through the logger
9. Setting up an update function that runs one epoch of optimization or one step of descent
10. Running the main loop of the algorithm:

    a) Run the agent in the environment
    
    b) Periodically update the parameters of the agent according to the main equations of the algorithm
    
    c) Log key performance metrics and save agent


We'll go through each of these steps with additional commentary as we see the code.

## A Quick Review of RL Algorithms
![image.png](attachment:image.png)

### The problem statement
The general statement of an RL problem can be formulated as follows:

The probability $p_\theta$ of a play-out for a game composed of a sequence of state vectors $\textbf{s}_t$ and agent actions $\textbf{a}_t$ is factored into the policy vector $\pi_\theta(\textbf{a}_t|\textbf{s}_t)$ and the model $p(\textbf{s}_{t+1}|\textbf{s}_t, \textbf{a}_t)$:

$$
p_{\theta}(\textbf{s}_1,\textbf{a}_1,...,\textbf{s}_T,\textbf{a}_T)=p(\textbf{s}_1)\Pi_{t=1}^{T}\pi_{\theta}(\textbf{a}_t|\textbf{s}_t)p(\textbf{s}_{t+1}|\textbf{s}_t,\textbf{a}_t)
$$

There is additionally a reward, $r(\textbf{s}_t,\textbf{a}_t)$, given to the agent for each step in the game. The goal is to teach an agent to maximize this reward, i.e.

$$
max_{\theta}E_{p_{\theta}}[\Sigma_{t}r(\textbf{s}_t,\textbf{a}_t)]
$$

In **model-free** RL we ignore the model and teach our agent to maximize the reward based purely on the current state and possibly the agent's history (past states and actions).

In **model-based** RL our agent tries to learn a model which correctly predicts future rewards, so that the policy can be easily chosen to maximize the cumulative reward.

**On-policy** learning means that the policy vector $\pi$ is being updated using data (i.e., state-action pairs $(s,a)$) collected according to *the most recent version of the policy*. Conversely, **off-policy** learning is done by using data collected at any time according to any policy.

### Additional formalisms
There are several *value functions* which are commonly used to treat RL approaches. These are:

1. The on-policy value function $V^\pi(s) = E_{\tau \sim \pi}[R(\tau)|s_0=s]$ which gives the expected return given a starting state $s$ and actions chosen according to the policy $\pi$.

2. The on-policy action-value function $Q^\pi(s,a) = E_{\tau \sim \pi}[R(\tau)|s_0=s, a_0=a]$ which gives the expected return given a starting state $s$ and initial action $a$, with all future actions (but not necessarily this first action $a$) chosen according to the policy $\pi$.

3. The optimal value function $V^*(s) = max_\pi E_{\tau \sim \pi}[R(\tau)|s_0=s]$ which gives the expected return given a starting state $s$ and actions chosen according to the *optimal* policy.

4. The optimal action-value function $Q^*(s,a) = max_\pi E_{\tau \sim \pi}[R(\tau)|s_0=s, a_0=a]$ which gives the expected return given a starting state $s$ and initial action $a$, with all future actions (but not necessarily this first action $a$) chosen according to the *optimal* policy.

There are a set of relations called the Bellman Equations which essentially tell us how these value functions evolve over time:

$V^\pi(s)=E_{a\sim\pi,\; s^\prime\sim P}[r(s,a)+\gamma V^\pi(s^\prime)]$

$Q^\pi(s,a)=E_{s^\prime\sim P}\big[r(s,a)+\gamma E_{a^\prime\pi}[Q^\pi(s^\prime,a^\prime)]\big]$

$V^*(s)=max_{a}E_{s^\prime\sim P}[r(s,a)+\gamma V^*(s^\prime)]$

$Q^*(s,a)=E_{s^\prime\sim P}\big[r(s,a)+\gamma max_{a^\prime}Q^*(s^\prime,a^\prime)\big]$

### Model-Free RL

**Policy Optimization**: representing $\pi_\theta(a|s)$ explicitly and optimizing $\theta$ either by gradient ascent directly on the performance objective, or by maximizing some local representation of the performance objective. Typically, this is done on-policy. This approach usually also requires learning an approximation of the on-policy value function $V^\pi(s)$.

Policy optimization is typically stable and sensible, because you are directly optimizing for the thing that you want.

**Q-Learning**: learning an approximation for the optimal action-value function $Q^*(s,a)$. The objective function will usually be based on the Bellman equations, and optimization performed off-policy.

Q-learning only indirectly optimizes for the agent performance, so it is less stable than policy optimization. However, because it can be done off-policy, it is far more efficient in data collection (and so potentially trains faster, if data collection is expensive).

### Model-Based RL

**Pure Planning**: does not represent a policy at all, but simply computes an optimal trajectory through the environment based on the current state and the model for the environment's evolution over some fixed time-window. At each step, a new 'optimal' trajectory is computed. Basically, this is a physics engine.

**Expert Iteration**: builds on pure planning by using a planning algorithm which relies on a policy $\pi_\theta(a|s)$, such as Monte Carlo Tree Search, to generate candidate actions for the plan. This allows for a more efficient search through action space than pure planning.

**Data Augmentation**: uses a model-free RL method but adds simulated data from a trained model to real data (or maybe even uses *only* simulated data).

## Vanilla Policy Gradient

Our first algorithm implementation will be the Vanilla Policy Gradient (VPG), a form of policy optimization. The theory behind this algorithm can actually be derived in just a few short lines. 

We aim to maximize the expected return $J(\pi_\theta)=E_{\tau\sim\pi_\theta}[R(\tau)]$.

We wish to perform this maximization via gradient ascent: $\theta_{k+1} = \theta_k + \alpha\nabla_\theta J(\pi_\theta)|_{\theta_k}$

To find an expression for the gradient term, we use the following reasoning:

$\nabla_\theta J(\pi_\theta) = \nabla_\theta E_{\tau\sim\pi_\theta}[R(\tau)]$

$\;\;\;\;\;\;=\int_\tau \nabla_\theta P(\tau|\theta)R(\tau)$

$\;\;\;\;\;\;=\int_\tau P(\tau|\theta)\nabla_\theta \text{log}P(\tau|\theta) R(\tau)$

$\;\;\;\;\;\;=E_{\tau\sim\pi_\theta}\big[\Sigma_{t=0}^T \nabla_\theta \text{log}\pi_\theta(a_t|s_t) R(\tau)\big]$

$\;\;\;\;\;\;\approx \frac{1}{|D|}\Sigma_{\tau\in D}\big[\Sigma_{t=0}^T \nabla_\theta \text{log}\pi_\theta(a_t|s_t) R(\tau)\big]$

where $D={\tau_i}$ for $i\in{1,...,N}$ and $|D|$ is the number of trajectories in $D$.


Now we use a convenient identity, that is:
$E_{x\sim P_\theta}\big[\nabla_\theta \text{log}P_\theta(x)\big] = 0$

This also means that we can include any function that is independent of the choice of action, e.g. $E_{x\sim P_\theta}\big[\nabla_\theta \text{log}P_\theta(x) b(s_t)\big] = 0$. A common choice for a *baseline* function $b(s_t)$ is the *on-policy value function* $V^\pi(s_t)$. 

Returning to our statement of the policy gradient, we can add as many terms as we want involving our baseline function because it is of the same form as our identity. One additional modification we will make is to condition the reward function in the gradient to only depend on *future* rewards, since past rewards are not relevant to judging the current choice of action $a$. This gives us a new expression for the policy gradient:

$\nabla_\theta J(\pi_\theta) \approx E_{\tau\sim\pi_\theta}\bigg[\Sigma_{t=0}^T \nabla_\theta \text{log}\pi_\theta(a_t|s_t) \big(\Sigma_{t^\prime}^T R(s_{t^\prime}, a_{t^\prime},s_{t^\prime+1}) - V^\pi(s_t)\big)\bigg]$

Recall that $V^\pi(s_t)$ is the expected total reward for starting at state $s_t$ and acting according to the policy $\pi$. This form of the policy gradient then has an intuitive explanation: our RL agent will feel 'neutral' (i.e. zero policy gradient) when it gets the reward it expects.

In practice, $V^\pi(s_t)$ is usually trained simultaneously with the policy by minimizing the mean-square-error of a predictive neural net.

### VPG Theory

While the term $R(s_{t^\prime}, a_{t^\prime},s_{t^\prime+1}) - V^\pi(s_t)$ has an intuitive explanation, the vanilla policy gradient method uses a related function called the advantage function:

$A^{\pi_\theta}(s_t, a_t) = Q^\pi(s_t,a_t)-V^\pi(s_t)$

The advantage function $A^{\pi}(s,a)$ corresponding to a policy $\pi_\theta$ describes how much better it is to take a specific action $a$ in state $s$, over randomly selecting an action according to $\pi_\theta$, assuming you act according to $\pi_\theta$ forever after. The gradient is then taken to be:

$\nabla_\theta J(\pi_\theta) = E_{\tau\sim\pi_\theta}\big[\Sigma_{t=0}^T \nabla_\theta \text{log}\pi_\theta(a_t|s_t) A^{\pi_\theta}(s_t, a_t)\big]$

### VPG Training Algorithm

![image-2.png](attachment:image-2.png)


### VPG Implementation

The OpenAI implementation of VPG is located at https://github.com/openai/spinningup/tree/master/spinup/algos/pytorch/vpg

What I provide here is the exact same code with additional comments and annotations.

First, let's step through the 'core' code. We import a variety of packages, and define a few convenience functions. The first two of these are just methods to format shapes of tensors into expected formats for use elsewhere. The third builds a multi-layer perceptron with layer sizes specified by the input list `sizes`, layer activations specified by `activation`, and an optional final `output_activation` function which may differ from the previous layers.

In [3]:
import numpy as np
import scipy.signal
from gym.spaces import Box, Discrete

import torch
import torch.nn as nn
from torch.distributions.normal import Normal
from torch.distributions.categorical import Categorical


def combined_shape(length, shape=None):
    if shape is None:
        return (length,)
    return (length, shape) if np.isscalar(shape) else (length, *shape)


def count_vars(module):
    return sum([np.prod(p.shape) for p in module.parameters()])


def mlp(sizes, activation, output_activation=nn.Identity):
    layers = []
    for j in range(len(sizes)-1):
        act = activation if j < len(sizes)-2 else output_activation
        layers += [nn.Linear(sizes[j], sizes[j+1]), act()]
    return nn.Sequential(*layers)

One more convenience function, this one is used for computing discounted future rewards, and provides the discount scaled to start at any future step as well.

In [4]:
def discount_cumsum(x, discount):
    """
    magic from rllab for computing discounted cumulative sums of vectors.
    input: 
        vector x, 
        [x0, 
         x1, 
         x2]
    output:
        [x0 + discount * x1 + discount^2 * x2,  
         x1 + discount * x2,
         x2]
    """
    return scipy.signal.lfilter([1], [1, float(-discount)], x[::-1], axis=0)[::-1]

OK, now we get into the meat of the core functions. The `Actor` class template needs three methods. First, it needs some `_distribution` implementing $\pi_\theta(s_t)$. Second, it needs to be able to compute $\text{log}P(a|\pi)$. Note that this is the log-probability of a particular action according to $\pi(s_t)$, which may be as simple as $\text{log}\frac{\pi^{(i)}}{\Sigma_i \pi^{(i)}}$ where $\pi^{(i)}$ are the elements of the vector $\pi$. Finally, it needs a forward method, which simply applies $\pi(s_t)$ to produce the action probability vector and returns this vector along with (optionally) the log-likelihood.

In [5]:
class Actor(nn.Module):

    def _distribution(self, obs):
        raise NotImplementedError

    def _log_prob_from_distribution(self, pi, act):
        raise NotImplementedError

    def forward(self, obs, act=None):
        # Produce action distributions for given observations, and 
        # optionally compute the log likelihood of given actions under
        # those distributions.
        pi = self._distribution(obs)
        logp_a = None
        if act is not None:
            logp_a = self._log_prob_from_distribution(pi, act)
        return pi, logp_a

Next we'll define two types of `Actor`s which perform different functions. The `MLPCategoricalActor` implements $\pi_\theta(s_t)$ as a simple multi-layer perceptron with a final `Categorical` layer which selects one of $K$ possible actions according to probabilities output from the neural network. This is useful for actors where the control parameters are discrete variables, such as turning an engine thruster on or off. 

The `MLPGaussianActor` implements $\pi_\theta(s_t)$ as a multidimensional Gaussian distribution with mean determined by a multi-layer perceptron and unit variance. This is useful for actors where the control parameters are continuous variables, such as the force to exert on a robotic leg. 

In [6]:
class MLPCategoricalActor(Actor):
    
    def __init__(self, obs_dim, act_dim, hidden_sizes, activation):
        super().__init__()
        self.logits_net = mlp([obs_dim] + list(hidden_sizes) + [act_dim], activation)

    def _distribution(self, obs):
        logits = self.logits_net(obs)
        return Categorical(logits=logits)

    def _log_prob_from_distribution(self, pi, act):
        return pi.log_prob(act)


class MLPGaussianActor(Actor):

    def __init__(self, obs_dim, act_dim, hidden_sizes, activation):
        super().__init__()
        log_std = -0.5 * np.ones(act_dim, dtype=np.float32)
        self.log_std = torch.nn.Parameter(torch.as_tensor(log_std))
        self.mu_net = mlp([obs_dim] + list(hidden_sizes) + [act_dim], activation)

    def _distribution(self, obs):
        mu = self.mu_net(obs)
        std = torch.exp(self.log_std)
        return Normal(mu, std)

    def _log_prob_from_distribution(self, pi, act):
        return pi.log_prob(act).sum(axis=-1)    # Last axis sum needed for Torch Normal distribution

Next we define some classes which are responsible for implementing the value function $V^\pi(s_t)$. The `MLPCritic` class uses a multi-layer perceptron to take a state vector `obs` and output a value reflecting $V$ (the expected future reward).

The `MLPActorCritic` class puts together several of the previously defined classes into a coherent solution to an RL problem. The `MLPActorCritic` class takes an OpenAI environment `observation_space` and chooses the appropriate MLP Actor class (either `MLPCategoricalActor` or `MLPGaussianActor` for discrete or continuous actors respectively). It also uses the `MLPCritic` class to define a value function. It then implements a `step` method which can be used in running a simulation. The `step` method takes a state vector `obs`, applies the actor's implementation of $\pi_\theta(s_t)$ to choose an action, and applies the actor's implementation of $V^\pi(s_t)$ to get a value for the state. The action, value, and log-probability of the selected action are returned for use in training. If only the action is needed (i.e. when using the actor to run a simulation without training) then the method `act` can be called, which does `step` but only returns the action vector.  

In [7]:
class MLPCritic(nn.Module):

    def __init__(self, obs_dim, hidden_sizes, activation):
        super().__init__()
        self.v_net = mlp([obs_dim] + list(hidden_sizes) + [1], activation)

    def forward(self, obs):
        return torch.squeeze(self.v_net(obs), -1) # Critical to ensure v has right shape.


class MLPActorCritic(nn.Module):


    def __init__(self, observation_space, action_space, 
                 hidden_sizes=(64,64), activation=nn.Tanh):
        super().__init__()

        obs_dim = observation_space.shape[0]

        # policy builder depends on action space
        if isinstance(action_space, Box):
            self.pi = MLPGaussianActor(obs_dim, action_space.shape[0], hidden_sizes, activation)
        elif isinstance(action_space, Discrete):
            self.pi = MLPCategoricalActor(obs_dim, action_space.n, hidden_sizes, activation)

        # build value function
        self.v  = MLPCritic(obs_dim, hidden_sizes, activation)

    def step(self, obs):
        with torch.no_grad():
            pi = self.pi._distribution(obs)
            a = pi.sample()
            logp_a = self.pi._log_prob_from_distribution(pi, a)
            v = self.v(obs)
        return a.numpy(), v.numpy(), logp_a.numpy()

    def act(self, obs):
        return self.step(obs)[0]

Now we will define the vanilla policy gradient training algorithm itself. With the actor itself encapsulated in the previously defined classes this should be easier to follow.

There are some logging and multithreading functions included in the OpenAI spinning up library.

To install, follow the instructions at https://spinningup.openai.com/en/latest/user/installation.html

Alternatively, you can try running the following commands (but if anything fails please see the Open AI instructions):

`sudo apt-get update && sudo apt-get install libopenmpi-dev`

`git clone https://github.com/openai/spinningup.git`

`cd spinningup`

`pip install -e .`

Pretty sure this code is slightly out of date and version specific, but I was able to hack out a working version.

In [8]:
import numpy as np
import torch
from torch.optim import Adam
import gym
import time
import spinup.algos.pytorch.vpg.core as core
from spinup.utils.logx import EpochLogger
from spinup.utils.mpi_pytorch import setup_pytorch_for_mpi, sync_params, mpi_avg_grads
from spinup.utils.mpi_tools import mpi_fork, mpi_avg, proc_id, mpi_statistics_scalar, num_procs

First we're going to define a buffer class `VPGBuffer` which will accumulate data during a training run to form a batch of data to train on. The `store` method for this class takes the output from a timestep of the game or simulation and puts all the various data (e.g. the action chosen, the state vector, the reward, the value function estimation, etc.) into the appropriate storage buffers.

The heart of the VPG algorithm is the estimation of the advantage function $A^{\pi_\theta}(s_t,a_t)$. When a simulation trajectory is completed (either because the game reached a termination state or because the epoch ended) then the `VPGBuffer` instance has a `finish_path` method to implement advantage function estimation. This is done using a method called *Generalized  Advantage Estimation*. The full details of this method can be read about here: https://arxiv.org/pdf/1506.02438.pdf
However, the main idea is that we need to find an empirical estimation for $A^{\pi_\theta}(s_t, a_t)$. It turns out that a reasonable estimator can be obtained from 

$A^{\pi_\theta}(s_t, a_t) \approx E_{s_{t+1}}\big[Q^\pi(s_t,a_t)-V^\pi(s_t)\big]$

$\;\;\;\;\;\;=E_{s_{t+1}}\big[r_t + \gamma V^\pi(s_{t+1})-V^\pi(s_t)\big] := E\big[\delta_t^{V^{\pi,\gamma}}\big]$

Consider these $k$-step estimators:
![image.png](attachment:image.png)

![image-2.png](attachment:image-2.png)

The `finish_path` method implements this estimator function for the advantage, which then feeds into our VPG algorithm when computing the policy update. The `finish_path` method also computes the cumulative future reward for each time step in the trajectory, to be used as target data for training the value function estimator.

Finally, the `get` method is called at the end of an epoch of training. The method performs some basic data pre-processing before feeding in the batch to the updating step of the algorithm. In particular, it returns a zero-mean and unit-variance transformed version of the advantage buffer. All the buffers are then packaged into a dictionary for future use.

In [22]:
class VPGBuffer:
    """
    A buffer for storing trajectories experienced by a VPG agent interacting
    with the environment, and using Generalized Advantage Estimation (GAE-Lambda)
    for calculating the advantages of state-action pairs.
    """

    def __init__(self, obs_dim, act_dim, size, gamma=0.99, lam=0.95):
        self.obs_buf = np.zeros(core.combined_shape(size, obs_dim), dtype=np.float32)
        self.act_buf = np.zeros(core.combined_shape(size, act_dim), dtype=np.float32)
        self.adv_buf = np.zeros(size, dtype=np.float32)
        self.rew_buf = np.zeros(size, dtype=np.float32)
        self.ret_buf = np.zeros(size, dtype=np.float32)
        self.val_buf = np.zeros(size, dtype=np.float32)
        self.logp_buf = np.zeros(size, dtype=np.float32)
        self.gamma, self.lam = gamma, lam
        self.ptr, self.path_start_idx, self.max_size = 0, 0, size

    def store(self, obs, act, rew, val, logp):
        """
        Append one timestep of agent-environment interaction to the buffer.
        """
        assert self.ptr < self.max_size     # buffer has to have room so you can store
        self.obs_buf[self.ptr] = obs
        self.act_buf[self.ptr] = act
        self.rew_buf[self.ptr] = rew
        self.val_buf[self.ptr] = val
        self.logp_buf[self.ptr] = logp
        self.ptr += 1

    def finish_path(self, last_val=0):
        """
        Call this at the end of a trajectory, or when one gets cut off
        by an epoch ending. This looks back in the buffer to where the
        trajectory started, and uses rewards and value estimates from
        the whole trajectory to compute advantage estimates with GAE-Lambda,
        as well as compute the rewards-to-go (reward at current state plus
        future rewards obtained along this trajectory during training) for
        each state, to use as the targets for the value function.
        The "last_val" argument should be 0 if the trajectory ended
        because the agent reached a terminal state (died), and otherwise
        should be V(s_T), the value function estimated for the last state.
        This allows us to bootstrap the reward-to-go calculation to account
        for timesteps beyond the arbitrary episode horizon (or epoch cutoff).
        """

        path_slice = slice(self.path_start_idx, self.ptr)
        rews = np.append(self.rew_buf[path_slice], last_val)
        vals = np.append(self.val_buf[path_slice], last_val)
        
        # the next two lines implement GAE-Lambda advantage calculation
        deltas = rews[:-1] + self.gamma * vals[1:] - vals[:-1]
        self.adv_buf[path_slice] = core.discount_cumsum(deltas, self.gamma * self.lam)
        
        # the next line computes rewards-to-go, to be targets for the value function
        self.ret_buf[path_slice] = core.discount_cumsum(rews, self.gamma)[:-1]
        
        self.path_start_idx = self.ptr

    def get(self):
        """
        Call this at the end of an epoch to get all of the data from
        the buffer, with advantages appropriately normalized (shifted to have
        mean zero and std one). Also, resets some pointers in the buffer.
        """
        assert self.ptr == self.max_size    # buffer has to be full before you can get
        self.ptr, self.path_start_idx = 0, 0
        # the next two lines implement the advantage normalization trick
        adv_mean, adv_std = mpi_statistics_scalar(self.adv_buf)  # Get mean/std in parallellized code
        self.adv_buf = (self.adv_buf - adv_mean) / adv_std
        data = dict(obs=self.obs_buf, act=self.act_buf, ret=self.ret_buf,
                    adv=self.adv_buf, logp=self.logp_buf)
        return {k: torch.as_tensor(v, dtype=torch.float32) for k,v in data.items()}

Finally, we are ready to see the VPG training algorithm in action! Most of the documentation within the code is as clear as I could manage to explain, so I suggest reading through the code itself to see how the algorithm is implemented.

I will highlight that the policy update step we derived from the VPG theory section (step 6 in the VPG algorithm pseudocode) is implemented in the third line of code of the `compute_loss_pi` function defined inside `vpg()`. As a reminder, our derivation told us that the policy gradient should be

$\hat{g}_k=\frac{1}{|D_k|}\Sigma_{\tau\in D_k}\Sigma_t \nabla_\theta \text{log}\pi_{\theta_k}(a_t|s_t)\cdot\hat{A}_t$

where $D_k$ is the batch collected in that epoch, $\tau$ enumerates trajectories in the batch, and $t$ enumerates time steps in the trajectory. 

In [23]:
def vpg(env_fn, actor_critic=MLPActorCritic, ac_kwargs=dict(),  seed=0, 
        steps_per_epoch=4000, epochs=50, gamma=0.99, pi_lr=3e-4,
        vf_lr=1e-3, train_v_iters=80, lam=0.97, max_ep_len=1000,
        logger_kwargs=dict(), save_freq=10):
    """
    Vanilla Policy Gradient 
    (with GAE-Lambda for advantage estimation)
    Args:
        env_fn : A function which creates a copy of the environment.
            The environment must satisfy the OpenAI Gym API.
        actor_critic: The constructor method for a PyTorch Module with a 
            ``step`` method, an ``act`` method, a ``pi`` module, and a ``v`` 
            module. The ``step`` method should accept a batch of observations 
            and return:
            ===========  ================  ======================================
            Symbol       Shape             Description
            ===========  ================  ======================================
            ``a``        (batch, act_dim)  | Numpy array of actions for each 
                                           | observation.
            ``v``        (batch,)          | Numpy array of value estimates
                                           | for the provided observations.
            ``logp_a``   (batch,)          | Numpy array of log probs for the
                                           | actions in ``a``.
            ===========  ================  ======================================
            The ``act`` method behaves the same as ``step`` but only returns ``a``.
            The ``pi`` module's forward call should accept a batch of 
            observations and optionally a batch of actions, and return:
            ===========  ================  ======================================
            Symbol       Shape             Description
            ===========  ================  ======================================
            ``pi``       N/A               | Torch Distribution object, containing
                                           | a batch of distributions describing
                                           | the policy for the provided observations.
            ``logp_a``   (batch,)          | Optional (only returned if batch of
                                           | actions is given). Tensor containing 
                                           | the log probability, according to 
                                           | the policy, of the provided actions.
                                           | If actions not given, will contain
                                           | ``None``.
            ===========  ================  ======================================
            The ``v`` module's forward call should accept a batch of observations
            and return:
            ===========  ================  ======================================
            Symbol       Shape             Description
            ===========  ================  ======================================
            ``v``        (batch,)          | Tensor containing the value estimates
                                           | for the provided observations. (Critical: 
                                           | make sure to flatten this!)
            ===========  ================  ======================================
        ac_kwargs (dict): Any kwargs appropriate for the ActorCritic object 
            you provided to VPG.
        seed (int): Seed for random number generators.
        steps_per_epoch (int): Number of steps of interaction (state-action pairs) 
            for the agent and the environment in each epoch.
        epochs (int): Number of epochs of interaction (equivalent to
            number of policy updates) to perform.
        gamma (float): Discount factor. (Always between 0 and 1.)
        pi_lr (float): Learning rate for policy optimizer.
        vf_lr (float): Learning rate for value function optimizer.
        train_v_iters (int): Number of gradient descent steps to take on 
            value function per epoch.
        lam (float): Lambda for GAE-Lambda. (Always between 0 and 1,
            close to 1.)
        max_ep_len (int): Maximum length of trajectory / episode / rollout.
        logger_kwargs (dict): Keyword args for EpochLogger.
        save_freq (int): How often (in terms of gap between epochs) to save
            the current policy and value function.
    """

    # Special function to avoid certain slowdowns from PyTorch + MPI combo.
    setup_pytorch_for_mpi()

    # Set up logger and save configuration
    logger = EpochLogger(**logger_kwargs)
    logger.save_config(locals())

    # Random seed
    seed += 10000 * proc_id()
    torch.manual_seed(seed)
    np.random.seed(seed)

    # Instantiate environment
    env = env_fn()
    obs_dim = env.observation_space.shape
    act_dim = env.action_space.shape

    # Create actor-critic module
    ac = actor_critic(env.observation_space, env.action_space, **ac_kwargs)

    # Sync params across processes (MPI setup)
    sync_params(ac)

    # Count variables
    var_counts = tuple(core.count_vars(module) for module in [ac.pi, ac.v])
    logger.log('\nNumber of parameters: \t pi: %d, \t v: %d\n'%var_counts)

    # Set up experience buffer
    local_steps_per_epoch = int(steps_per_epoch / num_procs())
    buf = VPGBuffer(obs_dim, act_dim, local_steps_per_epoch, gamma, lam)

    # Set up function for computing VPG policy loss
    def compute_loss_pi(data):
        obs, act, adv, logp_old = data['obs'], data['act'], data['adv'], data['logp']

        # Policy loss
        pi, logp = ac.pi(obs, act)
        loss_pi = -(logp * adv).mean()

        # Useful extra info
        approx_kl = (logp_old - logp).mean().item()
        ent = pi.entropy().mean().item()
        pi_info = dict(kl=approx_kl, ent=ent)

        return loss_pi, pi_info

    # Set up function for computing value loss (i.e. mean square error)
    def compute_loss_v(data):
        obs, ret = data['obs'], data['ret']
        return ((ac.v(obs) - ret)**2).mean()

    # Set up optimizers for policy and value function
    pi_optimizer = Adam(ac.pi.parameters(), lr=pi_lr)
    vf_optimizer = Adam(ac.v.parameters(), lr=vf_lr)

    # Set up model saving
    logger.setup_pytorch_saver(ac)

    def update():
        data = buf.get()

        # Get loss and info values before update
        pi_l_old, pi_info_old = compute_loss_pi(data)
        pi_l_old = pi_l_old.item()
        v_l_old = compute_loss_v(data).item()

        # Train policy with a single step of gradient descent
        pi_optimizer.zero_grad()
        loss_pi, pi_info = compute_loss_pi(data)
        loss_pi.backward()      # built-in method for PyTorch Module
        mpi_avg_grads(ac.pi)    # average grads across MPI processes
        pi_optimizer.step()

        # Value function learning
        for i in range(train_v_iters):
            vf_optimizer.zero_grad()
            loss_v = compute_loss_v(data)
            loss_v.backward()
            mpi_avg_grads(ac.v)    # average grads across MPI processes
            vf_optimizer.step()

        # Log changes from update
        kl, ent = pi_info['kl'], pi_info_old['ent']
        logger.store(LossPi=pi_l_old, LossV=v_l_old,
                     KL=kl, Entropy=ent,
                     DeltaLossPi=(loss_pi.item() - pi_l_old),
                     DeltaLossV=(loss_v.item() - v_l_old))

    # Prepare for interaction with environment
    start_time = time.time()
    o, ep_ret, ep_len = env.reset(), 0, 0

    # Main loop: collect experience in env and update/log each epoch
    for epoch in range(epochs):
        for t in range(local_steps_per_epoch):
            a, v, logp = ac.step(torch.as_tensor(o, dtype=torch.float32))

            next_o, r, d, _ = env.step(a)
            ep_ret += r
            ep_len += 1

            # save and log
            buf.store(o, a, r, v, logp)
            logger.store(VVals=v)
            
            # Update obs (critical!)
            o = next_o

            timeout = ep_len == max_ep_len  # check if trajectory has reached max length
            terminal = d or timeout  # check if environment sent a terminal state flag
            epoch_ended = t==local_steps_per_epoch-1  # check if enough data has been collected for the epoch

            if terminal or epoch_ended:
                if epoch_ended and not(terminal):
                    print('Warning: trajectory cut off by epoch at %d steps.'%ep_len, flush=True)
                # if trajectory didn't reach terminal state, bootstrap value target
                if timeout or epoch_ended:
                    _, v, _ = ac.step(torch.as_tensor(o, dtype=torch.float32))
                else:
                    v = 0
                buf.finish_path(v)
                if terminal:
                    # only save EpRet / EpLen if trajectory finished
                    logger.store(EpRet=ep_ret, EpLen=ep_len)
                o, ep_ret, ep_len = env.reset(), 0, 0


        # Save model
        if (epoch % save_freq == 0) or (epoch == epochs-1):
            logger.save_state({'env': env}, None)

        # Perform VPG update!
        update()

        # Log info about epoch
        logger.log_tabular('Epoch', epoch)
        logger.log_tabular('EpRet', with_min_and_max=True)
        logger.log_tabular('EpLen', average_only=True)
        logger.log_tabular('VVals', with_min_and_max=True)
        logger.log_tabular('TotalEnvInteracts', (epoch+1)*steps_per_epoch)
        logger.log_tabular('LossPi', average_only=True)
        logger.log_tabular('LossV', average_only=True)
        logger.log_tabular('DeltaLossPi', average_only=True)
        logger.log_tabular('DeltaLossV', average_only=True)
        logger.log_tabular('Entropy', average_only=True)
        logger.log_tabular('KL', average_only=True)
        logger.log_tabular('Time', time.time()-start_time)
        logger.dump_tabular()

In [24]:
args = {'env': 'LunarLander-v2',
        'hid': 64,
        'l': 2,
        'gamma': 0.99,
        'seed': 0,
        'cpu': 1,
        'steps': 4000,
        'epochs': 50,
        'exp_name': 'vpg'}


mpi_fork(args['cpu'])  # run parallel code with mpi

from spinup.utils.run_utils import setup_logger_kwargs
logger_kwargs = setup_logger_kwargs(args['exp_name'], args['seed'])

vpg(lambda : gym.make(args['env']), actor_critic=MLPActorCritic,
    ac_kwargs=dict(hidden_sizes=[args['hid']]*args['l']), gamma=args['gamma'], 
    seed=args['seed'], steps_per_epoch=args['steps'], epochs=args['epochs'],
    logger_kwargs=logger_kwargs)

[32;1mLogging data to /mnt/c/Users/Duncan/OneDrive - University of Toronto/Documents/University/Grad Studies Year 4/HLML/AI_Gym/spinningup/data/vpg/vpg_s0/progress.txt[0m
[36;1mSaving config:
[0m
{
    "ac_kwargs":	{
        "hidden_sizes":	[
            64,
            64
        ]
    },
    "actor_critic":	"MLPActorCritic",
    "env_fn":	"<function <lambda> at 0x7f3363d5a8b0>",
    "epochs":	50,
    "exp_name":	"vpg",
    "gamma":	0.99,
    "lam":	0.97,
    "logger":	{
        "<spinup.utils.logx.EpochLogger object at 0x7f3363ecceb0>":	{
            "epoch_dict":	{},
            "exp_name":	"vpg",
            "first_row":	true,
            "log_current_row":	{},
            "log_headers":	[],
            "output_dir":	"/mnt/c/Users/Duncan/OneDrive - University of Toronto/Documents/University/Grad Studies Year 4/HLML/AI_Gym/spinningup/data/vpg/vpg_s0",
            "output_file":	{
                "<_io.TextIOWrapper name='/mnt/c/Users/Duncan/OneDrive - University of Toronto/Docum

---------------------------------------
|             Epoch |               8 |
|      AverageEpRet |            -207 |
|          StdEpRet |             118 |
|          MaxEpRet |           -49.3 |
|          MinEpRet |            -485 |
|             EpLen |            94.4 |
|      AverageVVals |           -42.2 |
|          StdVVals |            1.25 |
|          MaxVVals |           -24.3 |
|          MinVVals |           -42.3 |
| TotalEnvInteracts |         3.6e+04 |
|            LossPi |         0.00158 |
|             LossV |        8.59e+03 |
|       DeltaLossPi |               0 |
|        DeltaLossV |            -713 |
|           Entropy |            1.38 |
|                KL |       -3.28e-10 |
|              Time |            15.8 |
---------------------------------------
---------------------------------------
|             Epoch |               9 |
|      AverageEpRet |            -159 |
|          StdEpRet |            96.6 |
|          MaxEpRet |            14.5 |


---------------------------------------
|             Epoch |              18 |
|      AverageEpRet |            -196 |
|          StdEpRet |             122 |
|          MaxEpRet |           -8.02 |
|          MinEpRet |            -512 |
|             EpLen |            95.4 |
|      AverageVVals |           -74.1 |
|          StdVVals |          0.0568 |
|          MaxVVals |           -73.1 |
|          MinVVals |           -74.1 |
| TotalEnvInteracts |         7.6e+04 |
|            LossPi |        0.000144 |
|             LossV |        5.03e+03 |
|       DeltaLossPi |               0 |
|        DeltaLossV |            -160 |
|           Entropy |            1.38 |
|                KL |       -6.26e-10 |
|              Time |              33 |
---------------------------------------
---------------------------------------
|             Epoch |              19 |
|      AverageEpRet |            -171 |
|          StdEpRet |             104 |
|          MaxEpRet |           -34.9 |


---------------------------------------
|             Epoch |              28 |
|      AverageEpRet |            -160 |
|          StdEpRet |             101 |
|          MaxEpRet |            71.6 |
|          MinEpRet |            -409 |
|             EpLen |            98.4 |
|      AverageVVals |           -84.5 |
|          StdVVals |          0.0132 |
|          MaxVVals |           -84.1 |
|          MinVVals |           -84.5 |
| TotalEnvInteracts |        1.16e+05 |
|            LossPi |        -0.00172 |
|             LossV |         3.1e+03 |
|       DeltaLossPi |               0 |
|        DeltaLossV |          -0.302 |
|           Entropy |            1.38 |
|                KL |        8.34e-10 |
|              Time |            50.8 |
---------------------------------------
---------------------------------------
|             Epoch |              29 |
|      AverageEpRet |            -178 |
|          StdEpRet |             115 |
|          MaxEpRet |            17.8 |


---------------------------------------
|             Epoch |              38 |
|      AverageEpRet |            -166 |
|          StdEpRet |              83 |
|          MaxEpRet |           -10.7 |
|          MinEpRet |            -344 |
|             EpLen |             103 |
|      AverageVVals |           -75.6 |
|          StdVVals |           0.458 |
|          MaxVVals |           -67.3 |
|          MinVVals |           -75.6 |
| TotalEnvInteracts |        1.56e+05 |
|            LossPi |        -0.00872 |
|             LossV |         2.2e+03 |
|       DeltaLossPi |               0 |
|        DeltaLossV |           -66.1 |
|           Entropy |            1.37 |
|                KL |       -2.68e-10 |
|              Time |            68.9 |
---------------------------------------
---------------------------------------
|             Epoch |              39 |
|      AverageEpRet |            -140 |
|          StdEpRet |            76.2 |
|          MaxEpRet |           -30.8 |


---------------------------------------
|             Epoch |              48 |
|      AverageEpRet |            -155 |
|          StdEpRet |            87.3 |
|          MaxEpRet |           -49.7 |
|          MinEpRet |            -458 |
|             EpLen |             105 |
|      AverageVVals |           -84.4 |
|          StdVVals |           0.206 |
|          MaxVVals |           -74.9 |
|          MinVVals |           -84.4 |
| TotalEnvInteracts |        1.96e+05 |
|            LossPi |         -0.0105 |
|             LossV |        1.92e+03 |
|       DeltaLossPi |               0 |
|        DeltaLossV |           -7.15 |
|           Entropy |            1.37 |
|                KL |        2.98e-11 |
|              Time |            87.9 |
---------------------------------------
---------------------------------------
|             Epoch |              49 |
|      AverageEpRet |            -186 |
|          StdEpRet |            89.2 |
|          MaxEpRet |           -58.8 |
