# Policy Gradients
The goal in policy gradient algorithms is to maximize the expected returns of a policy $\pi_\theta$ with parameters $\theta$. Letting $\tau=((s_0, a_0, r_0), \ldots, (s_T, a_T, r_T) )$ denote a trajectory and $R(\tau)$ the return of $\tau$, this objective can be written as
$$\max_{\theta} \mathbb E_{\tau \sim \pi_{\theta}}[R(\tau)].$$

Using the REINFORCE trick, we can compute the policy gradient (the gradient of expected policy returns) as
$$\sum_{t=0}^T \mathbb E_{s_t, a_t \sim \pi(\tau)} \nabla_{\theta} \log \pi_{\theta}(a_t \vert s_t) R(\tau).$$

We can then estimate this with a very simple scheme.
We first sample a trajectory $\tau = ((s_t, a_t, r_t))_{t=0}^\infty$ from our current policy, compute the discounted return of the trajectory as $R$, then take a stochastic estimate of the policy gradient as 

$$\sum_{t=0}^T \mathbb \nabla_{\theta} \log \pi_{\theta}(a_t \vert s_t) R(\tau).$$
We can then repeat sample more trajectories to average the estimate over multiple samples.
In practice, we will often use _discounted_ returns $\tilde R(\tau) = \sum_{t=0}^T \gamma^t r_t$ where $\gamma$ is the discount factor and our policy gradient estimate will simply replace the undiscounted returns with $\tilde R(\tau)$.



In [1]:
#@title imports
# As usual, a bit of setup
import os
import shutil
import time
import numpy as np
import gym
import torch

import deeprl.infrastructure.pytorch_util as ptu

from deeprl.infrastructure.rl_trainer import RL_Trainer
from deeprl.infrastructure.trainers import PG_Trainer
from deeprl.infrastructure.trainers import BC_Trainer

from deeprl.agents.pg_agent import PGAgent
from deeprl.policies.MLP_policy import MLPPolicyPG

%load_ext autoreload
%autoreload 2

def rel_error(x, y):
    """ returns relative error """
    return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))

def remove_folder(path):
    # check if folder exists
    if os.path.exists(path): 
        print("Clearing old results at {}".format(path))
        # remove if exists
        shutil.rmtree(path)
    else:
        print("Folder {} does not exist yet. No old results to delete".format(path))

In [2]:
pg_base_args_dict = dict(
    env_name = 'Hopper-v2', #@param ['Ant-v2', 'Humanoid-v2', 'Walker2d-v2', 'HalfCheetah-v2', 'Hopper-v2']
    exp_name = 'test_pg', #@param
    save_params = False, #@param {type: "boolean"}
    
    ep_len = 200, #@param {type: "integer"}
    discount = 0.95, #@param {type: "number"}

    reward_to_go = True, #@param {type: "boolean"}
    nn_baseline = False, #@param {type: "boolean"}
    dont_standardize_advantages = True, #@param {type: "boolean"}

    # Training
    num_agent_train_steps_per_iter = 1, #@param {type: "integer"})
    n_iter = 100, #@param {type: "integer"})

    # batches & buffers
    batch_size = 1000, #@param {type: "integer"})
    eval_batch_size = 1000, #@param {type: "integer"}
    train_batch_size = 1000, #@param {type: "integer"}
    max_replay_buffer_size = 1000000, #@param {type: "integer"}

    #@markdown network
    n_layers = 2, #@param {type: "integer"}
    size = 64, #@param {type: "integer"}
    learning_rate = 5e-3, #@param {type: "number"}

    #@markdown logging
    video_log_freq = -1, #@param {type: "integer"}
    scalar_log_freq = 1, #@param {type: "integer"}

    #@markdown gpu & run-time settings
    no_gpu = False, #@param {type: "boolean"}
    which_gpu = 0, #@param {type: "integer"}
    seed = 2, #@param {type: "integer"}
    logdir = 'test',
)

## Implementing policy gradients
We will first compute a very naive policy gradient calculation by taking the whole discounted return of a trajectory. Fill out the method <code>_discounted_return</code> in <code>pg_agent.py</code>. Your error should be 1e-6 or lower.

In [3]:
### Test return computation
pg_args = dict(pg_base_args_dict)

env_str = 'CartPole'
pg_args['env_name'] = '{}-v0'.format(env_str)
pgtrainer = PG_Trainer(pg_args)
pgagent = pgtrainer.rl_trainer.agent

T = 10
np.random.seed(0)
rewards = np.random.normal(size=T)
discounted_returns = pgagent._discounted_return(rewards)

expected_return = 6.49674307
return_error = rel_error(discounted_returns, expected_return)
print("Error in return estimate is", return_error)

########################
logging outputs to  test
########################
Using CPU for this assignment. There may be some bugs with using GPU that cause test cases to not match. You can uncomment the code below if you want to try using it.
CartPole-v0
Error in return estimate is 1.3988385848769778e-10


  logger.warn(
  logger.warn(
  deprecation(
  deprecation(
  deprecation(


Next, we'll consider a return estimate with lower variance by taking the discounted reward-to-go at each timestep instead of the entire discounted return. More precisely, instead of taking $\sum_{t'=0}^T \gamma^{t'} r_{t'}$ as the return estimate for all timesteps $t$, we will instead use $\sum_{t'=t}^T \gamma^{t' - t} r_{t'}$ for the return estimate at timestep $t$. Fill out the method <code>_discounted_cumsum</code> in <code>pg_agent.py</code>.   Your error should be 1e-6 or lower.

In [4]:
### Test reward to go computations
pg_args = dict(pg_base_args_dict)

env_str = 'CartPole'
pg_args['env_name'] = '{}-v0'.format(env_str)
pgtrainer = PG_Trainer(pg_args)
pgagent = pgtrainer.rl_trainer.agent

T = 10
np.random.seed(0)
rewards = np.random.normal(size=T)
discounted_cumsum = pgagent._discounted_cumsum(rewards)
expected_cumsum = np.array([6.49674307, 4.98177971, 4.82276053, 4.04633952, 1.90046981, 0.03464402,
 1.06518095, 0.12115003, 0.28684973, 0.4105985])

return_error = rel_error(discounted_cumsum, expected_cumsum)
print("Error in return estimate is", return_error)

########################
logging outputs to  test
########################
Using CPU for this assignment. There may be some bugs with using GPU that cause test cases to not match. You can uncomment the code below if you want to try using it.
CartPole-v0
Error in return estimate is 1.0143655677664199e-08


  logger.warn(
  logger.warn(
  deprecation(
  deprecation(
  deprecation(


Finally, we'll use our return estimates to compute a policy gradient. Fill out the surrogate loss computation in the <code>update</code> method in MLPPolicyPG class in <code>policies/MLP_policy.py</code>.

In [13]:
### Test policy gradient (check gradients match what we expect)
torch.manual_seed(0)
ac_dim = 2
ob_dim = 3
batch_size = 5

policy = MLPPolicyPG(
            ac_dim=ac_dim,
            ob_dim=ob_dim,
            n_layers=1,
            size=2,
            learning_rate=0.25)

np.random.seed(0)
obs = np.random.normal(size=(batch_size, ob_dim))
acts = np.random.normal(size=(batch_size, ac_dim))
advs = 1000 * np.random.normal(size=(batch_size,))

first_weight_before = np.array(ptu.to_numpy(next(policy.mean_net.parameters())))
print("Weight before update", first_weight_before)

for i in range(5):
    loss = policy.update(obs, acts, advs)['Training Loss']

print(loss)
expected_loss = -6142.9116
loss_error = rel_error(loss, expected_loss)
print("Loss Error", loss_error, "should be on the order of 1e-6 or lower")

first_weight_after = ptu.to_numpy(next(policy.mean_net.parameters()))
print('Weight after update', first_weight_after)

weight_change = first_weight_after - first_weight_before
print("Change in weights", weight_change)

expected_change = np.array([[ 1.035012, 1.0455959, 0.11085394],
                            [-1.1532364, -0.5915445, 0.557522]])
updated_weight_error = rel_error(weight_change, expected_change)
print("Weight Update Error", updated_weight_error, "should be on the order of 1e-6 or lower")

Weight before update [[-0.00432253  0.30971587 -0.47518533]
 [-0.42489457 -0.22236899  0.15482074]]
-6142.911
Loss Error 3.8026553900586275e-08 should be on the order of 1e-6 or lower
Weight after update [[ 1.0306894   1.3553118  -0.36433142]
 [-1.5781314  -0.8139109   0.71234274]]
Change in weights [[ 1.0350119   1.0455959   0.11085391]
 [-1.1532369  -0.5915419   0.557522  ]]
Weight Update Error 2.2091965382665285e-06 should be on the order of 1e-6 or lower


We can compare the two return estimators on a simple environment and compare how well they do. 

In [9]:
pg_args = dict(pg_base_args_dict)

env_str = 'CartPole'
pg_args['env_name'] = '{}-v0'.format(env_str)
pg_args['reward_to_go'] = False
pg_args['n_iter'] = 100

# Delete all previous logs
remove_folder('logs/policy_gradient/{}/full_returns/'.format(env_str))

for seed in range(3):
    print("Running policy gradient experiment with seed", seed)
    pg_args['seed'] = seed
    pg_args['logdir'] = 'logs/policy_gradient/{}/full_returns/seed{}'.format(env_str, seed)
    pgtrainer = PG_Trainer(pg_args)
    pgtrainer.run_training_loop()

Folder logs/policy_gradient/CartPole/full_returns/ does not exist yet. No old results to delete
Running policy gradient experiment with seed 0
########################
logging outputs to  logs/policy_gradient/CartPole/full_returns/seed0
########################
Using CPU for this assignment. There may be some bugs with using GPU that cause test cases to not match. You can uncomment the code below if you want to try using it.
CartPole-v0


********** Iteration 0 ************

Collecting data to be used for training...
At timestep:     1001 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 26.0000
Eval_StdReturn : 11.006815910339355
Eval_MaxReturn : 52.0
Eval_MinReturn : 11.0
Eval_AverageEpLen : 26.0
Train_AverageReturn : 24.414634704589844
Train_StdReturn : 10.264729499816895
Train_MaxReturn : 57.0
Train_MinReturn : 10.0
Train_AverageEpLen : 24.414634146341463
Train_EnvstepsSoFar : 1001
TimeSinceStart : 0.21847009658813477
Trainin

In [10]:
pg_args = dict(pg_base_args_dict)

env_str = 'CartPole'
pg_args['env_name'] = '{}-v0'.format(env_str)
pg_args['reward_to_go'] = True
pg_args['n_iter'] = 100

# Delete all previous logs
remove_folder('logs/policy_gradient/{}/return_to_go/'.format(env_str))

for seed in range(3):
    print("Running policy gradient experiment with seed", seed)
    pg_args['seed'] = seed
    pg_args['logdir'] = 'logs/policy_gradient/{}/return_to_go/seed{}'.format(env_str, seed)
    pgtrainer = PG_Trainer(pg_args)
    pgtrainer.run_training_loop()

Folder logs/policy_gradient/CartPole/return_to_go/ does not exist yet. No old results to delete
Running policy gradient experiment with seed 0
########################
logging outputs to  logs/policy_gradient/CartPole/return_to_go/seed0
########################
Using CPU for this assignment. There may be some bugs with using GPU that cause test cases to not match. You can uncomment the code below if you want to try using it.
CartPole-v0


********** Iteration 0 ************

Collecting data to be used for training...
At timestep:     1001 / 1000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 26.33333396911621
Eval_StdReturn : 13.417044639587402
Eval_MaxReturn : 75.0
Eval_MinReturn : 11.0
Eval_AverageEpLen : 26.333333333333332
Train_AverageReturn : 24.414634704589844
Train_StdReturn : 10.264729499816895
Train_MaxReturn : 57.0
Train_MinReturn : 10.0
Train_AverageEpLen : 24.414634146341463
Train_EnvstepsSoFar : 1001
TimeSinceStart : 0.1

We should see the reward to go estimator outperforming the full returns estimator, with some runs reaching the maximum reward of 200. There will likely however be high variance between runs.

In [14]:
### Visualize Policy Gradient results on CartPole
%load_ext tensorboard
%tensorboard --logdir logs/policy_gradient/CartPole

We can also compare our estimators on a more complex task, though you will probably see that they don't perform well (not getting much above 200 returns). Note that on this more complex task, we use a much larger batch size to reduce variance in the policy gradients.

In [15]:
pg_args = dict(pg_base_args_dict)

env_str = 'Hopper'
pg_args['env_name'] = '{}-v2'.format(env_str)
pg_args['learning_rate'] = 0.01
pg_args['reward_to_go'] = False
pg_args['batch_size'] = 10000
pg_args['train_batch_size'] = 10000
pg_args['n_iter'] = 100

# Delete all previous logs
remove_folder('logs/policy_gradient/{}/full_returns/'.format(env_str))

for seed in range(3):
    print("Running policy gradient experiment with seed", seed)
    pg_args['seed'] = seed
    pg_args['logdir'] = 'logs/policy_gradient/{}/full_returns/seed{}'.format(env_str, seed)
    pgtrainer = PG_Trainer(pg_args)
    pgtrainer.run_training_loop()

Folder logs/policy_gradient/Hopper/full_returns/ does not exist yet. No old results to delete
Running policy gradient experiment with seed 0
########################
logging outputs to  logs/policy_gradient/Hopper/full_returns/seed0
########################
Using CPU for this assignment. There may be some bugs with using GPU that cause test cases to not match. You can uncomment the code below if you want to try using it.
Hopper-v2


  logger.warn(
  logger.warn(
  from distutils.dep_util import newer, newer_group
  logger.warn(
  deprecation(
  deprecation(
  deprecation(




********** Iteration 0 ************

Collecting data to be used for training...
At timestep:     10006 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 49.4118537902832
Eval_StdReturn : 46.61728286743164
Eval_MaxReturn : 162.16844177246094
Eval_MinReturn : 8.17390251159668
Eval_AverageEpLen : 35.7
Train_AverageReturn : 9.695383071899414
Train_StdReturn : 5.122409820556641
Train_MaxReturn : 75.24513244628906
Train_MinReturn : 2.3481526374816895
Train_AverageEpLen : 13.217965653896961
Train_EnvstepsSoFar : 10006
TimeSinceStart : 2.694154977798462
Training Loss : 34.156253814697266
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 9.695383071899414
Done logging...




********** Iteration 1 ************

Collecting data to be used for training...
At timestep:     10046 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 75.4079360961914
Eval_StdReturn : 34.9294319152

  logger.warn(
  logger.warn(




********** Iteration 0 ************

Collecting data to be used for training...
At timestep:     10010 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 167.0282440185547
Eval_StdReturn : 42.58283615112305
Eval_MaxReturn : 222.31344604492188
Eval_MinReturn : 89.20071411132812
Eval_AverageEpLen : 83.3076923076923
Train_AverageReturn : 13.428664207458496
Train_StdReturn : 9.308354377746582
Train_MaxReturn : 73.26396179199219
Train_MinReturn : 2.7757763862609863
Train_AverageEpLen : 19.25
Train_EnvstepsSoFar : 10010
TimeSinceStart : 2.4159579277038574
Training Loss : 43.249969482421875
Baseline Loss : 0
Initial_DataCollection_AverageReturn : 13.428664207458496
Done logging...




********** Iteration 1 ************

Collecting data to be used for training...
At timestep:     10004 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 157.66111755371094
Eval_StdReturn : 49.8609

In [16]:
pg_args = dict(pg_base_args_dict)

env_str = 'Hopper'
pg_args['env_name'] = '{}-v2'.format(env_str)
pg_args['learning_rate'] = 0.01
pg_args['reward_to_go'] = True
pg_args['batch_size'] = 10000
pg_args['train_batch_size'] = 10000
pg_args['n_iter'] = 100

# Delete all previous logs
remove_folder('logs/policy_gradient/{}/return_to_go/'.format(env_str))

for seed in range(3):
    print("Running policy gradient experiment with seed", seed)
    pg_args['seed'] = seed
    pg_args['logdir'] = 'logs/policy_gradient/{}/return_to_go/seed{}'.format(env_str, seed)
    pgtrainer = PG_Trainer(pg_args)
    pgtrainer.run_training_loop()

Folder logs/policy_gradient/Hopper/return_to_go/ does not exist yet. No old results to delete
Running policy gradient experiment with seed 0
########################
logging outputs to  logs/policy_gradient/Hopper/return_to_go/seed0
########################
Using CPU for this assignment. There may be some bugs with using GPU that cause test cases to not match. You can uncomment the code below if you want to try using it.
Hopper-v2


********** Iteration 0 ************

Collecting data to be used for training...
At timestep:     10006 / 10000
Training agent...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 99.02117919921875
Eval_StdReturn : 61.76866912841797
Eval_MaxReturn : 264.6962890625
Eval_MinReturn : 13.24337387084961
Eval_AverageEpLen : 63.76470588235294
Train_AverageReturn : 9.695383071899414
Train_StdReturn : 5.122409820556641
Train_MaxReturn : 75.24513244628906
Train_MinReturn : 2.3481526374816895
Train_AverageEpLen : 13.217965653896961
Train

In [17]:
### Visualize Policy Gradient results on Hopper
%load_ext tensorboard
%tensorboard --logdir logs/policy_gradient/Hopper

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


ERROR: Failed to launch TensorBoard (exited with 1).
Contents of stderr:
Address already in use
Port 6006 is in use by another program. Either identify and stop that program, or start the server with a different port.

## Variance Reduction with a Value Function Baseline
We can further reduce the policy gradient variance by including state-dependent baselines. In this section, we will train a value function network to predict the value of the policy at a state, then use the value function as a baseline by subtracting it from our reward-to-go estimate.

Implement the value function baseline loss in the update method of the MLPPolicyPG class in <code>policies/MLP_policy.py</code>.

In [19]:
# Test value function gradient
torch.manual_seed(0)
ac_dim = 2
ob_dim = 3
batch_size = 5

policy = MLPPolicyPG(
            ac_dim=ac_dim,
            ob_dim=ob_dim,
            n_layers=1,
            size=2,
            learning_rate=0.25,
            nn_baseline=True)

np.random.seed(0)
obs = np.random.normal(size=(batch_size, ob_dim))
acts = np.random.normal(size=(batch_size, ac_dim))
advs = 1000 * np.random.normal(size=(batch_size,))
qvals = advs

first_weight_before = np.array(ptu.to_numpy(next(policy.baseline.parameters())))
print("Weight before update", first_weight_before)

for i in range(5):
    loss = policy.update(obs, acts, advs, qvals=qvals)['Baseline Loss']

print(loss)
expected_loss = 0.925361
loss_error = rel_error(loss, expected_loss)
print("Loss Error", loss_error, "should be on the order of 1e-6 or lower")

first_weight_after = ptu.to_numpy(next(policy.baseline.parameters()))
print('Weight after update', first_weight_after)

weight_change = first_weight_after - first_weight_before
print("Change in weights", weight_change)

expected_change = np.array([[ 0.38988823,  0.70297027,  0.2609921 ],
                            [-1.0340402,  -0.84166795,  0.7254925 ]])
updated_weight_error = rel_error(weight_change, expected_change)
print("Weight Update Error", updated_weight_error, "should be on the order of 1e-6 or lower")

Weight before update [[-0.23799711  0.0213871   0.22824687]
 [ 0.34642333 -0.39140946 -0.25141457]]
0.925361
Loss Error 1.2076536367060315e-08 should be on the order of 1e-6 or lower
Weight after update [[ 0.15189107  0.72435737  0.48923904]
 [-0.68761694 -1.2330772   0.474078  ]]
Change in weights [[ 0.38988817  0.70297027  0.26099217]
 [-1.0340402  -0.8416677   0.7254926 ]]
Weight Update Error 1.4154350421085664e-07 should be on the order of 1e-6 or lower


In the estimate_advantage function in <code>agents/pg_agent.py</code>, fill out the advantage estimate using the baseline and test your implementation below.

In [23]:
### Test return computation
pg_args = dict(pg_base_args_dict)

env_str = 'CartPole'
pg_args['env_name'] = '{}-v0'.format(env_str)
pg_args['nn_baseline'] = True
pgtrainer = PG_Trainer(pg_args)
pgagent = pgtrainer.rl_trainer.agent

obs_dim = 4
N = 10
np.random.seed(0)
obs = np.random.normal(size=(N, obs_dim))
qs = np.random.normal(size=N)

baseline_advantages = pgagent.estimate_advantage(obs, qs)
expected_advantages = np.array([-0.44662586, -0.89629588, -1.14574752,  2.43957172, -0.06601728,
       -0.00501807, -0.74720337,  1.27468092, -1.20184486,  0.25312274])

advantage_error = rel_error(expected_advantages, baseline_advantages)
print("Advantage error", advantage_error, "should be on the order of 1e-6 or lower")

########################
logging outputs to  test
########################
Using CPU for this assignment. There may be some bugs with using GPU that cause test cases to not match. You can uncomment the code below if you want to try using it.
CartPole-v0


TypeError: super(type, obj): obj must be an instance or subtype of type

## Train your policies!
In this section, we will train our policies using the reward-to-go estimator and learning a value function baseline. On Hopper, you should see your methods get over 300 rewards consistently, and sometimes over 400. Returns will tend to oscillate during training. You should also see that using a value function baseline greatly improves performance over our earlier experiments without it.

In [None]:
pg_args = dict(pg_base_args_dict)

env_str = 'Hopper'
pg_args['env_name'] = '{}-v2'.format(env_str)
pg_args['learning_rate'] = 0.01
pg_args['reward_to_go'] = True
pg_args['nn_baseline'] = True
pg_args['batch_size'] = 10000
pg_args['train_batch_size'] = 10000
pg_args['n_iter'] = 100

# Delete all previous logs
remove_folder('logs/policy_gradient/{}/with_baseline/'.format(env_str))

for seed in range(3):
    print("Running policy gradient experiment with seed", seed)
    pg_args['seed'] = seed
    pg_args['logdir'] = 'logs/policy_gradient/{}/with_baseline/seed{}'.format(env_str, seed)
    pgtrainer = PG_Trainer(pg_args)
    pgtrainer.run_training_loop()

In [None]:
# Plot learning curves
### Visualize Policy Gradient results on Hopper
%load_ext tensorboard
%tensorboard --logdir logs/policy_gradient/Hopper