# Actor Critic Algorithms
In the DQN algorithm, we learned a Q-function by minimizing Bellman errors for an implicit policy that always took the action that maximized the Q-function. However, this scheme requires a discrete action spaces to allow for us to easily compute the optimal action at each state, unlike generic policy gradient algorithms that also worked with continuous action spaces.

In this section, we will explore actor-critic algorithms which maintain an explicit policy (actor) like the policy gradient algorithms, learns a Q-function (critic) capturing the values of the _current policy_, and uses this learned Q-function to update the policy. Using a learned critic can provide much lower variance updates for the policy compared to using Monte-Carlo retun estimates, and also allows us to reuse our data by training the actor and critic on _off-policy_ data for more sample efficiency. We can thus take many more policy updates with an actor critic algorithm using our learned critic, instead of needing to wait and gather fresh samples every time.

In [4]:
# As usual, a bit of setup
import os
import shutil
import time
import numpy as np
import torch

import deeprl.infrastructure.pytorch_util as ptu
from deeprl.infrastructure.rl_trainer import RL_Trainer
from deeprl.infrastructure.trainers import AC_Trainer

from deeprl.policies.MLP_policy import MLPPolicyAC

%load_ext autoreload
%autoreload 2

def rel_error(x, y):
    """ returns relative error """
    return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))

def remove_folder(path):
    # check if folder exists
    if os.path.exists(path): 
        print("Clearing old results at {}".format(path))
        # remove if exists
        shutil.rmtree(path)
    else:
        print("Folder {} does not exist yet. No old results to delete".format(path))

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [5]:
ac_base_args_dict = dict(
    env_name = 'Hopper-v2', #@param ['Ant-v2', 'Humanoid-v2', 'Walker2d-v2', 'HalfCheetah-v2', 'Hopper-v2']
    exp_name = 'test_ac', #@param
    save_params = False, #@param {type: "boolean"}
    
    ## PDF will tell you how to set ep_len
    ## and discount for each environment
    ep_len = 200, #@param {type: "integer"}
    discount = 0.99, #@param {type: "number"}

    # Training
    num_agent_train_steps_per_iter = 1000, #@param {type: "integer"})
    n_iter = 100, #@param {type: "integer"})

    # batches & buffers
    batch_size = 1000, #@param {type: "integer"})
    eval_batch_size = 1000, #@param {type: "integer"}
    train_batch_size = 256, #@param {type: "integer"}
    max_replay_buffer_size = 1000000, #@param {type: "integer"}

    #@markdown actor network
    n_layers = 2, #@param {type: "integer"}
    size = 256, #@param {type: "integer"}
    entropy_weight=0, #@param {type: "number"}
    learning_rate = 3e-4, #@param {type: "number"}
    
    # critic network
    critic_n_layers = 2, #@param {type: "integer"}
    critic_size = 256, #@param {type: "integer"}
    target_update_rate = 5e-3,

    #@markdown logging
    video_log_freq = -1, #@param {type: "integer"}
    scalar_log_freq = 1, #@param {type: "integer"}

    #@markdown gpu & run-time settings
    no_gpu = False, #@param {type: "boolean"}
    which_gpu = 0, #@param {type: "integer"}
    seed = 2, #@param {type: "integer"}
    logdir = 'test',
)

First fill out the target value calculation in the compute_target_value method of <code>critics/bootstrapped_continuous_critic.py</code>. Compared to the DQN critic, the key difference is that we are now estimating the value of the current policy, instead of the optimal policy as in DQN or Q-learning.

To train our critic to evaluate the current policy $\pi$, we simply sample actions from the current policy in our target value. For each sample $(s,a,s')$, our loss will be
$$L(Q_{\theta}(s, a), r(s,a) + \gamma \mathbb{E}_{a'\sim \pi(s')}[Q_{\bar \theta} (s', a')]),$$
where $L$ is our loss function (for example squared error or the smooth L1 loss).
In this assignment, we will simply sample a single action from the policy to estimate the target value.

In [6]:
# Test bellman error for policy evaluation
ac_dim = 3
ob_dim = 11
N = 5

np.random.seed(0)
obs = np.random.normal(size=(N, ob_dim))
acts = np.random.choice(ac_dim, size=(N,))
next_obs = np.random.normal(size=(N, ob_dim))
rewards = np.random.normal(size=N)
terminals = np.zeros(N)
terminals[0] = 1

ac_args = dict(ac_base_args_dict)

env_str = 'Hopper'
ac_args['env_name'] = '{}-v2'.format(env_str)
ac_args['entropy_weight'] = 0.1
actrainer = AC_Trainer(ac_args)
critic = actrainer.rl_trainer.agent.critic

class DummyDist:
    def sample(self):
        return ptu.from_numpy(1 + np.zeros(shape=(N, ac_dim)))

def dummy_actor(next_obs):
    return DummyDist()

# assumes you call actor(next_obs) to get the distribution, then call distribution.sample()
target_vals = critic.compute_target_value(ptu.from_numpy(next_obs), 
                                          ptu.from_numpy(rewards), 
                                          ptu.from_numpy(terminals), 
                                          dummy_actor)
target_vals = ptu.to_numpy(target_vals)
expected_targets = np.array([-0.9167948, -0.11123351, -0.36787638, -2.1131861,  -0.13868617])

target_error = rel_error(target_vals, expected_targets)
print("Target value error", target_error, "should be on the order of 1e-6 or lower")

########################
logging outputs to  test
########################
Using CPU for this assignment. There may be some bugs with using GPU that cause test cases to not match. You can uncomment the code below if you want to try using it.
Hopper-v2
Import error. Trying to rebuild mujoco_py.




DependencyNotInstalled: numpy.core.multiarray failed to import (auto-generated because you didn't call 'numpy.import_array()' after cimporting numpy; use '<void>numpy._import_array' to disable if you are certain you don't need it).. (HINT: you need to install mujoco_py, and also perform the setup instructions here: https://github.com/openai/mujoco-py/.)

For this section, we will also update our target network parameters as an exponential moving average of the critic parameters, instead of simply copying the current parameters periodically as in DQN. Generally, either method for target networks tends to work with appropriately chosen update rates.

Fill out the update_target_parameter_ema method in <code>critics/bootstrapped_continuous_critic.py</code>.

In [None]:
# Test target network update
ac_args = dict(ac_base_args_dict)

env_str = 'Hopper'
ac_args['env_name'] = '{}-v2'.format(env_str)
ac_args['entropy_weight'] = 0.1
actrainer = AC_Trainer(ac_args)
critic = actrainer.rl_trainer.agent.critic

critic.target_update_rate = 0.5

# at initialization, target and critic networks are the same
for p in critic.critic_network.parameters():
    p.data += 1.
    
critic.update_target_network_ema()

for p, target_p in zip(critic.critic_network.parameters(), critic.target_network.parameters()):
    assert np.all(ptu.to_numpy((p-target_p)) == 0.5)

Next, we will implement the actor update using the learned critic instead of Monte Carlo returns. 

To update our policy at a particular state $s$, our previous policy gradient (using the reward to go estimator) took a step on the objective (treating $Q^{\pi}$ as a function that didn't depend on $\pi$ and using the results of a single trajectory to estimate $Q^\pi$)
$$\mathbb E_{a \sim \pi_{\theta}(s)}[Q^{\pi_\theta}(s, a)],$$
using the REINFORCE gradient estimator 
$$\mathbb E_{a \sim \pi_{\theta}(s)}[\nabla_{\theta} \log \pi_{\theta}(a\vert s) Q^{\pi}(s, a)].$$
This estimator only relied on the estimated value $Q^{\pi_{\theta}}(s,a)$, so was very general and could be applied with Monte Carlo estimates of $Q^{\pi_\theta}$.

One way to estimate policy gradients with an actor critic algorithm would be to directly replace the Monte Carlo estimate of $Q$ with the learned critic $Q_{\phi}$, and continue using the REINFORCE gradient estimator.
However, we note that we can explicitly compute derivatives of our learned critic $Q(s, a)$ with respect to the action $a$, which can enable potentially better gradient estimates. 

In order to take advantage of this, we would also need to differentiate sampled actions $a$ with respect to our policy parameters, which we can through a technique known as the _reparameterization trick_ or the _pathwise_ estimator. 
The idea is that if our policy sampled actions according to $a \sim \mathcal N(\mu_{\theta}(s), \sigma^2_{\theta}(s))$, we can rewrite $a = f_{\theta}(z)$, where $z \sim \mathcal N(0, 1)$, and $f(z) = \mu_{\theta}(s) + z \cdot \sigma_{\theta}(s)$. Now all the randomness comes from sampling $z$, which doesn't depend on our policy, so we can now differentiate the sampled action $a$ with respect to our policy parameters $\theta$ by simply differentiating through the function $f$ applied at the random noise $z$. 

Using the chain rule then allows to directly estimate gradients of 
$$\mathbb{E}_{a \sim \pi_{\theta}(s)}[Q_{\phi}(s,a)]$$
by drawing samples from $\pi$ and differentiating $Q_{\phi}(s,a)$ on the samples. 

Implement the actor update using this pathwise estimator in the update method of the MLPPolicyAC class in <code>policies/MLP_policy.py</code> (Hint: see the rsample function in for torch.distributions). Note that our implementation samples states uniformly from the entire replay buffer, not necessarily from the state distribution of the current policy. While this means we are no longer taking unbiased policy gradients (our estimates were already biased anyways due to using a learned critic), it works well in practice.

In [None]:
# Compute actor update using the policy gradient. 
# For this test to pass, make sure you only call sample once per actor update to not throw off 
# the actor samples expected for the updates in the this test.
torch.manual_seed(0)
ac_dim = 2
ob_dim = 3
batch_size = 5

np.random.seed(0)
obs = np.random.normal(size=(N, ob_dim))

policy = MLPPolicyAC(
            ac_dim=ac_dim,
            ob_dim=ob_dim,
            n_layers=1,
            size=2,
            learning_rate=0.25,
            entropy_weight=0.)

def dummy_critic(obs, acts):
    return torch.sum(acts + 1) + torch.sum(obs)

initial_loss = policy.update(obs, dummy_critic)['Actor Training Loss']
expected_initial_loss = -17.083496

print("Initial loss error", rel_error(expected_initial_loss, initial_loss), "should be on the order of 1e-6 or less.")
for i in range(5):
    loss = policy.update(obs, dummy_critic)['Actor Training Loss']
    print(loss)

expected_final_loss = -30.103575

print("Final loss error", rel_error(expected_final_loss, loss), "should be on the order of 1e-6 or less.")
    


Now we'll train our actor critic agent on the HalfCheetah task. You should see your policies generally get over 600 returns. 

We note that these actor critic algorithms, since they make use of off-policy updates, can be much more sample efficient than the basic policy gradient algorith we saw earlier. In our actor critic algorithms here, we only take 1000 new samples from the environment per iteration, while the policy gradient algorithms often needed many more samples per iteration to estimate the Monte Carlo returns (for example, we used 10000 in the Hopper experiments with policy gradient).



In [None]:
ac_args = dict(ac_base_args_dict)

env_str = 'HalfCheetah'
ac_args['env_name'] = '{}-v2'.format(env_str)
ac_args['n_iter'] = 50

# Delete all previous logs
remove_folder('logs/actor_critic/{}'.format(env_str))

for seed in range(3):
    print("Running actor critic experiment with seed", seed)
    ac_args['seed'] = seed
    ac_args['logdir'] = 'logs/actor_critic/{}/seed{}'.format(env_str, seed)
    actrainer = AC_Trainer(ac_args)
    actrainer.run_training_loop()

In [None]:
### Visualize Actor Critic results on Halfheetah
%load_ext tensorboard
%tensorboard --logdir logs/actor_critic/HalfCheetah