# Imitation Learning with Neural Network Policies
In this notebook, you will implement the supervised losses for behavior cloning and use it to train policies for locomotion tasks.

In [2]:
#@title imports
# As usual, a bit of setup
import os
import shutil
import time
import numpy as np
import torch
import deeprl.infrastructure.pytorch_util as ptu
from deeprl.infrastructure.rl_trainer import RL_Trainer
from deeprl.infrastructure.trainers import BC_Trainer
from deeprl.agents.bc_agent import BCAgent
from deeprl.policies.loaded_gaussian_policy import LoadedGaussianPolicy
from deeprl.policies.MLP_policy import MLPPolicySL

%load_ext autoreload
%autoreload 2

def rel_error(x, y):
    """ returns relative error """
    return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))

def remove_folder(path):
    # check if folder exists
    if os.path.exists(path): 
        print("Clearing old results at {}".format(path))
        # remove if exists
        shutil.rmtree(path)
    else:
        print("Folder {} does not exist yet. No old results to delete".format(path))

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [3]:
bc_base_args_dict = dict(
    expert_policy_file = 'deeprl/policies/experts/Hopper.pkl', #@param
    expert_data = 'deeprl/expert_data/expert_data_Hopper-v2.pkl', #@param
    env_name = 'Hopper-v2', #@param ['Ant-v2', 'Humanoid-v2', 'Walker2d-v2', 'HalfCheetah-v2', 'Hopper-v2']
    exp_name = 'test_bc', #@param
    do_dagger = True, #@param {type: "boolean"}
    ep_len = 1000, #@param {type: "integer"}
    save_params = False, #@param {type: "boolean"}

    # Training
    num_agent_train_steps_per_iter = 1000, #@param {type: "integer"})
    n_iter = 1, #@param {type: "integer"})

    # batches & buffers
    batch_size = 10000, #@param {type: "integer"})
    eval_batch_size = 1000, #@param {type: "integer"}
    train_batch_size = 100, #@param {type: "integer"}
    max_replay_buffer_size = 1000000, #@param {type: "integer"}

    #@markdown network
    n_layers = 2, #@param {type: "integer"}
    size = 64, #@param {type: "integer"}
    learning_rate = 5e-3, #@param {type: "number"}

    #@markdown logging
    video_log_freq = -1, #@param {type: "integer"}
    scalar_log_freq = 1, #@param {type: "integer"}

    #@markdown gpu & run-time settings
    no_gpu = False, #@param {type: "boolean"}
    which_gpu = 0, #@param {type: "integer"}
    seed = 2, #@param {type: "integer"}
    logdir = 'test',
)

# Infrastructure
**Policies**: We have provided implementations of simple neural network policies for your convenience. For discrete environments, the neural network takes in the current state and outputs the logits of the policy's action distribution at this state. The policy then outputs a categorical distribution using those logits. In environments with continuous action spaces, the network will output the mean of a diagonal Gaussian distribution, as well as having a separate single parameter for the log standard deviations of the Gaussian. 

Calling forward on the policy will output a torch distribution object, so look at the documentation at https://pytorch.org/docs/stable/distributions.html.
Look at <code>policies/MLP_policy</code> to make sure you understand the implementation.

**RL Training Loop**: The reinforcement learning training loop, which alternates between gathering samples from the environment and updating the policy (and other learned functions) can be found in <code>infrastructure/rl_trainer.py</code>. While you won't need to understand this for the basic behavior cloning part (as you only use a fixed set of expert data), you should read through and understand the run_training_loop function before starting the Dagger implementation.

# Basic Behavior Cloning
The first part of the assignment will be a familiar exercise in supervised learning. Given a dataset of expert trajectories, we will simply train our policy to imitate the expert via maximum likelihood. Fill out the update method in the MLPPolicySL class in <code>policies/MLP_policy.py</code>.

In [5]:
### Basic test for correctness of loss and gradients
torch.manual_seed(0)
ac_dim = 2
ob_dim = 3
batch_size = 5

policy = MLPPolicySL(
            ac_dim=ac_dim,
            ob_dim=ob_dim,
            n_layers=1,
            size=2,
            learning_rate=0.25)

np.random.seed(0)
obs = np.random.normal(size=(batch_size, ob_dim))
acts = np.random.normal(size=(batch_size, ac_dim))

first_weight_before = np.array(ptu.to_numpy(next(policy.mean_net.parameters())))
print("Weight before update", first_weight_before)

for i in range(5):
    loss = policy.update(obs, acts)['Training Loss']

print(loss)
expected_loss = 2.628419
loss_error = rel_error(loss, expected_loss)
print("Loss Error", loss_error, "should be on the order of 1e-6 or lower")

first_weight_after = ptu.to_numpy(next(policy.mean_net.parameters()))
print('Weight after update', first_weight_after)

weight_change = first_weight_after - first_weight_before
print("Change in weights", weight_change)

expected_change = np.array([[ 0.04385546, -0.4614172,  -1.0613215 ],
                            [ 0.20986436, -1.2060736,  -1.0026767 ]])
updated_weight_error = rel_error(weight_change, expected_change)
print("Weight Update Error", updated_weight_error, "should be on the order of 1e-6 or lower")


Weight before update [[-0.00432253  0.30971587 -0.47518533]
 [-0.42489457 -0.22236899  0.15482074]]


AttributeError: 'NoneType' object has no attribute 'backward'

Having implemented our behavior cloning loss, we can now start training some policies to imitate the expert policies provided. 

Run the following cell to train policies with simple behavior cloning on the HalfCheetah environment.

In [4]:
bc_args = dict(bc_base_args_dict)

env_str = 'HalfCheetah'
bc_args['expert_policy_file'] = 'deeprl/policies/experts/{}.pkl'.format(env_str)
bc_args['expert_data'] = 'deeprl/expert_data/expert_data_{}-v2.pkl'.format(env_str)
bc_args['env_name'] = '{}-v2'.format(env_str)

# Delete all previous logs
remove_folder('logs/behavior_cloning/{}'.format(env_str))

for seed in range(3):
    print("Running behavior cloning experiment with seed", seed)
    bc_args['seed'] = seed
    bc_args['logdir'] = 'logs/behavior_cloning/{}/seed{}'.format(env_str, seed)
    bctrainer = BC_Trainer(bc_args)
    bctrainer.run_training_loop()

Clearing old results at logs/behavior_cloning/HalfCheetah
Running behavior cloning experiment with seed 0
########################
logging outputs to  logs/behavior_cloning/HalfCheetah/seed0
########################
Using CPU for this assignment. There may be some bugs with using GPU that cause test cases to not match. You can uncomment the code below if you want to try using it.
HalfCheetah-v2
Import error. Trying to rebuild mujoco_py.
Compiling /Users/jangdong-eon/miniforge3/envs/hw4/lib/python3.8/site-packages/mujoco_py/cymj.pyx because it changed.
[1/1] Cythonizing /Users/jangdong-eon/miniforge3/envs/hw4/lib/python3.8/site-packages/mujoco_py/cymj.pyx




DependencyNotInstalled: dlopen(/Users/jangdong-eon/miniforge3/envs/hw4/lib/python3.8/site-packages/mujoco_py/generated/cymj_2.1.2.14_38_macextensionbuilder_38.so, 0x0002): Library not loaded: @rpath/MuJoCo.framework/Versions/A/libmujoco.2.1.1.dylib
  Referenced from: <404DF1FE-8863-3ABA-9D0E-14A74C69A063> /Users/jangdong-eon/miniforge3/envs/hw4/lib/python3.8/site-packages/mujoco_py/generated/cymj_2.1.2.14_38_macextensionbuilder_38.so
  Reason: tried: '/Users/jangdong-eon/miniforge3/envs/hw4/lib/MuJoCo.framework/Versions/A/libmujoco.2.1.1.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Users/jangdong-eon/miniforge3/envs/hw4/lib/MuJoCo.framework/Versions/A/libmujoco.2.1.1.dylib' (no such file), '/Users/jangdong-eon/miniforge3/envs/hw4/lib/MuJoCo.framework/Versions/A/libmujoco.2.1.1.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Users/jangdong-eon/miniforge3/envs/hw4/lib/MuJoCo.framework/Versions/A/libmujoco.2.1.1.dylib' (no such file), '/Users/jangdong-eon/.mujoco/mujoco210/bin/MuJoCo.framework/Versions/A/libmujoco.2.1.1.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Users/jangdong-eon/.mujoco/mujoco210/bin/MuJoCo.framework/Versions/A/libmujoco.2.1.1.dylib' (no such file), '/Users/jangdong-eon/miniforge3/envs/hw4/lib/python3.8/site-packages/mujoco_py/generated/MuJoCo.framework/Versions/A/libmujoco.2.1.1.dylib' (no such file), '/opt/homebrew/Cellar/gcc@11/11.4.0/lib/gcc/11/gcc/aarch64-apple-darwin23/11/MuJoCo.framework/Versions/A/libmujoco.2.1.1.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/homebrew/Cellar/gcc@11/11.4.0/lib/gcc/11/gcc/aarch64-apple-darwin23/11/MuJoCo.framework/Versions/A/libmujoco.2.1.1.dylib' (no such file), '/opt/homebrew/Cellar/gcc@11/11.4.0/lib/gcc/11/gcc/MuJoCo.framework/Versions/A/libmujoco.2.1.1.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/homebrew/Cellar/gcc@11/11.4.0/lib/gcc/11/gcc/MuJoCo.framework/Versions/A/libmujoco.2.1.1.dylib' (no such file), '/opt/homebrew/Cellar/gcc@11/11.4.0/lib/gcc/11/MuJoCo.framework/Versions/A/libmujoco.2.1.1.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/homebrew/Cellar/gcc@11/11.4.0/lib/gcc/11/MuJoCo.framework/Versions/A/libmujoco.2.1.1.dylib' (no such file), '/Users/jangdong-eon/miniforge3/envs/hw4/lib/MuJoCo.framework/Versions/A/libmujoco.2.1.1.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Users/jangdong-eon/miniforge3/envs/hw4/lib/MuJoCo.framework/Versions/A/libmujoco.2.1.1.dylib' (no such file), '/Users/jangdong-eon/miniforge3/envs/hw4/lib/MuJoCo.framework/Versions/A/libmujoco.2.1.1.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Users/jangdong-eon/miniforge3/envs/hw4/lib/MuJoCo.framework/Versions/A/libmujoco.2.1.1.dylib' (no such file), '/Users/jangdong-eon/.mujoco/mujoco210/bin/MuJoCo.framework/Versions/A/libmujoco.2.1.1.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Users/jangdong-eon/.mujoco/mujoco210/bin/MuJoCo.framework/Versions/A/libmujoco.2.1.1.dylib' (no such file), '/Users/jangdong-eon/miniforge3/envs/hw4/lib/python3.8/site-packages/mujoco_py/generated/MuJoCo.framework/Versions/A/libmujoco.2.1.1.dylib' (no such file), '/opt/homebrew/Cellar/gcc@11/11.4.0/lib/gcc/11/gcc/aarch64-apple-darwin23/11/MuJoCo.framework/Versions/A/libmujoco.2.1.1.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/homebrew/Cellar/gcc@11/11.4.0/lib/gcc/11/gcc/aarch64-apple-darwin23/11/MuJoCo.framework/Versions/A/libmujoco.2.1.1.dylib' (no such file), '/opt/homebrew/Cellar/gcc@11/11.4.0/lib/gcc/11/gcc/MuJoCo.framework/Versions/A/libmujoco.2.1.1.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/homebrew/Cellar/gcc@11/11.4.0/lib/gcc/11/gcc/MuJoCo.framework/Versions/A/libmujoco.2.1.1.dylib' (no such file), '/opt/homebrew/Cellar/gcc@11/11.4.0/lib/gcc/11/MuJoCo.framework/Versions/A/libmujoco.2.1.1.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/homebrew/Cellar/gcc@11/11.4.0/lib/gcc/11/MuJoCo.framework/Versions/A/libmujoco.2.1.1.dylib' (no such file), '/Users/jangdong-eon/miniforge3/envs/hw4/bin/../lib/MuJoCo.framework/Versions/A/libmujoco.2.1.1.dylib' (no such file), '/Users/jangdong-eon/miniforge3/envs/hw4/bin/../lib/MuJoCo.framework/Versions/A/libmujoco.2.1.1.dylib' (no such file). (HINT: you need to install mujoco_py, and also perform the setup instructions here: https://github.com/openai/mujoco-py/.)

Visualize your results using Tensorboard. You should see that on HalfCheetah, the returns of your learned policies (Eval_AverageReturn) are fairly similar (thought a bit lower) to that of the expert (Initial_DataCollection_Average_Return).

In [None]:
### Visualize behavior cloning results on HalfCheetah
%load_ext tensorboard
%tensorboard --logdir logs/behavior_cloning/HalfCheetah

Now run the following cell to train policies with simple behavior cloning on Hopper.

In [None]:
bc_args = dict(bc_base_args_dict)

env_str = 'Hopper'
bc_args['expert_policy_file'] = 'deeprl/policies/experts/{}.pkl'.format(env_str)
bc_args['expert_data'] = 'deeprl/expert_data/expert_data_{}-v2.pkl'.format(env_str)
bc_args['env_name'] = '{}-v2'.format(env_str)

# Delete all previous logs
remove_folder('logs/behavior_cloning/{}'.format(env_str))

for seed in range(3):
    print("Running behavior cloning experiment on Hopper with seed", seed)
    bc_args['seed'] = seed
    bc_args['logdir'] = 'logs/behavior_cloning/{}/seed{}'.format(env_str, seed)
    bctrainer = BC_Trainer(bc_args)
    bctrainer.run_training_loop()

Visualize your results using Tensorboard. You should see that on Hopper, the returns of your learned policies (Eval_AverageReturn) are substantially lower than that of the expert (Initial_DataCollection_Average_Return), due to the distribution shift issues that arise when doing naive behavior cloning.

In [None]:
### Visualize behavior cloning results on Hopper
%load_ext tensorboard
%tensorboard --logdir logs/behavior_cloning/Hopper

# Dataset Aggregation
As discussed in lecture, behavior cloning can suffer from distribution shift, as a small mismatch between the learned and expert policy can take the learned policy to new states that were unseen during training, on which the learned policy hasn't been trained. In Dagger, we will address this issue iteratively, where we use our expert policy to provide labels for the new states we encounter with our learned policy, and then retrain our policy on these newly labeled states.

Implement the <code>do_relabel_with_expert</code> function in <code>infrastructure/rl_trainer.py</code>. The errors in the expert actions should be on the order of 1e-6 or less.

In [None]:
### Test do relabel function
bc_args = dict(bc_base_args_dict)

env_str = 'Hopper'
bc_args['expert_policy_file'] = 'deeprl/policies/experts/{}.pkl'.format(env_str)
bc_args['expert_data'] = 'deeprl/expert_data/expert_data_{}-v2.pkl'.format(env_str)
bc_args['env_name'] = '{}-v2'.format(env_str)
bctrainer = BC_Trainer(bc_args)

np.random.seed(0)
T = 2
ob_dim = 11
ac_dim = 3

paths = []
for i in range(3):
    obs = np.random.normal(size=(T, ob_dim))
    acs = np.random.normal(size=(T, ac_dim))
    paths.append(dict(observation=obs,
                      action=acs))
    
rl_trainer = bctrainer.rl_trainer
relabeled_paths = rl_trainer.do_relabel_with_expert(bctrainer.loaded_expert_policy, paths)

expert_actions = np.array([[[-1.7814021, -0.11137983,  1.763353  ],
                            [-2.589222,   -5.463195,    2.4301376 ]],
                           [[-2.8287444, -5.298558,   3.0320463],
                            [ 3.9611065,  2.626403,  -2.8639293]],
                           [[-0.3055225,  -0.9865407,   0.80830705],
                            [ 2.8788857,   3.5550566,  -0.92875874]]])

for i, (path, relabeled_path) in enumerate(zip(paths, relabeled_paths)):
    assert np.all(path['observation'] == relabeled_path['observation'])
    print("Path {} expert action error".format(i), rel_error(expert_actions[i], relabeled_path['action']))

We can run Dagger on the Hopper env again.

In [None]:
dagger_args = dict(bc_base_args_dict)

dagger_args['do_dagger'] = True
dagger_args['n_iter'] = 10

env_str = 'Hopper'
dagger_args['expert_policy_file'] = 'deeprl/policies/experts/{}.pkl'.format(env_str)
dagger_args['expert_data'] = 'deeprl/expert_data/expert_data_{}-v2.pkl'.format(env_str)
dagger_args['env_name'] = '{}-v2'.format(env_str)


In [None]:
# Delete all previous logs
remove_folder('logs/dagger/{}'.format(env_str))

for seed in range(3):
    print("Running Dagger experiment with seed", seed)
    dagger_args['seed'] = seed
    dagger_args['logdir'] = 'logs/dagger/{}/seed{}'.format(env_str, seed)
    bctrainer = BC_Trainer(dagger_args)
    bctrainer.run_training_loop()

Visualizing the Dagger results on Hopper, we see that Dagger is able to recover the performance of the expert policy after a few iterations of online interaction and expert relabeling.

In [None]:
### Visualize Dagger results on Hopper
%load_ext tensorboard
%tensorboard --logdir logs/dagger/Hopper