### **Due Date**
2/29/2024 at 11:59PM EST

# **Introduction**

Welcome to Assignment 2 of 4756. In this assignment, you will train an agent using demonstrations from an expert. Concretely, you will:
* Implement behavior cloning (BC) and dataset aggregation (DAgger) methods
* **Extra Credit:** Get imitation learning working under causal confounds

You will use the Hopper agent for this assignment, which is part of Gym’s Mujoco Environments. Refer to the Gym website for more details about the [Hopper environment](https://gymnasium.farama.org/environments/mujoco/hopper/).


Please read through the following paragraphs carefully, as they will apply to this and all future assignments.

**Getting Started:** This assignment should be completed in [Google Colab](https://colab.research.google.com/). In order to access the python files bc.py and dagger.py which you will be editing, it is necessary to first upload the folder A2_FILES to your google drive and then mount your Google Drive in Colab. To do so, carefully follow the directions below in the section **Mounting Google Drive to Colab**, or reference the instructions [here](https://saturncloud.io/blog/how-to-import-python-files-in-google-colaboratory/). Additionally, make sure to switch your runtime type to GPU; this will help speed up the training process.

**Evaluation:**
Your code will be tested for correctness and, for certain assignments, speed. For this particular assignment, performance results will not be harshly graded (although we provide approximate expected reward numbers as lower bounds, you are not expected to replicate them exactly); however, it will be important to make an effort to justify your approach which led to the obtained results. Please remember that all assignments should be completed individually.

**Academic Integrity:** We will be checking your code against other submissions in the class for logical redundancy. If you copy someone else’s code and submit it with minor changes, we will know. These cheat detectors are quite hard to fool, so please don’t try. We trust you all to submit your own work only; please don’t let us down. If you do, we will pursue the strongest consequences available to us.

**Getting Help:** The [Resources](https://www.cs.cornell.edu/courses/cs4756/2024sp/#resources) section on the course website is your friend! If you ever feel stuck in these projects, please feel free to avail yourself to office hours and Edstem! If you are unable to make any of the office hours listed, please let TAs know and we will be happy to assist. If you need a refresher for PyTorch, please see this [60 minute blitz](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html)! For Numpy, please see the quickstart [here](https://numpy.org/doc/stable/user/quickstart.html) and full API [here](https://numpy.org/doc/stable/reference/).


### **Imports**

In [None]:
!apt-get install -y \
    libgl1-mesa-dev \
    libgl1-mesa-glx \
    libglew-dev \
    libosmesa6-dev \
    software-properties-common

!apt-get install -y patchelf
!pip install gym

!pip install free-mujoco-py
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1
!pip install imageio==2.4.1
!pip install -U colabgymrender
!pip install mujoco

In [None]:
import gym
import torch.nn as nn
import torch
import numpy as np
import random
import tqdm
from tqdm import tqdm
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader
from torch.optim import optimizer
import matplotlib.pyplot as plt
from IPython import display as ipythondisplay

In [None]:
# Setting the seed to ensure reproducability
def reseed(seed):
  torch.manual_seed(seed)
  random.seed(seed)
  np.random.seed(seed)

reseed(42)

### **Mounting Google Drive in Colab**

Before you complete this step, make sure that you have uploaded the folder A2_FILES to your Google Drive. Once you have done that, you need to mount your Google Drive in Colab. In order to do so, run the cell below. Running this cell will prompt you to authorize Colab to access your drive. Follow the instructions to complete the authorization process.

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount = True)

Next, locate A2_FILES on the left panel in Colab. To do so, navigate to Files/drive/MyDrive. At this point, you should see the contents of your Google Drive. Locate A2_FILES in your drive, and if necessary, modify the cell below such that you are correctly indicating the file path to A2_FILES. You will append the path to A2_FILES to the system path. If you have completed this step correctly, you should be able to successfully import the BC and DAgger modules into this notebook.

In [None]:
import sys
sys.path.append('/content/drive/MyDrive/A2_FILES')

### **Setting Up the Environment**

In [None]:
def make_env(env_id, seed=42, p_tremble=0.0):
    env = gym.make(env_id, render_mode=None) # Change render_mode = rbg_array to render
    env = gym.wrappers.RecordEpisodeStatistics(env)
    env.seed(seed)
    env.action_space.seed(seed)
    env.observation_space.seed(seed)
    return env
env = make_env('Hopper-v3')

### **Visualizing the Hopper environment with random actions**

We have provided functions to visualize the environment and compute rewards on the Hopper environment with random actions. Looking through this code will help you get familiarized with the environment, and set you up for the next parts in this assignment.

In [None]:
plt.axis('off')
done = False
visualize = False # set to false in order to disable rendering code
obs = env.reset()
total_random_reward = 0
i = 0
while not done:
    i += 1
    if i%5==0 and visualize:
        ipythondisplay.clear_output(wait=True)
        screen = env.render()
        plt.imshow(screen[0])
        ipythondisplay.display(plt.gcf())
    action = env.action_space.sample()
    obs, reward, done, info = env.step(action)
    total_random_reward += reward
    if done:
        break
print("Total Reward using Random Actions = ", total_random_reward)

**Approximate expected reward for total reward using random actions: 27**

In [None]:
# Download Hopper expert policy
!wget https://github.com/portal-cornell/cs4756-robot-learning-sp24/raw/main/assignments/A2/experts/hopper.pt

### **Neural Networks in PyTorch**

We have provided some code for implementing simple neural networks (fully connected, multilayer perceptrons) in PyTorch, including the ExpertActor and Learner classes. We have also provided code for checkpointing for saving your best performing model. If you wish to learn more about how to construct and train neural networks in PyTorch, check out the tutorials on [pytorch.org](https://pytorch.org/).

### ExpertActor Class

In [None]:
LOG_STD_MAX = 2
LOG_STD_MIN = -5

class ExpertActor(nn.Module):
    def __init__(self, env):
        super().__init__()
        self.fc1 = nn.Linear(np.array(env.observation_space.shape).prod(), 256)
        self.fc2 = nn.Linear(256, 256)
        self.fc_mean = nn.Linear(256, np.prod(env.action_space.shape))
        self.fc_logstd = nn.Linear(256, np.prod(env.action_space.shape))
        # action rescaling
        self.register_buffer(
            "action_scale",
            torch.tensor(
                (env.action_space.high - env.action_space.low) / 2.0,
                dtype=torch.float32,
            ).reshape(1, -1),
        )
        self.register_buffer(
            "action_bias",
            torch.tensor(
                (env.action_space.high + env.action_space.low) / 2.0,
                dtype=torch.float32,
            ).reshape(1, -1),
        )

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        mean = self.fc_mean(x)
        log_std = self.fc_logstd(x)
        log_std = torch.tanh(log_std)
        log_std = LOG_STD_MIN + 0.5 * (LOG_STD_MAX - LOG_STD_MIN) * (
            log_std + 1
        )

        return mean, log_std

    def get_action(self, x):
        mean, log_std = self(x)
        std = log_std.exp()
        normal = torch.distributions.Normal(mean, std)
        x_t = normal.rsample()  # for reparameterization trick (mean + std * N(0,1))
        y_t = torch.tanh(x_t)
        action = y_t * self.action_scale + self.action_bias
        log_prob = normal.log_prob(x_t)
        # Enforcing Action Bound
        log_prob -= torch.log(self.action_scale * (1 - y_t.pow(2)) + 1e-6)
        log_prob = log_prob.sum(1, keepdim=True)
        mean = torch.tanh(mean) * self.action_scale + self.action_bias
        return action, log_prob, mean

    def get_expert_action(self, obs, random_prob=0.0):
        if np.random.random() < random_prob:
            return env.action_space.sample()
        else:
            action = self.get_action(torch.tensor([obs]).float())
            return np.array(action[0][0].detach().cpu())

ckpt_path = "hopper.pt"
expert = ExpertActor(env).to('cpu')
expert.load_state_dict(torch.load(str(ckpt_path), map_location='cpu'))

### Learner Class

In [None]:
class Learner(nn.Module):
    def __init__(self, env, hidden_dim = 256, random_prob=0.0):
        super().__init__()
        self.fc1 = nn.Linear(np.array(env.observation_space.shape).prod(), hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
        self.fc_out = nn.Linear(hidden_dim, np.prod(env.action_space.shape))

        self.env = env
        self.random_prob = random_prob

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        out = F.tanh(self.fc_out(x))
        return out

    def get_action(self, obs):
        if np.random.random() < self.random_prob:
            return self.env.action_space.sample()
        action = self.forward(torch.tensor([obs]).float())
        return np.array(action[0].detach().cpu())

### Checkpointing Functions

In [None]:
def get_checkpoint_path(algo):
    """Return the path to save the best performing model checkpoint.

    Parameters:
        algo (str)
          Indicates which algorithm will be used to train the model

    Returns:
        checkpoint_path (str)
            The path to save the best performing model checkpoint
    """
    if algo == "bc":
      return 'best_bc_checkpoint.pth'
    elif algo == "dagger":
      return 'best_dagger_checkpoint.pth'
    return 'best_model_checkpoint.pth'

def load_model_checkpoint(checkpoint_path):
    """Load a model checkpoint from disk.

    Parameters:
        checkpoint_path (str)
            The path to load the checkpoint from

    Returns:
        model (torch.nn.Module)
            The model loaded from the checkpoint
    """
    model = Learner(env)
    model.load_state_dict(torch.load(checkpoint_path))
    return model

### **Visualizing the Hopper environment with the expert policy**

We have provided a visualization for computing rewards using the expert policy on the Hopper environment.

In [None]:
plt.axis('off')
done = False
visualize = False # set to false in order to disable rendering code
reseed(1)
obs = env.reset(seed=1)
total_expert_reward = 0
i = 0
while not done:
    i += 1
    if i%20==0 and visualize:
        ipythondisplay.clear_output(wait=True)
        screen = env.render()
        plt.imshow(screen[0])
        ipythondisplay.display(plt.gcf())
    with torch.no_grad():
        action = expert.get_expert_action(obs)
    obs, reward, done, info = env.step(action)
    total_expert_reward += reward
    if done:
        break
print(f"Total Reward using Expert Policy = {total_expert_reward}\nTotal Reward using Random Actions = {total_random_reward}\n")

**Approximate expected reward for total reward using expert policy: 2238**

### **Data collection**

We have provided some code to collect 50 demonstrations using the expert policy. To collect a different number of  trajectories, change the value of the NUM_TRAJS variable.

### Collecting and processing offline data

In [None]:
### Collecting trajectories (i.e. demonstrations) using the expert policy
NUM_TRAJS = 50
observations, actions = [], []
reseed(1)
for traj_num in tqdm(range(NUM_TRAJS)):
    print("Collecting trajectory ", traj_num+1)
    done = False
    obs = env.reset(seed = 1)
    while not done:
        with torch.no_grad():
            action = expert.get_expert_action(obs)
            observations.append(obs)
            actions.append(action)
            obs, reward, done, info = env.step(action)
        if done:
            break

# **Q1: Behavior Cloning (BC) with Shaky Hands**

To begin, fill in the implementation for the training loop function in **bc.py** found in **A2_FILES**. We provide the loss function and optimizer already, just iterate through your dataloader and return the updated policy!

Once you finish the training loop implementation, it is now time to build up your agents! **Behavior cloning (BC)** is the simplest imitation learning algorithm, where we perform supervised learning on the given (offline) expert dataset. We either do this via log-likelihood maximization (cross-entropy minimization) in the discrete action case, or mean-squared error minimization (can also do MLE) in the continuous control setting.

If implemented correctly, training your BC model should take roughly 15 minutes.

### Train Behavior Cloning (BC) Model

In [None]:
import bc

bc_learner = Learner(env)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
bc_learner.to(device)
checkpoint_path = get_checkpoint_path("bc")
reseed(2)
bc.train(bc_learner, observations, actions, checkpoint_path, num_epochs = 1500)

### Visualize the learner policy and compare rewards with expert policy

In [None]:
done = False
visualize = False # set to false in order to disable rendering code
reseed(2)
obs = env.reset(seed = 2)
total_learner_reward = 0
i= 0
while not done:
    if i%20==0 and visualize:
        ipythondisplay.clear_output(wait=True)
        screen = env.render()
        plt.imshow(screen[0])
        ipythondisplay.display(plt.gcf())
    with torch.no_grad():
        action = bc_learner.get_action(obs)
    obs, reward, done, info = env.step(action)
    total_learner_reward += reward
    if done:
        break
print(f"Total Reward using Expert Policy = {total_expert_reward}\nTotal Reward using Learned Policy = {total_learner_reward}\n")

**Approximate expected reward for total reward using learned policy: 1000**

Most likely, the performance of your BC agent will be very close to the expert.  However, what happens if your learner has SHAKY HANDS, i.e it executes random actions every few timesteps?

Concretely, set the probability of a random action by the learner to be just 5% (code already provided). You will probably see that the performance of the learner tanks!

### Add 1% random actions to learner and check rewards

In [None]:
bc_learner.random_prob = 0.05
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
bc_learner.to(device)
reseed(2)
bc.train(bc_learner, observations, actions, checkpoint_path, num_epochs = 1500)

### Visualize learner policy with random actions and compare rewards with expert policy

In [None]:
done = False
visualize = False # set to false in order to disable rendering code
reseed(2)
obs = env.reset(seed=2)
total_learner_reward = 0
i= 0
while not done:
    if i%5==0 and visualize:
        screen = env.render()
        plt.imshow(screen[0])
        ipythondisplay.display(plt.gcf())
        ipythondisplay.clear_output(wait=True)
    with torch.no_grad():
        action = bc_learner.get_action(obs)
    obs, reward, done, info = env.step(action)
    total_learner_reward += reward
    if done:
        break
print(f"Total Reward using Expert Policy = {total_expert_reward}\nTotal Reward using Learned Policy (Random Actions)= {total_learner_reward}\n")

**Approximate expected reward for total reward using learned policy with 1% random actions: 111**

# **Q2: DAgger**

**Dataset aggregation (DAgger)** is a fundamentally interactive algorithm, where we can query the expert any time we want to get information about how to proceed. This allows for significantly more freedom for the learner, as it can ask the expert anywhere and not be limited by the dataset that it is given to learn from.

**Can we overcome shaky hands with DAgger?** Fundamentally, this algorithm allows the learner to recover from bad states and should lead to much better performance than simply behavior cloning a fixed set of expert demonstrations. For this portion of the assignment, you will interact with the environment using the learner policy with random actions. You will do so in **dagger.py** found in **A2_FILES**.

Remember to initialize the DAgger policy with the already learned BC policy and your dataset with the already collected expert demonstrations for BC.



### Initialize DAgger with BC

In [None]:
dagger_learner = bc_learner
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
dagger_learner.to(device)

### Interact with the environment using the learner policy with random actions

In [None]:
import dagger

observations, actions = [], []
checkpoint_path = get_checkpoint_path("dagger")
seed = 2
reseed(seed)
dagger.interact(env, dagger_learner, expert, observations, actions, checkpoint_path, seed, num_epochs = 500)

**Approximate expected reward for 100th interaction with the environment: 1328**

# **Extra Credit: Causal Confounds**

Congratulations, you made it! You have implemented your first few (“deep” :') ) imitation learning algorithms in PyTorch.

With that in mind, let’s dig a little deeper. A common problem in the real world is hidden information. What if parts of the robot's state are hidden from the learner? How well does imitation learning do when the expert has full state knowledge, but the learner does not?

You will need to:
* Create a “partially observable” Hopper environment where the last observation index (refer to Gym documentation) is hidden from the learner (note that it’s still available to the expert!)
* Obtain rewards for both BC and DAgger. How well do BC and DAgger work for the partially observable Hopper environment? Explain the performance of each.

**Note:** For this part, BC and DAgger should just work if you did things right.
