# ENN585 - Advanced Machine Learning - Week 2

Welcome to Week 2 of ENN585!

In this notebook we will explore Imitation Learning -- the topic of this week.

You will train an imitation learning agent for a simple Cart Pole balancing task, and a more complex task involving a opening a door with a robotic hand.

In the end, you can build on your code from Week 1 and train a policy network based on the demonstrations by your hand-written controller in the Fetch-Slide environment.

## Install and Setup

You can run this notebook on [Google Colab](https://colab.research.google.com/github/nikosuenderhauf/enn585/blob/main/Week%202/imitation_learning.ipynb) or locally on your computer.

This first code cell takes care of some installation and setup. If you run this locally on your own machine, we recommend setting up a conda environment and following the setup instructions on Canvas.

In [None]:
#@title Install packages - (Run this once at the start)

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F

try:
  import minari
except:
  !pip install minari
  import minari


from matplotlib import pyplot as plt
from torch.utils.data import DataLoader

### Install gym-robotics and renderlab
try:
  import gymnasium as gym
  gym.spec('FetchSlide-v2')
except:
  !pip install gymnasium-robotics
  import gymnasium as gym

try:
  import renderlab as rl
except:
  !pip install renderlab
  import renderlab as rl

## Are we on Google Colab?
try:
  import google.colab
  IN_COLAB = True
except:
  IN_COLAB = False


### If on Colab, we have to setup gym's rendering. Otherwise we are ok to proceed.
if IN_COLAB:
  
  # download the dataset with the expert demonstrations
  !wget -P CartPole-v1-expert https://github.com/nikosuenderhauf/enn585/raw/main/Week%202/CartPole-v1-expert/data/main_data.hdf5 CartPole-v1-expert


  from google.colab import files
  import distutils.util
  import os
  import subprocess
  try:
    if subprocess.run('nvidia-smi').returncode:
      raise RuntimeError(
          'Cannot communicate with GPU. '
          'Make sure you are using a GPU Colab runtime. '
          'Go to the Runtime menu and select Choose runtime type.')
  except:
    pass
  # Add an ICD config so that glvnd can pick up the Nvidia EGL driver.
  # This is usually installed as part of an Nvidia driver package, but the Colab
  # kernel doesn't install its driver via APT, and as a result the ICD is missing.
  # (https://github.com/NVIDIA/libglvnd/blob/master/src/EGL/icd_enumeration.md)
  NVIDIA_ICD_CONFIG_PATH = '/usr/share/glvnd/egl_vendor.d/10_nvidia.json'
  if not os.path.exists(NVIDIA_ICD_CONFIG_PATH):
    with open(NVIDIA_ICD_CONFIG_PATH, 'w') as f:
      f.write("""{
      "file_format_version" : "1.0.0",
      "ICD" : {
          "library_path" : "libEGL_nvidia.so.0"
      }
  }
  """)

  # Configure MuJoCo to use the EGL rendering backend (requires GPU)
  print('Setting environment variable to use GPU rendering:')
  %env MUJOCO_GL=egl

  try:
    print('Checking that the installation succeeded:')
    import mujoco
    mujoco.MjModel.from_xml_string('<mujoco/>')
  except Exception as e:
    raise e from RuntimeError(
        'Something went wrong during installation. Check the shell output above '
        'for more information.\n'
        'If using a hosted Colab runtime, make sure you enable GPU acceleration '
        'by going to the Runtime menu and selecting "Choose runtime type".')

  print('Installation successful.')


## Let's explore the CartPole environment.

Our first imitation learning example will use the simple Cart Pole environment, a classical task in control and learning.

The task here is to balance an inverted pendulum by moving the cart to either the left or the right. The agent received a reward of +1 for every time step the pendulum is not tipped beyond a certain angle. Otherwise the episode ends. 

The maximum reward the agent can get is 500, since the episode is terminated after 500 timesteps.

Let's have a look by executing some random actions. 

In [None]:
# Create the CartPole environment
env = gym.make('CartPole-v1', render_mode='rgb_array')

# this wraps the environment so we can record a video of its outputs and watch it later
env = rl.RenderFrame(env, "./output")

# reset the environment
observation, info = env.reset()

# we will keep track of the accumulated reward 
accumulated_reward = 0

while True:

    # sample a random action to be executed
    action = env.action_space.sample()

    # this executes the action and returns observation and reward etc
    observation, reward, terminated, truncated, info = env.step(action)
    
    # increment the accumulated reward
    accumulated_reward += reward

    # we stop the loop if we terminate (e.g. the pole falls over) or run out of time (truncated after 50 steps)
    if terminated or truncated:
      break

# show the recorded video    
env.play()

print(f'Episode ended because it was terminated: {terminated} or truncated: {truncated}')
print(f'Total reward received: {accumulated_reward}')



## Towards an Imitation Learning Agent

Instead of executing randomly sampled actions, let's introduce a policy network that acts as the agent.

The policy network $\pi(a|s)$ returns actions, given the state of the environment. 
We will define a simple neural network for the policy network, but for more complex tasks the network architecture will of course be much more involved.

In [None]:
# Let's define a Policy Network \pi(a|s) that takes in a state and outputs a probability distribution over actions.
# We can change the architecture of the network as we like, but for this example, we will use a simple 3-layer fully connected network.
class PolicyNetwork(nn.Module):
    def __init__(self, input_dim, output_dim):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, output_dim)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

We can now use this network to decide on the actions. In the beginning, the network will not be trained at all, so we can expect it to perform just as bad as random actions.

In [None]:
# Create the CartPole environment
env = gym.make('CartPole-v1', render_mode='rgb_array')

# this wraps the environment so we can record a video of its outputs and watch it later
env = rl.RenderFrame(env, "./output")

# here we create an instance of the policy network
# notice how we define the input and output dimensions depending on the size of the observation and action space
observation_space = env.observation_space
action_space = env.action_space
policy_net = PolicyNetwork(np.prod(observation_space.shape), action_space.n)

# reset the environment
observation, info = env.reset()

accumulated_reward = 0
while True:

    # get the action from the policy network
    # Question: why do we call .argmax() here? Inspect the output of the policy network and see if you can figure it out.
    action = policy_net(torch.tensor(observation)).argmax()

    # this executes the action and returns observation and reward etc
    observation, reward, terminated, truncated, info = env.step(action.numpy())
    accumulated_reward += reward

    # we stop the loop if we terminate (e.g. the pole falls over) or run out of time (truncated after 50 steps)
    if terminated or truncated:
      break

# show the recorded video    
env.play()

print(f'Episode ended because it was terminated: {terminated} or truncated: {truncated}')
print(f'Total reward received: {accumulated_reward}')




## Let's inspect the training dataset.

We are using [Minari](https://minari.farama.org/), a Python API that hosts a number of popular datasets for offline reinforcement learning and imitation learning.

We load a dataset for Cart Pole that was collected by an expert and replay one of the episodes.

See how we can work with tha dataset in the code cell below, but also check out the documentation on Minari's website, e.g. https://minari.farama.org/content/basic_usage/


In [None]:
# load the dataset containing the expert demonstrations
%env MINARI_DATASETS_PATH=.
dataset = minari.load_dataset('CartPole-v1-expert/')

# recreate the environment used to generate the dataset
env = dataset.recover_environment(render_mode='rgb_array')
env = rl.RenderFrame(env, "./output")

# get one random episode from the dataset
episode = dataset.sample_episodes(n_episodes=1)[0]

# now use the episode data to visualize the expert's behavior
# we reset the environment using the random seed of the episode, so we get the same initial state
observation, info = env.reset(seed = episode.seed)


accumulated_reward = 0

# for all the actions in the episode ...
for action in episode.actions:
    observation, reward, terminated, truncated, info = env.step(action)
    accumulated_reward += reward

    if terminated or truncated:
        break

env.play()

# this time we get much higher rewards, as the expert has learned to solve the task
print(f'Episode ended because it was terminated: {terminated} or truncated: {truncated}')
print(f'Total reward received: {accumulated_reward}')



## Implement Behavioral Cloning

Our next step is to set up a training loop and train the policy network, using the expert demonstrations. 

The ``MinariDataset`` is compatible with the PyTorch Dataset API, allowing us to load it directly using [PyTorch DataLoader](https://pytorch.org/docs/stable/data.html).

However, since each episode can have a varying length, we need to pad them.
To achieve this, we can utilize the [collate_fn](https://pytorch.org/docs/stable/data.html#working-with-collate-fn) feature of PyTorch DataLoader. Let's create the ``collate_fn`` function:



In [None]:
def collate_fn(batch):
    return {        
        "observations": torch.nn.utils.rnn.pad_sequence(
            [torch.as_tensor(x.observations) for x in batch],
            batch_first=True
        ),
        "actions": torch.nn.utils.rnn.pad_sequence(
            [torch.as_tensor(x.actions) for x in batch],
            batch_first=True
        )      
    }

We can now proceed to instantiate the data loader, create the training loop and train the network.

You can experiment here by changing the number of epochs or the network architecture above. How does that influence performance?



In [None]:

# create a DataLoader that will iterate over the dataset in batches
dataloader = DataLoader(dataset, batch_size=256, shuffle=True, collate_fn=collate_fn)

# the optimizer will be used to update the policy network
optimizer = torch.optim.Adam(policy_net.parameters())

# we use a cross-entropy loss like in a classification task, as the action space is discrete
loss_fn = nn.CrossEntropyLoss()

# we train the policy network for 32 epochs, i.e. we iterate over the dataset 32 times
num_epochs = 32
for epoch in range(num_epochs):
    for batch in dataloader:
        a_pred = policy_net(batch['observations'][:, :-1])
        a_hat = F.one_hot(batch["actions"]).type(torch.float32)
        loss = loss_fn(a_pred, a_hat)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(f"Epoch: {epoch}/{num_epochs}, Loss: {loss.item()}")

Now we can see the trained policy network in action. We will use the same environment as before, but this time we will use the policy network to select the actions.

**YOUR TURN**

Instead of just running a single episode, can you modify the code so that it runs 100 episodes and logs the received reward for each episode? Then you can report the average reward and standard deviation. This would be beneficial if you want to experiment with network architectures or details of the training loop, as you can very easily judge the change in performance.

Hint: turn off the rendering by setting render_mode to `None` and skip the wrapping with the `RenderFrame` class.

In [None]:
# Create the CartPole environment
env = gym.make('CartPole-v1', render_mode='rgb_array')

# this wraps the environment so we can record a video of its outputs and watch it later
env = rl.RenderFrame(env, "./output")

observation, info = env.reset()
accumulated_reward = 0

# for all the actions in the episode ...
while True:

    action = policy_net(torch.tensor(observation)).argmax()
    observation, reward, terminated, truncated, info = env.step(action.numpy())
    accumulated_reward += reward

    if terminated or truncated:
        break

env.play()

# this time we get much higher rewards, as the network has learned to solve the task from the expert demonstrations
print(f'Episode ended because it was terminated: {terminated} or truncated: {truncated}')
print(f'Total reward received: {accumulated_reward}')


## Try a Different Environment -- Opening a Door with a Robot Hand

Now that we have demonstrated behavioral cloning for the simple Cart Pole environment, we can look at a more complex task: We will learn how to open a door with a robot hand.

The environment we will work with is https://robotics.farama.org/envs/adroit_hand/adroit_door/

Have a look at the documentation, especially the parts about the action and state spaces.


**YOUR TURN**
Follow the instructions to edit the code.




In [None]:
# we load the dataset containing the expert demonstrations
dataset = minari.load_dataset('door-expert-v1', download=True)

# we recreate the environment used to generate the dataset
env = dataset.recover_environment(render_mode='rgb_array')

# as before, we generate a policy network and make sure it has the correct input and output dimensions
observation_space = env.observation_space
action_space = env.action_space
policy_net = PolicyNetwork(np.prod(observation_space.shape), np.prod(action_space.shape))


## YOUR TURN!
# Using the code blocks above as a reference, start the environment and use the untrained policy network to control the agent.
# Visualise the result using the video player as before.


## YOUR TURN!
# Now replay some of the recorded episodes from the dataset to see how the expert behaves in the environment.





We are now ready to train the policy network from the expert demonstrations. 

We can re-use the `collate_fn` function from above to create the `DataLoader` but have to choose a different loss function. 

**YOUR TURN**

Choose an appropriate loss function.

In [None]:
# create a DataLoader that will iterate over the dataset in batches
dataloader = DataLoader(dataset, batch_size=256, shuffle=True, collate_fn=collate_fn)

# check if we have cuda available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# a new policy net instance
policy_net = PolicyNetwork(np.prod(observation_space.shape), np.prod(action_space.shape)).to(device)


# the optimizer will be used to update the policy network
optimizer = torch.optim.Adam(policy_net.parameters())


# YOUR TURN! Choose a loss function that is appropriate for the task and implement the training loop.
loss_fn = None
loss_fn = nn.MSELoss()

# we train the policy network for 16 epochs, i.e. we iterate over the dataset 16 times
# YOUR TURN! You can experiment and change the number of epochs, batch size, etc. and see how that influences the performance of the trained network.
num_epochs = 16
losses = []

for epoch in range(num_epochs):
    for batch in dataloader:
        
        # the predicted action according to the policy network
        a_pred = policy_net(batch['observations'][:, :-1,:].float().to(device))
        
        # the true action by the expert
        a_hat = batch["actions"].float().to(device)
        
        # The loss should measure the difference between both. Make sure you choose an appropriate loss function.
        loss = loss_fn(a_pred, a_hat)
        losses.append(loss.item())
        
        # update the policy network using the optimizer
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(f"Epoch: {epoch}/{num_epochs}, Loss: {loss.item()}")

plt.plot(losses)

Let's test the policy network we just trained!

In [None]:
# Create the CartPole environment
env = gym.make('AdroitHandDoor-v1', render_mode='rgb_array')

# this wraps the environment so we can record a video of its outputs and watch it later
env = rl.RenderFrame(env, "./output")

# as always, we reset the environment and start the episode
observation, info = env.reset()
accumulated_reward = 0

while True:
    action = policy_net(torch.tensor(observation).float().to(device))
  
    observation, reward, terminated, truncated, info = env.step(action.detach().cpu().numpy())
    accumulated_reward += reward

    if terminated or truncated:
        break

env.play()

print(f'Episode ended because it was terminated: {terminated} or truncated: {truncated}')
print(f'Total reward received: {accumulated_reward}')

## Collect your own Dataset for Imitation Learning and Reinforcement Learning

Last week you wrote a rudimentary controller that could do somewhat better than random actions at the Fetch-Slide task.

Let's use Minari's `DataCollector` class to collect a dataset from this controller.
Then, adapt the code blocks above to train a poicy network that can imitate your hand-written controller. This policy network can serve as a starting point for your reinforcement learning experiments in Assessment 1.

First, check the documentation at https://minari.farama.org/tutorials/using_datasets/behavioral_cloning/#dataset-generation to see how a dataset can be collected. See how the `DataCollector` class wraps the environment while you execute actions from your 'expert' (hand-written) controller? The dataset gets stored automatically to `~/.minari/datasets` from where you can load it later using `minari.load_dataset()`.



In [None]:
# YOUR TURN!
# Follow the instructions above to:
# - copy the code from last week that runs your hand-written controller on the FetchSlide environment
# - use the DataCollector class to collect a dataset of expert demonstrations from this contoller
# - train a policy network using the expert demonstrations
# - visualize the trained policy network controlling the FetchSlide environment