# Learning from Demonstrations

<img src="https://raw.githubusercontent.com/jeremiedecock/polytechnique-csc-53439-ep-2025-students/refs/heads/main/assets/logo.jpg?raw=true" style="float: left; width: 15%" />

[CSC_53439_EP-2025](https://moodle.ip-paris.fr/course/view.php?id=10716) Lab session #4

2019-2025 Jérémie Decock

[![Open in Google Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jeremiedecock/polytechnique-csc-53439-ep-2025-students/blob/main/lab4_lfd_and_pbrl.ipynb)

[![My Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/jeremiedecock/polytechnique-csc-53439-ep-2025-students/main?filepath=lab4_lfd_and_pbrl.ipynb)

[![NbViewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.jupyter.org/github/jeremiedecock/polytechnique-csc-53439-ep-2025-students/blob/main/lab4_lfd_and_pbrl.ipynb)

[![Local](https://img.shields.io/badge/Local-Save%20As...-blue)](https://github.com/jeremiedecock/polytechnique-csc-53439-ep-2025-students/raw/main/lab4_lfd_and_pbrl.ipynb)

## Introduction

The purpose of this lab is to introduce some classic algorithms of *Learning from Demonstrations*. We will see how they work, their caveats and benefits.

*Learning from Demonstrations* (LfD) is an approach in reinforcement learning where an agent learns behaviors by observing examples provided by a demonstrator. This field is divided into two main branches: *Imitation Learning* and *Inverse Reinforcement Learning* (IRL).

- **Imitation Learning** involves directly mimicking the demonstrator's actions. The agent learns to replicate the demonstrated behavior without attempting to infer the underlying objectives or rewards driving the actions. It focuses on reproducing successful behaviors in a given task, often through supervised learning.

- **Inverse Reinforcement Learning (IRL)**, on the other hand, seeks to infer the demonstrator's underlying reward function. Instead of copying actions, the agent tries to discover the goals or preferences that motivated the demonstrator’s behavior. Once the reward function is learned, the agent can optimize its own policy to achieve similar outcomes.

The main objective of *Learning from Demonstrations* is to enable agents to learn effectively in settings where the reward function is unknown or poorly defined, by leveraging expert demonstrations as a source of supervisory signal.

Both parts of this lab will focus on *Imitation Learning*. In the first part, we will implement the *Behavioral Cloning* algorithm, while the second part will introduce the *GAIL* algorithm.

You can either:
- open, edit and execute the notebook in *Google Colab* following this link: https://colab.research.google.com/github/jeremiedecock/polytechnique-csc-53439-ep-2025-students/blob/main/lab4_lfd_and_pbrl.ipynb ; this is the **recommended** choice as you have nothing to install on your computer
- open, edit and execute the notebook in *MyBinder* (if for any reason the Google Colab solution doesn't work): https://mybinder.org/v2/gh/jeremiedecock/polytechnique-csc-53439-ep-2025-students/main?filepath=lab4_lfd_and_pbrl.ipynb
- download, edit and execute the notebook on your computer if Python3 and JypyterLab are already installed: https://github.com/jeremiedecock/polytechnique-csc-53439-ep-2025-students/raw/main/lab4_lfd_and_pbrl.ipynb

If you work with Google Colab or MyBinder, **remember to save or download your work regularly or you may lose it!**

## Lab Submission

Please submit your completed notebook in [Moodle : "Lab 4 - Submission"](https://moodle.ip-paris.fr/mod/assign/view.php?id=185691).

### Submission Guidelines

1. **File Naming:** Rename your notebook as follows: **`firstname_lastname-04.ipynb`** where `firstname` and `lastname` match your email address. *Example: `jesse_read-04.ipynb`*
2. **Clear Output Cells:** To reduce file size (**must be under 500 KB**), clear all output cells before submitting. This includes rendered images, videos, plots, and dataframes...
   - **JupyterLab:**
     - Click **"Kernel" → "Restart Kernel and Clear Outputs of All Cells..."**
     - Then go to **"File" → "Save Notebook As..."**
   - **Google Colab:**
     - Click **"Edit" → "Clear all outputs"**
     - Then go to **"File" → "Download" → "Download.ipynb"**
   - **VSCode:**
     - Click **"Clear All Outputs"**
     - Then **save your file**
3. **Upload Your File:** Only **`.ipynb`** files are accepted.

**Note:** Bonus parts (if any) are optional, as their name suggests.

## Setup the Python environment

This notebook relies on several libraries including `torch`, `gymnasium`, `numpy`, `pandas`, `seaborn`, `imageio`, `pygame`, and `tqdm`.
A complete list of dependencies can be found in the following [requirements_lab4.txt](https://raw.githubusercontent.com/jeremiedecock/polytechnique-csc-53439-ep-2025-students/main/requirements_lab4.txt) file.

### If you use Google Colab

If you use Google Colab, execute the next cell to install required libraries.

In [None]:
import sys, subprocess

def is_colab():
    return "google.colab" in sys.modules

def run_subprocess_command(cmd):
    # run the command
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE)
    # print the output
    for line in process.stdout:
        print(line.decode().strip())

if is_colab():
    run_subprocess_command("apt install swig")
    run_subprocess_command("pip install -r https://raw.githubusercontent.com/jeremiedecock/polytechnique-csc-53439-ep-2025-students/main/requirements_lab4_google_colab.txt")

### If you have downloaded the notebook on your computer and execute it in your own Python environment

To set up the necessary dependencies, run the following commands to establish a [Python virtual environment (venv)](https://docs.python.org/3/library/venv.html) that includes all the essential libraries for this lab.

#### On Posix systems (Linux, MacOSX, WSL, ...)

```bash
python3 -m venv env-lab4
source env-lab4/bin/activate
python3 -m pip install --upgrade pip
python3 -m pip install -r https://raw.githubusercontent.com/jeremiedecock/polytechnique-csc-53439-ep-2025-students/main/requirements_lab4.txt
```

#### On Windows

```bash
python3 -m venv env-lab4
env-lab4\Scripts\activate.bat
python3 -m pip install --upgrade pip
python3 -m pip install -r https://raw.githubusercontent.com/jeremiedecock/polytechnique-csc-53439-ep-2025-students/main/requirements_lab4.txt
```

### Run CSC-53439-EP notebooks locally in a dedicated Docker container

If you are familiar with Docker (or Podman), an image is available on Docker Hub for this lab:

```bash
docker run -it --rm --user root -p 8888:8888 -e NB_UID=$(id -u) -e NB_GID=$(id -g) -v "${PWD}":/home/jovyan/work jdhp/csc-53439-ep-lab4:latest
```

### Import required packages

In [None]:
import gymnasium as gym
from IPython.display import Video
import json
import lzma
import numpy as np
from numpy.typing import NDArray
import pandas as pd
from pathlib import Path
import torch
from typing import List, Tuple, Deque, Optional, Callable

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt

import seaborn as sns
from tqdm.notebook import tqdm

In [None]:
gym.__version__

In [None]:
sns.set_context("talk")

In [None]:
FIGS_DIR = Path("figs/")       # Where to save figures (.gif files)
PLOTS_DIR = Path("figs/")      # Where to save plots (.png or .svg files)
MODELS_DIR = Path("models/")   # Where to save models (.pth files)

In [None]:
if not FIGS_DIR.exists():
    FIGS_DIR.mkdir()
if not PLOTS_DIR.exists():
    PLOTS_DIR.mkdir()
if not MODELS_DIR.exists():
    MODELS_DIR.mkdir()

## PyTorch setup

PyTorch can run on both CPUs and GPUs. The following cell will determine the device PyTorch will use. If a GPU is available, PyTorch will use it; otherwise, it will use the CPU.

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # Set the device to CUDA if available, otherwise use CPU


For utilizing a GPU on Google Colab, you also have to activate it following the steps outlined [here](https://colab.research.google.com/notebooks/gpu.ipynb).

In [None]:
print("Available GPUs:")
if torch.cuda.is_available():
    for i in range(torch.cuda.device_count()):
        print(f"- Device {i}: {torch.cuda.get_device_name(i)}")
else:
    print("- No GPU available.")

If you have a very recent GPU and want to use it, you might need to install a specific version of PyTorch compatible with your Cuda version.
For this, you will have to edit the [requirements_lab4.txt](https://raw.githubusercontent.com/jeremiedecock/polytechnique-csc-53439-ep-2025-students/main/requirements_lab4.txt) file and replace the current version of PyTorch with the one compatible with your Cuda version.
Check the [official PyTorch website](https://pytorch.org/get-started/locally/) for more information.

Note that the GPU is not very useful for CartPole (but useful for MuJoCo) because CartPole is a simple and quick problem to solve, and CUDA spends more time transferring data between the CPU and GPU than processing it directly on the CPU.

You can uncomment the next cell to explicitly instruct PyTorch to train neural networks using the CPU.

In [None]:
# device = "cpu"

In [None]:
print(f"PyTorch will train and test neural networks on {device}")

## Part 1: Behavioral Cloning

### Exercise 1: Hands on MountainCar environment

The MountainCar is a classic reinforcement learning environment. In this simple 2D scenario, an underpowered car must navigate a hill, but it lacks the power to ascend directly. Instead, the car must learn to use the hill's slopes to build momentum and ultimately reach the flag at the top. While the environment is straightforward, it becomes interesting due to its sparse reward signal, making it an excellent candidate for learning from demonstrations.

**Task 1:** refer to the following link [MountainCar Environment](https://gymnasium.farama.org/environments/classic_control/mountain_car/) to familiarize yourself with the MountainCar environment if you are not already.

Print some information about the environment:

In [None]:
env = gym.make('MountainCar-v0', render_mode="rgb_array")

mountain_car_state_dim = env.observation_space.shape[0]
mountain_car_action_dim = env.action_space.n.item()

print(f"State space size is: { env.observation_space }")
print(f"Action space size is: { env.action_space }")
print("Actions are: {" + ", ".join([str(a) for a in range(env.action_space.n)]) + "}")

env.close()

**Task 2:** Run the following cells and check different basic 
policies (for instance constant actions or randomly drawn actions) to discover the MountainCar environment.

#### Test the MountainCar environment with a constant policy

In [None]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

In [None]:
VIDEO_DIRNAME = "lab4_mountain-car_action0"

env = gym.make('MountainCar-v0', render_mode='rgb_array')
env = gym.wrappers.RecordVideo(env, FIGS_DIR / VIDEO_DIRNAME)

observation, info = env.reset()
done = False

for t in range(200):
    action = 0
    observation, reward, done, truncated, info = env.step(action)

env.close()

Video(FIGS_DIR / VIDEO_DIRNAME / "rl-video-episode-0.mp4", embed=True, html_attributes="controls autoplay loop")

In [None]:
VIDEO_DIRNAME = "lab4_mountain-car_action1"

env = gym.make('MountainCar-v0', render_mode='rgb_array')
env = gym.wrappers.RecordVideo(env, FIGS_DIR / VIDEO_DIRNAME)

observation, info = env.reset()
done = False

for t in range(200):
    action = 1
    observation, reward, done, truncated, info = env.step(action)

env.close()

Video(FIGS_DIR / VIDEO_DIRNAME / "rl-video-episode-0.mp4", embed=True, html_attributes="controls autoplay loop")

In [None]:
VIDEO_DIRNAME = "lab4_mountain-car_action2"

env = gym.make('MountainCar-v0', render_mode='rgb_array')
env = gym.wrappers.RecordVideo(env, FIGS_DIR / VIDEO_DIRNAME)

observation, info = env.reset()
done = False

for t in range(200):
    action = 2
    observation, reward, done, truncated, info = env.step(action)

env.close()

Video(FIGS_DIR / VIDEO_DIRNAME / "rl-video-episode-0.mp4", embed=True, html_attributes="controls autoplay loop")

#### Test the MountainCar environment with a random policy

In [None]:
VIDEO_DIRNAME = "lab4_mountain-car_random_action"

env = gym.make('MountainCar-v0', render_mode='rgb_array')
env = gym.wrappers.RecordVideo(env, FIGS_DIR / VIDEO_DIRNAME)

observation, info = env.reset()
done = False

for t in range(200):
    action = env.action_space.sample()
    observation, reward, done, truncated, info = env.step(action)

env.close()

Video(FIGS_DIR / VIDEO_DIRNAME / "rl-video-episode-0.mp4", embed=True, html_attributes="controls autoplay loop")

#### Test the MountainCar environment with a good handcrafted policy

**Task 3:** The MountainCar environment is simple in design but poses a significant challenge for many algorithms, such as PPO, due to its sparse reward structure. The agent must engage in extensive exploration before receiving its first positive reward, which occurs only when it successfully reaches the flag at the top of the hill. Despite this, the task can be solved with a surprisingly simple policy. Can you discover it?

In [None]:
VIDEO_DIRNAME = "lab4_mountain-car_random_action"

env = gym.make('MountainCar-v0', render_mode='rgb_array')
env = gym.wrappers.RecordVideo(env, FIGS_DIR / VIDEO_DIRNAME)

observation, info = env.reset()
done = False

for t in range(200):

    # TODO...

    observation, reward, done, truncated, info = env.step(action)

env.close()

Video(FIGS_DIR / VIDEO_DIRNAME / "rl-video-episode-0.mp4", embed=True, html_attributes="controls autoplay loop")

### Behavioral Cloning

*Behavioral Cloning* ([D. A. Pomerleau, *Efficient Training of Artificial Neural Networks for Autonomous Navigation*, Neural Computation, vol. 3, no. 1, pp. 88–97, 1991](https://cours.etsmtl.ca/sys843/REFS/ORG/pomerleau_alvinn.pdf)) is one of the most fundamental approaches to *Imitation Learning*. The concept is straightforward: an *expert* provides high-quality traces, or demonstrations, and the learning agent's task is to mimic the expert’s behavior.
In *Imitation Learning*, *traces* or *demonstrations* refer to sequences of state-action pairs generated by an expert while performing a task. These demonstrations serve as examples for the agent to learn from. Each demonstration consists of a series of observations (states) encountered by the expert, along with the corresponding actions taken in those states.
For example, in a driving task, a demonstration might be a series of snapshots of the environment (such as the car’s position and speed) and the actions the expert driver took at each moment (such as steering or braking). These state-action pairs are recorded and used to train the agent, enabling it to learn how to behave similarly in similar situations.

The goal of *Behavioral Cloning* is to map states to the actions the expert would take, essentially allowing the agent to "clone" the expert's behavior. The quality and variety of these demonstrations are critical for successful learning, as they provide the agent with the knowledge it needs to act appropriately across different scenarios.
Typically, expert demonstrations are obtained by recording human behavior, which is then used to train the agent.

**Note**: In the original publication, the algorithm learns a stochastic policy by maximizing the likelihood of the expert's actions. However, in this lab, we will train a deterministic policy by minimizing the Categorical Cross Entropy between the expert's actions and the model’s predictions.

### Exercise 2: Behavioral Cloning on MountainCar

#### Make a PyTorch dataset from the demonstrations

##### Download expert demonstrations

The expert demonstrations are available at the following URL: https://github.com/jeremiedecock/polytechnique-csc-53439-ep-2025-students/raw/refs/heads/main/models/lab4_expert_mountaincar-v0_handcrafted/demonstrations.json.xz

In [None]:
!mkdir -p models/lab4_expert_mountaincar-v0_handcrafted

In [None]:
!wget https://github.com/jeremiedecock/polytechnique-csc-53439-ep-2025-students/raw/refs/heads/main/models/lab4_expert_mountaincar-v0_handcrafted/demonstrations.json.xz -O models/lab4_expert_mountaincar-v0_handcrafted/demonstrations.json.xz

##### Make the dataset

In [None]:
class ExpertDemonstrationsDataset(torch.utils.data.Dataset):
    def __init__(self, json_demonstrations_file_path, transform=None, target_transform=None):
        super().__init__()
        self.json_demonstrations_file_path = json_demonstrations_file_path

        with lzma.open(self.json_demonstrations_file_path, "rt") as f:
            demonstrations_list = json.load(f)

        self._observations_tensor = torch.tensor([transition["observation"] for transition in demonstrations_list], dtype=torch.float32)
        self._actions_tensor = torch.tensor([transition["action"] for transition in demonstrations_list], dtype=torch.long)

        self.transform = transform
        self.target_transform = target_transform

    def __len__(self):
        return len(self._observations_tensor)

    def __getitem__(self, idx):
        observation = self._observations_tensor[idx]
        action = self._actions_tensor[idx]

        if self.transform:
            observation = self.transform(observation)
        if self.target_transform:
            action = self.target_transform(action)

        return observation, action

In [None]:
mountain_car_expert_dataset = ExpertDemonstrationsDataset(Path("models") / "lab4_expert_mountaincar-v0_handcrafted" / "demonstrations.json.xz")

**Task1**: Take time to check the definition and the content of this dataset.

##### Plot the dataset

In [None]:
expert_policy_df = pd.DataFrame([{"s_1": observation_tensor[0], "s_2": observation_tensor[1], "a": action_tensor} for observation_tensor, action_tensor in mountain_car_expert_dataset])
expert_policy_df.plot(kind="scatter", x="s_1", y="s_2", c="a", colormap="viridis", colorbar=True, figsize=(10, 7), s=2);

#### Define the neural network

**Task2**: Implement a neural network that takes the state as input and outputs the action. The neural network should have the following architecture:
- A first fully connected layer with `hidden_units` units and ReLU activation function
- A second fully connected layer with `n_actions` units and no activation function

In [None]:
class DiscretePolicyNetwork(torch.nn.Module):

    def __init__(self, n_observations: int, n_actions: int, hidden_units: int):

        super(DiscretePolicyNetwork, self).__init__()

        # TODO...

    def forward(self, x: torch.Tensor) -> torch.Tensor:

        # TODO...

        return logits

In [None]:
mountain_car_model = DiscretePolicyNetwork(n_observations=mountain_car_state_dim, n_actions=mountain_car_action_dim, hidden_units=8).to(device)
print(mountain_car_model)

#### Define the training loop

**Task3**: Implement a classical supervised learning training function that train one epoch of the neural network on the dataset. This function will be called at each epoch in the *Train the model* cell defined below.

In [None]:
def train(dataloader, model, loss_fn, optimizer, verbose=True):

    # TODO...

#### Define the testing loop

**Task4**: Implement a classical supervised learning testing function that assess the performance of the neural network on the test dataset at each epoch. This function will be called at each epoch in the *Train the model* cell defined below.

In [None]:
def test(dataloader, model, loss_fn):

    # TODO...

#### Split the dataset

In the following cell, we will split the dataset (expert demonstrations) into a training and a test set.

In [None]:
train_size = int(0.8 * len(mountain_car_expert_dataset))
test_size = len(mountain_car_expert_dataset) - train_size

train_subset, test_subset = torch.utils.data.random_split(mountain_car_expert_dataset, [train_size, test_size])

mountain_car_train_dataloader = torch.utils.data.DataLoader(train_subset, batch_size=32, shuffle=True)
mountain_car_test_dataloader = torch.utils.data.DataLoader(test_subset, batch_size=32, shuffle=False)

#### Define the loss function and the optimizer

**Task5**: Define the loss function used to train the neural network.

In [None]:
mountain_car_loss_fn = # TODO...

mountain_car_optimizer = torch.optim.Adam(mountain_car_model.parameters())

#### Train the model

In [None]:
epochs = 50
for epoch in range(epochs):
    print(f"Epoch {epoch+1}/{epochs}\n-------------------------------")

    train(mountain_car_train_dataloader, mountain_car_model, mountain_car_loss_fn, mountain_car_optimizer, verbose=False)
    test(mountain_car_test_dataloader, mountain_car_model, mountain_car_loss_fn)

##### Plot the learned policy

You can now compare the learned policy with the expert's policy.

In [None]:
# expert_policy_df = pd.DataFrame([{"s_1": test_subset[idx][0][0], "s_2": test_subset[idx][0][1], "a": test_subset[idx][1]} for idx in range(len(test_subset))])
# expert_policy_df.plot(kind="scatter", x="s_1", y="s_2", c="a", colormap="viridis", colorbar=True, figsize=(10, 7), s=2);

In [None]:
expert_policy_df = pd.DataFrame([{"s_1": test_subset[idx][0][0], "s_2": test_subset[idx][0][1], "a": mountain_car_model(test_subset[idx][0]).argmax().item()} for idx in range(len(test_subset))])
expert_policy_df.plot(kind="scatter", x="s_1", y="s_2", c="a", colormap="viridis", colorbar=True, figsize=(10, 7), s=2);

#### Test the learned policy

In [None]:
VIDEO_DIRNAME = "lab4_mountain-car_trained_policy"

env = gym.make('MountainCar-v0', render_mode='rgb_array')
env = gym.wrappers.RecordVideo(env, FIGS_DIR / VIDEO_DIRNAME)

observation, info = env.reset()
done = False

for t in range(200):
    observation_tensor = torch.tensor(observation, dtype=torch.float32).to(device)
    action = mountain_car_model(observation_tensor).argmax().item()
    observation, reward, done, truncated, info = env.step(action)

env.close()

Video(FIGS_DIR / VIDEO_DIRNAME / "rl-video-episode-0.mp4", embed=True, html_attributes="controls autoplay loop")

### Exercise 3: Hands on LunarLander environment

In this exercise, we will apply the *Behavioral Cloning* algorithm to the LunarLander environment. The LunarLander environment is a classic reinforcement learning task where an agent must learn to land a spacecraft safely on the moon. The agent controls the spacecraft's engines, which can fire in four directions: do nothing, fire left, fire right, or fire both engines downward. The agent receives a reward for successfully landing the spacecraft and a penalty for crashing or running out of fuel.

The reward signal is less sparse than in the MountainCar environment, but the dynamics is more complex making it interesting too for *Imitation Learning*.

**Task 1:** refer to the following link [LunarLander-v3 Environment](https://gymnasium.farama.org/environments/box2d/lunar_lander/) to familiarize yourself with the LunarLander-v3 environment if you are not already.

Print some information about the environment:

In [None]:
env = gym.make('LunarLander-v3', render_mode="rgb_array")

lunar_lander_state_dim = env.observation_space.shape[0]
lunar_lander_action_dim = env.action_space.n.item()

print(f"State space size is: { env.observation_space }")
print(f"Action space size is: { env.action_space }")
print("Actions are: {" + ", ".join([str(a) for a in range(env.action_space.n)]) + "}")

env.close()

**Task 2:** Run the following cells and check different basic 
policies (for instance constant actions or randomly drawn actions) to discover the LunarLander-v3 environment.

#### Test the LunarLander environment with a constant policy

In [None]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

In [None]:
VIDEO_DIRNAME = "lab4_lunar-lander_action0"

env = gym.make('LunarLander-v3', render_mode='rgb_array')
env = gym.wrappers.RecordVideo(env, FIGS_DIR / VIDEO_DIRNAME)

observation, info = env.reset()
done = False

for t in range(200):
    action = 0
    observation, reward, done, truncated, info = env.step(action)

env.close()

Video(FIGS_DIR / VIDEO_DIRNAME / "rl-video-episode-0.mp4", embed=True, html_attributes="controls autoplay loop")

In [None]:
VIDEO_DIRNAME = "lab4_lunar-lander_action1"

env = gym.make('LunarLander-v3', render_mode='rgb_array')
env = gym.wrappers.RecordVideo(env, FIGS_DIR / VIDEO_DIRNAME)

observation, info = env.reset()
done = False

for t in range(200):
    action = 1
    observation, reward, done, truncated, info = env.step(action)

env.close()

Video(FIGS_DIR / VIDEO_DIRNAME / "rl-video-episode-0.mp4", embed=True, html_attributes="controls autoplay loop")

In [None]:
VIDEO_DIRNAME = "lab4_lunar-lander_action2"

env = gym.make('LunarLander-v3', render_mode='rgb_array')
env = gym.wrappers.RecordVideo(env, FIGS_DIR / VIDEO_DIRNAME)

observation, info = env.reset()
done = False

for t in range(200):
    action = 2
    observation, reward, done, truncated, info = env.step(action)

env.close()

Video(FIGS_DIR / VIDEO_DIRNAME / "rl-video-episode-0.mp4", embed=True, html_attributes="controls autoplay loop")

In [None]:
VIDEO_DIRNAME = "lab4_lunar-lander_action3"

env = gym.make('LunarLander-v3', render_mode='rgb_array')
env = gym.wrappers.RecordVideo(env, FIGS_DIR / VIDEO_DIRNAME)

observation, info = env.reset()
done = False

for t in range(200):
    action = 3
    observation, reward, done, truncated, info = env.step(action)

env.close()

Video(FIGS_DIR / VIDEO_DIRNAME / "rl-video-episode-0.mp4", embed=True, html_attributes="controls autoplay loop")

#### Test the LunarLander environment with a random policy

In [None]:
VIDEO_DIRNAME = "lab4_lunar-lander_random_action"

env = gym.make('LunarLander-v3', render_mode='rgb_array')
env = gym.wrappers.RecordVideo(env, FIGS_DIR / VIDEO_DIRNAME)

observation, info = env.reset()
done = False

for t in range(200):
    action = env.action_space.sample()
    observation, reward, done, truncated, info = env.step(action)

env.close()

Video(FIGS_DIR / VIDEO_DIRNAME / "rl-video-episode-0.mp4", embed=True, html_attributes="controls autoplay loop")

### Exercise 4: Behavioral Cloning on LunarLander

In this exercise, we will reuse most of the code from the previous exercise to apply the *Behavioral Cloning* algorithm to the LunarLander environment.
There are very little to do except check that the code of the previous exercise is still working on the LunarLander environment.

#### Make a PyTorch dataset from the demonstrations

##### Download expert demonstrations

The expert demonstrations are available at the following URL: https://github.com/jeremiedecock/polytechnique-csc-53439-ep-2025-students/raw/refs/heads/main/models/lab4_expert_lunar-lander-v2-discrete-nowind_ppo/demonstrations.json.xz

In [None]:
!mkdir -p models/lab4_expert_lunar-lander-v2-discrete-nowind_ppo

In [None]:
!wget https://github.com/jeremiedecock/polytechnique-csc-53439-ep-2025-students/raw/refs/heads/main/models/lab4_expert_lunar-lander-v2-discrete-nowind_ppo/demonstrations.json.xz -O models/lab4_expert_lunar-lander-v2-discrete-nowind_ppo/demonstrations.json.xz

In [None]:
lunar_lander_expert_dataset = ExpertDemonstrationsDataset(Path("models") / "lab4_expert_lunar-lander-v2-discrete-nowind_ppo" / "demonstrations.json.xz")

#### Define the neural network

In [None]:
lunar_lander_model = DiscretePolicyNetwork(n_observations=lunar_lander_state_dim, n_actions=lunar_lander_action_dim, hidden_units=64).to(device)
print(lunar_lander_model)

#### Split the dataset

In [None]:
train_size = int(0.8 * len(lunar_lander_expert_dataset))
test_size = len(lunar_lander_expert_dataset) - train_size

train_subset, test_subset = torch.utils.data.random_split(lunar_lander_expert_dataset, [train_size, test_size])

lunar_lander_train_dataloader = torch.utils.data.DataLoader(train_subset, batch_size=32, shuffle=True)
lunar_lander_test_dataloader = torch.utils.data.DataLoader(test_subset, batch_size=32, shuffle=False)

#### Define the loss function and the optimizer

**Task1**: Define the loss function used to train the neural network.

In [None]:
lunar_lander_loss_fn = # TODO...

lunar_lander_optimizer = torch.optim.Adam(lunar_lander_model.parameters())

#### Train the model

In [None]:
epochs = 10
for epoch in range(epochs):
    print(f"Epoch {epoch+1}/{epochs}\n-------------------------------")

    train(lunar_lander_train_dataloader, lunar_lander_model, lunar_lander_loss_fn, lunar_lander_optimizer, verbose=False)
    test(lunar_lander_test_dataloader, lunar_lander_model, lunar_lander_loss_fn)

#### Test the learned policy

In [None]:
VIDEO_DIRNAME = "lab4_lunar-lander_trained_policy"

env = gym.make('LunarLander-v3', render_mode='rgb_array')
env = gym.wrappers.RecordVideo(env, FIGS_DIR / VIDEO_DIRNAME)

observation, info = env.reset()
done = False

for t in range(200):
    observation_tensor = torch.tensor(observation, dtype=torch.float32).to(device)
    action = lunar_lander_model(observation_tensor).argmax().item()
    observation, reward, done, truncated, info = env.step(action)

env.close()

Video(FIGS_DIR / VIDEO_DIRNAME / "rl-video-episode-0.mp4", embed=True, html_attributes="controls autoplay loop")

## Part 2: GAIL (bonus part)

### Introduction to GAIL

**Generative Adversarial Imitation Learning (GAIL)** ([Ho & Ermon, 2016](https://proceedings.neurips.cc/paper_files/paper/2016/file/cc7e2b878868cbae992d1fb743995d8f-Paper.pdf)) is an advanced imitation learning algorithm that combines ideas from Generative Adversarial Networks (GANs) and Inverse Reinforcement Learning (IRL).

Unlike Behavioral Cloning, which directly learns to map states to actions through supervised learning, GAIL learns both a policy and a reward function simultaneously:

1. **Discriminator**: A neural network that tries to distinguish between state-action pairs from expert demonstrations and those generated by the current policy
2. **Policy (Generator)**: A neural network that tries to generate trajectories that fool the discriminator into thinking they come from the expert

The key insight is that the discriminator provides a learned reward signal to train the policy using reinforcement learning (typically PPO). This approach addresses some limitations of behavioral cloning:
- It can handle distribution mismatch between training and test states
- It doesn't require explicit reward functions
- It's more robust to imperfect demonstrations

### Exercise 3: GAIL on MountainCar

**Note**: This implementation of GAIL is intentionally kept simple to focus on understanding the core algorithm without the complexity of advanced policy training procedures. We use the **REINFORCE** algorithm (vanilla policy gradient) to train the policy, which makes the learning process transparent and easier to follow.

#### Define the policy network

**Task1**: Implement a neural network that takes the state as input and outputs the action. The neural network should have the following architecture:
- Input Layer:
  - The network takes an input with a dimension of obs_dim.
- Hidden Layer:
  - The first hidden layer is a fully connected (Linear) layer with 128 units.
  - This is followed by a ReLU activation function.
- Output Layer:
  - The output layer is a fully connected (Linear) layer with act_dim units.
  - This is followed by a Softmax activation function, which ensures that the output is a probability distribution over actions.

In [None]:
class PolicyNetwork(torch.nn.Module):
    """Policy network for REINFORCE training in GAIL"""

    def __init__(self, obs_dim, act_dim):
        super(PolicyNetwork, self).__init__()

        # TODO...

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass of the policy network.
        
        Parameters
        ----------
        x : torch.Tensor
            State tensor of shape (n_observations)

        Returns
        -------
        torch.Tensor
            Action probabilities, shape (n_actions)
        """

        # TODO...

        return action_probs

#### Define the discriminator neural network

**Task2**: Implement a neural network that takes the state-action pair as input and outputs the probability that the pair comes from the expert policy. The neural network should have the following architecture:
- Input Layer:
  - The network takes a concatenated input of observations and actions with a combined dimension of obs_dim + act_dim.
- Hidden Layer:
  - The first hidden layer is a fully connected (Linear) layer with 128 units.
  - This is followed by a ReLU activation function.
- Output Layer:
  - The output layer is a fully connected (Linear) layer with 1 unit.
  - This is followed by a Sigmoid activation function, which outputs a probability indicating whether the input is from the expert or the generated policy.

In [None]:
class Discriminator(torch.nn.Module):
    """Discriminator network that distinguishes expert from policy trajectories"""

    def __init__(self, observations_dim: int, actions_dim: int):
        super(Discriminator, self).__init__()

        # TODO...

    def forward(self, observations: torch.Tensor, action: torch.Tensor) -> torch.Tensor:

        # TODO...

        return prob

#### Environment initialization

In [None]:
env = gym.make('LunarLander-v3')

obs_dim = env.observation_space.shape[0]  # Dimensions de l'observation
act_dim = env.action_space.n              # Nombre d'actions possibles

#### Initialize the policy and the discriminator networks and their optimizers

In [None]:
# Initialization of the networks
policy = PolicyNetwork(obs_dim, act_dim)
discriminator = Discriminator(obs_dim, act_dim)

# Optimizers
policy_optimizer = torch.optim.Adam(policy.parameters(), lr=1e-3)
discriminator_optimizer = torch.optim.Adam(discriminator.parameters(), lr=1e-3)

#### Load expert demonstrations

In [None]:
expert_dataset = ExpertDemonstrationsDataset(Path("models") / "lab4_expert_lunar-lander-v2-discrete-nowind_ppo" / "demonstrations.json.xz")
expert_loader = torch.utils.data.DataLoader(expert_dataset, batch_size=64, shuffle=True)

#### Function to collect agent trajectories

The following function collects the agent's trajectories using the given policy.

In [None]:
def collect_agent_trajectories(
    policy: PolicyNetwork,
    env: gym.Env,
    num_episodes: int=10
):
    """
    Collect trajectories from the agent using the given policy.

    Parameters
    ----------
    policy : PolicyNetwork
        The neural network policy that outputs action probabilities
    env : gym.Env
        The gymnasium environment
    num_episodes : int, optional
        Number of episodes to collect (default: 10)

    Returns
    -------
    trajectories : list of tuple
        List of tuples (obs_list, action_list) for each episode, where:
        - obs_list : list of numpy.ndarray
            Observations collected during the episode
        - action_list : list of int
            Actions taken during the episode
    """
    trajectories = []

    # Collect multiple episodes
    for _ in range(num_episodes):
        obs_list = []
        action_list = []

        # Reset the environment to start a new episode
        obs, _ = env.reset()
        done = False

        # Run one episode until termination
        while not done:
            # Convert observation to tensor for the policy network
            obs_tensor = torch.tensor(obs, dtype=torch.float32)

            # Get action probabilities from the policy (no gradient needed for inference)
            with torch.no_grad():
                action_probs = policy(obs_tensor)

            # Sample an action from the probability distribution
            action = torch.multinomial(action_probs, 1).item()

            # Store the observation and action
            obs_list.append(obs)
            action_list.append(action)

            # Take a step in the environment
            obs, reward, done, truncated, info = env.step(action)

            # Handle truncated episodes (time limit reached)
            if truncated:
                done = True

        # Store the complete trajectory for this episode
        trajectories.append((obs_list, action_list))

    return trajectories

#### The main training loop

**Task3**: Implement the main training loop for the Generative Adversarial Imitation Learning (GAIL) algorithm. Here's a step-by-step explanation:

1. **Initialization**:
   - `num_iterations`, `num_agent_episodes`, and `gamma` (discount factor) are defined.
   - The loop runs for [`num_iterations`.

2. **Collecting Agent Trajectories**:
   - For each iteration, agent trajectories are collected using the current policy by running `num_agent_episodes` episodes in the environment.

3. **Processing Trajectories for Policy Gradient**:
   - For each trajectory, observations and actions are converted to tensors.
   - Actions are converted to one-hot encoding.
   - The discriminator's output is used to compute rewards for the agent.
   - Cumulative returns are calculated using the discount factor `gamma`.

4. **Concatenation of Data**:
   - All observations, actions, and returns from the trajectories are concatenated into single tensors.

5. **Policy Update**:
   - The policy network is updated using the policy gradient method.
   - Log probabilities of the selected actions are computed.
   - The policy loss is calculated and backpropagated to update the policy network.

6. **Data Preparation for the Discriminator**:
   - Agent actions are converted to one-hot encoding.
   - A batch of expert data is retrieved.
   - Expert actions are also converted to one-hot encoding.

7. **Combining and Shuffling Agent and Expert Data**:
   - Observations and actions from both agent and expert are concatenated.
   - Labels are created (0 for agent data, 1 for expert data).
   - The combined data is shuffled.

8. **Training the Discriminator**:
   - The discriminator is trained to distinguish between agent and expert data.
   - The discriminator loss is calculated using binary cross-entropy and backpropagated to update the discriminator network.

9. **Logging**:
   - Every 10 iterations, the discriminator and policy losses are printed.

This loop iteratively improves the policy by making it more similar to the expert's behavior while simultaneously training the discriminator to better distinguish between agent and expert actions.

In [None]:
num_iterations = 500
num_agent_episodes = 5
gamma = 0.99  # Discount factor

discriminator_loss_fn = torch.nn.BCELoss()

# Track training metrics
training_history = {
    'iteration': [],
    'discriminator_loss': [],
    'policy_loss': [],
    # 'mean_reward': []
}

for iteration in range(num_iterations):

    # Collecting agent trajectories
    trajectories = collect_agent_trajectories(policy, env, num_episodes=num_agent_episodes)

    all_observations = []
    all_actions = []
    all_returns = []

    # ==================================
    # STEP 1: Process agent trajectories
    # ==================================

    # Convert raw trajectories into tensors and compute returns for policy gradient
    for transition in trajectories:

        # Extract observations and actions from the trajectory
        observation_list, action_list = transition

        # --------------------
        # Process observations
        # --------------------

        observation_tensor = torch.tensor(observation_list, dtype=torch.float32)
        all_observations.append(observation_tensor)

        # ---------------
        # Process actions
        # ---------------

        action_tensor = torch.tensor(action_list, dtype=torch.long)
        
        # Convert discrete actions to one-hot encoding for the discriminator
        action_tensor_onehot = torch.nn.functional.one_hot(action_tensor, num_classes=act_dim).float()

        all_actions.append(action_tensor_onehot)

        # ---------------------------------------
        # Compute returns using the discriminator
        # ---------------------------------------

        # TODO...

        all_returns.append(returns)

    # ======================================
    # STEP 2: Prepare data for policy update
    # ======================================

    # Concatenate all trajectory data into single tensors
    all_observations = torch.cat(all_observations, dim=0)  # all_observations.shape = torch.Size([5000, 8])
    all_actions = torch.cat(all_actions, dim=0)            # all_actions.shape = torch.Size([5000, 4])
    all_returns = torch.cat(all_returns, dim=0)            # all_returns.shape = torch.Size([5000])

    # Normalize returns to reduce variance and stabilize training
    # This is a common technique in policy gradient methods
    all_returns = (all_returns - all_returns.mean()) / (all_returns.std() + 1e-8)

    # =================================
    # STEP 3: Update the policy network
    # =================================

    # Compute log probabilities of the actions taken
    # The policy outputs action probabilities; we take the log for numerical stability
    log_prob_actions_tensor = torch.log(policy(all_observations) + 1e-8)

    # Extract log probabilities of the specific actions that were taken
    # This multiplies by one-hot encoded actions and sums to get the log prob of the taken action
    log_prob_actions_tensor = # TODO...

    # Compute policy gradient loss: -E[G_t * log π(a_t|s_t)]
    # We maximize expected returns, which is equivalent to minimizing the negative
    policy_loss = # TODO...

    # Perform gradient descent on the policy
    policy_optimizer.zero_grad()
    policy_loss.backward()
    policy_optimizer.step()

    # =============================================
    # STEP 4: Prepare data for discriminator update
    # =============================================

    # Detach agent actions from the computation graph to avoid coupling with policy gradients
    agent_actions_onehot = all_actions.detach().float()

    # Sample a batch of expert demonstrations
    expert_batch = next(iter(expert_loader))
    expert_observations, expert_actions = expert_batch
    
    # Convert expert actions to one-hot encoding (same format as agent actions)
    expert_actions_onehot = torch.nn.functional.one_hot(expert_actions, num_classes=act_dim).float()

    # Combine agent and expert data for discriminator training
    # Agent observations and actions (detached from policy gradient)
    discriminator_input_observations = torch.cat([all_observations.detach(), expert_observations], dim=0)
    discriminator_input_actions = torch.cat([agent_actions_onehot, expert_actions_onehot], dim=0)
    
    # Create binary labels: 0 for agent (fake), 1 for expert (real)
    # This follows the standard GAN convention
    discriminator_labels = torch.cat([
        torch.zeros(len(all_observations)),  # Agent data labeled as 0 (fake)
        torch.ones(len(expert_observations))  # Expert data labeled as 1 (real)
    ], dim=0)

    # Shuffle the combined dataset to prevent the discriminator from learning
    # patterns based on the order of agent vs. expert data
    perm = torch.randperm(discriminator_input_observations.size(0))
    discriminator_input_observations = discriminator_input_observations[perm]
    discriminator_input_actions = discriminator_input_actions[perm]
    discriminator_labels = discriminator_labels[perm]

    # ========================================
    # STEP 5: Update the discriminator network
    # ========================================

    # Train the discriminator to distinguish between agent and expert trajectories
    # This is a binary classification task
    predictions = # TODO...
    discriminator_loss = # TODO...

    # Perform gradient descent on the discriminator
    discriminator_optimizer.zero_grad()
    discriminator_loss.backward()
    discriminator_optimizer.step()

    # Track metrics
    training_history['iteration'].append(iteration)
    training_history['discriminator_loss'].append(discriminator_loss.item())
    training_history['policy_loss'].append(policy_loss.item())

    if iteration % 10 == 0:
        print(f"Iteration {iteration}, Discriminator loss: {discriminator_loss.item()}, Policy loss: {policy_loss.item()}")

#### Plot Training Metrics

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(18, 10))

# Discriminator Loss
axes[0].plot(training_history['iteration'], training_history['discriminator_loss'])
axes[0].set_xlabel('Iteration')
axes[0].set_ylabel('Discriminator Loss')
axes[0].set_title('Discriminator Loss over Training')
axes[0].grid(True)

# Policy Loss
axes[1].plot(training_history['iteration'], training_history['policy_loss'])
axes[1].set_xlabel('Iteration')
axes[1].set_ylabel('Policy Loss')
axes[1].set_title('Policy Loss over Training')
axes[1].grid(True)

plt.tight_layout()
plt.savefig(PLOTS_DIR / "gail_training_metrics.png", dpi=150, bbox_inches='tight')
plt.show()

#### Test the learned policy

In [None]:
VIDEO_DIRNAME = "lab4_lunar-lander_trained_policy"

env = gym.make('LunarLander-v3', render_mode='rgb_array')
env = gym.wrappers.RecordVideo(env, FIGS_DIR / VIDEO_DIRNAME)

observation, info = env.reset()
done = False

for t in range(200):
    observation_tensor = torch.tensor(observation, dtype=torch.float32).to(device)
    action = policy(observation_tensor).argmax().item()
    observation, reward, done, truncated, info = env.step(action)

env.close()

Video(FIGS_DIR / VIDEO_DIRNAME / "rl-video-episode-0.mp4", embed=True, html_attributes="controls autoplay loop")

### Conclusion

This implementation of GAIL is intentionally kept simple to focus on understanding the core algorithm without the complexity of advanced policy training procedures. We use the **REINFORCE** algorithm (vanilla policy gradient) to train the policy, which makes the learning process transparent and easier to follow.

However, this simplicity comes at a cost: the training can be **less efficient and less stable** compared to more sophisticated approaches. In the original GAIL paper ([Ho & Ermon, 2016](https://proceedings.neurips.cc/paper_files/paper/2016/file/cc7e2b878868cbae992d1fb743995d8f-Paper.pdf)), the authors used **Trust Region Policy Optimization (TRPO)** to train the policy, which provides better stability and sample efficiency through constrained optimization.

In modern practice, **Proximal Policy Optimization (PPO)** has become the most widely used algorithm for training the policy in GAIL implementations. PPO offers a good balance between:
- Implementation simplicity (compared to TRPO)
- Training stability (compared to REINFORCE)
- Sample efficiency
- Robustness across different environments

For production-level implementations or more challenging environments, consider replacing the REINFORCE-based policy training with PPO or TRPO to achieve significantly better performance and stability.