<center><img src='https://i.postimg.cc/TPR1n1rp/AI-Tech-PL-RGB.png' height="60"></center>

AI TECH - Akademia Innowacyjnych Zastosowań Technologii Cyfrowych. Programu Operacyjnego Polska Cyfrowa na lata 2014-2020
<hr>

<center><img src='https://i.postimg.cc/Gpq2KRQz/logotypy-aitech.jpg'></center>

<center>
Projekt współfinansowany ze środków Unii Europejskiej w ramach Europejskiego Funduszu Rozwoju Regionalnego
Program Operacyjny Polska Cyfrowa na lata 2014-2020,
Oś Priorytetowa nr 3 "Cyfrowe kompetencje społeczeństwa" Działanie  nr 3.2 "Innowacyjne rozwiązania na rzecz aktywizacji cyfrowej"
Tytuł projektu:  „Akademia Innowacyjnych Zastosowań Technologii Cyfrowych (AI Tech)”
</center>

# Lab 08: Imitation Learning

In this lab, we look into the problem of learning from expert demonstrations.

- Find a policy $\pi(a | s)$ that best imitates the expert policy $\pi^*(a | s)$ in the given environment.
- It's worth noting, that we don't need access to the environment rewards.

Major Imitation Learning techniques are:

1. Behavioural Cloning,
1. Imitation Learning via Interactive Demonstrator e.g. SMILe (Ross and Bagnell, 2010) or DAgger (Ross et al., 2011),
1. Inverse Reinforcement Learning -- out of scope of this lab.

We will solve the Ant problem, shown below, examining the first two approaches.

## Install dependencies

In [1]:
!pip -q install gymnasium[mujoco]
!pip -q install pyglet
!pip -q install pyopengl
!pip -q install pyvirtualdisplay

In [2]:
!git clone https://github.com/alex-petrenko/sample-factory.git

fatal: destination path 'sample-factory' already exists and is not an empty directory.


In [3]:
!pip install -q sample-factory[mujoco]

In [4]:
%cd sample-factory

/home/bartek/expert_checkpoint/sample-factory


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


## Download Expert

In [9]:
!python -m sample_factory.huggingface.load_from_hub -r LLParallax/sf_Ant

For more details, please read https://huggingface.co/docs/huggingface_hub/concepts/git_vs_http.
Cloning https://huggingface.co/LLParallax/sf_Ant into local empty directory.
[37m[1m[2024-04-25 17:19:15,774][150423] The repository LLParallax/sf_Ant has been cloned to ./train_dir/sf_Ant[0m


In [10]:
import functools

import torch

from sample_factory.algo.learning.learner import Learner
from sample_factory.algo.utils.env_info import extract_env_info
from sample_factory.algo.utils.make_env import make_env_func_batched
from sample_factory.algo.utils.rl_utils import prepare_and_normalize_obs
from sample_factory.cfg.arguments import load_from_checkpoint
from sample_factory.model.actor_critic import create_actor_critic
from sample_factory.model.model_utils import get_rnn_size
from sample_factory.utils.attr_dict import AttrDict
from sample_factory.utils.typing import Config


def create_expert(cfg):
    cfg = load_from_checkpoint(cfg)

    cfg.num_envs = 1

    env = make_env_func_batched(
        cfg, env_config=AttrDict(worker_index=0, vector_index=0, env_id=0), render_mode=None
    )

    if hasattr(env.unwrapped, "reset_on_init"):
        # reset call ruins the demo recording for VizDoom
        env.unwrapped.reset_on_init = False

    actor_critic = create_actor_critic(cfg, env.observation_space, env.action_space)
    actor_critic.eval()

    device = torch.device("cpu" if cfg.device == "cpu" else "cuda")
    actor_critic.model_to_device(device)

    policy_id = cfg.policy_index
    name_prefix = dict(latest="checkpoint", best="best")[cfg.load_checkpoint_kind]
    checkpoints = Learner.get_checkpoints(Learner.checkpoint_dir(cfg, policy_id), f"{name_prefix}_*")
    checkpoint_dict = Learner.load_checkpoint(checkpoints, device)
    actor_critic.load_state_dict(checkpoint_dict["model"])
    return actor_critic


def get_expert_actions(obs, cfg: Config, actor_critic, env, env_info, device):
    rnn_states = torch.zeros([env.num_agents, get_rnn_size(cfg)], dtype=torch.float32, device=device)

    obs = {"obs": obs}
    with torch.no_grad():
        normalized_obs = prepare_and_normalize_obs(actor_critic, obs)
        policy_outputs = actor_critic(normalized_obs, rnn_states)

        # sample actions from the distribution by default
        actions = policy_outputs["actions"]
    return actions

## Load expert model

In [12]:
from sample_factory.cfg.arguments import parse_full_cfg, parse_sf_args
from sample_factory.envs.env_utils import register_env
from sf_examples.mujoco.mujoco_params import add_mujoco_env_args, mujoco_override_defaults
from sf_examples.mujoco.train_mujoco import register_mujoco_components
from sf_examples.mujoco.mujoco_utils import MUJOCO_ENVS, make_mujoco_env


def register_mujoco_components():
    for env in MUJOCO_ENVS:
        register_env(env.name, make_mujoco_env)


register_mujoco_components()
argv = ["--algo=APPO", "--env=mujoco_ant", "--experiment=sf_Ant", "--train_dir=train_dir", "--no_render"]
parser, partial_cfg = parse_sf_args(argv=argv, evaluation=True)
add_mujoco_env_args(partial_cfg.env, parser)
mujoco_override_defaults(partial_cfg.env, parser)
cfg = parse_full_cfg(parser, argv=argv)
expert = create_expert(cfg)

[33m[2024-04-25 17:19:35,658][147072] Environment mujoco_hopper already registered, overwriting...[0m
[33m[2024-04-25 17:19:35,660][147072] Environment mujoco_halfcheetah already registered, overwriting...[0m
[33m[2024-04-25 17:19:35,660][147072] Environment mujoco_humanoid already registered, overwriting...[0m
[33m[2024-04-25 17:19:35,661][147072] Environment mujoco_ant already registered, overwriting...[0m
[33m[2024-04-25 17:19:35,661][147072] Environment mujoco_standup already registered, overwriting...[0m
[33m[2024-04-25 17:19:35,662][147072] Environment mujoco_doublependulum already registered, overwriting...[0m
[33m[2024-04-25 17:19:35,662][147072] Environment mujoco_pendulum already registered, overwriting...[0m
[33m[2024-04-25 17:19:35,662][147072] Environment mujoco_reacher already registered, overwriting...[0m
[33m[2024-04-25 17:19:35,663][147072] Environment mujoco_walker already registered, overwriting...[0m
[33m[2024-04-25 17:19:35,663][147072] Environme

## Helpers
collecting data  

evaluation

In [13]:
import time

from IPython import display as ipydisplay

import torch
import gymnasium as gym
import matplotlib.pyplot as plt
import numpy as np

from matplotlib import animation


@torch.no_grad()
def run_policy (env, model, total_steps=10000, verbose=True):
    obs_array = np.empty([total_steps, *env.observation_space.shape])
    act_array = np.empty([total_steps, env.action_space.shape[0]])
    rew_array = np.empty([total_steps, 1])
    done_array = np.empty([total_steps, 1])

    iter_time = time.time()
    done = True
    for i in range(total_steps):
        if verbose and (i + 1) % 1000 == 0:
            steps_per_second = 1000 / (time.time() - iter_time)
            print(f'Step {i + 1}/{total_steps}, Steps per second: {steps_per_second}')
            iter_time = time.time()

        if done:
            obs, info = env.reset()

        act = model(torch.from_numpy(obs).unsqueeze(0).float())[0].detach().cpu().numpy()
        obs_, rew, terminated, truncated, _ = env.step(act)
        done = terminated or truncated

        obs_array[i] = obs
        act_array[i] = act
        rew_array[i] = rew
        done_array[i] = float(done)

        obs = obs_

    return obs_array, act_array, rew_array, done_array

def calculate_returns(rew, done):
    rew_cumsum = np.cumsum(rew)[:, None]
    ret_cumsum = rew_cumsum * done
    ret_cumsum_trimed = ret_cumsum[np.nonzero(ret_cumsum)]
    ret_cumsum_trimed[1:] -= ret_cumsum_trimed[:-1]
    return ret_cumsum_trimed

def evaluate_agent(env, model, verbose=False):
    _, _, rew, done = run_policy(env, model, total_steps=50000, verbose=verbose)
    rets = calculate_returns(rew, done)

    print(f'Num. episodes: {len(rets)}')
    print(f'Avg. return: {np.mean(rets)}')
    print(f'Max. return: {np.max(rets)}')
    print(f'Min. return: {np.min(rets)}')

@torch.no_grad()
def collect_frames(eval_env, model, num_frames=2000):
    state, _ = eval_env.reset()
    state = torch.from_numpy(np.array(state)).float()
    frames = []

    for _ in range(num_frames):
        frames.append(eval_env.render())

        action = model(state.unsqueeze(0))[0]
        next_state, reward, terminal, truncate, info = eval_env.step(action.detach().cpu().numpy())

        if terminal or truncate:
            state, _ = eval_env.reset()
        state = next_state
        state = torch.from_numpy(np.array(state)).float()

    return frames

def display_frames_as_video(frames):
    """
    Displays a list of frames as a video.
    """
    plt.figure(figsize=(frames[0].shape[1] / 72.0, frames[0].shape[0] / 72.0), dpi=72)
    patch = plt.imshow(frames[0])
    plt.axis('off')

    def animate(i):
        patch.set_data(frames[i])

    anim = animation.FuncAnimation(plt.gcf(), animate, frames=len(frames), interval=50)
    ipydisplay.display(ipydisplay.HTML(anim.to_jshtml()))

## 1. Behavior Clonning

Algorithm

1. Collect the expert data.
2. Fit the model (classifier/regressor) to the expert data.

### Create model

In [14]:
import torch
import torch.nn as nn


class MLP(nn.Module):
    def __init__(self, input_shape, output_size, hidden_sizes=(256, 256), hidden_activation=nn.Tanh(), output_activation=None, l2_weight=0.0001):
        super(MLP, self).__init__()
        self.layers = nn.Sequential()

        # Input layer
        self.layers.add_module("input", nn.Linear(input_shape, hidden_sizes[0]))
        self.layers.add_module("input_activation", hidden_activation)

        # Hidden layers
        layer_sizes = zip(hidden_sizes[:-1], hidden_sizes[1:])
        for i, (h1, h2) in enumerate(layer_sizes):
            self.layers.add_module(f"hidden_{i}", nn.Linear(h1, h2))
            self.layers.add_module(f"activation_{i}", hidden_activation)

        # Output layer
        self.layers.add_module("output", nn.Linear(hidden_sizes[-1], output_size))
        if output_activation is not None:
            self.layers.add_module("output_activation", output_activation)

        # Regularization
        self.l2_weight = l2_weight

    def forward(self, x):
        # Forward pass through the network
        x = self.layers(x)
        return x

    def l2_regularization(self):
        l2_reg = None
        for name, param in self.named_parameters():
            if 'weight' in name:
                if l2_reg is None:
                    l2_reg = param.norm(2)
                else:
                    l2_reg = l2_reg + param.norm(2)
        return self.l2_weight * l2_reg

### Function for training the model

In [15]:
from torch.utils.data import DataLoader, TensorDataset


def train(obs, act, model, num_epochs=10, batch_size=32):
    obs_tensor = torch.tensor(obs, dtype=torch.float32)
    act_tensor = torch.tensor(act, dtype=torch.float32)

    dataset = TensorDataset(obs_tensor, act_tensor)
    data_loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)


    # Define the loss function and optimizer
    loss_fn = nn.MSELoss()
    optimizer = torch.optim.Adam(model.parameters())

    # Training loop
    for epoch in range(num_epochs):
        for batch_idx, (x_batch, y_batch) in enumerate(data_loader):
            # Forward pass
            y_pred = model(x_batch)

            # Compute loss
            loss = loss_fn(y_pred, y_batch) + model.l2_regularization()

            # Zero gradients, perform a backward pass, and update the weights.
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        # Print loss every epoch
        print(f"Epoch {epoch+1}/{num_epochs}, Loss: {loss.item()}")

In [16]:
env = gym.make('Ant-v4')
env.num_agents = 1
env_info = extract_env_info(env, cfg)
device = torch.device("cpu" if cfg.device == "cpu" else "cuda")
collected_data = run_policy(env, functools.partial(get_expert_actions, cfg=cfg, actor_critic=expert, env=env, env_info=env_info, device=device), total_steps=10000)

Step 1000/10000, Steps per second: 735.2488033790245
Step 2000/10000, Steps per second: 1089.0234187975602
Step 3000/10000, Steps per second: 1090.9194947281248
Step 4000/10000, Steps per second: 1073.1327653527585
Step 5000/10000, Steps per second: 1118.6519966917613
Step 6000/10000, Steps per second: 1128.1218455424262
Step 7000/10000, Steps per second: 1124.9511859140787
Step 8000/10000, Steps per second: 1112.3872609329744
Step 9000/10000, Steps per second: 1122.014965000432
Step 10000/10000, Steps per second: 1125.2357820505977


In [17]:
obs, act, rewards, dones = collected_data

# EXERCISE: Create model
model = ...

train(...)

  from .autonotebook import tqdm as notebook_tqdm


Epoch 1/10, Loss: 0.017476055771112442
Epoch 2/10, Loss: 0.014620059169828892
Epoch 3/10, Loss: 0.009299281984567642
Epoch 4/10, Loss: 0.01285579800605774
Epoch 5/10, Loss: 0.013694003224372864
Epoch 6/10, Loss: 0.012380083091557026
Epoch 7/10, Loss: 0.014809663407504559
Epoch 8/10, Loss: 0.012385744601488113
Epoch 9/10, Loss: 0.01006702333688736
Epoch 10/10, Loss: 0.01109317410737276


In [18]:
evaluate_agent(env, model)

Num. episodes: 82
Avg. return: 2547.0173836750105
Max. return: 5538.0024825983855
Min. return: 14.634580395818375


### Exercise

Discuss the questions

1. In principle, do we need the expert policy for BC?

2. What are the problems with BC?

3. How can we help BC do better?


In [19]:
# Collect the exploratory data
def exploratory(obs, **kwargs):
    """Adds the Gaussian noise to the expert actions."""
    ...
    return action

expl_data = run_policy(env, functools.partial(exploratory, cfg=cfg, actor_critic=expert, env=env, env_info=env_info, device=device), total_steps=10000)

Step 1000/10000, Steps per second: 803.912627892673
Step 2000/10000, Steps per second: 769.5650107784502
Step 3000/10000, Steps per second: 769.4930061751282
Step 4000/10000, Steps per second: 788.1560420823024
Step 5000/10000, Steps per second: 773.1042623453627
Step 6000/10000, Steps per second: 775.3428526851833
Step 7000/10000, Steps per second: 823.1821436197606
Step 8000/10000, Steps per second: 803.5847166735766
Step 9000/10000, Steps per second: 810.7443481548592
Step 10000/10000, Steps per second: 762.390307547844


In [20]:
obs_expl, act_expl, rewards, dones = expl_data
# Exercise: Run BC on the exploratory data

# ANSWER
...
# END ANSWER

Epoch 1/10, Loss: 0.016396241262555122
Epoch 2/10, Loss: 0.027394423261284828
Epoch 3/10, Loss: 0.012382512912154198
Epoch 4/10, Loss: 0.011884797364473343
Epoch 5/10, Loss: 0.011303437873721123
Epoch 6/10, Loss: 0.009989911690354347
Epoch 7/10, Loss: 0.00975151639431715
Epoch 8/10, Loss: 0.009488686919212341
Epoch 9/10, Loss: 0.01032942347228527
Epoch 10/10, Loss: 0.008678617887198925


In [21]:
evaluate_agent(env, model_expl)

Num. episodes: 54
Avg. return: 4446.122508227501
Max. return: 5606.291213523451
Min. return: 124.37031972278783


### Exercise

Answer the questions

1. Why does it better?

2. How can we use the expert to further improve the data?


In [22]:
# Exercise: Infere the expert actions on the exploratory observations
#           and run BC on it.

# ANSWER
...
# ANSWER END

Epoch 1/10, Loss: 0.016244133934378624
Epoch 2/10, Loss: 0.011613224633038044
Epoch 3/10, Loss: 0.012513170950114727
Epoch 4/10, Loss: 0.011456134729087353
Epoch 5/10, Loss: 0.007960868999361992
Epoch 6/10, Loss: 0.010249370709061623
Epoch 7/10, Loss: 0.011152423918247223
Epoch 8/10, Loss: 0.0086353225633502
Epoch 9/10, Loss: 0.008648401126265526
Epoch 10/10, Loss: 0.011051847599446774


In [24]:
evaluate_agent(env, model_expl2)

Num. episodes: 102
Avg. return: 1534.5132216604345
Max. return: 5468.810280809093
Min. return: 115.43894478452421


### Exercise

Answer the questions

1. Did it help? Why?


1. How can you extend this idea?


## 2. Imitation Learning via Interactive Demostrator

[DAgger](https://www.ri.cmu.edu/pub_files/2011/4/Ross-AISTATS11-NoRegret.pdf)

1. Collect the expert data.
2. Fit the model (classifier/regressor) to the expert data.
3. Collect the imitator data.
4. Infere the expert actions on the imitator data.
5. Fit the model to the extended dataset.
6. Repeat from 3.

In [28]:
# We will pre-train on less expert data to keep the same dataset size
obs_ = obs[:2000,:]
act_ = act[:2000,:]

# EXERCISE: pretrain on first 2000 samples
# ANSWER
...
# END ANSWER

evaluate_agent(env, model_dagger)

Epoch 1/10, Loss: 0.02567135915160179
Epoch 2/10, Loss: 0.021934062242507935
Epoch 3/10, Loss: 0.01393201481550932
Epoch 4/10, Loss: 0.010927984490990639
Epoch 5/10, Loss: 0.01473385002464056
Epoch 6/10, Loss: 0.023166796192526817
Epoch 7/10, Loss: 0.011995591223239899
Epoch 8/10, Loss: 0.010540708899497986
Epoch 9/10, Loss: 0.014523297548294067
Epoch 10/10, Loss: 0.009346766397356987
Num. episodes: 56
Avg. return: 3017.7322707465974
Max. return: 4360.509766970099
Min. return: 179.2149749492237


In [29]:
# Exercise: Implement DAgger

for i in range(4):
    print(f'\n### Iter. {i+1} ###')

    # ANSWER
    print('\n1. Data collection')
    obs_extra, _, _, _ = # Collect 2k steps


    print('\n2. Training')
    # reset model for fair comparison
    model_dagger = ...

    # END ANSWER

    print('\n3. Evaluation')
    evaluate_agent(env, model_dagger)


### Iter. 1 ###

1. Data collection
Step 1000/2000, Steps per second: 3294.7693632223873
Step 2000/2000, Steps per second: 3707.6622117245215

2. Training
Epoch 1/10, Loss: 0.01619561016559601
Epoch 2/10, Loss: 0.01852767914533615
Epoch 3/10, Loss: 0.013723315671086311
Epoch 4/10, Loss: 0.014866928569972515
Epoch 5/10, Loss: 0.014802966266870499
Epoch 6/10, Loss: 0.011287961155176163
Epoch 7/10, Loss: 0.015429697930812836
Epoch 8/10, Loss: 0.01600058376789093
Epoch 9/10, Loss: 0.012179029174149036
Epoch 10/10, Loss: 0.011647832579910755

3. Evaluation
Num. episodes: 53
Avg. return: 4466.662482786521
Max. return: 5459.026620695091
Min. return: 384.4503018264944

### Iter. 2 ###

1. Data collection
Step 1000/2000, Steps per second: 3451.289655274561
Step 2000/2000, Steps per second: 3548.693703259671

2. Training
Epoch 1/10, Loss: 0.01004914939403534
Epoch 2/10, Loss: 0.010697782039642334
Epoch 3/10, Loss: 0.009375996887683868
Epoch 4/10, Loss: 0.00921697448939085
Epoch 5/10, Loss: 0.00

### Note

Training the expert with the PPO algorithm took 10M data samples (env. interactions). Here, we nearly match it with only 10k samples! Training from the expert can be much more efficient than reinforcement learning.