# Setup Code

To begin, prepare the colab environment by clicking the play button below and make sure you are using a GPU runtime.. This will install all dependencies for the future code. This can take up to 1.5 minutes

In [1]:
# below fixes some bugs introduced by some recent Colab changes
!mkdir -p /usr/share/vulkan/icd.d
!gdown https://drive.google.com/uc?id=1wPc9yjRLwcr3B3aTyfHNQcw4l23xzOTY
!gdown https://drive.google.com/uc?id=1IG__shIYJOWiKt09T5UfEHoD0iUylXF8
!mv nvidia_icd.json /usr/share/vulkan/icd.d
!mv 10_nvidia.json /usr/share/glvnd/egl_vendor.d/10_nvidia.json
# dependencies
!apt-get install -y --no-install-recommends libvulkan-dev
!pip install mani_skill2
!pip install --upgrade --no-cache-dir gdown

Downloading...
From: https://drive.google.com/uc?id=1wPc9yjRLwcr3B3aTyfHNQcw4l23xzOTY
To: /content/10_nvidia.json
100% 106/106 [00:00<00:00, 359kB/s]
Downloading...
From: https://drive.google.com/uc?id=1IG__shIYJOWiKt09T5UfEHoD0iUylXF8
To: /content/nvidia_icd.json
100% 139/139 [00:00<00:00, 552kB/s]
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  libvulkan1
Recommended packages:
  mesa-vulkan-drivers | vulkan-icd
The following NEW packages will be installed:
  libvulkan-dev libvulkan1
0 upgraded, 2 newly installed, 0 to remove and 30 not upgraded.
Need to get 1,020 kB of archives.
After this operation, 17.2 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/main amd64 libvulkan1 amd64 1.3.204.1-2 [128 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/main amd64 libvulkan-dev amd64 1.3.204.1-2 [892 kB]
Fetched 1,020 kB in 1s (1,144 kB/s)
Selecting pr



In [1]:
try:
    import google.colab
    IN_COLAB = True
except:
    IN_COLAB = False

if IN_COLAB:
    import site
    site.main() # run this so local pip installs are recognized

# Robotic Learning Tutorial Part 2: Imitation Learning

This notebook will go over an Imitation Learning (IL) baseline for solving [ManiSkill2](https://github.com/haosulab/ManiSkill2) environments. Our environments have expert demonstrations which makes Learning from Demonstrations (LfD) and IL approaches like behavior cloning (BC) possible and feasible.


We will use the LiftCube environment with state and visual observations and train policies via supervised learning using the BC algorithm

A single-file code version of this tutorial can be found here: https://github.com/haosulab/ManiSkill2/tree/main/examples/tutorials/imitation-learning/

First, we will import all required packages

In [2]:
# Import required packages
import argparse
import os.path as osp
from pathlib import Path

import gymnasium as gym
import numpy as np
import h5py
import torch as th
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from gymnasium.wrappers import TimeLimit
from tqdm.notebook import tqdm

import mani_skill2.envs
from mani_skill2.utils.wrappers import RecordEpisode
from mani_skill2.utils.io_utils import load_json

## 1 State Based IL

State based observations are flat/dense vectors and are generally easier to learn for machine learning algorithms.

State based observations are much faster and easier to work with than visual observations as generating visuals can be slow (especially without a GPU) and distilling information from images is difficult.

However, state can provide privileged information about the Environment unavailable in the real-world or at test time. While training is fast, state based policies are more limited in their generalizability across tasks and objects without additional techniques.

### 1.1 Download Demonstrations

To get started, we first need to download the demonstrations dataset for our desired environment. The code here is agnostic to environment choice but the training code is tuned for the LiftCube-v0 environment

Using the `mani_skill2.utils.download_demo` tool you can download datasets by `env_id`. Note that these datasets don't come with observations in order to conserve space. As a result, we further need to convert the trajectories to add observations back in. To convert trajectories you can use the `mani_skill2.trajectory.replay_trajectory` tool shown below.

If you want to skip the trajectory conversion, you can directly download the already converted trajectory dataset with the `gdown` command shown later. For this section we will use `state` observations and the recommended `pd_ee_delta_pose` controller.


In [3]:
env_id = "LiftCube-v0"

In [4]:
# Directly download the converted demonstrations dataset files
import urllib.request
!mkdir -p "demos/v0/rigid_body/LiftCube-v0"
urllib.request.urlretrieve("https://huggingface.co/datasets/haosulab/ManiSkill2/resolve/main/processed_demos/LiftCube-v0.tar.gz", "demos/v0/rigid_body/LiftCube-v0.tar.gz")
!tar -xvzf "demos/v0/rigid_body/LiftCube-v0.tar.gz" -C "demos/v0/rigid_body/"


LiftCube-v0/
LiftCube-v0/trajectory.json
LiftCube-v0/trajectory.state.pd_ee_delta_pose.h5
LiftCube-v0/trajectory.state.pd_ee_delta_pose.json
LiftCube-v0/trajectory.rgbd.pd_ee_delta_pose.json
LiftCube-v0/trajectory.h5
LiftCube-v0/trajectory.rgbd.pd_ee_delta_pose.h5


### 1.2 Setting up the Dataset

Using PyTorch, we can use the [Dataset and Dataloader](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) classes to manage the demonstrations dataset.

Importantly, as our datasets are stored with the h5py package, to load demonstrations into memory for faster access, you will need to use the `load_h5_data` function we provide below. By default, h5py will give you references to data instead of loading into memory.

Note that each trajectory in the dataset has `N+1` observations and `N` actions with the extra observation being the terminal observation.

In [5]:
# loads h5 data into memory for faster access
def load_h5_data(data):
    out = dict()
    for k in data.keys():
        if isinstance(data[k], h5py.Dataset):
            out[k] = data[k][:]
        else:
            out[k] = load_h5_data(data[k])
    return out

class ManiSkill2Dataset(Dataset):
    def __init__(self, dataset_file: str, load_count=-1) -> None:
        self.dataset_file = dataset_file
        # for details on how the code below works, see the
        # quick start tutorial
        self.data = h5py.File(dataset_file, "r")
        json_path = dataset_file.replace(".h5", ".json")
        self.json_data = load_json(json_path)
        self.episodes = self.json_data["episodes"]
        self.env_info = self.json_data["env_info"]
        self.env_id = self.env_info["env_id"]
        self.env_kwargs = self.env_info["env_kwargs"]

        self.observations = []
        self.actions = []
        self.total_frames = 0
        if load_count == -1:
            load_count = len(self.episodes)
        for eps_id in tqdm(range(load_count)):
            eps = self.episodes[eps_id]
            trajectory = self.data[f"traj_{eps['episode_id']}"]
            trajectory = load_h5_data(trajectory)
            # we use :-1 here to ignore the last observation as that
            # is the terminal observation which has no actions
            self.observations.append(trajectory["obs"][:-1])
            self.actions.append(trajectory["actions"])
        self.observations = np.vstack(self.observations)
        self.actions = np.vstack(self.actions)

    def __len__(self):
        return len(self.observations)

    def __getitem__(self, idx):
        action = th.from_numpy(self.actions[idx]).float()
        obs = th.from_numpy(self.observations[idx]).float()
        return obs, action

In [6]:
dataset = ManiSkill2Dataset(f"demos/v0/rigid_body/{env_id}/trajectory.state.pd_ee_delta_pose.h5")
dataloader = DataLoader(dataset, batch_size=256, num_workers=0, pin_memory=True, drop_last=True, shuffle=True)
obs, action = dataset[0]
print("Observation:", obs.shape)
print("Action:", action.shape)

  0%|          | 0/100 [00:00<?, ?it/s]

Observation: torch.Size([42])
Action: torch.Size([7])


### 1.3 Policy Definition

With our dataset, we know what our inputs and outputs look like. We can now easily define a policy/model with PyTorch. For state observations, we can simply build an MLP to process them and predict actions

In [7]:
class Policy(nn.Module):
    def __init__(
        self,
        obs_dims,
        act_dims,
        hidden_units=[128, 128],
        activation=nn.ReLU,
    ):
        super().__init__()
        mlp_layers = []
        prev_units = obs_dims
        for h in hidden_units:
            mlp_layers += [nn.Linear(prev_units, h), activation()]
            prev_units = h
        # attach a tanh regression head since we know all actions are constrained to [-1, 1]
        mlp_layers += [nn.Linear(prev_units, act_dims), nn.Tanh()]
        self.mlp = nn.Sequential(*mlp_layers)

    def forward(self, observations) -> th.Tensor:
        return self.mlp(observations)

# create our policy
obs, action = dataset[0]
policy = Policy(obs.shape[0], action.shape[0], hidden_units=[256, 256])
# move model to gpu if possible
device = "cuda" if th.cuda.is_available() else "cpu"
policy = policy.to(device)
policy.train()
print(policy)

Policy(
  (mlp): Sequential(
    (0): Linear(in_features=42, out_features=256, bias=True)
    (1): ReLU()
    (2): Linear(in_features=256, out_features=256, bias=True)
    (3): ReLU()
    (4): Linear(in_features=256, out_features=7, bias=True)
    (5): Tanh()
  )
)


### 1.4 Setting up Training, Dataloader, and Logging

With a policy and dataset, we can now write some utility functions to perform a training step, load data in batches, and log results to tensorboard.

In [8]:
loss_fn = nn.MSELoss()

# a short save function to save our model
def save_model(policy, path):
    save_data = dict(
        policy=policy.state_dict(),
    )
    th.save(save_data, path)

def train_step(policy, obs, actions, optim, loss_fn):
    optim.zero_grad()
    # move data to appropriate device first
    obs = obs.to(device)
    actions = actions.to(device)

    pred_actions = policy(obs)

    # compute loss and optimize
    loss = loss_fn(actions, pred_actions)
    loss.backward()
    optim.step()
    return loss.item()

def evaluate_policy(env, policy, num_episodes=10):
    policy.eval()
    obs, _ = env.reset()
    successes = []
    i = 0
    while i < num_episodes:
        obs = th.from_numpy(obs[None]).float().to(device)
        with th.no_grad():
            action = policy(obs).cpu().numpy()[0]
        obs, reward, terminated, truncated, info = env.step(action)
        if terminated or truncated:
            successes.append(info["success"])
            i += 1
            obs, _ = env.reset(seed=i)
    policy.train()
    print(successes)
    return np.mean(successes)

Below sets up the logging tools as well which can be viewed with `tensorboard --logdir logs`. You can also open up Tensorboard directly in this notebook

### 1.5 Training and Evaluation

We can now create a optimizer and training loop and begin training. The code below will optimize for `iterations = 70000` number of gradient steps at a learning rate of `1e-3`. These parameters are tuned for training on the LiftCube environment and will train a succesful policy that doesn't overfit too much to the dataset. Training time takes around 2-10 minutes depending on hardware. If you wish to skip the training you can also download pretrained weights in the next cell.

Note that this is a simple tutorial with a barebones training setup. It doesn't include using a validation dataset, computing success rate during training, regularization or normalization etc.

With a trained policy on our hands, we can now create an evaluation environment to compute the success rate and watch the videos. The default settings should train a policy that achieves around 30% success rate

In [9]:
!rm -rf logs/state_*

In [10]:
obs_mode = "state"
control_mode = "pd_ee_delta_pose"
env = gym.make(env_id, obs_mode=obs_mode, control_mode=control_mode, render_mode="cameras")
# RecordEpisode wrapper auto records a new video once an episode is completed
env = RecordEpisode(env, output_dir=f"logs/state_{env_id}/videos", save_trajectory=False)

In [11]:
iterations = 70000
optim = th.optim.Adam(policy.parameters(), lr=1e-3)
best_epoch_loss = np.inf
pbar = tqdm(dataloader, total=iterations)
ckpt_dir = f"logs/state_{env_id}/ckpts"
Path(ckpt_dir).mkdir(parents=True, exist_ok=True)
epoch = 0
steps = 0
while steps < iterations:
    epoch_loss = 0
    for batch in dataloader:
        steps += 1
        obs, actions = batch
        loss_val = train_step(policy, obs, actions, optim, loss_fn)

        # track the loss and print it
        epoch_loss += loss_val
        pbar.set_postfix(dict(loss=loss_val))
        pbar.update(1)

        # periodically save the policy
        if steps % 10000 == 0: save_model(policy, osp.join(ckpt_dir, f"ckpt_{steps}.pt"))
        if steps >= iterations: break

    epoch_loss = epoch_loss / len(dataloader)

    # save a new model if the average MSE loss in an epoch has improved
    if epoch_loss < best_epoch_loss:
        best_epoch_loss = epoch_loss
        save_model(policy, osp.join(ckpt_dir, f"ckpt_best.pt"))

    if epoch % 200 == 0:
        print(f"Evaluating policy at step {steps}...")
        print("Success rate:", evaluate_policy(env, policy))

    epoch += 1
save_model(policy, osp.join(ckpt_dir, f"ckpt_latest.pt"))
print("Training complete. Final evaluation:")
print("Success rate:", evaluate_policy(env, policy))

  0%|          | 0/70000 [00:00<?, ?it/s]

Evaluating policy at step 35...
[False, False, False, True, False, False, False, False, False, False]
Success rate: 0.1
Evaluating policy at step 7035...
[True, False, True, True, False, False, False, True, True, False]
Success rate: 0.5
Evaluating policy at step 14035...
[False, False, False, True, False, False, False, False, False, False]
Success rate: 0.1
Evaluating policy at step 21035...
[False, False, False, False, False, False, False, False, False, False]
Success rate: 0.0
Evaluating policy at step 28035...
[False, False, False, False, False, False, False, False, False, False]
Success rate: 0.0
Evaluating policy at step 35035...
[False, False, False, False, False, False, False, False, False, False]
Success rate: 0.0
Evaluating policy at step 42035...
[False, False, False, True, False, False, True, False, False, False]
Success rate: 0.2
Evaluating policy at step 49035...
[False, False, False, False, False, False, False, False, True, False]
Success rate: 0.1
Evaluating policy at s

KeyboardInterrupt: 

In [None]:
!wget https://huggingface.co/datasets/haosulab/ManiSkill2/resolve/main/pretrained_models/tutorials/LiftCube-v0_il.state.pd_ee_delta_pose.pt
policy.load_state_dict(th.load("LiftCube-v0_il.state.pd_ee_delta_pose.pt")["policy"])

--2025-04-08 04:31:07--  https://huggingface.co/datasets/haosulab/ManiSkill2/resolve/main/pretrained_models/tutorials/LiftCube-v0_il.state.pd_ee_delta_pose.pt
Resolving huggingface.co (huggingface.co)... 3.166.152.65, 3.166.152.110, 3.166.152.44, ...
Connecting to huggingface.co (huggingface.co)|3.166.152.65|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.hf.co/repos/2a/1b/2a1bc925ce74825552424e1032229b5b8ac70df951fd407aa55890151fb2e4ba/9eda3bf4e11d83605e6667b7a010d2b0b3b61c301f338db73e39be035a441e76?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27LiftCube-v0_il.state.pd_ee_delta_pose.pt%3B+filename%3D%22LiftCube-v0_il.state.pd_ee_delta_pose.pt%22%3B&Expires=1744090267&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0NDA5MDI2N319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9yZXBvcy8yYS8xYi8yYTFiYzkyNWNlNzQ4MjU1NTI0MjRlMTAzMjIyOWI1YjhhYzcwZGY5NTFmZDQwN2FhNTU4OTAxNTFmYjJlNGJhLzllZGEzYm

<All keys matched successfully>

In [None]:
obs_mode = "state"
control_mode = "pd_ee_delta_pose"
env = gym.make(env_id, obs_mode=obs_mode, control_mode=control_mode, render_mode="cameras")
# RecordEpisode wrapper auto records a new video once an episode is completed
env = RecordEpisode(env, output_dir=f"logs/state_{env_id}/eval_videos", save_trajectory=False)
obs, _ = env.reset(seed=42)

successes = []
num_episodes = 10
i = 0
pbar = tqdm(total=num_episodes)
while i < num_episodes:
    # batch observation and move to appropriate device
    obs = th.from_numpy(obs[None]).float().to(device)
    with th.no_grad():
        action = policy(obs).cpu().numpy()[0]
    obs, reward, terminated, truncated, info = env.step(action)

    if terminated or truncated:
        print(f"Test Episode {i}: {info['success']}")
        successes.append(info['success'])
        obs, _ = env.reset()
        i += 1
        pbar.update(1)
print("Success Rate:", np.mean(successes))
print(successes)

  0%|          | 0/10 [00:00<?, ?it/s]

Test Episode 0: False
Test Episode 1: False
Test Episode 2: False
Test Episode 3: False
Test Episode 4: False
Test Episode 5: False
Test Episode 6: True
Test Episode 7: False
Test Episode 8: False
Test Episode 9: False
Success Rate: 0.1
[False, False, False, False, False, False, True, False, False, False]


In [12]:
from IPython.display import Video
Video(f"logs/state_{env_id}/videos/3.mp4", embed=True) # Watch one of the replays

## 2 Visual IL

Visual observations, while slower and harder to work with than states, have the potential to train more generalizable policies that can solve multiple tasks with different objects. Visual data like RGBD or Pointcloud capture important geometry crucial for manipulation of objects with complex geometries.

This section will go over how to create a demonstration dataset with visual observation data, specifically RGBD, and importantly cover some subtler points necessary to properly use our demonstrations (e.g. depth data being compressed into a `uint16` type to conserve space)

### 2.1 Download Demonstrations

To get started, we first need to download the demonstrations dataset for our desired environment. The code here is agnostic to environment choice but the training code is tuned for the LiftCube-v0 environment

Using the `mani_skill2.utils.download_demo` tool you can download datasets by `env_id`. Note that these datasets don't come with observations in order to conserve space. As a result, we further need to convert the trajectories to add observations back in.

If you want to skip the trajectory conversion, you can directly download the already converted trajectory dataset with the `gdown` command shown later. For this section we will use `state` observations and the recommended `pd_ee_delta_pose` controller.


In [None]:
env_id = "LiftCube-v0"

In [None]:
# Directly download the converted demonstrations dataset files
import urllib.request
!mkdir -p "demos/v0/rigid_body/LiftCube-v0"
urllib.request.urlretrieve("https://huggingface.co/datasets/haosulab/ManiSkill2/resolve/main/processed_demos/LiftCube-v0.tar.gz", "demos/v0/rigid_body/LiftCube-v0.tar.gz")
!tar -xvzf "demos/v0/rigid_body/LiftCube-v0.tar.gz" -C "demos/v0/rigid_body/"

LiftCube-v0/
LiftCube-v0/trajectory.json
LiftCube-v0/trajectory.state.pd_ee_delta_pose.h5
LiftCube-v0/trajectory.state.pd_ee_delta_pose.json
LiftCube-v0/trajectory.rgbd.pd_ee_delta_pose.json
LiftCube-v0/trajectory.h5
LiftCube-v0/trajectory.rgbd.pd_ee_delta_pose.h5


### 2.2 Setting up the Dataset

Using PyTorch, we can use the [Dataset and Dataloader](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) classes to manage the demonstrations dataset.

Importantly, as our datasets are stored with the h5py package, to load demonstrations into memory for faster access, you will need to use the `load_h5_data` function we provide below. By default, h5py will give you references to data instead of loading into memory.

Note that each trajectory in the dataset has `N+1` observations and `N` actions with the extra observation being the terminal observation.

#### 2.2.1 Processing RGBD Observations

As we are working with RGBD data, we will need to process it appropriately. Specifically, we will need to scale the RGB data, and un-scale the depth information which was previously scaled and converted to `uint16` to save space.

Moreover, we want to simplify and remove the nested hierarchy in the observations by building a `convert_observation` function to convert observations into a desired input shape that's easier to work with. This includes flattening all dictionaries with state data (`agent` and `extra` keys).

For more details on how the code below works, you can reference https://haosulab.github.io/ManiSkill2/concepts/observation.html for information on what is stored in the raw RGBD observations.

In [None]:
def convert_observation(observation):
    # flattens the original observation by flattening the state dictionaries
    # and combining the rgb and depth images

    # image data is not scaled here and is kept as uint16 to save space
    image_obs = observation["image"]
    rgb = image_obs["base_camera"]["rgb"]
    depth = image_obs["base_camera"]["depth"]
    rgb2 = image_obs["hand_camera"]["rgb"]
    depth2 = image_obs["hand_camera"]["depth"]

    # we provide a simple tool to flatten dictionaries with state data
    from mani_skill2.utils.common import flatten_state_dict
    state = np.hstack(
        [
            flatten_state_dict(observation["agent"]),
            flatten_state_dict(observation["extra"]),
        ]
    )

    # combine the RGB and depth images
    rgbd = np.concatenate([rgb, depth, rgb2, depth2], axis=-1)
    obs = dict(rgbd=rgbd, state=state)
    return obs
def rescale_rgbd(rgbd, scale_rgb_only=False):
    # rescales rgbd data and changes them to floats
    rgb1 = rgbd[..., 0:3] / 255.0
    rgb2 = rgbd[..., 4:7] / 255.0
    depth1 = rgbd[..., 3:4]
    depth2 = rgbd[..., 7:8]
    if not scale_rgb_only:
        depth1 = rgbd[..., 3:4] / (2**10)
        depth2 = rgbd[..., 7:8] / (2**10)
    return np.concatenate([rgb1, depth1, rgb2, depth2], axis=-1)

#### 2.2.2 Dataset Class Definition

Now we can define our PyTorch Dataset and Dataloader. The Dataset will go over each trajectory, load into memory, then convert all the observations into our desired shape (single-level dictionary with rgbd and state keys).

In [None]:
# loads h5 data into memory for faster access
def load_h5_data(data):
    out = dict()
    for k in data.keys():
        if isinstance(data[k], h5py.Dataset):
            out[k] = data[k][:]
        else:
            out[k] = load_h5_data(data[k])
    return out

class ManiSkill2Dataset(Dataset):
    def __init__(self, dataset_file: str, load_count=-1) -> None:
        self.dataset_file = dataset_file
        # for details on how the code below works, see the
        # quick start tutorial
        import h5py
        from mani_skill2.utils.io_utils import load_json
        self.data = h5py.File(dataset_file, "r")
        json_path = dataset_file.replace(".h5", ".json")
        self.json_data = load_json(json_path)
        self.episodes = self.json_data["episodes"]
        self.env_info = self.json_data["env_info"]
        self.env_id = self.env_info["env_id"]
        self.env_kwargs = self.env_info["env_kwargs"]

        self.obs_state = []
        self.obs_rgbd = []
        self.actions = []
        self.total_frames = 0
        if load_count == -1:
            load_count = len(self.episodes)
        for eps_id in tqdm(range(load_count)):
            eps = self.episodes[eps_id]
            trajectory = self.data[f"traj_{eps['episode_id']}"]
            trajectory = load_h5_data(trajectory)

            # convert the original raw observation with our batch-aware function
            obs = convert_observation(trajectory["obs"])
            # we use :-1 to ignore the last obs as terminal observations are included
            # and they don't have actions
            self.obs_rgbd.append(obs['rgbd'][:-1])
            self.obs_state.append(obs['state'][:-1])
            self.actions.append(trajectory["actions"])
        self.obs_rgbd = np.vstack(self.obs_rgbd)
        self.obs_state = np.vstack(self.obs_state)
        self.actions = np.vstack(self.actions)

    def __len__(self):
        return len(self.obs_rgbd)

    def __getitem__(self, idx):
        action = th.from_numpy(self.actions[idx]).float()
        rgbd = self.obs_rgbd[idx]
        rgbd = rescale_rgbd(rgbd)
        # permute data so that channels are the first dimension as PyTorch expects this
        rgbd = th.from_numpy(rgbd).float().permute((2, 0, 1))
        state = th.from_numpy(self.obs_state[idx]).float()
        return dict(rgbd=rgbd, state=state), action


For this tutorial, the LiftCube environment comes with just 100 demonstrations which is sufficient for training a decent policy. For other environments there are many more demonstrations which may require more memory to hold.

In [None]:
dataset = ManiSkill2Dataset(f"demos/v0/rigid_body/{env_id}/trajectory.rgbd.pd_ee_delta_pose.h5")
dataloader = DataLoader(dataset, batch_size=128, num_workers=1, pin_memory=True, drop_last=True, shuffle=True)
obs, action = dataset[0]
print("RGBD:", obs['rgbd'].shape)
print("State:", obs['state'].shape)
print("Action:", action.shape)

  0%|          | 0/100 [00:00<?, ?it/s]

RGBD: torch.Size([8, 128, 128])
State: torch.Size([32])
Action: torch.Size([7])


### 2.3 Building a Model to Process RGBD data

Now that we have our data and we know its shape, we build a PyTorch model to perform predictions from observations. Here we will use NatureCNN as our backbone architecture to process RGBD data. The outputs of NatureCNN then are fed into a MLP that predicts the desired actions.

In [None]:
class NatureCNN(nn.Module):
    def __init__(self, image_size=(128, 128), in_channels=8, state_size=42):
        super().__init__()

        extractors = {}

        self.out_features = 0
        feature_size = 256

        # here we use a NatureCNN architecture to process images, but any architecture is permissble here
        cnn = nn.Sequential(
            nn.Conv2d(
                in_channels=in_channels,
                out_channels=32,
                kernel_size=8,
                stride=4,
                padding=0,
            ),
            nn.ReLU(),
            nn.Conv2d(
                in_channels=32, out_channels=64, kernel_size=4, stride=2, padding=0
            ),
            nn.ReLU(),
            nn.Conv2d(
                in_channels=64, out_channels=64, kernel_size=3, stride=1, padding=0
            ),
            nn.ReLU(),
            nn.Flatten(),
        )

        # to easily figure out the dimensions after flattening, we pass a test tensor
        test_tensor = th.zeros([in_channels, image_size[0], image_size[1]])
        with th.no_grad():
            n_flatten = cnn(test_tensor[None]).shape[1]
            fc = nn.Sequential(nn.Linear(n_flatten, feature_size), nn.ReLU())
        extractors["rgbd"] = nn.Sequential(cnn, fc)
        self.out_features += feature_size

        # for state data we simply pass it through a single linear layer
        extractors["state"] = nn.Linear(state_size, 64)
        self.out_features += 64

        self.extractors = nn.ModuleDict(extractors)

    def forward(self, observations) -> th.Tensor:
        encoded_tensor_list = []
        # self.extractors contain nn.Modules that do all the processing.
        for key, extractor in self.extractors.items():
            encoded_tensor_list.append(extractor(observations[key]))
        return th.cat(encoded_tensor_list, dim=1)

In [None]:
class Policy(nn.Module):
    def __init__(
        self,
        image_size=(128, 128),
        in_channels=8,
        state_size=42,
        hidden_units=[128, 128],
        act_dims=8,
        activation=nn.ReLU,
    ):
        super().__init__()
        self.feature_extractor = NatureCNN(image_size, in_channels, state_size)
        mlp_layers = []
        prev_units = self.feature_extractor.out_features
        for h in hidden_units:
            mlp_layers += [nn.Linear(prev_units, h), activation()]
            prev_units = h
        mlp_layers += [nn.Linear(prev_units, act_dims), nn.Tanh()]
        self.mlp = nn.Sequential(*mlp_layers)

    def forward(self, observations) -> th.Tensor:
        features = self.feature_extractor(observations)
        return self.mlp(features)

# create our policy
obs, action = dataset[0]
rgbd_shape = obs['rgbd'].shape
th.manual_seed(0)
policy = Policy(image_size=rgbd_shape[1:], in_channels=rgbd_shape[0], state_size=obs['state'].shape[0],
                act_dims=action.shape[0], hidden_units=[256, 256, 256])
# move model to gpu if possible
device = "cuda" if th.cuda.is_available() else "cpu"
policy = policy.to(device)
policy.train()
print(policy)

Policy(
  (feature_extractor): NatureCNN(
    (extractors): ModuleDict(
      (rgbd): Sequential(
        (0): Sequential(
          (0): Conv2d(8, 32, kernel_size=(8, 8), stride=(4, 4))
          (1): ReLU()
          (2): Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2))
          (3): ReLU()
          (4): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1))
          (5): ReLU()
          (6): Flatten(start_dim=1, end_dim=-1)
        )
        (1): Sequential(
          (0): Linear(in_features=9216, out_features=256, bias=True)
          (1): ReLU()
        )
      )
      (state): Linear(in_features=32, out_features=64, bias=True)
    )
  )
  (mlp): Sequential(
    (0): Linear(in_features=320, out_features=256, bias=True)
    (1): ReLU()
    (2): Linear(in_features=256, out_features=256, bias=True)
    (3): ReLU()
    (4): Linear(in_features=256, out_features=256, bias=True)
    (5): ReLU()
    (6): Linear(in_features=256, out_features=7, bias=True)
    (7): Tanh()
  )
)


### 2.4 Setting up Training, Dataloader, and Logging

With a policy and dataset, we can now write some utility functions to perform a training step, load data in batches, and log results to tensorboard.

In [None]:
loss_fn = nn.MSELoss()

# a short save function to save our model
def save_model(policy, path):
    save_data = dict(
        policy=policy.state_dict(),
    )
    th.save(save_data, path)

def train_step(policy, obs, actions, optim, loss_fn):
    optim.zero_grad()
    # move data to appropriate device first
    obs_device = dict()
    for k in obs:
        obs_device[k] = obs[k].to(device)
    actions = actions.to(device)

    pred_actions = policy(obs_device)

    # compute loss and optimize
    loss = loss_fn(actions, pred_actions)
    loss.backward()
    optim.step()
    return loss.item()

def evaluate_policy(env, policy, num_episodes=10):
    policy.eval()
    obs, _ = env.reset()
    successes = []
    i = 0
    while i < num_episodes:
        obs = convert_observation(obs)
        obs_device = dict()
        # note that depth information from the env is already scaled, but rgb is in [0, 255] so we scale rgb only
        obs['rgbd'] = rescale_rgbd(obs['rgbd'], scale_rgb_only=True)
        # unsqueeze adds an extra batch dimension and we permute rgbd since PyTorch expects the channel dimension to be first
        obs_device['rgbd'] = th.from_numpy(obs['rgbd']).float().permute(2,0,1).unsqueeze(0).to(device)
        obs_device['state'] = th.from_numpy(obs['state']).float().unsqueeze(0).to(device)
        with th.no_grad():
            action = policy(obs_device).cpu().numpy()[0]
        obs, reward, terminated, truncated, info = env.step(action)
        if terminated or truncated:
            successes.append(info["success"])
            i += 1
            obs, _ = env.reset(seed=i)
    policy.train()
    print(successes)
    return np.mean(successes)

### 2.5 Training and Evaluation

We can now create a optimizer and training loop and begin training. The code below will optimize for `iterations = 8000` number of gradient steps at a learning rate of `1e-3`. These parameters are tuned for training on the LiftCube environment and will train a succesful policy that doesn't overfit too much to the dataset. Training time takes around 5-25 minutes depending on hardware.

Note that this is a simple tutorial with a barebones training setup. It doesn't include using a validation dataset, computing success rate during training, regularization or normalization etc.

With a trained policy on our hands, we can now create an evaluation environment to compute the success rate and watch the videos. The tutorial's behavior cloning as an approach will do well enough to solve the task partially most of the time.

In [None]:
!rm -rf logs/rgbd_*

In [None]:
obs_mode = "rgbd"
control_mode = "pd_ee_delta_pose"
env = gym.make(env_id, obs_mode=obs_mode, control_mode=control_mode, render_mode="cameras")
# RecordEpisode wrapper auto records a new video once an episode is completed
env = RecordEpisode(env, output_dir=f"logs/rgbd_{env_id}/videos", save_trajectory=False)
obs, _ = env.reset(seed=42)

In [None]:
iterations = 8000
optim = th.optim.Adam(policy.mlp.parameters(), lr=1e-3)
best_epoch_loss = np.inf
pbar = tqdm(dataloader, total=iterations)
ckpt_dir = f"logs/state_{env_id}/ckpts"
Path(ckpt_dir).mkdir(parents=True, exist_ok=True)
epoch = 0
steps = 0
while steps < iterations:
    epoch_loss = 0
    for batch in dataloader:
        steps += 1
        obs, actions = batch
        loss_val = train_step(policy, obs, actions, optim, loss_fn)

        epoch_loss += loss_val
        pbar.set_postfix(dict(loss=loss_val))
        pbar.update(1)

        # periodically save the policy
        if steps % 1000 == 0: save_model(policy, osp.join(ckpt_dir, f"ckpt_{steps}.pt"))
        if steps >= iterations: break

    epoch_loss = epoch_loss / len(dataloader)

    # save a new model if the average MSE loss in an epoch has improved
    if epoch_loss < best_epoch_loss:
        best_epoch_loss = epoch_loss
        save_model(policy, osp.join(ckpt_dir, f"ckpt_best.pt"))

    if epoch % 10 == 0:
        print(f"Evaluating policy at step {steps}...")
        print("Success rate:", evaluate_policy(env, policy))

    epoch += 1
save_model(policy, osp.join(ckpt_dir, f"ckpt_latest.pt"))
print("Training complete. Final evaluation:")
print("Success rate:", evaluate_policy(env, policy))

In [None]:
!wget https://huggingface.co/datasets/haosulab/ManiSkill2/resolve/main/pretrained_models/tutorials/LiftCube-v0_il.rgbd.pd_ee_delta_pose.pt
policy.load_state_dict(th.load("LiftCube-v0_il.rgbd.pd_ee_delta_pose.pt")["policy"])

In [None]:
from mani_skill2.utils.wrappers import RecordEpisode

obs_mode = "rgbd"
control_mode = "pd_ee_delta_pose"
env = gym.make(env_id, obs_mode=obs_mode, control_mode=control_mode, render_mode="cameras")
# RecordEpisode wrapper auto records a new video once an episode is completed
env = RecordEpisode(env, output_dir=f"logs/rgbd_{env_id}/eval_videos", save_trajectory=False)
obs, _ = env.reset(seed=42)

successes = []
num_episodes = 10
i = 0
pbar = tqdm(total=num_episodes)
while i < num_episodes:
    # convert observation to our desired shape and move to appropriate device
    obs = convert_observation(obs)
    obs_device = dict()
    # note that depth information from the env is already scaled, but rgb is in [0, 255] so we scale rgb only
    obs['rgbd'] = rescale_rgbd(obs['rgbd'], scale_rgb_only=True)
    # unsqueeze adds an extra batch dimension and we permute rgbd since PyTorch expects the channel dimension to be first
    obs_device['rgbd'] = th.from_numpy(obs['rgbd']).float().permute(2,0,1).unsqueeze(0).to(device)
    obs_device['state'] = th.from_numpy(obs['state']).float().unsqueeze(0).to(device)
    with th.no_grad():
        action = policy(obs_device).cpu().numpy()[0]
    obs, reward, terminated, truncated, info = env.step(action)

    if terminated or truncated:
        print(f"Test Episode {i}: {info['success']}")
        successes.append(info['success'])
        obs, _ = env.reset()
        i += 1
        pbar.update(1)
print("Success Rate:", np.mean(successes))
print(successes)

In [None]:
from IPython.display import Video
Video(f"logs/rgbd_{env_id}/videos/0.mp4", embed=True) # Watch one of the replays

# 3 Visual IL w/ BC-RNN


In [None]:
env_id = "LiftCube-v0"

In [None]:
# Directly download the converted demonstrations dataset files
import urllib.request
!mkdir -p "demos/v0/rigid_body/LiftCube-v0"
urllib.request.urlretrieve("https://huggingface.co/datasets/haosulab/ManiSkill2/resolve/main/processed_demos/LiftCube-v0.tar.gz", "demos/v0/rigid_body/LiftCube-v0.tar.gz")
!tar -xvzf "demos/v0/rigid_body/LiftCube-v0.tar.gz" -C "demos/v0/rigid_body/"

In [None]:
from mani_skill2.utils.common import flatten_state_dict

# --- Utility Functions ---
def tensor_to_numpy(x):
    if th.is_tensor(x):
        return x.cpu().numpy()
    return x

def convert_observation(observation):
    image_obs = observation["image"]
    rgb = image_obs["base_camera"]["rgb"]
    depth = image_obs["base_camera"]["depth"]
    rgb2 = image_obs["hand_camera"]["rgb"]
    depth2 = image_obs["hand_camera"]["depth"]
    state = np.hstack([
        flatten_state_dict(observation["agent"]),
        flatten_state_dict(observation["extra"]),
    ])
    rgbd = np.concatenate([rgb, depth, rgb2, depth2], axis=-1)
    return dict(rgbd=rgbd, state=state)

def rescale_rgbd(rgbd, scale_rgb_only=False):
    rgb1 = rgbd[..., 0:3] / 255.0
    rgb2 = rgbd[..., 4:7] / 255.0
    depth1 = rgbd[..., 3:4]
    depth2 = rgbd[..., 7:8]
    if not scale_rgb_only:
        depth1 = rgbd[..., 3:4] / (2**10)
        depth2 = rgbd[..., 7:8] / (2**10)
    return np.concatenate([rgb1, depth1, rgb2, depth2], axis=-1)

def load_h5_data(data):
    out = dict()
    for k in data.keys():
        if isinstance(data[k], h5py.Dataset):
            out[k] = data[k][:]
        else:
            out[k] = load_h5_data(data[k])
    return out

In [None]:
class SequentialManiSkill2Dataset(ManiSkill2Dataset):
    def __init__(self, dataset_file: str, seq_len: int = 10, load_count=-1):
        super().__init__(dataset_file, load_count=load_count)
        self.seq_len = seq_len

    def __getitem__(self, idx):
        start = max(0, idx - self.seq_len + 1)
        end = idx + 1

        rgbd_seq = [rescale_rgbd(self.obs_rgbd[i]) for i in range(start, end)]
        state_seq = [self.obs_state[i] for i in range(start, end)]
        action_seq = [self.actions[i] for i in range(start, end)]

        # Pad if needed
        pad_len = self.seq_len - len(rgbd_seq)
        if pad_len > 0:
            rgbd_seq = [rgbd_seq[0]] * pad_len + rgbd_seq
            state_seq = [state_seq[0]] * pad_len + state_seq
            action_seq = [action_seq[0]] * pad_len + action_seq

        rgbd_seq = np.stack(rgbd_seq, axis=0)
        state_seq = np.stack(state_seq, axis=0)
        action_seq = np.stack(action_seq, axis=0)

        rgbd_seq = th.from_numpy(rgbd_seq).float().permute(0, 3, 1, 2)
        state_seq = th.from_numpy(state_seq).float()
        action_seq = th.from_numpy(action_seq).float()
        return {"rgbd": rgbd_seq, "state": state_seq}, action_seq

dataset = SequentialManiSkill2Dataset(f"demos/v0/rigid_body/{env_id}/trajectory.rgbd.pd_ee_delta_pose.h5", seq_len=10)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True, num_workers=1)
sample_obs, sample_act = dataset[0]

In [None]:
class RecurrentPolicy(nn.Module):
    def __init__(self, image_size=(128, 128), in_channels=8, state_size=42,
                 hidden_units=256, rnn_hidden_size=256, act_dims=8, rnn_type='LSTM'):
        super().__init__()
        self.feature_extractor = NatureCNN(image_size, in_channels, state_size)
        self.rnn = nn.LSTM(self.feature_extractor.out_features, rnn_hidden_size, batch_first=True)
        self.mlp = nn.Sequential(
            nn.Linear(rnn_hidden_size, hidden_units), nn.ReLU(),
            nn.Linear(hidden_units, act_dims), nn.Tanh()
        )

    def forward(self, observations, hidden_state=None):
        B, T = observations["rgbd"].shape[:2]
        obs_seq = {k: v.view(B * T, *v.shape[2:]) for k, v in observations.items()}
        features = self.feature_extractor(obs_seq).view(B, T, -1)
        rnn_out, hidden = self.rnn(features, hidden_state)
        return self.mlp(rnn_out), hidden

device = th.device("cuda" if th.cuda.is_available() else "cpu")
policy = RecurrentPolicy(
    image_size=sample_obs["rgbd"].shape[2:],
    in_channels=sample_obs["rgbd"].shape[1],
    state_size=sample_obs["state"].shape[-1],
    act_dims=sample_act.shape[-1]
).to(device)
policy.train()

In [None]:
# --- Training, Evaluation, Main ---
def train_step(policy, obs, actions, optim, loss_fn):
    optim.zero_grad()
    obs = {k: v.to(device) for k, v in obs.items()}
    actions = actions.to(device)
    pred_actions, _ = policy(obs)
    loss = loss_fn(pred_actions, actions)
    loss.backward()
    optim.step()
    return loss.item()

def evaluate_policy(env, policy, num_episodes=10):
    policy.eval()
    obs, _ = env.reset()
    successes = []
    i = 0
    # pbar = tqdm(total=num_episodes)
    while i < num_episodes:
        # obs, _ = env.reset(seed=i)
        # done = False
        # while not done:
        obs = convert_observation(obs)
        obs["rgbd"] = rescale_rgbd(obs["rgbd"], scale_rgb_only=True)
        obs_device = {
            "rgbd": th.from_numpy(obs["rgbd"]).float().permute(2, 0, 1).unsqueeze(0).unsqueeze(0).to(device),
            "state": th.from_numpy(obs["state"]).float().unsqueeze(0).unsqueeze(0).to(device)
        }
        with th.no_grad():
            action, _ = policy(obs_device)
        obs, _, terminated, truncated, info = env.step(action[0, 0].cpu().numpy())
        if terminated or truncated:
            successes.append(info["success"])
            i += 1
            obs, _ = env.reset(seed=i)
            # pbar.update(1)
    policy.train()
    print(successes)
    return np.mean(successes)

In [None]:
!rm -rf logs/bcrnn_rgbd_*

In [None]:
env_id = "LiftCube-v0"
env = gym.make(env_id, obs_mode="rgbd", control_mode="pd_ee_delta_pose", render_mode="cameras")
env = RecordEpisode(env, output_dir=f"logs/bcrnn_rgbd_{env_id}/videos", save_trajectory=False)

In [None]:
optim = th.optim.Adam(policy.parameters(), lr=1e-3)
best_epoch_loss = np.inf
loss_fn = nn.MSELoss()
steps = 0
epoch = 0
iterations = 8000
pbar = tqdm(total=iterations)
ckpt_dir = f"logs/bcrnn_rgbd_{env_id}/ckpts"
Path(ckpt_dir).mkdir(parents=True, exist_ok=True)
while steps < iterations:
    epoch_loss = 0
    for batch in dataloader:
        loss_val = train_step(policy, *batch, optim, loss_fn)
        steps += 1
        pbar.set_postfix(dict(loss=loss_val))
        pbar.update(1)
        if steps % 2000 == 0:
            th.save(policy.state_dict(), osp.join(ckpt_dir, f"ckpt_{steps}.pt"))
        if steps >= iterations:
            break

    epoch_loss = epoch_loss / len(dataloader)

    # save a new model if the average MSE loss in an epoch has improved
    if epoch_loss < best_epoch_loss:
        best_epoch_loss = epoch_loss
        th.save(policy.state_dict(), osp.join(ckpt_dir, f"ckpt_best.pt"))
    if epoch % 4 == 0:
        print(f"Evaluating policy at step {steps}...")
        print("Success rate:", evaluate_policy(env, policy))
    epoch += 1


th.save(policy.state_dict(), osp.join(ckpt_dir, "ckpt_latest.pt"))
print("Training complete. Final evaluation:")
print("Success rate:", evaluate_policy(env, policy))

In [None]:
from IPython.display import Video
Video(f"logs/bcrnn_rgbd_{env_id}/videos/0.mp4", embed=True) # Watch one of the replays

# 4 Visual IL w/ BC-RNN + Resnet

In [None]:
import torchvision.models as models

class ResNet18Encoder(nn.Module):
    def __init__(self, in_channels=8):
        super().__init__()
        resnet = models.resnet18(pretrained=True)
        resnet.conv1 = nn.Conv2d(in_channels, 64, kernel_size=7, stride=2, padding=3, bias=False)
        resnet.fc = nn.Identity()
        for param in resnet.parameters():
            param.requires_grad = False
        self.resnet = resnet

    def forward(self, x):
        return self.resnet(x)

class NatureCNN(nn.Module):
    def __init__(self, image_size=(128, 128), in_channels=8, state_size=42):
        super().__init__()
        self.rgbd_encoder = ResNet18Encoder(in_channels)
        self.state_encoder = nn.Linear(state_size, 64)
        self.out_features = 512 + 64

    def forward(self, observations):
        rgbd_feat = self.rgbd_encoder(observations["rgbd"])
        state_feat = self.state_encoder(observations["state"])
        return th.cat([rgbd_feat, state_feat], dim=-1)

In [None]:
class RecurrentPolicy(nn.Module):
    def __init__(self, image_size=(128, 128), in_channels=8, state_size=42,
                 hidden_units=256, rnn_hidden_size=256, act_dims=8, rnn_type='LSTM'):
        super().__init__()
        self.feature_extractor = NatureCNN(image_size, in_channels, state_size)
        self.rnn = nn.LSTM(self.feature_extractor.out_features, rnn_hidden_size, batch_first=True)
        self.mlp = nn.Sequential(
            nn.Linear(rnn_hidden_size, hidden_units), nn.ReLU(),
            nn.Linear(hidden_units, act_dims), nn.Tanh()
        )

    def forward(self, observations, hidden_state=None):
        B, T = observations["rgbd"].shape[:2]
        obs_seq = {k: v.view(B * T, *v.shape[2:]) for k, v in observations.items()}
        features = self.feature_extractor(obs_seq).view(B, T, -1)
        rnn_out, hidden = self.rnn(features, hidden_state)
        return self.mlp(rnn_out), hidden

device = th.device("cuda" if th.cuda.is_available() else "cpu")
policy = RecurrentPolicy(
    image_size=sample_obs["rgbd"].shape[2:],
    in_channels=sample_obs["rgbd"].shape[1],
    state_size=sample_obs["state"].shape[-1],
    act_dims=sample_act.shape[-1]
).to(device)
policy.train()

In [None]:
!rm -rf logs/bcrnn_resnet_rgbd_*

In [None]:
env_id = "LiftCube-v0"
env = gym.make(env_id, obs_mode="rgbd", control_mode="pd_ee_delta_pose", render_mode="cameras")
env = RecordEpisode(env, output_dir=f"logs/bcrnn_resnet_rgbd_{env_id}/videos", save_trajectory=False)

In [None]:
optim = th.optim.Adam(policy.parameters(), lr=1e-3)
best_epoch_loss = np.inf
loss_fn = nn.MSELoss()
steps = 0
epoch = 0
iterations = 8000
pbar = tqdm(total=iterations)
ckpt_dir = f"logs/bcrnn_rgbd_{env_id}/ckpts"
Path(ckpt_dir).mkdir(parents=True, exist_ok=True)
while steps < iterations:
    epoch_loss = 0
    for batch in dataloader:
        loss_val = train_step(policy, *batch, optim, loss_fn)
        steps += 1
        pbar.set_postfix(dict(loss=loss_val))
        pbar.update(1)
        if steps % 2000 == 0:
            th.save(policy.state_dict(), osp.join(ckpt_dir, f"ckpt_{steps}.pt"))
        if steps >= iterations:
            break

    epoch_loss = epoch_loss / len(dataloader)

    # save a new model if the average MSE loss in an epoch has improved
    if epoch_loss < best_epoch_loss:
        best_epoch_loss = epoch_loss
        th.save(policy.state_dict(), osp.join(ckpt_dir, f"ckpt_best.pt"))
    if epoch % 4 == 0:
        print(f"Evaluating policy at step {steps}...")
        print("Success rate:", evaluate_policy(env, policy))
    epoch += 1


th.save(policy.state_dict(), osp.join(ckpt_dir, "ckpt_latest.pt"))
print("Training complete. Final evaluation:")
print("Success rate:", evaluate_policy(env, policy))

In [None]:
!gdown https://drive.google.com/uc?id=117y51RgTpnNMivPmR66AeL82DTPj_33a
policy.load_state_dict(th.load("bcrnn_resnet_ckpt_8000.pt")["policy"])
print("Success Rate: ", evaluate_policy(env, policy)

In [None]:
from IPython.display import Video
Video(f"logs/bcrnn_resnet_rgbd_{env_id}/videos/2.mp4", embed=True) # Watch one of the replays