<a href="https://colab.research.google.com/github/leonmkim/lerobot_tutorial/blob/main/lerobot_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Adapted from https://colab.research.google.com/github/TrossenRobotics/aloha_docs/blob/main/docs/files/LeRobot_Notebook.ipynb
# LeRobot Training Instruction Manual

# 🎉 Welcome to the LeRobot Training Notebook!

This guide will help you set up and train a model on a cloud-based platform, such as **Google Colab**, using **LeRobot** with **Hugging Face**.

---

## ⚠️ **Disclaimers:**

- **GPU Subscription**: 🔑 Make sure you have the appropriate subscription plan that provides access to the necessary GPU (e.g., **A100**, **T4**). Review pricing and benefits on the cloud provider's website before proceeding.
  
- **Checkpoint Requirement**: ⏳ If resuming training, ensure that you have the previous training checkpoint available in your session. Without the checkpoint, the training **cannot be resumed**.

---

## 📝 **Important Instructions:**

- **Run All Cells Together**: 🔄 It is recommended to run all the cells in one go if you plan to leave the session **unmonitored**. This helps avoid session timeouts or disruptions.

- **GPU & Compute Units**: 🎛️ Ensure you select a suitable GPU (e.g., **A100**, **T4**) and have enough compute units for your session. A typical 5-hour training session requires approximately **70 compute units**.

- **Monitor Training**: 👀 It’s advisable to monitor the **first few epochs** to ensure that the training is running smoothly before leaving the session unattended.

- **Local Storage**: 💾 You will be prompted to choose whether you want to store the training outputs **locally** at the end of the process.

---

Now, let’s begin the setup process! 🚀


---



## (For Colab) GPU Setup & Compute Units

Make sure runtime type is set to GPU (e.g. **A100**, **T4**).

---



## Installing Dependencies

In this step, we'll install all the necessary dependencies for running LeRobot and performing model training.

For installing rerun, kernel will be restarted after pip installation.

Ensure that these packages are successfully installed before proceeding to the next steps.

---


In [2]:
import os
import subprocess
import sys
try:
    import google.colab
    IS_COLAB = True
except ImportError:
    IS_COLAB = False
    home_env_var = os.environ["HOME"]
    ld_library_path_env_var = os.environ["LD_LIBRARY_PATH"]

root_dir = "/content" if IS_COLAB else os.path.expanduser("~/lerobot_tutorial")
custom_scripts_dirpath = os.path.join(root_dir, "lerobot_tutorial") if IS_COLAB else root_dir
lerobot_root_dir = os.path.join(root_dir, "lerobot")
train_output_dir = os.path.join(lerobot_root_dir, "outputs", "train")
visualize_dataset_path = os.path.join(root_dir, "lerobot_tutorial", "visualize_dataset.py") if IS_COLAB else os.path.join(root_dir, "visualize_dataset.py")
sys.path.append(custom_scripts_dirpath)

if IS_COLAB:
    print("Installing required dependencies...")

    !pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 # specifically 12.1 as 12.4 has issues locally
    !pip install --upgrade blinker wandb datasets huggingface-hub hydra-core gitpython flask diffusers InquirerPy

    # Install blinker if needed
    !pip install --ignore-installed blinker

    # clone custom visualize dataset from tutorial repo
    if not os.path.exists(visualize_dataset_path):
      !git clone https://github.com/leonmkim/lerobot_tutorial.git
    
    # Install LeRobot repository
    if not os.path.exists(lerobot_root_dir):
        !git clone https://github.com/huggingface/lerobot.git
    %cd {lerobot_root_dir}
    !pip install -e .[pusht]
    
    # install custom version of gym-aloha that supports varying initialization distribution
    %cd {root_dir}
    !git clone https://github.com/leonmkim/gym-aloha.git
    %cd gym-aloha
    !pip install -e . 

    try: 
        import rerun as rr
    except ImportError:
        print("The `rerun` module is missing. Installing now via pip...")
        !pip install --upgrade rerun-sdk[notebook]
        print('Installation completed. The runtime needs to be restarted. Stopping now.')
        os.kill(os.getpid(), 9)

    print("Dependencies installed successfully.")


## Hugging Face Login & Dataset Setup

We will now log into Hugging Face using the token provided. After login, the dataset repo, job name, and output directory that you specified will be configured for the training session.

---


In [None]:
# Hugging Face login token
print("Generate a Hugging Face token from: https://huggingface.co/settings/tokens")
hf_token = input("Please enter your Hugging Face token: ")

# Log in to Hugging Face and verify login
print("Logging into Hugging Face...")
!huggingface-cli login --token {hf_token}

# Verify the login by checking the user information
user_info = !huggingface-cli whoami
print(f"Logged in as: {user_info[0]}")


Logging into Hugging Face...
The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
The token `repo+collection read+write` has been saved to /home/serialexperimentsleon/.cache/huggingface/stored_tokens
Your token has been saved to /home/serialexperimentsleon/.cache/huggingface/token
Login successful.
The current active token is: `repo+collection read+write`
Logged in as: serialexperimentsleon



## WandB login

Weights & Biases (WandB) is a tool that helps track and visualize various metrics during model training. We will log in now to enable tracking.

---


In [None]:
import wandb
wandb.login()

[34m[1mwandb[0m: Currently logged in as: [33mserialexperimentsleon[0m. Use [1m`wandb login --relogin`[0m to force relogin


True


# ACT/ALOHA Transfer Cube Example

In this example, we will train an [Action Chunking Transformer (ACT) policy](https://tonyzhaozh.github.io/aloha/) on a cube transfer task using the [ALOHA bimanual robot](https://www.trossenrobotics.com/aloha-kits).

## What is ACT?

ACT is a state-of-the-art imitation learning algorithm that leverages the transformer architecture to generate actions given sensor observations (e.g. joint encoders, camera images, etc.). Unlike more naive imitation learning methods, ACT predicts action sequences multiple timesteps into the future (coined "action chunks") at each inference step, with the aim of reducing the effective horizon of a task and thus mitigating compounding error. Furthermore, action chunks are aggregated across multiple inference steps through "temporal ensembling" to smooth out potential discontinuities in the predicted action sequences. Details on the ACT algorithm can be found in the [original paper](https://arxiv.org/abs/2304.13705).

![image](ACT.png "ACT figure")

## What is ALOHA?

Where do we get the data for training? A Low-cost Open-source Hardware System for Bimanual Teleoperation (ALOHA) includes "leader" arms that can be naturally operated by a human demonstrator with "follower" arms that track the joint angles of the leader arms through PID control. 

<!-- link video -->

<iframe width="700" height="394" src="https://www.youtube.com/embed/PHXQFE-Rteo?si=QOFaoEHHAez-ByqK" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

## Transfer Cube Task
In this simple simulated example, we will train on pre-recorded demonstrations of transferring a red cube from one arm to the other. The red cube is randomly initialized in the workspace and the task is considered successful if 1) the cube is grasped off the table by the right arm and 2) the cube is then grasped by the left arm.

<!-- link video -->
<video src="ALOHA_transfer_cube_success.mp4" width="640" height="480" controls></video>

---



## Configure Settings

---


In [8]:
# Dataset and job details
dataset_repo_id = "lerobot/aloha_sim_transfer_cube_human"
task_id = "AlohaTransferCube-v0"
training_offline_steps = 2000 # short training in class 
training_eval_freq = 1000
training_save_freq = 1000
training_log_freq = 250
# training_offline_steps = 200000
# training_eval_freq = 20000
# training_save_freq = 20000
eval_n_episodes = 50
eval_batch_size = 50

print("**Important**: Use a valid directory/jobs name. Avoid numbers or special characters other than '_'.")
print("Example: 'training_results_aloha' or 'aloha_training_output'")

job_name = "train_aloha_sim_transfer_cube_human_200k"

# Output directory with naming format instructions

output_dir = job_name

# Resume flag with disclaimer
# resume_flag = input("Do you want to resume training from a previous checkpoint? (yes/no): ")
resume_flag = "no"
resume_cmd = "--resume" if resume_flag.lower() == 'yes' else ""

# Model upload flag
# upload_choice = input("Do you want to upload the model to Hugging Face after training? (yes/no): ")
upload_choice = "yes"
model_repo_id = ""
if upload_choice.lower() == 'yes':
    # model_repo_id = input("Please enter the model repo ID to store your trained model to Hugging Face (e.g., TrossenRoboticsCommunity/aloha_static_logo_assembly): ")
    model_repo_id = job_name

# Local storage flag
# store_locally = input("Do you want to store the training outputs locally? (yes/no): ")
store_locally = "yes"

# 

**Important**: Use a valid directory/jobs name. Avoid numbers or special characters other than '_'.
Example: 'training_results_aloha' or 'aloha_training_output'



## Visualize Dataset

Let's visualize the demonstrations to get a better understanding of the data we will be training on. The dataset consists of 50 human demonstrations of the cube transfer task where the red cube has been randomly initialized in the workspace. 

We will make use of "[rerun](https://rerun.io/)", a logging and visualization toolkit for multimodal (e.g. video, audio, text) data.

---


In [None]:
dataset = LeRobotDataset(dataset_repo_id, root=None, local_files_only=False)
visualize_dataset(
    dataset=dataset,
    episode_index=0, # change this to visualize a different episode
    mode=rerun_mode,
)

## Model Training or Resumption

Now that everything is set up, we can either begin training the model or resume training from the last checkpoint, depending on your input.

If resuming, make sure the checkpoint is available in your session. The training will continue from the last checkpoint if found.

> **⚠️ Important: GPU Usage**
>
> By default, the training is configured to use a **GPU** for faster computation. If the runtime does not have access to a GPU, the training will fail.
>
> To avoid this issue:
>
> - **Ensure GPU is enabled** in your Colab runtime. You can check this by navigating to **Runtime > Change runtime type > Hardware accelerator** and selecting **GPU**.
> - If you prefer to use a **CPU** instead, update the `device` argument to `device=cpu` in the training command in the next cell.

---

In [9]:
# Start or resume training depending on user choice
%cd {lerobot_root_dir}
if resume_flag.lower() == "no":
    print(f"Starting new training on {dataset_repo_id}...")
    # for sim
    !python lerobot/scripts/train.py \
        dataset_repo_id={dataset_repo_id} \
        env=aloha \
        env.task={task_id} \
        policy=act \
        training.eval_freq={training_eval_freq} \
        training.log_freq={training_log_freq} \
        training.offline_steps={training_offline_steps} \
        training.save_freq={training_save_freq} \
        eval.n_episodes={eval_n_episodes} \
        eval.batch_size={eval_batch_size} \
        hydra.run.dir=outputs/train/{output_dir} \
        hydra.job.name={job_name} \
        device=cuda wandb.enable=true
else:
    print(f"Resuming training from {output_dir}... (ensure checkpoint is available)")
    !python lerobot/scripts/train.py hydra.run.dir={output_dir} resume=true

# step: optimization steps, smpl: num samples seen, ep: num episodes seen, epch: num epochs seen, Sigma rwrd: return, success: success rate, eval_s: 


/home/serialexperimentsleon/lerobot_tutorial/lerobot
Starting new training on lerobot/aloha_sim_transfer_cube_human...
INFO 2025-01-28 18:15:47 ts/train.py:244 {'dataset_repo_id': 'lerobot/aloha_sim_transfer_cube_human',
 'device': 'cuda',
 'env': {'action_dim': 14,
         'episode_length': 400,
         'fps': '${fps}',
         'gym': {'obs_type': 'pixels_agent_pos', 'render_mode': 'rgb_array'},
         'name': 'aloha',
         'state_dim': 14,
         'task': 'AlohaTransferCube-v0'},
 'eval': {'batch_size': 50, 'n_episodes': 50, 'use_async_envs': False},
 'fps': 50,
 'override_dataset_stats': {'observation.images.top': {'mean': [[[0.485]],
                                                                [[0.456]],
                                                                [[0.406]]],
                                                       'std': [[[0.229]],
                                                               [[0.224]],
                                             


## Eval policy

Now let's evaluate the trained policy on the cube transfer task.

---


In [None]:
%cd {lerobot_root_dir}
!python lerobot/scripts/eval.py -p outputs/train/{output_dir}/checkpoints/last/pretrained_model


## Uploading the Model (Recommended)

Once the model is trained, you can choose to upload it to Hugging Face for safekeeping. This is **highly recommended** if you are running long sessions or training a valuable model.

Uploading the model will help protect against potential session interruptions or failures.

---


In [None]:
%cd {lerobot_root_dir}

print(model_repo_id)
# Model upload step if chosen
if upload_choice.lower() == "yes":
    print("Uploading the model to Hugging Face...")
    !huggingface-cli upload {model_repo_id}  outputs/train/{output_dir}/checkpoints/last/pretrained_model
    print("Model uploaded to Hugging Face successfully.")
else:
    print("Model upload skipped.")



/home/serialexperimentsleon/lerobot_tutorial/lerobot
train_aloha_sim_transfer_cube_human_200k
Uploading the model to Hugging Face...
Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.
Start hashing 4 files.
Finished hashing 4 files.
model.safetensors: 100%|█████████████████████| 207M/207M [00:03<00:00, 59.2MB/s]
https://huggingface.co/serialexperimentsleon/train_aloha_sim_transfer_cube_human_200k/tree/main/.
Model uploaded to Hugging Face successfully.



## Eval policy: Skip to pre-trained results

---


In [None]:
# view training results for 200k steps
%wandb serialexperimentsleon/lerobot/runs/onwbxyfw

In [None]:
# skip to pretrained model
%cd {lerobot_root_dir}
!python lerobot/scripts/eval.py -p lerobot/act_aloha_sim_transfer_cube_human

## Limitations of Imitation Learning: Distribution Shift

With enough demonstrations, imitation learning can be a powerful tool for learning complex behaviors. However, in this simple example, we will see what happens when we test the trained policy on a cube that is initialized outside the nominal initialization range (described below). What do you think will happen in this case? How would you address this?

![image](ALOHA_transfer_cube_init.png "ALOHA_transfer_cube_init")

```python
# nominal cube initialization range
x_range = [0.0, 0.2]
y_range = [0.4, 0.6]
z_range = [0.05, 0.05]

'''
      ^
      |
      y
      |
Where o---x--->
'''
```

            

In [None]:
%cd {lerobot_root_dir}
xrange="'[0.0,0.2]'"
# yrange="'[0.4,0.6]'"
yrange="'[0.6,0.7]'" # disjoint from the nominal range
zrange="'[0.05,0.05]'"
!python lerobot/scripts/eval.py -p lerobot/act_aloha_sim_transfer_cube_human +env.gym.cube_init_xrange={xrange} +env.gym.cube_init_yrange={yrange} +env.gym.cube_init_zrange={zrange}



## Safeguarding Session Data and Local Storage

To prevent data loss in case of session termination, you can zip the output directory and download it locally. If you selected local storage, the outputs will be saved to your local machine.

Make sure to run this step to save all training outputs before closing your session.

---


In [None]:
# Zip the output directory and download it if local storage is chosen
%cd {train_output_dir}
!ls
if IS_COLAB:
    if store_locally.lower() == "yes":
        print("Zipping outputs for download...")
        !zip -r trained.zip {output_dir}

        # Download the zipped file
        from google.colab import files
        files.download('/content/lerobot/outputs/train/trained.zip')
    else:
        print("Local storage not selected, skipping download.")


# Troubleshooting and Recommendations

1. **GPU Availability**: Ensure the selected GPU is available on your cloud platform (e.g., Colab).
2. **Compute Units**: Ensure you have sufficient compute units. Each 5-hour session requires ~70 units.
3. **Hugging Face Token**: You can generate a token [here](https://huggingface.co/settings/tokens).
4. **Session Safeguards**: Always download your results (output files) to prevent data loss if the session terminates.
5. **Checkpoint Reminder**: If resuming training, ensure that the checkpoint file from the previous session is present in the session.

---



# EXTRA: Diffusion Policy/Push T Example

TODO

---



## Configure Settings
TODO
---


In [None]:
# Collect all necessary inputs from the user

# Dataset and job details
# dataset_repo_id = input("Please enter the dataset repo ID from Hugging Face (e.g., TrossenRoboticsCommunity/aloha_static_logo_assembly): ")
dataset_repo_id = "lerobot/pusht_keypoints"
task_id = "PushT-v0"
training_offline_steps = 200000
# training_offline_steps = 1000
training_eval_freq = 20000
training_save_freq = 20000
training_log_freq = 50
eval_n_episodes = 50
eval_batch_size = 50

print("**Important**: Use a valid directory/jobs name. Avoid numbers or special characters other than '_'.")
print("Example: 'training_results_aloha' or 'aloha_training_output'")

job_name = "train_diffusion_pusht_keypoints"

# Output directory with naming format instructions

output_dir = job_name

# Resume flag with disclaimer
# resume_flag = input("Do you want to resume training from a previous checkpoint? (yes/no): ")
resume_flag = "no"
resume_cmd = "--resume" if resume_flag.lower() == 'yes' else ""

# Model upload flag
# upload_choice = input("Do you want to upload the model to Hugging Face after training? (yes/no): ")
upload_choice = "yes"
model_repo_id = ""
if upload_choice.lower() == 'yes':
    # model_repo_id = input("Please enter the model repo ID to store your trained model to Hugging Face (e.g., TrossenRoboticsCommunity/aloha_static_logo_assembly): ")
    model_repo_id = job_name

# Local storage flag
# store_locally = input("Do you want to store the training outputs locally? (yes/no): ")
store_locally = "yes"

Please select a suitable GPU type (e.g., A100, T4) for cloud-based training.
Generate a Hugging Face token from: https://huggingface.co/settings/tokens
You can explore datasets from Trossen Robotics Community here: https://huggingface.co/TrossenRoboticsCommunity
**Important**: Use a valid directory/jobs name. Avoid numbers or special characters other than '_'.
Example: 'training_results_aloha' or 'aloha_training_output'



## Visualize Dataset

TODO

---


In [None]:
rerun_mode = "notebook" if IS_COLAB else "local"

from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
from visualize_dataset import visualize_dataset

dataset = LeRobotDataset('lerobot/pusht', root=None, local_files_only=False)
visualize_dataset(
    dataset=dataset,
    episode_index=0,
    mode=rerun_mode,
)

> **⚠️ Important Notice:**
>
> Before you start the training, make sure to edit the `act_aloha_real.yaml` file located at:
>
> **Click Here** >> `/content/lerobot/lerobot/configs/policy/act_aloha_real.yaml`
>
> This file contains crucial parameters such as `batch_size`, `offline_steps`, and `learning_rate`. You should update these parameters based on your training needs. For example, you can modify:
>
> - **Batch Size** (`training.batch_size`): Adjust the number of samples processed in each training step.
> - **Offline Training Steps** (`training.offline_steps`): Define how many steps to run during offline training.
> - **Learning Rate** (`training.lr`): Set the learning rate to control how quickly the model learns.
>
> Once the file is updated, you can proceed with training.


## Model Training or Resumption

Now that everything is set up, we can either begin training the model or resume training from the last checkpoint, depending on your input.

If resuming, make sure the checkpoint is available in your session. The training will continue from the last checkpoint if found.

> **⚠️ Important: GPU Usage**
>
> By default, the training is configured to use a **GPU** for faster computation. If the runtime does not have access to a GPU, the training will fail.
>
> To avoid this issue:
>
> - **Ensure GPU is enabled** in your Colab runtime. You can check this by navigating to **Runtime > Change runtime type > Hardware accelerator** and selecting **GPU**.
> - If you prefer to use a **CPU** instead, update the `device` argument to `device=cpu` in the training command in the next cell.

---

In [None]:
# Start or resume training depending on user choice
%cd {lerobot_root_dir}
if resume_flag.lower() == "no":
    print(f"Starting new training on {dataset_repo_id}...")
    # for sim
    !python lerobot/scripts/train.py \
        dataset_repo_id={dataset_repo_id} \
        env=pusht \
        env.task={task_id} \
        env.gym.obs_type=environment_state_agent_pos \
        policy=diffusion_pusht_keypoints \
        training.eval_freq={training_eval_freq} \
        training.log_freq={training_log_freq} \
        training.offline_steps={training_offline_steps} \
        training.save_freq={training_save_freq} \
        eval.n_episodes={eval_n_episodes} \
        eval.batch_size={eval_batch_size} \
        hydra.run.dir=outputs/train/{output_dir} \
        hydra.job.name={job_name} \
        device=cuda wandb.enable=true \
        use_amp=true
else:
    print(f"Resuming training from {output_dir}... (ensure checkpoint is available)")
    !python lerobot/scripts/train.py hydra.run.dir={output_dir} resume=true

# step: optimization steps, smpl: num samples seen, ep: num episodes seen, epch: num epochs seen, Sigma rwrd: return, success: success rate, eval_s: 


In [None]:
# view training results for 200k steps
%wandb serialexperimentsleon/lerobot/runs/7j3hkdkq


## Eval policy

---


In [None]:
%cd {lerobot_root_dir}
!python lerobot/scripts/eval.py -p outputs/train/{output_dir}/checkpoints/last/pretrained_model

In [None]:
# skip to pretrained model
%cd {lerobot_root_dir}
!python lerobot/scripts/eval.py -p lerobot/diffusion_pusht_keypoints


## Uploading the Model (Recommended)

Once the model is trained, you can choose to upload it to Hugging Face for safekeeping. This is **highly recommended** if you are running long sessions or training a valuable model.

Uploading the model will help protect against potential session interruptions or failures.

---


In [None]:
%cd {lerobot_root_dir}

print(model_repo_id)
# Model upload step if chosen
if upload_choice.lower() == "yes":
    print("Uploading the model to Hugging Face...")
    !huggingface-cli upload {model_repo_id}  outputs/train/{output_dir}/checkpoints/last/pretrained_model
    print("Model uploaded to Hugging Face successfully.")
else:
    print("Model upload skipped.")

