<a href="https://colab.research.google.com/github/leonmkim/lerobot_tutorial/blob/main/lerobot_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# LeRobot Training Instruction Manual

# 🎉 Welcome to the LeRobot Training Notebook!

This guide will help you set up and train a model on a cloud-based platform, such as **Google Colab**, using **LeRobot** with **Hugging Face**.

---

## ⚠️ **Disclaimers:**

- **GPU Subscription**: 🔑 Make sure you have the appropriate subscription plan that provides access to the necessary GPU (e.g., **A100**, **T4**). Review pricing and benefits on the cloud provider's website before proceeding.
  
- **Checkpoint Requirement**: ⏳ If resuming training, ensure that you have the previous training checkpoint available in your session. Without the checkpoint, the training **cannot be resumed**.

---

## 📝 **Important Instructions:**

- **Run All Cells Together**: 🔄 It is recommended to run all the cells in one go if you plan to leave the session **unmonitored**. This helps avoid session timeouts or disruptions.

- **GPU & Compute Units**: 🎛️ Ensure you select a suitable GPU (e.g., **A100**, **T4**) and have enough compute units for your session. A typical 5-hour training session requires approximately **70 compute units**.

- **Monitor Training**: 👀 It’s advisable to monitor the **first few epochs** to ensure that the training is running smoothly before leaving the session unattended.

- **Local Storage**: 💾 You will be prompted to choose whether you want to store the training outputs **locally** at the end of the process.

---

Now, let’s begin the setup process! 🚀


---



## Installing Dependencies

In this step, we'll install all the necessary dependencies for running LeRobot and performing model training.

Ensure that these packages are successfully installed before proceeding to the next steps.

---


In [8]:
import os
import subprocess
import sys
try:
    import google.colab
    IS_COLAB = True
except ImportError:
    IS_COLAB = False
    home_env_var = os.environ["HOME"]
    ld_library_path_env_var = os.environ["LD_LIBRARY_PATH"]

root_dir = "/content" if IS_COLAB else os.path.expanduser("~/lerobot_tutorial")
custom_scripts_dirpath = os.path.join(root_dir, "lerobot_tutorial") if IS_COLAB else root_dir
lerobot_root_dir = os.path.join(root_dir, "lerobot")
train_output_dir = os.path.join(lerobot_root_dir, "outputs", "train")
visualize_dataset_path = os.path.join(root_dir, "lerobot_tutorial", "visualize_dataset.py") if IS_COLAB else os.path.join(root_dir, "visualize_dataset.py")
sys.path.append(custom_scripts_dirpath)

if IS_COLAB:
    # Install required dependencies
    print("Installing required dependencies...")

    # !sudo apt-get install libusb-1.0-0-dev
    # !pip install --upgrade  pyrealsense2 dynamixel-sdk rerun-sdk blinker wandb datasets huggingface-hub hydra-core gitpython flask diffusers InquirerPy
    !pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 # specifically 12.1 as 12.4 has issues locally
    !pip install --upgrade rerun-sdk[notebook] blinker wandb datasets huggingface-hub hydra-core gitpython flask diffusers InquirerPy

    # Install blinker if needed
    !pip install --ignore-installed blinker

    # clone custom visualize dataset from tutorial repo
    if not os.path.exists(visualize_dataset_path):
      !git clone https://github.com/leonmkim/lerobot_tutorial.git
    
    # Install LeRobot repository
    if not os.path.exists(lerobot_root_dir):
        !git clone https://github.com/huggingface/lerobot.git
    %cd {lerobot_root_dir}
    # to avoid updates to dataset version v2.0
    # !git reset --hard 96c7052777aca85d4e55dfba8f81586103ba8f61
    !ls
    !pip install -e .[pusht]
    # !pip install .[intelrealsense,dynamixel]
    # !pip install .[aloha,pusht]
    
    # install custom version of gym-aloha that supports varying initialization distribution
    %cd {root_dir}
    !git clone https://github.com/leonmkim/gym-aloha.git
    %cd gym-aloha
    !pip install -e . 

    print("Dependencies installed successfully.")
# else:
#     %env LD_LIBRARY_PATH={home_env_var}/.pyenv/versions/lerobot_tutorial/lib64/python3.11/site-packages/nvidia/nvjitlink/lib:{ld_library_path_env_var}
#     # %env LD_LIBRARY_PATH=/home/serialexperimentsleon/.pyenv/versions/lerobot_tutorial/lib64/python3.11/site-packages/nvidia/nvjitlink/lib:$LD_LIBRARY_PATH
#     !echo $LD_LIBRARY_PATH



# PushT Example

TODO

---



## Configure Settings
TODO
---


In [9]:
# Collect all necessary inputs from the user

# GPU selection (Reminder: Ensure enough compute units for smooth training)
print("Please select a suitable GPU type (e.g., A100, T4) for cloud-based training.")

# Hugging Face login token
print("Generate a Hugging Face token from: https://huggingface.co/settings/tokens")
hf_token = input("Please enter your Hugging Face token: ")

# Link to Trossen Robotics Community datasets
print("You can explore datasets from Trossen Robotics Community here: https://huggingface.co/TrossenRoboticsCommunity")

# Dataset and job details
# dataset_repo_id = input("Please enter the dataset repo ID from Hugging Face (e.g., TrossenRoboticsCommunity/aloha_static_logo_assembly): ")
dataset_repo_id = "lerobot/pusht_keypoints"
task_id = "PushT-v0"
training_offline_steps = 200000
# training_offline_steps = 1000
training_eval_freq = 20000
training_save_freq = 20000
training_log_freq = 50
eval_n_episodes = 50
eval_batch_size = 50

print("**Important**: Use a valid directory/jobs name. Avoid numbers or special characters other than '_'.")
print("Example: 'training_results_aloha' or 'aloha_training_output'")

job_name = "train_diffusion_pusht_keypoints"

# Output directory with naming format instructions

output_dir = job_name

# Resume flag with disclaimer
# resume_flag = input("Do you want to resume training from a previous checkpoint? (yes/no): ")
resume_flag = "no"
resume_cmd = "--resume" if resume_flag.lower() == 'yes' else ""

# Model upload flag
# upload_choice = input("Do you want to upload the model to Hugging Face after training? (yes/no): ")
upload_choice = "yes"
model_repo_id = ""
if upload_choice.lower() == 'yes':
    # model_repo_id = input("Please enter the model repo ID to store your trained model to Hugging Face (e.g., TrossenRoboticsCommunity/aloha_static_logo_assembly): ")
    model_repo_id = job_name

# Local storage flag
# store_locally = input("Do you want to store the training outputs locally? (yes/no): ")
store_locally = "yes"

# 

Please select a suitable GPU type (e.g., A100, T4) for cloud-based training.
Generate a Hugging Face token from: https://huggingface.co/settings/tokens
You can explore datasets from Trossen Robotics Community here: https://huggingface.co/TrossenRoboticsCommunity
**Important**: Use a valid directory/jobs name. Avoid numbers or special characters other than '_'.
Example: 'training_results_aloha' or 'aloha_training_output'



## (For Colab) GPU Setup & Compute Units

The GPU type you selected earlier will now be configured for this cloud-based training session. Make sure to have enough compute units to support long training sessions, and monitor the first few epochs to ensure smooth execution.

---



## Hugging Face Login & Dataset Setup

We will now log into Hugging Face using the token provided. After login, the dataset repo, job name, and output directory that you specified will be configured for the training session.

---


In [10]:
# Log in to Hugging Face and verify login
print("Logging into Hugging Face...")
!huggingface-cli login --token {hf_token}

# Verify the login by checking the user information
user_info = !huggingface-cli whoami
print(f"Logged in as: {user_info[0]}")


Logging into Hugging Face...
The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
The token `repo+collection read+write` has been saved to /home/serialexperimentsleon/.cache/huggingface/stored_tokens
Your token has been saved to /home/serialexperimentsleon/.cache/huggingface/token
Login successful.
The current active token is: `repo+collection read+write`
Logged in as: serialexperimentsleon



## WandB login

TODO

---


In [54]:
import wandb
wandb.login()

[34m[1mwandb[0m: Currently logged in as: [33mserialexperimentsleon[0m. Use [1m`wandb login --relogin`[0m to force relogin


True


## Visualize Dataset

TODO

---


In [6]:
rerun_mode = "notebook" if IS_COLAB else "local"

from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
from visualize_dataset import visualize_dataset

dataset = LeRobotDataset('lerobot/pusht', root=None, local_files_only=False)
visualize_dataset(
    dataset=dataset,
    episode_index=0,
    mode=rerun_mode,
)

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

Fetching 418 files:   0%|          | 0/418 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/206 [00:00<?, ?it/s]

[2025-01-28T00:25:09Z INFO  re_sdk_comms::server] Hosting a SDK server over TCP at 0.0.0.0:9876. Connect with the Rerun logging SDK.
[2025-01-28T00:25:09Z INFO  winit::platform_impl::linux::x11::window] Guessed window scale factor: 1
[2025-01-28T00:25:09Z WARN  wgpu_hal::vulkan::instance] Unable to find extension: VK_EXT_swapchain_colorspace
[2025-01-28T00:25:09Z INFO  re_sdk_comms::server] New SDK client connected from: 127.0.0.1:53100
[2025-01-28T00:25:09Z WARN  wgpu_hal::gles::egl] No config found!
[2025-01-28T00:25:09Z WARN  wgpu_hal::gles::egl] EGL says it can present to the window but not natively
[2025-01-28T00:25:09Z WARN  wgpu_hal::gles::adapter] Max vertex attribute stride unknown. Assuming it is 2048
[2025-01-28T00:25:09Z WARN  wgpu_hal::gles::adapter] Max vertex attribute stride unknown. Assuming it is 2048
[2025-01-28T00:25:09Z INFO  egui_wgpu] There were 3 available wgpu adapters: {backend: Vulkan, device_type: DiscreteGpu, name: "NVIDIA GeForce RTX 3060", driver: "NVIDIA

> **⚠️ Important Notice:**
>
> Before you start the training, make sure to edit the `act_aloha_real.yaml` file located at:
>
> **Click Here** >> `/content/lerobot/lerobot/configs/policy/act_aloha_real.yaml`
>
> This file contains crucial parameters such as `batch_size`, `offline_steps`, and `learning_rate`. You should update these parameters based on your training needs. For example, you can modify:
>
> - **Batch Size** (`training.batch_size`): Adjust the number of samples processed in each training step.
> - **Offline Training Steps** (`training.offline_steps`): Define how many steps to run during offline training.
> - **Learning Rate** (`training.lr`): Set the learning rate to control how quickly the model learns.
>
> Once the file is updated, you can proceed with training.


## Model Training or Resumption

Now that everything is set up, we can either begin training the model or resume training from the last checkpoint, depending on your input.

If resuming, make sure the checkpoint is available in your session. The training will continue from the last checkpoint if found.

> **⚠️ Important: GPU Usage**
>
> By default, the training is configured to use a **GPU** for faster computation. If the runtime does not have access to a GPU, the training will fail.
>
> To avoid this issue:
>
> - **Ensure GPU is enabled** in your Colab runtime. You can check this by navigating to **Runtime > Change runtime type > Hardware accelerator** and selecting **GPU**.
> - If you prefer to use a **CPU** instead, update the `device` argument to `device=cpu` in the training command in the next cell.

---

In [25]:
# Start or resume training depending on user choice
%cd {lerobot_root_dir}
if resume_flag.lower() == "no":
    print(f"Starting new training on {dataset_repo_id}...")
    # for sim
    !python lerobot/scripts/train.py \
        dataset_repo_id={dataset_repo_id} \
        env=pusht \
        env.task={task_id} \
        env.gym.obs_type=environment_state_agent_pos \
        policy=diffusion_pusht_keypoints \
        training.eval_freq={training_eval_freq} \
        training.log_freq={training_log_freq} \
        training.offline_steps={training_offline_steps} \
        training.save_freq={training_save_freq} \
        eval.n_episodes={eval_n_episodes} \
        eval.batch_size={eval_batch_size} \
        hydra.run.dir=outputs/train/{output_dir} \
        hydra.job.name={job_name} \
        device=cuda wandb.enable=true \
        use_amp=true
else:
    print(f"Resuming training from {output_dir}... (ensure checkpoint is available)")
    !python lerobot/scripts/train.py hydra.run.dir={output_dir} resume=true

# step: optimization steps, smpl: num samples seen, ep: num episodes seen, epch: num epochs seen, Sigma rwrd: return, success: success rate, eval_s: 


/home/serialexperimentsleon/lerobot_tutorial/lerobot
Starting new training on lerobot/pusht_keypoints...


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


INFO 2025-01-26 21:06:26 ts/train.py:244 {'dataset_repo_id': 'lerobot/pusht_keypoints',
 'device': 'cuda',
 'env': {'action_dim': 2,
         'episode_length': 300,
         'fps': '${fps}',
         'gym': {'obs_type': 'environment_state_agent_pos',
                 'render_mode': 'rgb_array',
                 'visualization_height': 384,
                 'visualization_width': 384},
         'image_size': 96,
         'name': 'pusht',
         'state_dim': 2,
         'task': 'PushT-v0'},
 'eval': {'batch_size': 50, 'n_episodes': 50, 'use_async_envs': False},
 'fps': 10,
 'policy': {'beta_end': 0.02,
            'beta_schedule': 'squaredcos_cap_v2',
            'beta_start': 0.0001,
            'clip_sample': True,
            'clip_sample_range': 1.0,
            'crop_is_random': True,
            'crop_shape': [84, 84],
            'diffusion_step_embed_dim': 128,
            'do_mask_loss_for_padding': False,
            'down_dims': [256, 512, 1024],
            'horizon': 16,
 

In [56]:
# view training results for 200k steps
%wandb serialexperimentsleon/lerobot/runs/7j3hkdkq


## Eval policy

---


In [3]:
%cd {lerobot_root_dir}
!python lerobot/scripts/eval.py -p outputs/train/{output_dir}/checkpoints/last/pretrained_model

/home/serialexperimentsleon/lerobot_tutorial/lerobot


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


outputs/train/train_diffusion_pusht_keypoints/checkpoints/last/pretrained_model
INFO 2025-01-27 12:22:22 on/logger.py:39 [1m[33mOutput dir:[0m outputs/eval/2025-01-27/12-22-22_pusht_diffusion
INFO 2025-01-27 12:22:22 pts/eval.py:480 Making environment.
INFO 2025-01-27 12:22:22 pts/eval.py:483 Making policy.
Loading weights from local directory
Stepping through eval batches:   0%|                      | 0/1 [00:00<?, ?it/s]
Running rollout with at most 300 steps:   0%|           | 0/300 [00:00<?, ?it/s][A
Running rollout with at most 300 steps:   0%| | 0/300 [00:00<?, ?it/s, running_s[A
Running rollout with at most 300 steps:   0%| | 1/300 [00:00<03:22,  1.48it/s, r[A
Running rollout with at most 300 steps:   0%| | 1/300 [00:00<03:22,  1.48it/s, r[A
Running rollout with at most 300 steps:   1%| | 2/300 [00:00<03:21,  1.48it/s, r[A
Running rollout with at most 300 steps:   1%| | 3/300 [00:00<01:06,  4.44it/s, r[A
Running rollout with at most 300 steps:   1%| | 3/300 [00:00<01:06

In [None]:
# skip to pretrained model
%cd {lerobot_root_dir}
!python lerobot/scripts/eval.py -p lerobot/diffusion_pusht_keypoints

Fetching 7 files:   0%|                                   | 0/7 [00:00<?, ?it/s]
README.md: 100%|███████████████████████████| 2.79k/2.79k [00:00<00:00, 23.1MB/s][A

replay.mp4: 100%|██████████████████████████| 56.7k/56.7k [00:00<00:00, 11.1MB/s][A

eval_info.json:   0%|                               | 0.00/76.2k [00:00<?, ?B/s][A

config.json: 100%|█████████████████████████| 1.08k/1.08k [00:00<00:00, 13.9MB/s][A[A


eval_info.json: 100%|██████████████████████| 76.2k/76.2k [00:00<00:00, 5.45MB/s][A[A
config.yaml: 100%|█████████████████████████| 2.66k/2.66k [00:00<00:00, 14.0MB/s]

.gitattributes: 100%|██████████████████████| 1.52k/1.52k [00:00<00:00, 18.3MB/s][A
Fetching 7 files:  14%|███▊                       | 1/7 [00:00<00:00,  7.67it/s]
model.safetensors:   0%|                             | 0.00/995M [00:00<?, ?B/s][A
model.safetensors:   1%|▏                   | 10.5M/995M [00:00<00:24, 40.6MB/s][A
model.safetensors:   2%|▍                   | 21.0M/995M [00:00<00:23, 41


## Uploading the Model (Recommended)

Once the model is trained, you can choose to upload it to Hugging Face for safekeeping. This is **highly recommended** if you are running long sessions or training a valuable model.

Uploading the model will help protect against potential session interruptions or failures.

---


In [41]:
%cd {lerobot_root_dir}

print(model_repo_id)
# Model upload step if chosen
if upload_choice.lower() == "yes":
    print("Uploading the model to Hugging Face...")
    !huggingface-cli upload {model_repo_id}  outputs/train/{output_dir}/checkpoints/last/pretrained_model
    print("Model uploaded to Hugging Face successfully.")
else:
    print("Model upload skipped.")



train_diffusion_pusht_keypoints
Uploading the model to Hugging Face...
Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.
Start hashing 4 files.
Finished hashing 4 files.
model.safetensors: 100%|█████████████████████| 254M/254M [00:04<00:00, 60.9MB/s]
https://huggingface.co/serialexperimentsleon/train_diffusion_pusht_keypoints/tree/main/.
Model uploaded to Hugging Face successfully.



# ALOHA Example

TODO

---


In [41]:
# Dataset and job details
dataset_repo_id = "lerobot/aloha_sim_transfer_cube_human"
task_id = "AlohaTransferCube-v0"
training_eval_freq = 20000
training_log_freq = 250
training_offline_steps = 100000
training_save_freq = 20000
eval_n_episodes = 50
eval_batch_size = 50

print("**Important**: Use a valid directory/jobs name. Avoid numbers or special characters other than '_'.")
print("Example: 'training_results_aloha' or 'aloha_training_output'")

job_name = "train_aloha_sim_transfer_cube_human"

# Output directory with naming format instructions

output_dir = job_name

# Resume flag with disclaimer
# resume_flag = input("Do you want to resume training from a previous checkpoint? (yes/no): ")
resume_flag = "no"
resume_cmd = "--resume" if resume_flag.lower() == 'yes' else ""

# Model upload flag
# upload_choice = input("Do you want to upload the model to Hugging Face after training? (yes/no): ")
upload_choice = "yes"
model_repo_id = ""
if upload_choice.lower() == 'yes':
    # model_repo_id = input("Please enter the model repo ID to store your trained model to Hugging Face (e.g., TrossenRoboticsCommunity/aloha_static_logo_assembly): ")
    model_repo_id = job_name

# Local storage flag
# store_locally = input("Do you want to store the training outputs locally? (yes/no): ")
store_locally = "yes"

# 

**Important**: Use a valid directory/jobs name. Avoid numbers or special characters other than '_'.
Example: 'training_results_aloha' or 'aloha_training_output'



## Visualize Dataset

TODO

---


In [None]:
dataset = LeRobotDataset(dataset_repo_id, root=None, local_files_only=False)
visualize_dataset(
    dataset=dataset,
    episode_index=0,
    mode=rerun_mode,
)

Fetching 56 files: 100%|██████████████████████| 56/56 [00:00<00:00, 8932.88it/s]
[0m[38;5;8m[[0m2025-01-26T01:52:32Z [0m[32mINFO [0m re_sdk_comms::server[0m[38;5;8m][0m Hosting a SDK server over TCP at 0.0.0.0:9876. Connect with the Rerun logging SDK.
[0m[38;5;8m[[0m2025-01-26T01:52:32Z [0m[32mINFO [0m winit::platform_impl::linux::x11::window[0m[38;5;8m][0m Guessed window scale factor: 1
[0m[38;5;8m[[0m2025-01-26T01:52:32Z [0m[33mWARN [0m wgpu_hal::vulkan::instance[0m[38;5;8m][0m Unable to find extension: VK_EXT_swapchain_colorspace
[0m[38;5;8m[[0m2025-01-26T01:52:32Z [0m[32mINFO [0m re_sdk_comms::server[0m[38;5;8m][0m New SDK client connected from: 127.0.0.1:54078
[0m[38;5;8m[[0m2025-01-26T01:52:32Z [0m[33mWARN [0m wgpu_hal::gles::egl[0m[38;5;8m][0m No config found!
[0m[38;5;8m[[0m2025-01-26T01:52:32Z [0m[33mWARN [0m wgpu_hal::gles::egl[0m[38;5;8m][0m EGL says it can present to the window but not natively
  0%|                     

> **⚠️ Important Notice:**
>
> Before you start the training, make sure to edit the `act_aloha_real.yaml` file located at:
>
> **Click Here** >> `/content/lerobot/lerobot/configs/policy/act_aloha_real.yaml`
>
> This file contains crucial parameters such as `batch_size`, `offline_steps`, and `learning_rate`. You should update these parameters based on your training needs. For example, you can modify:
>
> - **Batch Size** (`training.batch_size`): Adjust the number of samples processed in each training step.
> - **Offline Training Steps** (`training.offline_steps`): Define how many steps to run during offline training.
> - **Learning Rate** (`training.lr`): Set the learning rate to control how quickly the model learns.
>
> Once the file is updated, you can proceed with training.


## Model Training or Resumption

Now that everything is set up, we can either begin training the model or resume training from the last checkpoint, depending on your input.

If resuming, make sure the checkpoint is available in your session. The training will continue from the last checkpoint if found.

> **⚠️ Important: GPU Usage**
>
> By default, the training is configured to use a **GPU** for faster computation. If the runtime does not have access to a GPU, the training will fail.
>
> To avoid this issue:
>
> - **Ensure GPU is enabled** in your Colab runtime. You can check this by navigating to **Runtime > Change runtime type > Hardware accelerator** and selecting **GPU**.
> - If you prefer to use a **CPU** instead, update the `device` argument to `device=cpu` in the training command in the next cell.

---

In [None]:
# Start or resume training depending on user choice
%cd {lerobot_root_dir}
if resume_flag.lower() == "no":
    print(f"Starting new training on {dataset_repo_id}...")
    # for sim
    !python lerobot/scripts/train.py \
        dataset_repo_id={dataset_repo_id} \
        env=aloha \
        env.task={task_id} \
        policy=act \
        training.eval_freq={training_eval_freq} \
        training.log_freq={training_log_freq} \
        training.offline_steps={training_offline_steps} \
        training.save_freq={training_save_freq} \
        eval.n_episodes={eval_n_episodes} \
        eval.batch_size={eval_batch_size} \
        hydra.run.dir=outputs/train/{output_dir} \
        hydra.job.name={job_name} \
        device=cuda wandb.enable=true
else:
    print(f"Resuming training from {output_dir}... (ensure checkpoint is available)")
    !python lerobot/scripts/train.py hydra.run.dir={output_dir} resume=true

# step: optimization steps, smpl: num samples seen, ep: num episodes seen, epch: num epochs seen, Sigma rwrd: return, success: success rate, eval_s: 


Starting new training on lerobot/aloha_sim_transfer_cube_human...
INFO 2025-01-28 12:52:07 ts/train.py:244 {'dataset_repo_id': 'lerobot/aloha_sim_transfer_cube_human',
 'device': 'cuda',
 'env': {'action_dim': 14,
         'episode_length': 400,
         'fps': '${fps}',
         'gym': {'cube_init_xrange': [0.0, 0.2],
                 'cube_init_yrange': [0.2, 0.4],
                 'cube_init_zrange': [0.05, 0.05],
                 'obs_type': 'pixels_agent_pos',
                 'render_mode': 'rgb_array'},
         'name': 'aloha',
         'state_dim': 14,
         'task': 'AlohaTransferCube-v0'},
 'eval': {'batch_size': 50, 'n_episodes': 50, 'use_async_envs': False},
 'fps': 50,
 'override_dataset_stats': {'observation.images.top': {'mean': [[[0.485]],
                                                                [[0.456]],
                                                                [[0.406]]],
                                                       'std': [[[0.229]],
      

In [57]:
# view training results for 100k steps
%wandb serialexperimentsleon/lerobot/runs/1cq1s97q


## Eval policy

---


In [42]:
%cd {lerobot_root_dir}
!python lerobot/scripts/eval.py -p outputs/train/{output_dir}/checkpoints/last/pretrained_model

/home/serialexperimentsleon/lerobot_tutorial/lerobot
[]
outputs/train/train_aloha_sim_transfer_cube_human/checkpoints/last/pretrained_model
INFO 2025-01-28 12:54:20 on/logger.py:39 [1m[33mOutput dir:[0m outputs/eval/2025-01-28/12-54-20_aloha_act
INFO 2025-01-28 12:54:20 pts/eval.py:481 Making environment.
INFO 2025-01-28 12:54:20 /__init__.py:88 MUJOCO_GL is not set, so an OpenGL backend will be chosen automatically.
INFO 2025-01-28 12:54:20 /__init__.py:96 Successfully imported OpenGL backend: %s
INFO 2025-01-28 12:54:20 /__init__.py:31 MuJoCo library version is: %s
^C
Traceback (most recent call last):
  File "/home/serialexperimentsleon/lerobot_tutorial/lerobot/lerobot/scripts/eval.py", line 582, in <module>
    main(
  File "/home/serialexperimentsleon/lerobot_tutorial/lerobot/lerobot/scripts/eval.py", line 482, in main
    env = make_env(hydra_cfg)
          ^^^^^^^^^^^^^^^^^^^
  File "/home/serialexperimentsleon/lerobot_tutorial/lerobot/lerobot/common/envs/factory.py", line 51

In [14]:
# skip to pretrained model
%cd {lerobot_root_dir}
!python lerobot/scripts/eval.py -p lerobot/act_aloha_sim_transfer_cube_human

/home/serialexperimentsleon/lerobot_tutorial/lerobot
Fetching 10 files:   0%|                                 | 0/10 [00:00<?, ?it/s]
eval_pc_success.csv: 100%|█████████████████████| 112/112 [00:00<00:00, 1.30MB/s][A

.gitattributes: 100%|██████████████████████| 1.56k/1.56k [00:00<00:00, 23.1MB/s][A

eval_info.json:   0%|                               | 0.00/66.8k [00:00<?, ?B/s][A

README.md: 100%|███████████████████████████| 3.63k/3.63k [00:00<00:00, 38.8MB/s][A[A


config.yaml:   0%|                                  | 0.00/2.83k [00:00<?, ?B/s][A[A


config.yaml: 100%|█████████████████████████| 2.83k/2.83k [00:00<00:00, 10.3MB/s][A[A[A
config.json: 100%|█████████████████████████████| 903/903 [00:00<00:00, 5.05MB/s]
eval_info.json: 100%|██████████████████████| 66.8k/66.8k [00:00<00:00, 11.5MB/s]

demo.gif:   0%|                                     | 0.00/5.83M [00:00<?, ?B/s][A

model.safetensors:   0%|                             | 0.00/207M [00:00<?, ?B/s][A[A


traini

## Sensitivity to distribution shift
<!-- insert png -->
![image](ALOHA_transfer_cube_init.png "ALOHA_transfer_cube_init")

<!-- insert code block -->
```python
x_range = [0.0, 0.2]
y_range = [0.4, 0.6]
z_range = [0.05, 0.05]

'''
      ^
      |
      y
      |
Where o---x--->
'''
```

            

In [None]:
%cd {lerobot_root_dir}
xrange="'[0.0,0.2]'"
# yrange="'[0.4,0.6]'"
yrange="'[0.3,0.4]'"
zrange="'[0.05,0.05]'"
!python lerobot/scripts/eval.py -p lerobot/act_aloha_sim_transfer_cube_human +env.gym.cube_init_xrange={xrange} +env.gym.cube_init_yrange={yrange} +env.gym.cube_init_zrange={zrange}


/home/serialexperimentsleon/lerobot_tutorial/lerobot
Fetching 10 files: 100%|████████████████████| 10/10 [00:00<00:00, 133576.56it/s]
['+env.gym.cube_init_xrange=[0.0,0.2]', '+env.gym.cube_init_yrange=[0.4,0.6]', '+env.gym.cube_init_zrange=[0.05,0.05]']
/home/serialexperimentsleon/.cache/huggingface/hub/models--lerobot--act_aloha_sim_transfer_cube_human/snapshots/bf09d3a9c6fd382bbcae1167297f4ec88a38f650
INFO 2025-01-28 13:22:35 on/logger.py:39 [1m[33mOutput dir:[0m outputs/eval/2025-01-28/13-22-35_aloha_act
INFO 2025-01-28 13:22:35 pts/eval.py:481 Making environment.
INFO 2025-01-28 13:22:35 /__init__.py:88 MUJOCO_GL is not set, so an OpenGL backend will be chosen automatically.
INFO 2025-01-28 13:22:35 /__init__.py:96 Successfully imported OpenGL backend: %s
INFO 2025-01-28 13:22:35 /__init__.py:31 MuJoCo library version is: %s
INFO 2025-01-28 13:22:38 pts/eval.py:484 Making policy.
Loading weights from local directory
Stepping through eval batches:   0%|                      | 0/1


## Safeguarding Session Data and Local Storage

To prevent data loss in case of session termination, you can zip the output directory and download it locally. If you selected local storage, the outputs will be saved to your local machine.

Make sure to run this step to save all training outputs before closing your session.

---


In [48]:
# Zip the output directory and download it if local storage is chosen
%cd {train_output_dir}
!ls
if IS_COLAB:
    if store_locally.lower() == "yes":
        print("Zipping outputs for download...")
        !zip -r trained.zip {output_dir}

        # Download the zipped file
        from google.colab import files
        files.download('/content/lerobot/outputs/train/trained.zip')
    else:
        print("Local storage not selected, skipping download.")

[Errno 2] No such file or directory: 'outputs/train/'
/home/serialexperimentsleon/lerobot_tutorial/lerobot/outputs/train
aloha_train  train_aloha_sim_transfer_cube_human


  bkms = self.shell.db.get('bookmarks', {})


Zipping outputs for download...
  adding: train_aloha_sim_transfer_cube_human/ (stored 0%)
  adding: train_aloha_sim_transfer_cube_human/wandb/ (stored 0%)
  adding: train_aloha_sim_transfer_cube_human/wandb/latest-run/ (stored 0%)
  adding: train_aloha_sim_transfer_cube_human/wandb/latest-run/logs/ (stored 0%)
  adding: train_aloha_sim_transfer_cube_human/wandb/latest-run/logs/debug.log (deflated 67%)
  adding: train_aloha_sim_transfer_cube_human/wandb/latest-run/logs/debug-internal.log (deflated 74%)
  adding: train_aloha_sim_transfer_cube_human/wandb/latest-run/logs/debug-core.log (deflated 70%)
  adding: train_aloha_sim_transfer_cube_human/wandb/latest-run/tmp/ (stored 0%)
  adding: train_aloha_sim_transfer_cube_human/wandb/latest-run/tmp/code/ (stored 0%)
  adding: train_aloha_sim_transfer_cube_human/wandb/latest-run/run-1cq1s97q.wandb (deflated 77%)
  adding: train_aloha_sim_transfer_cube_human/wandb/latest-run/files/ (stored 0%)
  adding: train_aloha_sim_transfer_cube_human/wand


# Troubleshooting and Recommendations

1. **GPU Availability**: Ensure the selected GPU is available on your cloud platform (e.g., Colab).
2. **Compute Units**: Ensure you have sufficient compute units. Each 5-hour session requires ~70 units.
3. **Hugging Face Token**: You can generate a token [here](https://huggingface.co/settings/tokens).
4. **Session Safeguards**: Always download your results (output files) to prevent data loss if the session terminates.
5. **Checkpoint Reminder**: If resuming training, ensure that the checkpoint file from the previous session is present in the session.

---
