<a href="https://colab.research.google.com/github/leonmkim/lerobot_tutorial/blob/main/lerobot_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# LeRobot Training Instruction Manual

# 🎉 Welcome to the LeRobot Training Notebook!

This guide will help you set up and train a model on a cloud-based platform, such as **Google Colab**, using **LeRobot** with **Hugging Face**.

---

## ⚠️ **Disclaimers:**

- **GPU Subscription**: 🔑 Make sure you have the appropriate subscription plan that provides access to the necessary GPU (e.g., **A100**, **T4**). Review pricing and benefits on the cloud provider's website before proceeding.
  
- **Checkpoint Requirement**: ⏳ If resuming training, ensure that you have the previous training checkpoint available in your session. Without the checkpoint, the training **cannot be resumed**.

---

## 📝 **Important Instructions:**

- **Run All Cells Together**: 🔄 It is recommended to run all the cells in one go if you plan to leave the session **unmonitored**. This helps avoid session timeouts or disruptions.

- **GPU & Compute Units**: 🎛️ Ensure you select a suitable GPU (e.g., **A100**, **T4**) and have enough compute units for your session. A typical 5-hour training session requires approximately **70 compute units**.

- **Monitor Training**: 👀 It’s advisable to monitor the **first few epochs** to ensure that the training is running smoothly before leaving the session unattended.

- **Local Storage**: 💾 You will be prompted to choose whether you want to store the training outputs **locally** at the end of the process.

---

Now, let’s begin the setup process! 🚀


---



## Installing Dependencies

In this step, we'll install all the necessary dependencies for running LeRobot and performing model training.

Ensure that these packages are successfully installed before proceeding to the next steps.

---


In [6]:
try:
    import google.colab
    IS_COLAB = True
except ImportError:
    IS_COLAB = False

if IS_COLAB:
    # Install required dependencies
    print("Installing required dependencies...")

    # !sudo apt-get install libusb-1.0-0-dev
    # !pip install --upgrade  pyrealsense2 dynamixel-sdk rerun-sdk blinker wandb datasets huggingface-hub hydra-core gitpython flask diffusers InquirerPy
    !pip install --upgrade rerun-sdk[notebook] blinker wandb datasets huggingface-hub hydra-core gitpython flask diffusers InquirerPy

    # Install blinker if needed
    !pip install --ignore-installed blinker


    # Install LeRobot repository
    !git clone https://github.com/huggingface/lerobot.git
    %cd /content/lerobot
    # to avoid updates to dataset version v2.0
    # !git reset --hard 96c7052777aca85d4e55dfba8f81586103ba8f61
    !ls
    !pip install -e .
    # !pip install .[intelrealsense,dynamixel]
    !pip install .[aloha,pusht]

    print("Dependencies installed successfully.")
else:
    %env LD_LIBRARY_PATH=$HOME/.pyenv/versions/lerobot_tutorial/lib64/python3.11/site-packages/nvidia/nvjitlink/lib:$LD_LIBRARY_PATH
    !echo $LD_LIBRARY_PATH


env: LD_LIBRARY_PATH=$HOME/.pyenv/versions/lerobot_tutorial/lib64/python3.11/site-packages/nvidia/nvjitlink/lib:$LD_LIBRARY_PATH
$HOME/.pyenv/versions/lerobot_tutorial/lib64/python3.11/site-packages/nvidia/nvjitlink/lib:$LD_LIBRARY_PATH



# PushT Example

TODO

---



## Configure Settings
TODO
---


In [8]:
# Collect all necessary inputs from the user
import os
import subprocess

root_dir = "/contact" if IS_COLAB else os.path.expanduser("~/lerobot_tutorial")
lerobot_root_dir = "/content/lerobot" if IS_COLAB else os.path.expanduser("~/lerobot_tutorial/lerobot")
train_output_dir = os.path.join(lerobot_root_dir, "outputs", "train")
# GPU selection (Reminder: Ensure enough compute units for smooth training)
print("Please select a suitable GPU type (e.g., A100, T4) for cloud-based training.")

# Hugging Face login token
print("Generate a Hugging Face token from: https://huggingface.co/settings/tokens")
hf_token = input("Please enter your Hugging Face token: ")

# Link to Trossen Robotics Community datasets
print("You can explore datasets from Trossen Robotics Community here: https://huggingface.co/TrossenRoboticsCommunity")

# Dataset and job details
# dataset_repo_id = input("Please enter the dataset repo ID from Hugging Face (e.g., TrossenRoboticsCommunity/aloha_static_logo_assembly): ")
dataset_repo_id = "lerobot/pusht_keypoints"
task_id = "PushT-v0"
# training_offline_steps = 200000
training_offline_steps = 1000
training_eval_freq = 20000
training_save_freq = 20000
training_log_freq = 50
eval_n_episodes = 50
eval_batch_size = 50

print("**Important**: Use a valid directory/jobs name. Avoid numbers or special characters other than '_'.")
print("Example: 'training_results_aloha' or 'aloha_training_output'")

job_name = "train_diffusion_pusht_keypoints"

# Output directory with naming format instructions

output_dir = job_name

# Resume flag with disclaimer
# resume_flag = input("Do you want to resume training from a previous checkpoint? (yes/no): ")
resume_flag = "no"
resume_cmd = "--resume" if resume_flag.lower() == 'yes' else ""

# Model upload flag
# upload_choice = input("Do you want to upload the model to Hugging Face after training? (yes/no): ")
upload_choice = "yes"
model_repo_id = ""
if upload_choice.lower() == 'yes':
    # model_repo_id = input("Please enter the model repo ID to store your trained model to Hugging Face (e.g., TrossenRoboticsCommunity/aloha_static_logo_assembly): ")
    model_repo_id = job_name

# Local storage flag
# store_locally = input("Do you want to store the training outputs locally? (yes/no): ")
store_locally = "yes"

# 

Please select a suitable GPU type (e.g., A100, T4) for cloud-based training.
Generate a Hugging Face token from: https://huggingface.co/settings/tokens
You can explore datasets from Trossen Robotics Community here: https://huggingface.co/TrossenRoboticsCommunity
**Important**: Use a valid directory/jobs name. Avoid numbers or special characters other than '_'.
Example: 'training_results_aloha' or 'aloha_training_output'



## (For Colab) GPU Setup & Compute Units

The GPU type you selected earlier will now be configured for this cloud-based training session. Make sure to have enough compute units to support long training sessions, and monitor the first few epochs to ensure smooth execution.

---



## Hugging Face Login & Dataset Setup

We will now log into Hugging Face using the token provided. After login, the dataset repo, job name, and output directory that you specified will be configured for the training session.

---


In [3]:
# Log in to Hugging Face and verify login
print("Logging into Hugging Face...")
!huggingface-cli login --token {hf_token}

# Verify the login by checking the user information
user_info = !huggingface-cli whoami
print(f"Logged in as: {user_info[0]}")


Logging into Hugging Face...
The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
The token `repo+collection read+write` has been saved to /home/serialexperimentsleon/.cache/huggingface/stored_tokens
Your token has been saved to /home/serialexperimentsleon/.cache/huggingface/token
Login successful.
The current active token is: `repo+collection read+write`
Logged in as: serialexperimentsleon



## WandB login

TODO

---


In [4]:
!wandb login

[34m[1mwandb[0m: Currently logged in as: [33mserialexperimentsleon[0m. Use [1m`wandb login --relogin`[0m to force relogin



## Visualize Dataset

TODO

---


In [15]:
%cd {root_dir}
rerun_mode = "notebook" if IS_COLAB else "local"
!python visualize_dataset.py \
    --mode {rerun_mode} \
    --repo-id lerobot/pusht \
    --episode-index 0

/home/serialexperimentsleon/lerobot_tutorial


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


Fetching 4 files: 100%|█████████████████████████| 4/4 [00:00<00:00, 4120.14it/s]
Fetching 418 files: 100%|███████████████████| 418/418 [00:00<00:00, 2284.12it/s]
Resolving data files: 100%|███████████████| 206/206 [00:00<00:00, 321080.13it/s]
[0m[38;5;8m[[0m2025-01-27T01:16:59Z [0m[32mINFO [0m re_sdk_comms::server[0m[38;5;8m][0m Hosting a SDK server over TCP at 0.0.0.0:9876. Connect with the Rerun logging SDK.
[0m[38;5;8m[[0m2025-01-27T01:16:59Z [0m[32mINFO [0m winit::platform_impl::linux::x11::window[0m[38;5;8m][0m Guessed window scale factor: 1
[0m[38;5;8m[[0m2025-01-27T01:16:59Z [0m[33mWARN [0m wgpu_hal::vulkan::instance[0m[38;5;8m][0m Unable to find extension: VK_EXT_swapchain_colorspace
[0m[38;5;8m[[0m2025-01-27T01:16:59Z [0m[32mINFO [0m re_sdk_comms::server[0m[38;5;8m][0m New SDK client connected from: 127.0.0.1:50992
[0m[38;5;8m[[0m2025-01-27T01:16:59Z [0m[33mWARN [0m wgpu_hal::gles::egl[0m[38;5;8m][0m No config found!
[0m[38;5;8m[

> **⚠️ Important Notice:**
>
> Before you start the training, make sure to edit the `act_aloha_real.yaml` file located at:
>
> **Click Here** >> `/content/lerobot/lerobot/configs/policy/act_aloha_real.yaml`
>
> This file contains crucial parameters such as `batch_size`, `offline_steps`, and `learning_rate`. You should update these parameters based on your training needs. For example, you can modify:
>
> - **Batch Size** (`training.batch_size`): Adjust the number of samples processed in each training step.
> - **Offline Training Steps** (`training.offline_steps`): Define how many steps to run during offline training.
> - **Learning Rate** (`training.lr`): Set the learning rate to control how quickly the model learns.
>
> Once the file is updated, you can proceed with training.


## Model Training or Resumption

Now that everything is set up, we can either begin training the model or resume training from the last checkpoint, depending on your input.

If resuming, make sure the checkpoint is available in your session. The training will continue from the last checkpoint if found.

> **⚠️ Important: GPU Usage**
>
> By default, the training is configured to use a **GPU** for faster computation. If the runtime does not have access to a GPU, the training will fail.
>
> To avoid this issue:
>
> - **Ensure GPU is enabled** in your Colab runtime. You can check this by navigating to **Runtime > Change runtime type > Hardware accelerator** and selecting **GPU**.
> - If you prefer to use a **CPU** instead, update the `device` argument to `device=cpu` in the training command in the next cell.

---

In [77]:
# Start or resume training depending on user choice
# !export LD_LIBRARY_PATH=$HOME/.pyenv/versions/lerobot_tutorial/lib64/python3.11/site-packages/nvidia/nvjitlink/lib:$LD_LIBRARY_PATH
if resume_flag.lower() == "no":
    print(f"Starting new training on {dataset_repo_id}...")
    # !python lerobot/scripts/train.py dataset_repo_id={dataset_repo_id} policy=act_aloha_real env=aloha_real hydra.run.dir=outputs/train/{output_dir} hydra.job.name={job_name} device=cuda wandb.enable=false
    # for sim
    !python lerobot/scripts/train.py \
        dataset_repo_id={dataset_repo_id} \
        env=pusht \
        env.task={task_id} \
        env.gym.obs_type=environment_state_agent_pos \
        policy=diffusion_pusht_keypoints \
        training.eval_freq={training_eval_freq} \
        training.log_freq={training_log_freq} \
        training.offline_steps={training_offline_steps} \
        training.save_freq={training_save_freq} \
        eval.n_episodes={eval_n_episodes} \
        eval.batch_size={eval_batch_size} \
        hydra.run.dir=outputs/train/{output_dir} \
        hydra.job.name={job_name} \
        device=cuda wandb.enable=true \
        use_amp=true
else:
    print(f"Resuming training from {output_dir}... (ensure checkpoint is available)")
    !python lerobot/scripts/train.py hydra.run.dir={output_dir} resume=true

# step: optimization steps, smpl: num samples seen, ep: num episodes seen, epch: num epochs seen, Sigma rwrd: return, success: success rate, eval_s: 


Starting new training on lerobot/pusht_keypoints...
INFO 2025-01-26 19:15:29 ts/train.py:244 {'dataset_repo_id': 'lerobot/pusht_keypoints',
 'device': 'cuda',
 'env': {'action_dim': 2,
         'episode_length': 300,
         'fps': '${fps}',
         'gym': {'obs_type': 'environment_state_agent_pos',
                 'render_mode': 'rgb_array',
                 'visualization_height': 384,
                 'visualization_width': 384},
         'image_size': 96,
         'name': 'pusht',
         'state_dim': 2,
         'task': 'PushT-v0'},
 'eval': {'batch_size': 50, 'n_episodes': 50, 'use_async_envs': False},
 'fps': 10,
 'policy': {'beta_end': 0.02,
            'beta_schedule': 'squaredcos_cap_v2',
            'beta_start': 0.0001,
            'clip_sample': True,
            'clip_sample_range': 1.0,
            'crop_is_random': True,
            'crop_shape': [84, 84],
            'diffusion_step_embed_dim': 128,
            'do_mask_loss_for_padding': False,
            'down_d


## Eval policy

---


In [78]:
# output_dir
!python lerobot/scripts/eval.py -p outputs/train/{output_dir}/checkpoints/last/pretrained_model

INFO 2025-01-26 19:17:37 on/logger.py:39 [1m[33mOutput dir:[0m outputs/eval/2025-01-26/19-17-37_pusht_diffusion
INFO 2025-01-26 19:17:37 pts/eval.py:479 Making environment.
INFO 2025-01-26 19:17:37 pts/eval.py:482 Making policy.
Loading weights from local directory
Stepping through eval batches:   0%|                      | 0/1 [00:00<?, ?it/s]
Running rollout with at most 300 steps:   0%|           | 0/300 [00:00<?, ?it/s][A
Running rollout with at most 300 steps:   0%| | 0/300 [00:00<?, ?it/s, running_s[A
Running rollout with at most 300 steps:   0%| | 1/300 [00:00<03:18,  1.50it/s, r[A
Running rollout with at most 300 steps:   0%| | 1/300 [00:00<03:18,  1.50it/s, r[A
Running rollout with at most 300 steps:   1%| | 2/300 [00:00<03:18,  1.50it/s, r[A
Running rollout with at most 300 steps:   1%| | 3/300 [00:00<01:05,  4.55it/s, r[A
Running rollout with at most 300 steps:   1%| | 3/300 [00:00<01:05,  4.55it/s, r[A
Running rollout with at most 300 steps:   1%| | 4/300 [00:00<0

In [None]:
# skip to pretrained model
!python lerobot/scripts/eval.py -p lerobot/diffusion_pusht_keypoints

Fetching 7 files:   0%|                                   | 0/7 [00:00<?, ?it/s]
README.md: 100%|███████████████████████████| 2.79k/2.79k [00:00<00:00, 23.1MB/s][A

replay.mp4: 100%|██████████████████████████| 56.7k/56.7k [00:00<00:00, 11.1MB/s][A

eval_info.json:   0%|                               | 0.00/76.2k [00:00<?, ?B/s][A

config.json: 100%|█████████████████████████| 1.08k/1.08k [00:00<00:00, 13.9MB/s][A[A


eval_info.json: 100%|██████████████████████| 76.2k/76.2k [00:00<00:00, 5.45MB/s][A[A
config.yaml: 100%|█████████████████████████| 2.66k/2.66k [00:00<00:00, 14.0MB/s]

.gitattributes: 100%|██████████████████████| 1.52k/1.52k [00:00<00:00, 18.3MB/s][A
Fetching 7 files:  14%|███▊                       | 1/7 [00:00<00:00,  7.67it/s]
model.safetensors:   0%|                             | 0.00/995M [00:00<?, ?B/s][A
model.safetensors:   1%|▏                   | 10.5M/995M [00:00<00:24, 40.6MB/s][A
model.safetensors:   2%|▍                   | 21.0M/995M [00:00<00:23, 41


## Uploading the Model (Recommended)

Once the model is trained, you can choose to upload it to Hugging Face for safekeeping. This is **highly recommended** if you are running long sessions or training a valuable model.

Uploading the model will help protect against potential session interruptions or failures.

---


In [43]:
print(model_repo_id)
# Model upload step if chosen
if upload_choice.lower() == "yes":
    print("Uploading the model to Hugging Face...")
    !huggingface-cli upload {model_repo_id}  outputs/train/{output_dir}/checkpoints/last/pretrained_model
    print("Model uploaded to Hugging Face successfully.")
else:
    print("Model upload skipped.")



train_aloha_sim_transfer_cube_human
Uploading the model to Hugging Face...
Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.
Start hashing 4 files.
Finished hashing 4 files.
model.safetensors: 100%|█████████████████████| 207M/207M [00:04<00:00, 43.3MB/s]
https://huggingface.co/serialexperimentsleon/train_aloha_sim_transfer_cube_human/tree/main/.
Model uploaded to Hugging Face successfully.



# ALOHA Example

TODO

---


In [None]:
# Collect all necessary inputs from the user
import os
import subprocess

# GPU selection (Reminder: Ensure enough compute units for smooth training)
print("Please select a suitable GPU type (e.g., A100, T4) for cloud-based training.")

# Hugging Face login token
print("Generate a Hugging Face token from: https://huggingface.co/settings/tokens")
hf_token = input("Please enter your Hugging Face token: ")

# Link to Trossen Robotics Community datasets
print("You can explore datasets from Trossen Robotics Community here: https://huggingface.co/TrossenRoboticsCommunity")

# Dataset and job details
# dataset_repo_id = input("Please enter the dataset repo ID from Hugging Face (e.g., TrossenRoboticsCommunity/aloha_static_logo_assembly): ")
dataset_repo_id = "lerobot/aloha_sim_transfer_cube_human"
task_id = "AlohaTransferCube-v0"
training_eval_freq = 20000
training_log_freq = 250
training_offline_steps = 100000
training_save_freq = 20000
eval_n_episodes = 50
eval_batch_size = 50

print("**Important**: Use a valid directory/jobs name. Avoid numbers or special characters other than '_'.")
print("Example: 'training_results_aloha' or 'aloha_training_output'")

job_name = "train_aloha_sim_transfer_cube_human"

# Output directory with naming format instructions

output_dir = job_name

# Resume flag with disclaimer
# resume_flag = input("Do you want to resume training from a previous checkpoint? (yes/no): ")
resume_flag = "no"
resume_cmd = "--resume" if resume_flag.lower() == 'yes' else ""

# Model upload flag
# upload_choice = input("Do you want to upload the model to Hugging Face after training? (yes/no): ")
upload_choice = "yes"
model_repo_id = ""
if upload_choice.lower() == 'yes':
    # model_repo_id = input("Please enter the model repo ID to store your trained model to Hugging Face (e.g., TrossenRoboticsCommunity/aloha_static_logo_assembly): ")
    model_repo_id = job_name

# Local storage flag
# store_locally = input("Do you want to store the training outputs locally? (yes/no): ")
store_locally = "yes"

# 

Please select a suitable GPU type (e.g., A100, T4) for cloud-based training.
Generate a Hugging Face token from: https://huggingface.co/settings/tokens
You can explore datasets from Trossen Robotics Community here: https://huggingface.co/TrossenRoboticsCommunity
**Important**: Use a valid directory/jobs name. Avoid numbers or special characters other than '_'.
Example: 'training_results_aloha' or 'aloha_training_output'



## Visualize Dataset

TODO

---


In [None]:
!python lerobot/scripts/visualize_dataset.py \
    --repo-id {dataset_repo_id} \
    --episode-index 0

Fetching 56 files: 100%|██████████████████████| 56/56 [00:00<00:00, 8932.88it/s]
[0m[38;5;8m[[0m2025-01-26T01:52:32Z [0m[32mINFO [0m re_sdk_comms::server[0m[38;5;8m][0m Hosting a SDK server over TCP at 0.0.0.0:9876. Connect with the Rerun logging SDK.
[0m[38;5;8m[[0m2025-01-26T01:52:32Z [0m[32mINFO [0m winit::platform_impl::linux::x11::window[0m[38;5;8m][0m Guessed window scale factor: 1
[0m[38;5;8m[[0m2025-01-26T01:52:32Z [0m[33mWARN [0m wgpu_hal::vulkan::instance[0m[38;5;8m][0m Unable to find extension: VK_EXT_swapchain_colorspace
[0m[38;5;8m[[0m2025-01-26T01:52:32Z [0m[32mINFO [0m re_sdk_comms::server[0m[38;5;8m][0m New SDK client connected from: 127.0.0.1:54078
[0m[38;5;8m[[0m2025-01-26T01:52:32Z [0m[33mWARN [0m wgpu_hal::gles::egl[0m[38;5;8m][0m No config found!
[0m[38;5;8m[[0m2025-01-26T01:52:32Z [0m[33mWARN [0m wgpu_hal::gles::egl[0m[38;5;8m][0m EGL says it can present to the window but not natively
  0%|                     

> **⚠️ Important Notice:**
>
> Before you start the training, make sure to edit the `act_aloha_real.yaml` file located at:
>
> **Click Here** >> `/content/lerobot/lerobot/configs/policy/act_aloha_real.yaml`
>
> This file contains crucial parameters such as `batch_size`, `offline_steps`, and `learning_rate`. You should update these parameters based on your training needs. For example, you can modify:
>
> - **Batch Size** (`training.batch_size`): Adjust the number of samples processed in each training step.
> - **Offline Training Steps** (`training.offline_steps`): Define how many steps to run during offline training.
> - **Learning Rate** (`training.lr`): Set the learning rate to control how quickly the model learns.
>
> Once the file is updated, you can proceed with training.


## Model Training or Resumption

Now that everything is set up, we can either begin training the model or resume training from the last checkpoint, depending on your input.

If resuming, make sure the checkpoint is available in your session. The training will continue from the last checkpoint if found.

> **⚠️ Important: GPU Usage**
>
> By default, the training is configured to use a **GPU** for faster computation. If the runtime does not have access to a GPU, the training will fail.
>
> To avoid this issue:
>
> - **Ensure GPU is enabled** in your Colab runtime. You can check this by navigating to **Runtime > Change runtime type > Hardware accelerator** and selecting **GPU**.
> - If you prefer to use a **CPU** instead, update the `device` argument to `device=cpu` in the training command in the next cell.

---

In [None]:
if not IS_COLAB:
    %env LD_LIBRARY_PATH=$HOME/.pyenv/versions/lerobot_tutorial/lib64/python3.11/site-packages/nvidia/nvjitlink/lib:$LD_LIBRARY_PATH
    !echo $LD_LIBRARY_PATH
    %cd lerobot

env: LD_LIBRARY_PATH=$HOME/.pyenv/versions/lerobot_tutorial/lib64/python3.11/site-packages/nvidia/nvjitlink/lib:$LD_LIBRARY_PATH
$HOME/.pyenv/versions/lerobot_tutorial/lib64/python3.11/site-packages/nvidia/nvjitlink/lib:$LD_LIBRARY_PATH
/home/serialexperimentsleon/lerobot_tutorial/lerobot


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


In [None]:
# Start or resume training depending on user choice
# !export LD_LIBRARY_PATH=$HOME/.pyenv/versions/lerobot_tutorial/lib64/python3.11/site-packages/nvidia/nvjitlink/lib:$LD_LIBRARY_PATH
if resume_flag.lower() == "no":
    print(f"Starting new training on {dataset_repo_id}...")
    # !python lerobot/scripts/train.py dataset_repo_id={dataset_repo_id} policy=act_aloha_real env=aloha_real hydra.run.dir=outputs/train/{output_dir} hydra.job.name={job_name} device=cuda wandb.enable=false
    # for sim
    !python lerobot/scripts/train.py \
        dataset_repo_id={dataset_repo_id} \
        env=aloha \
        env.task={task_id} \
        policy=act \
        training.eval_freq={training_eval_freq} \
        training.log_freq={training_log_freq} \
        training.offline_steps={training_offline_steps} \
        training.save_freq={training_save_freq} \
        eval.n_episodes={eval_n_episodes} \
        eval.batch_size={eval_batch_size} \
        hydra.run.dir=outputs/train/{output_dir} \
        hydra.job.name={job_name} \
        device=cuda wandb.enable=true
else:
    print(f"Resuming training from {output_dir}... (ensure checkpoint is available)")
    !python lerobot/scripts/train.py hydra.run.dir={output_dir} resume=true

# step: optimization steps, smpl: num samples seen, ep: num episodes seen, epch: num epochs seen, Sigma rwrd: return, success: success rate, eval_s: 


Starting new training on lerobot/aloha_sim_transfer_cube_human...
INFO 2025-01-25 20:56:20 ts/train.py:244 {'dataset_repo_id': 'lerobot/aloha_sim_transfer_cube_human',
 'device': 'cuda',
 'env': {'action_dim': 14,
         'episode_length': 400,
         'fps': '${fps}',
         'gym': {'obs_type': 'pixels_agent_pos', 'render_mode': 'rgb_array'},
         'name': 'aloha',
         'state_dim': 14,
         'task': 'AlohaTransferCube-v0'},
 'eval': {'batch_size': 50, 'n_episodes': 50, 'use_async_envs': False},
 'fps': 50,
 'override_dataset_stats': {'observation.images.top': {'mean': [[[0.485]],
                                                                [[0.456]],
                                                                [[0.406]]],
                                                       'std': [[[0.229]],
                                                               [[0.224]],
                                                               [[0.225]]]}},
 'policy': {'chunk_si


## Eval policy

---


In [None]:
!python lerobot/scripts/eval.py -p outputs/train/{output_dir}/checkpoints/last/pretrained_model

INFO 2025-01-26 13:17:09 on/logger.py:39 [1m[33mOutput dir:[0m outputs/eval/2025-01-26/13-17-09_aloha_act
INFO 2025-01-26 13:17:09 pts/eval.py:479 Making environment.
INFO 2025-01-26 13:17:09 /__init__.py:88 MUJOCO_GL is not set, so an OpenGL backend will be chosen automatically.
INFO 2025-01-26 13:17:09 /__init__.py:96 Successfully imported OpenGL backend: %s
INFO 2025-01-26 13:17:09 /__init__.py:31 MuJoCo library version is: %s
INFO 2025-01-26 13:17:13 pts/eval.py:482 Making policy.
Loading weights from local directory
Stepping through eval batches:   0%|                      | 0/1 [00:00<?, ?it/s]
Running rollout with at most 400 steps:   0%|           | 0/400 [00:00<?, ?it/s][A
Running rollout with at most 400 steps:   0%| | 0/400 [00:02<?, ?it/s, running_s[A
Running rollout with at most 400 steps:   0%| | 1/400 [00:02<15:29,  2.33s/it, r[A
Running rollout with at most 400 steps:   0%| | 1/400 [00:02<15:29,  2.33s/it, r[A
Running rollout with at most 400 steps:   0%| | 2/400


## Safeguarding Session Data and Local Storage

To prevent data loss in case of session termination, you can zip the output directory and download it locally. If you selected local storage, the outputs will be saved to your local machine.

Make sure to run this step to save all training outputs before closing your session.

---


In [48]:
# Zip the output directory and download it if local storage is chosen
if IS_COLAB:
    train_output_dir = "/content/lerobot/outputs/train/"
else:
    train_output_dir = "outputs/train/"
%cd {train_output_dir}
!ls
if IS_COLAB:
    if store_locally.lower() == "yes":
        print("Zipping outputs for download...")
        !zip -r trained.zip {output_dir}

        # Download the zipped file
        from google.colab import files
        files.download('/content/lerobot/outputs/train/trained.zip')
    else:
        print("Local storage not selected, skipping download.")

[Errno 2] No such file or directory: 'outputs/train/'
/home/serialexperimentsleon/lerobot_tutorial/lerobot/outputs/train
aloha_train  train_aloha_sim_transfer_cube_human


  bkms = self.shell.db.get('bookmarks', {})


Zipping outputs for download...
  adding: train_aloha_sim_transfer_cube_human/ (stored 0%)
  adding: train_aloha_sim_transfer_cube_human/wandb/ (stored 0%)
  adding: train_aloha_sim_transfer_cube_human/wandb/latest-run/ (stored 0%)
  adding: train_aloha_sim_transfer_cube_human/wandb/latest-run/logs/ (stored 0%)
  adding: train_aloha_sim_transfer_cube_human/wandb/latest-run/logs/debug.log (deflated 67%)
  adding: train_aloha_sim_transfer_cube_human/wandb/latest-run/logs/debug-internal.log (deflated 74%)
  adding: train_aloha_sim_transfer_cube_human/wandb/latest-run/logs/debug-core.log (deflated 70%)
  adding: train_aloha_sim_transfer_cube_human/wandb/latest-run/tmp/ (stored 0%)
  adding: train_aloha_sim_transfer_cube_human/wandb/latest-run/tmp/code/ (stored 0%)
  adding: train_aloha_sim_transfer_cube_human/wandb/latest-run/run-1cq1s97q.wandb (deflated 77%)
  adding: train_aloha_sim_transfer_cube_human/wandb/latest-run/files/ (stored 0%)
  adding: train_aloha_sim_transfer_cube_human/wand


# Troubleshooting and Recommendations

1. **GPU Availability**: Ensure the selected GPU is available on your cloud platform (e.g., Colab).
2. **Compute Units**: Ensure you have sufficient compute units. Each 5-hour session requires ~70 units.
3. **Hugging Face Token**: You can generate a token [here](https://huggingface.co/settings/tokens).
4. **Session Safeguards**: Always download your results (output files) to prevent data loss if the session terminates.
5. **Checkpoint Reminder**: If resuming training, ensure that the checkpoint file from the previous session is present in the session.

---
