# Train SmolVLA with FineTuning

---

- Conda env : [lerobot](../README.md#setup-a-conda-environment)

----

- Ref: 
    - ...


    


### Device Setup

In [4]:
import torch

if torch.backends.mps.is_available():
    device = "mps"
elif torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"

print(f"Available device : {device}")

Available device : cuda


In [28]:
if device == "cuda":
    !nvidia-smi

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Wed Sep 10 07:51:08 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 2080 Ti     On  |   00000000:01:00.0  On |                  N/A |
| 43%   46C    P2             65W /  250W |    5318MiB /  11264MiB |     17%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## DataSet(svla-so101_pickplace) Visualization

In [34]:
!python -m lerobot.scripts.visualize_dataset \
    --repo-id lerobot/svla_so101_pickplace \
    --episode-index 40

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Resolving data files: 100%|██████████████████| 50/50 [00:00<00:00, 89088.87it/s]
[0m[38;5;8m[[0m2025-09-10T15:24:09Z [0m[32mINFO [0m winit::platform_impl::linux::x11::window[0m[38;5;8m][0m Guessed window scale factor: 1
[0m[38;5;8m[[0m2025-09-10T15:24:10Z [0m[33mWARN [0m wgpu_hal::gles::egl[0m[38;5;8m][0m No config found!
[0m[38;5;8m[[0m2025-09-10T15:24:10Z [0m[33mWARN [0m wgpu_hal::gles::egl[0m[38;5;8m][0m EGL says it can present to the window but not natively
  0%|                                                    | 0/10 [00:00<?, ?it/s][0m[38;5;8m[[0m2025-09-10T15:24:10Z [0m[33mWARN [0m wgpu_hal::gles::adapter[0m[38;5;8m][0m Max vertex attribute stride unknown. Assuming it is 2048
[0m[38;5;8m[[0m2025-09-10T15:24:10Z [0m[33mWARN [0m wgpu_hal::vulkan::conv[0m[38;5;8m][0m Unrecognized present mode 1000361000
[0m[38;5;8m[[0m2025-09-10T15:24:10Z [0m[33mWARN [0m wgpu_hal::gles::adapter[0m[38;5;8m][0m Max vertex attribute stride unknow

## Fine-tuning SmolVAL with sval-so101-pickplace dataset

In [None]:
import os

output_dir = "./temp/outputs/svla_so101_pickplace"
print(output_dir)

In [6]:
!python -m lerobot.scripts.train \
    --policy.path=lerobot/smolvla_base \
    --dataset.repo_id=lerobot/svla_so101_pickplace \
    --batch_size=32  \
    --steps=2000 \
    --save_freq=1000 \
    --eval_freq=10 \
    --policy.device=$device \
    --wandb.enable=false \
    --output_dir=.$output_dir \
    --policy.push_to_hub=false

INFO 2025-09-10 05:15:45 ils/utils.py:48 Cuda backend detected, using cuda.
INFO 2025-09-10 05:15:46 ts/train.py:111 {'batch_size': 32,
 'dataset': {'episodes': None,
             'image_transforms': {'enable': False,
                                  'max_num_transforms': 3,
                                  'random_order': False,
                                  'tfs': {'brightness': {'kwargs': {'brightness': [0.8,
                                                                                   1.2]},
                                                         'type': 'ColorJitter',
                                                         'weight': 1.0},
                                          'contrast': {'kwargs': {'contrast': [0.8,
                                                                               1.2]},
                                                       'type': 'ColorJitter',
                                                       'weight': 1.0},
                

## Preformace Test with the pre-trained model

In [30]:
import torch
import time

from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy
from lerobot.policies.smolvla.configuration_smolvla import SmolVLAConfig

from transformers import AutoProcessor
 
local_path = os.path.join(output_dir, "checkpoints/last/pretrained_model")
policy = SmolVLAPolicy.from_pretrained(local_path).to(device)
policy.eval()
 
# patch: The loaded policy is missing the language_tokenizer attribute.
policy.language_tokenizer = AutoProcessor.from_pretrained(policy.config.vlm_model_name).tokenizer
 
# Dummy batch config for a single observation
batch_size = 1
img_shape = (3, 480, 640)  # (C, H, W)
# Infer state_dim from the loaded normalization stats
state_dim = policy.normalize_inputs.buffer_observation_state.mean.shape[-1]
 
dummy_batch = {
    # a single image observation
    "observation.images.top": torch.rand(batch_size, *img_shape, device=device),
    "observation.images.side": torch.rand(batch_size, *img_shape, device=device),
    # a single state observation
    "observation.state": torch.rand(batch_size, state_dim, device=device),
    "task": ["stack the blocks"] * batch_size,
}
 
# --- Prepare inputs for the model ---
# The policy expects normalized inputs and specific data preparation.
normalized_batch = policy.normalize_inputs(dummy_batch)
images, img_masks = policy.prepare_images(normalized_batch)
state = policy.prepare_state(normalized_batch)
lang_tokens, lang_masks = policy.prepare_language(normalized_batch)
# ---
 
# Warmup
for _ in range(3):
    with torch.no_grad():
        _ = policy.model.sample_actions(images, img_masks, lang_tokens, lang_masks, state)
 
# Benchmark
torch.cuda.reset_peak_memory_stats()
start = time.time()
for _ in range(100):
    with torch.no_grad():
        _ = policy.model.sample_actions(images, img_masks, lang_tokens, lang_masks, state)
end = time.time()
 
print(f"Avg inference time: {(end - start)/100:.6f} s")
print(f"Max GPU memory used: {torch.cuda.max_memory_allocated() / 1024**2:.2f} MB")
 

Loading  HuggingFaceTB/SmolVLM2-500M-Video-Instruct weights ...
Reducing the number of VLM layers to 16 ...
Loading weights from local directory
Avg inference time: 0.282433 s
Max GPU memory used: 1913.13 MB
