VITRA:
Scalable Vision-Language-Action Model Pretraining
for Robotic Manipulation with Real-Life Human Activity Videos
For video demonstrations, please refer to our project page.
- [2025-12-05] 🚀 Release the code for performing zero-shot inference using a single image.
- [2025-11-30] 🚀 Our code, pretrained models, and datasets are now open-sourced.
- [2025-10-24] 🚀 VITRA paper is released on arXiv.
Our pretrained model and datasets are available on the huggingface hub:
| Hugging Face Model | #Params | Description |
|---|---|---|
VITRA-VLA-3B |
3B | Base VLA model pretrained on Human Hand Data. |
Note: Our base VLA model is finetuned from Paligemma2. if you do not have access to Paligemma2, please request permission on the official website.
| Hugging Face Dataset | Sub Datasets | |
|---|---|---|
| Dataset Name | Number of Episodes | |
VITRA-1M |
ego4d_cooking_and_cleaning |
454,244 |
ego4d_other |
494,439 | |
epic |
154,464 | |
egoexo4d |
67,053 | |
ssv2 |
52,718 | |
| Total | 1,222,918 | |
Note: See data/data.md for detailed information about our datasets.
- 1. Installation
- 2. Inference with Human Hand Image
- 3. Fine-tuning with a Custom Robot Dataset
- 4. Deployment in the Real World
- 5. Human Hand VLA Dataset Utilization
- 6. Human Data Pretraining from Scratch
- Citation
We recommend using conda to manage the environment. We require PyTorch >= 2.3.0 and CUDA >= 12.1 (It may run with lower versions, but we have not tested it). If the environment is used solely for training, it is recommended to use a higher version of PyTorch to achieve improved training speed.
# Clone the repository
git clone https://github.com/microsoft/VITRA.git
cd VITRA
# Create environment
conda create -n vitra python=3.10 -y
conda activate vitra
# Install dependencies
pip install -e .Click to view detailed system requirements
- OS: Linux (Ubuntu 20.04/22.04 recommended)
- Python: 3.10+
- CUDA: 11.8+
- GPU: Minimum 16GB VRAM for inference, A100/H100 recommended for training.
If you want to visualize the results after inference, run dataset visualization, or perform zero-shot human hand action prediction from a single image, please follow the instructions below.
Submodules Installation
Please clone the submodules to perform hand pose estimation.
git submodule update --init --recursiveLibraries Installation
Please install the following additional modules for visualization using the commands below:
pip install -e .[visulization] --no-build-isolationClick here if you encounter issues when installing Installing PyTorch3D
If FFmpeg is not installed on your system, please install it first.
sudo apt install ffmpegMANO Hand Model
Our reconstructed hand labels are based on the MANO hand model. We only require the right hand model. The model parameters can be downloaded from the official website and organized in the following structure:
weights/
└── mano/
├── MANO_RIGHT.pkl
└── mano_mean_params.npz
Please download the model weights of HaWoR for hand pose estimation:
wget https://huggingface.co/spaces/rolpotamias/WiLoR/resolve/main/pretrained_models/detector.pt -P ./weights/hawor/external/
wget https://huggingface.co/ThunderVVV/HaWoR/resolve/main/hawor/checkpoints/hawor.ckpt -P ./weights/hawor/checkpoints/You can use our pretrained model to perform zero-shot 3D human hand action prediction directly from an egocentric human hand image (landscape view) based on instructions. To predict human actions from pre-captured images, please run scripts/run_human_inference.sh. Here is a simple example:
python scripts/inference_human_prediction.py \
--config VITRA-VLA/VITRA-VLA-3B \
--image_path ./examples/0002.jpg \
--sample_times 4 \
--save_state_local \
--use_right \
--video_path ./example_human_inf.mp4 \
--mano_path ./weights/mano \
--instruction "Left hand: None. Right hand: Pick up the picture of Michael Jackson." \All example images are captured on mobile phones in rooms that do not appear anywhere in the V-L-A dataset. They also include entirely unseen concepts, such as photos of celebrities.
Users can capture images with their own devices and directly test the model using the recorded images.
Note:
For best inference quality, it is recommended to capture landscape view images from a camera height close to that of a human head to match natural egocentric viewpoints. Highly unusual or distorted hand poses/positions may cause inference failures.
Below is the minimum usage example for predicting human actions.
import json
import torch
import numpy as np
from PIL import Image
from vitra.models import VITRA_Paligemma, load_model
from vitra.utils.data_utils import resize_short_side_to_target, load_normalizer
from vitra.datasets.human_dataset import pad_state_human, pad_action
from vitra.utils.config_utils import load_config
from vitra.datasets.dataset_utils import (
ActionFeature,
StateFeature,
)
# Load configs
configs = load_config('VITRA-VLA/VITRA-VLA-3B')
# Override config if provided
pretrained_path = 'VITRA-VLA/VITRA-VLA-3B'
statistics_path = 'VITRA-VLA/VITRA-VLA-3B'
configs['model_load_path'] = pretrained_path
configs['statistics_path'] = statistics_path
# Load model and normalizer
model = load_model(configs).cuda()
model.eval()
normalizer = load_normalizer(configs)
image_path = "your_image.jpg"
image = Image.open(image_path)
image = resize_short_side_to_target(image, target=224)
fov = torch.tensor([[np.deg2rad(60.0), np.deg2rad(60.0)]]) # input your camera FOV here [fov_x, fov_y]
image = np.array(image)
# Input your prompt here. Only predict the right hand action as an example.
instruction = "Left hand: None. Right hand: Pick up the phone on the table."
# Initialize state
# State vector structure (total dimension: 122):
# - state_left [51]: Left hand state vector
# * [0:3] transl: Translation in camera space (x, y, z in meters)
# * [3:6] global_orient: Global rotation as Euler angles (xyz, in radians)
# * [6:51] hand_pose: 45 joint angles as Euler angles (15 joints × 3 axes, in radians)
# - beta_left [10]: Left hand MANO shape parameters
# - state_right [51]: Right hand state vector (same structure as state_left)
# * [0:3] transl: Translation in camera space (x, y, z in meters)
# * [3:6] global_orient: Global rotation as Euler angles (xyz, in radians)
# * [6:51] hand_pose: 45 joint angles as Euler angles (15 joints × 3 axes, in radians)
# - beta_right [10]: Right hand MANO shape parameters
state = np.zeros((normalizer.state_mean.shape[0],)) # Input your hand state here
# Only use right hand state as an example.
# state_mask[0] indicates whether to use left hand state,
# state_mask[1] indicates whether to use right hand state.
state_mask = np.array([False, True], dtype=bool) # Input your hand state mask here.
# Input your action_mask here. Shape: (W, 2) where W is chunk_size.
# action_mask[:, 0] indicates whether to predict left hand actions,
# action_mask[:, 1] indicates whether to predict right hand actions.
# All left hand False, all right hand True as an example.
action_mask = np.tile(np.array([[False, True]], dtype=bool), (model.chunk_size, 1))
# Normalize state
norm_state = normalizer.normalize_state(state)
unified_action_dim = ActionFeature.ALL_FEATURES[1] # 192
unified_state_dim = StateFeature.ALL_FEATURES[1] # 212
unified_state, unified_state_mask = pad_state_human(
state = norm_state,
state_mask = state_mask,
action_dim = normalizer.action_mean.shape[0],
state_dim = normalizer.state_mean.shape[0],
unified_state_dim = unified_state_dim,
)
_, unified_action_mask = pad_action(
actions=None,
action_mask=action_mask,
action_dim=normalizer.action_mean.shape[0],
unified_action_dim=unified_action_dim
)
# Model inference
norm_action = model.predict_action(
image = image,
instruction = instruction,
current_state = unified_state.unsqueeze(0),
current_state_mask = unified_state_mask.unsqueeze(0),
action_mask_torch = unified_action_mask.unsqueeze(0),
num_ddim_steps = 10,
cfg_scale = 5.0,
fov = fov,
sample_times = 1
)
norm_action = norm_action[0, :,:102]
# Denormalize predicted action
unnorm_action = normalizer.unnormalize_action(norm_action)
print("Predicted Action:", unnorm_action)Our VITRA model serve as starting points for robot-specific fine-tuning (e.g., on Xhand or your custom robot).
We use the camera-space end-effector (EEF) pose to represent the wrist pose of the dexterous hand. The teleoperation command sent to the XHand is treated as the dexterous-hand joint action, and the retrieved XHand joint angles are used as the hand joint state. Before fine-tuning with robot data, we strongly recommend aligning both hands’ EEF translation and rotation coordinate frames to the coordinate system used in the Hand V-L-A dataset.
Align all coordinates to the camera coordinate system:
- X-axis: pointing to the right of the screen (positive direction)
- Y-axis: pointing downward on the screen (positive direction)
- Z-axis: pointing away from the camera, perpendicular into the screen (positive direction)
For the EEF rotation origin, pay attention to the mirror relationship between the left and right hands. After aligning the rotation origin, the EEF pose should match the case in the following image. The left side is an example of Human V-L-A dataset; The right side is an aligned example of XHand dataset.
In addition, we provide the transfer_xhand_to_human function in vitra/dataset/robot_dataset.py, which maps XHand joint angles to the closest degrees of freedom used in the human-hand representation. (Note that this XHand-to-human-hand alignment does not need to be preprocessed offline. It will be automatically applied during trajectory transformation inside the transform_trajectory function of the RoboDatasetCore class.)
If you are using a different dexterous hand model, we recommend implementing a similar function to align its joint configuration with the human-hand DoF.
To adapt the dataset loader to your own robot data, you should create a custom implementation of the RoboDatasetCore class in vitra/dataset/robot_dataset.py.
The following methods typically need to be overridden to match the human-data pretraining format:
def __init__(...):
# dataset initialization
def __len__(self):
# return the number of frames
...
def __getitem__(self, idx):
"""
Returns a sample dictionary
"""
...
return {
"instruction": instruction,
"image_list": image_list,
"image_mask": image_mask,
"action_list": action_list,
"action_mask": action_mask,
"current_state": current_state,
"current_state_mask": current_state_mask,
"fov": fov,
"intrinsics": self.intrinsics,
}The dictionary returned by __getitem__ contains multiple fields representing the robot state, actions, and observations. Here, we use the XHand configuration as an example for illustration:
- Left Hand: 6-DoF EEF pose + 12 joint angles = 18-DoF
- Right Hand: 6-DoF EEF pose + 12 joint angles = 18-DoF
- The ordering of dimensions in the state or action: [
left_eef_trans,left_eef_euler_rotation,left_hand_joint,right_eef_trans,right_eef_euler_rotation,right_hand_joint]
| Key | Type | Shape | Description |
|---|---|---|---|
| instruction | str |
— | Natural language description of the task. e.g., "Left hand: None. Right hand: {Right hand prompt}" |
| image_list | np.ndarray (uint8) |
(1, H, W, C) |
RGB image sequence. • C=3 (channels) • H, W = height & width |
| image_mask | np.ndarray (bool) |
(1,) |
Indicates which frames are valid (1=valid, 0=padded). |
| action_list | np.ndarray (float32) |
(T, 36) |
Sequence of robot actions. (an action chunking) • T = length of an action chunking |
| action_mask | np.ndarray (bool) |
(T, 2) |
Indicates which time steps contain valid left-hand or right-hand actions. action_mask[:, 0] corresponds to the left hand, and action_mask[:, 1] corresponds to the right hand. |
| current_state | np.ndarray (float32) |
(36,) |
Robot state at the current time step. |
| current_state_mask | np.ndarray (float32) |
(2,) |
Indicates which hand's state is valid. current_state_mask[0] corresponds to the left hand, and current_state_mask[1] corresponds to the right hand. |
| fov | np.ndarray (float32) |
(2,) |
Camera field of view. [fov_x, fov_y] |
| intrinsics | np.ndarray (float32) |
(3, 3) |
Camera intrinsic matrix. Not used during training or model inference. |
Before training, we need to compute the dataset statistics for the robot dataset, specifically the mean and standard deviation of the state and action variables. These statistics will be used to normalize the actions using a Gaussian normalization.
After preparing the dataset and computing statistics, you can edit the vitra/configs/robot_finetune.json file to update paths and other relevant settings. pretrain_path should be replaced with the path to the pretrained model or the folder containing it, so that fine-tuning can be performed correctly.
In scripts/run_robot_finetune.sh, make sure to input your Hugging Face token and WANDB API key for proper authentication and logging.
Once the configuration is ready, you can start pretraining by running:
bash scripts/run_robot_finetune.shBelow is an example of running inference with VITRA on the XHand platform.
import json
import torch
import numpy as np
from PIL import Image
from vitra.models import VITRA_Paligemma, load_model
from vitra.utils.data_utils import resize_short_side_to_target
from vitra.datasets.human_dataset import pad_state_human, pad_action
from vitra.utils.data_utils import load_normalizer
from vitra.datasets.dataset_utils import (
ActionFeature,
StateFeature,
)
from vitra.datasets.robot_dataset import (
transfer_xhand_to_human,
transfer_human_to_xhand,
pad_state_robot, pad_action
)
# Load configs
configs = json.load(open('configs/robot_finetune.json'))
pretrained_path = 'checkpoints/finetuned_model.pt'
statistics_path = 'statistics/RoboData_statistics.json'
configs['model_load_path'] = pretrained_path
configs['statistics_path'] = statistics_path
# Load model and normalizer
model = load_model(configs).cuda()
model.eval()
normalizer = load_normalizer(configs)
image_path = "your_image.jpg"
image = Image.open(image_path)
image = resize_short_side_to_target(image, target=224)
fov = torch.tensor([[np.deg2rad(60.0), np.deg2rad(60.0)]]) # input your camera FOV here [fov_x, fov_y]
image = np.array(image)
# Input your prompt here. Only predict the right hand action as an example.
instruction = "Left hand: None. Right hand: Pour the contents of the bottle into the pot."
# Initialize state
# State vector structure (total dimension: 36):
# - Left hand [0:18]:
# * [0:3] transl: Wrist translation in camera space (x, y, z in meters)
# * [3:6] global_orient: Wrist rotation as Euler angles (xyz, in radians)
# * [6:18] hand_pose: XHand joint angles (12 joints, in radians)
# - Right hand [18:36]:
# * [18:21] transl: Wrist translation in camera space (x, y, z in meters)
# * [21:24] global_orient: Wrist rotation as Euler angles (xyz, in radians)
# * [24:36] hand_pose: XHand joint angles (12 joints, in radians)
state = np.zeros((normalizer.state_mean.shape[0],)) # Input your hand state here
# Only use right hand state as an example.
# state_mask[0] indicates whether to use left hand state,
# state_mask[1] indicates whether to use right hand state.
state_mask = np.array([False, True], dtype=bool) # Input your hand state mask here.
# Input your action_mask here. Shape: (W, 2) where W is chunk_size.
# action_mask[:, 0] indicates whether to predict left hand actions,
# action_mask[:, 1] indicates whether to predict right hand actions.
# All left hand False, all right hand True as an example.
action_mask = np.tile(np.array([[False, True]], dtype=bool), (model.chunk_size, 1))
# Normalize state
norm_state = normalizer.normalize_state(state)
unified_action_dim = ActionFeature.ALL_FEATURES[1] # 192
unified_state_dim = StateFeature.ALL_FEATURES[1] # 212
unified_state, unified_state_mask = pad_state_robot(
state = norm_state,
state_mask = state_mask,
state_dim = normalizer.state_mean.shape[0],
unified_state_dim = unified_state_dim,
)
_, unified_action_mask = pad_action(
actions=None,
action_mask=action_mask,
action_dim=normalizer.action_mean.shape[0],
unified_action_dim=unified_action_dim
)
human_state, human_state_mask, _, human_action_mask = transfer_xhand_to_human(
unified_state, unified_state_mask,
None, unified_action_mask
)
# Model inference
norm_action = model.predict_action(
image = image,
instruction = instruction,
current_state = human_state.unsqueeze(0),
current_state_mask = human_state_mask.unsqueeze(0),
action_mask_torch = human_action_mask.unsqueeze(0),
num_ddim_steps = 10,
cfg_scale = 5.0,
fov = fov,
sample_times = 1
)
norm_action = norm_action[0, :,:102]
norm_robot_action = transfer_human_to_xhand(norm_action)
# Denormalize predicted action
unnorm_action = normalizer.unnormalize_action(norm_robot_action)
print("Predicted Action:", unnorm_action.shape)
# results in 2x 18-DoF actions of 16 steps with shape [16, 36]We release the Human Hand V-L-A dataset, which transforms "in-the-wild" videos into robot-aligned (image, instruction, action) tuples.
You can download the dataset annotations from VITRA-1M.
After downloading, please extract all .gz files using the following commands:
tar -xzvf ego4d_cooking_and_cleaning.tar.gz
tar -xzvf ego4d_other.tar.gz
tar -xzvf egoexo4d.tar.gz
tar -xzvf ssv2.tar.gz
tar -xzvf epic.tar.gzFor detailed information on the dataset, its structure, and usage instructions, please refer to the data/data.md file.
To reproduce our pretraining results on the Human-VLA dataset:
First, please follow the instructions in data/data.md to download the video dataset and the corresponding annotation files. After downloading, perform undistortion on the videos to correct for fisheye and lens distortions before further preprocessing.
The undistorted EgoExo4D videos contain black borders due to fisheye correction. We apply an additional preprocessing step to remove these borders: first resize all EgoExo4D frames size to 448×448, then center-cropped to 256×256. The change of the intrinsics is already handled in vitra/datasets/human_dataset.py, where the computation of camera intrinsics incorporates the crop transformation and adjusts the camera FOV accordingly.
Once prepared, organize the data in the following directory structure:
Data_root/
├── Video/
│ ├── Ego4D_root/
│ │ ├── {video_name1}.mp4
│ │ ├── {video_name2}.mp4
│ │ └── ...
│ ├── Epic-Kitchen_root/
│ │ ├── {video_name1}.MP4
│ │ ├── {video_name2}.MP4
│ │ └── ...
│ ├── EgoExo4D_root/
│ │ ├── {video_name1}.mp4
│ │ ├── {video_name2}.mp4
│ │ └── ...
│ └── Somethingsomething-v2_root/
│ ├── {video_name1}.webm
│ ├── {video_name2}.webm
│ └── ...
└── Annotation/
├── ego4d_cooking_and_cleaning/
├── ego4d_other/
├── egoexo4d/
├── epic/
├── ssv2/
└── statistics/
Video/contains all raw video files from different sources such as Ego4D, Epic-Kitchen, EgoExo4D, and Something-Something v2.Annotation/contains the annotation files aligned with the corresponding videos.
To further accelerate training, we recommend the following optional preprocessing steps:
- Resize all processed videos so that the shorter side is 224 for faster decoding and reduced memory usage (preserve the aspect ratio during resizing).
- Split long videos into shorter clips of at most 2000 frames each.
Name the clips using the format:{video_name}_part{part_index}.mp4, wherepart_indexstarts from 1. - In
configs/human_pretrain.json, set:
"clip_len": 2000This significantly improves data-loading speed during training, especially for large-scale datasets.
You can directly use the precomputed statistics provided in the downloaded dataset. Alternatively, You can compute the dataset statistics (mean and standard deviation of states and actions) by running the following command.
python vitra/datasets/calculate_statistics.py --save_folder data_root/statisticsEdit the vitra/configs/human_pretrain.json file to update paths and other relevant settings.
In scripts/run_human_pretrain.sh, make sure to input your Hugging Face token and WANDB API key for proper authentication and logging.
Once the configuration is ready, you can start pretraining by running:
# Distributed training
bash scripts/run_human_pretrain.shIf you find our work useful in your research, please cite:
@article{li2025vitra,
title={Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos},
author={Qixiu Li and Yu Deng and Yaobo Liang and Lin Luo and Lei Zhou and Chengtang Yao and Lingqi Zeng and Zhiyuan Feng and Huizhi Liang and Sicheng Xu and Yizhong Zhang and Xi Chen and Hao Chen and Lily Sun and Dong Chen and Jiaolong Yang and Baining Guo},
journal={arXiv preprint arXiv:2510.21571},
year={2025}
}
