Zimu Zhang1*
Yucheng Zhang1*
Xiyan Xu1
Ziyin Wang1
Sirui Xu1†
Kai Zhou2,3
Bing Zhou3
Chuan Guo3
Jian Wang3
Yu-Xiong Wang1‡
Liang-Yan Gui1‡
1University of Illinois Urbana-Champaign
2Specs Inc.
3Snap Inc.
*Equal Contribution
†Project Lead
‡Equal Advising
Installation steps
- Create a conda environment:
conda create -n HandX python=3.11 -y
conda activate HandX- Install PyTorch 2.6.0 (CUDA 12.4):
pip install torch==2.6.0+cu124 torchvision==0.21.0+cu124 torchaudio==2.6.0+cu124 --index-url https://download.pytorch.org/whl/cu124- Install the required dependencies:
pip install -r requirements.txt- Install PyTorch3D:
pip install --no-build-isolation "git+https://github.com/facebookresearch/pytorch3d.git"Place the MANO model files MANO_LEFT.pkl and MANO_RIGHT.pkl into the diffusion/body_models/mano/ directory.
diffusion/
└── body_models/
└── mano/
├── MANO_LEFT.pkl
└── MANO_RIGHT.pkl
Download the HandX sample data archive from this link, all base data will be released shortly here after legal review. This archive contains data from all datasets except ARCTIC and H2O, which must be obtained separately due to their redistribution policies.
Extract the archive and place the included 4 files under data/handx/:
data/
└── handx/
├── train_can_pos_all_wotextfeat.npz
├── train_mano.npz
├── test_can_pos_all_wotextfeat.npz
└── test_mano.npz
The data/processing/ directory contains scripts to process ARCTIC and H2O data and merge them into the base dataset.
Directory structure
data/processing/
├── H2O/
│ ├── raw/
│ │ ├── subject1/
│ │ ├── subject2/
│ │ ├── subject3/
│ │ └── subject4/
│ ├── skeleton/ # step 1 output
│ ├── skeleton_canonicalized/ # step 2 output
│ ├── skeleton_split/ # step 3 output
│ ├── text/
│ └── mano/ # step 4 output
└── ARCTIC/
├── raw_seqs/
│ ├── s01/
│ ├── s02/
│ ├── s04/
│ ├── s05/ ... s10/
├── temp/ # step 1 output (intermediate)
├── skeleton/ # step 2 output
├── skeleton_canonicalized/ # step 3 output
├── skeleton_split/ # step 4 output
├── text/
└── mano/ # step 5 output
H2O
-
Download the H2O dataset from https://h2odataset.ethz.ch. You need the following files:
subject1_pose_v1_1.tarsubject2_pose_v1_1.tarsubject3_pose_v1_1.tarsubject4_pose_v1_1.tar
Unzip them and place the contents under
data/processing/H2O/raw/ -
Download the annotation texts from this link and place the Json files under
data/processing/H2O/text/ -
Process the data by running the following commands in order:
cd data/processing/H2O
# Step 1: Extract skeleton from raw H2O data
python extract_h2o_data.py
# Step 2: Canonicalize skeleton orientation
python canonicalize_pose.py
# Step 3: Split full sequences into 60-frame windows
python split_skeleton.py
# Step 4: Fit MANO parameters to skeleton windows (requires GPU)
cd ..
python skeleton2mano.py --input_dir H2O/skeleton_split --output_dir H2O/manoARCTIC
-
Download the ARCTIC dataset from https://arctic.is.tue.mpg.de. You need the raw_seqs files.
Place them under
data/processing/ARCTIC/raw_seqs/. -
Download the annotation texts from this link and place the Json files under
data/processing/ARCTIC/text/ -
Process the data by running the following commands in order:
cd data/processing/ARCTIC
# Step 1: Extract MANO parameters from raw ARCTIC data
python mano_extract.py
# Step 2: Convert MANO parameters to skeleton
python mano_to_skeleton.py
# Step 3: Canonicalize skeleton orientation
python canonicalize_pose.py
# Step 4: Split full sequences into 60-frame windows
python split_skeleton.py
# Step 5: Fit MANO parameters to skeleton windows (requires GPU)
cd ..
python skeleton2mano.py --input_dir ARCTIC/skeleton_split --output_dir ARCTIC/manoMerging
After processing both datasets, merge them into the base dataset:
cd data/processing
python merge_arctic_h2o.pyThis appends the ARCTIC and H2O data to the existing .npz files in data/handx/, producing the complete dataset.
(Optional) Dataset Contact Quality Evaluation
The script compute_contact_metric.py evaluates contact quality metrics for bimanual skeleton data, which is used in to evaluate the data quality in our article.
It computes three core metrics:
- Contact Ratio: Proportion of frames with hand-hand contact
- Avg Contact Duration: Mean length of contact segments (seconds)
- Contact Frequency: Number of contact events per second
cd data
python scripts/evaluation/compute_contact_metric.py /path/to/motion.npy [fps]The input .npy file should have shape (T, 2, 21, 3) where T is the number of frames, 2 represents left/right hands, 21 is the number of joints per hand, and 3 is the xyz coordinates. It can be extracted from the dataset npz files.
The autoregressive model uses a 288-dim motion representation. To convert the HandX dataset (data/handx/) into this format:
cd data/processing
python convert_to_autoregressive.pyThis reads data/handx/{train,test}_can_pos_all_wotextfeat.npz and {train,test}_mano.npz, and outputs to autoregressive/data/:
autoregressive/data/
├── train_full_correct_duet_scalar_rot.npz
├── test_full_correct_duet_scalar_rot.npz
├── texts_all.pkl
├── mean_correct_duet_scalar_rot.npy
└── std_correct_duet_scalar_rot.npy
To write to a custom directory, use --output_dir:
python convert_to_autoregressive.py --output_dir /path/to/outputAll training and evaluation commands below should be run from the diffusion/ directory:
cd diffusionsh scripts/diffusion/train/train.shThis repository uses Hydra to manage experiment configurations. You can modify the configuration files located in the conf/ folder.
The following scripts are available for versatile generation tasks:
| Script | Task |
|---|---|
run_text2motion.py |
Text-to-motion generation (unconstrained) |
run_fix_lefthand.py |
Fix left hand, generate right hand |
run_wrist_traj.py |
Fix wrist trajectory, generate hand motion |
run_inbetweening.py |
Motion in-betweening (fix first/last frames) |
run_contact_keyframe.py |
Generation conditioned on contact keyframes |
run_two_stage.py |
Long-horizon generation |
Option 1: Edit Script Configuration (for the first 4 scripts)
Open the corresponding script and update the CHECKPOINTS configuration:
CHECKPOINTS = [
{
'name': 'your_checkpoint_name',
'checkpoint_dir': '/path/to/your/checkpoint',
'model_name': 'model000220000.pt',
'num_val_samples': 256,
'data_dir': '/path/to/your/data',
'data_loader': 'src.diffusion.data_loader.handx.HandXDataset',
'data_file_name': 'can_pos_all_wotextfeat.npz',
'eval_folder_name': 'generate_xxx',
'num_generated': 4,
'description': 'Your description'
}
]Then run:
python scripts/evaluation/run_text2motion.pyOption 2: Command Line Arguments (for the last 2 scripts)
python scripts/evaluation/run_contact_keyframe.py \
--checkpoint_dir /path/to/your/checkpoint \
--model_name model000070000.pt \
--data_dir /path/to/your/data \
--num_val_samples 256
python scripts/evaluation/run_two_stage.py \
--checkpoint_dir /path/to/your/checkpoint \
--model_name model000070000.pt \
--data_dir /path/to/your/data \
--num_val_samples 256All training and evaluation commands below should be run from the autoregressive/ directory:
cd autoregressiveRun the script to download dependencies materials:
bash prepare/download_glove.shbash scripts/train/train_tokenizer.shIf you don't want to use wavelet transformation, simply delete --use_patcher, --patch_size and --patch_method arguments.
Codebook size is controlled by --nb-code. Modify it in train_tokenizer.sh:
--nb-code 4096 # default, can be changed to 1024, 8192, 65536, etc.First, run the following command to inference all of the motion codes by the trained tokenizer. Change the --resume-pth argument to the path of your tokenizer checkpoint.
bash scripts/train/train_t2m_get_codes.shThen train the text-to-motion model:
bash scripts/train/train_t2m_4096.shModel size is controlled by --pretrained_llama in train_t2m_4096.sh. Available sizes:
| Name | Layers | Heads | Embed Dim |
|---|---|---|---|
| 44M | 8 | 8 | 512 |
| 111M | 12 | 12 | 768 |
| 222M | 16 | 16 | 1024 |
| 343M | 24 | 16 | 1024 |
| 775M | 36 | 20 | 1280 |
| 1B | 48 | 24 | 1536 |
| 3B | 24 | 32 | 3200 |
Example: to use a smaller 44M model, change the flag in the script:
--pretrained_llama 44MNote: when changing --nb-code (codebook size), make sure the same value is used consistently across the tokenizer training, get_codes, and train_t2m scripts.
To evaluate the text-to-motion sample:
bash scripts/eval/generate_for_eval.sh Both diffusion and autoregressive models generate per-sample PKL files in the same format. A unified evaluation script computes all metrics on these files.
Download the evaluation encoder checkpoints from this link and place them under evaluation/checkpoints/:
evaluation/
└── checkpoints/
├── epoch=269.ckpt
├── mean_can_pos.npy
└── std_can_pos.npy
Optionally, you can train your own evaluation encoder:
cd evaluation
sh train_tma.shcd evaluation
python run_evaluation.py --output_dir /path/to/pkl/filesFor diffusion, point --output_dir to the directory containing generated val_sample_*.pkl files. For autoregressive, point it to the output of generate_for_eval.sh.
Metrics and options
Evaluation metrics:
- FID: Frechet Inception Distance (distribution similarity)
- R-precision: Text-motion matching accuracy (Top-1, Top-2, Top-3)
- Matching Score: Text-motion embedding distance
- MPJPE: Mean Per-Joint Position Error (mm)
- Diversity: Variation across generated samples
- Multimodality: Variation across different generations of same text (diffusion only, since autoregressive is deterministic)
- Interaction: Intra/Inter hand contact precision, recall, F1
Options:
python run_evaluation.py \
--output_dir /path/to/pkl/files \
--batch_size 32 \
--delete_pkl # delete PKL files after loading to save disk space
--results_file eval.json # custom output filenameResults are saved as evaluation_results.json in the current directory.
If you find this repository useful for your work, please cite:
@inproceedings{zhang2026handx,
title = {HandX: Scaling Bimanual Motion and Interaction Generation},
author = {Zhang, Zimu and Zhang, Yucheng and Xu, Xiyan and Wang, Ziyin and Xu, Sirui and Zhou, Kai and Zhou, Bing and Guo, Chuan and Wang, Jian and Wang, Yu-Xiong and Gui, Liang-Yan},
booktitle = {CVPR},
year = {2026},
}Please also consider citing the datasets used in this work:
@inproceedings{fu2025gigahands,
title = {{GigaHands}: A Massive Annotated Dataset of Bimanual Hand Activities},
author = {Fu, Rao and Zhang, Dingxi and Jiang, Alex and Fu, Wanjia and Funk, Austin and Ritchie, Daniel and Sridhar, Srinath},
booktitle = {CVPR},
year = {2025},
}
@inproceedings{banerjee2025hot3d,
title = {{HOT3D}: Hand and Object Tracking in {3D} from Egocentric Multi-View Videos},
author = {Banerjee, Prithviraj and Shkodrani, Sindi and Moulon, Pierre and Hampali, Shreyas and Han, Shangchen and Zhang, Fan and Zhang, Linguang and Fountain, Jade and Miller, Edward and Basol, Selen and others},
booktitle = {CVPR},
year = {2025},
}
@inproceedings{fan2023arctic,
title = {{ARCTIC}: A Dataset for Dexterous Bimanual Hand-Object Manipulation},
author = {Fan, Zicong and Taheri, Omid and Tzionas, Dimitrios and Kocabas, Muhammed and Kaufmann, Manuel and Black, Michael J. and Hilliges, Otmar},
booktitle = {CVPR},
year = {2023},
}
@inproceedings{kwon2021h2o,
title = {{H2O}: Two Hands Manipulating Objects for First Person Interaction Recognition},
author = {Kwon, Taein and Tekin, Bugra and St{\"u}hmer, Jan and Bogo, Federica and Pollefeys, Marc},
booktitle = {ICCV},
year = {2021},
}
@inproceedings{wang2023holoassist,
title = {{HoloAssist}: An Egocentric Human Interaction Dataset for Interactive {AI} Assistants in the Real World},
author = {Wang, Xin and Kwon, Taein and Rad, Mahdi and Pan, Bowen and Chakraborty, Ishani and Andrist, Sean and Bohus, Dan and Feniello, Ashley and Tekin, Bugra and Frujeri, Felipe Vieira and others},
booktitle = {ICCV},
year = {2023},
}