Skip to content

h6kplus/PhyMotion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation

* Equal contribution.

Generating realistic human motion is a central yet unsolved challenge in video generation. While reinforcement learning (RL)-based post-training has driven recent gains in general video quality, extending it to human motion remains bottlenecked by a reward signal that cannot reliably score motion realism. Existing video rewards primarily rely on 2D perceptual signals, without explicitly modeling the 3D body state, contact, and dynamics underlying articulated human motion, and often assign high scores to videos with floating bodies or physically implausible movements. To address this, we propose PhyMotion, a structured, fine-grained motion reward that grounds recovered 3D human trajectories in a physics simulator and evaluates motion quality along multiple dimensions of physical feasibility. Concretely, we recover SMPL body meshes from generated videos, retarget them onto a humanoid in the MuJoCo physics simulator, and evaluate the resulting motion along three axes: kinematic plausibility, contact and balance consistency, and dynamic feasibility. Each component provides a continuous and interpretable signal tied to a specific aspect of motion quality, allowing the reward to capture which aspects of motion are physically correct or violated. Experiments show that PhyMotion achieves stronger correlation with human judgments than existing reward formulations. These gains carry over to RL-based post-training, where optimizing PhyMotion leads to larger and more consistent improvements than optimizing existing rewards, improving motion realism across both autoregressive and bidirectional video generators under both automatic metrics and blind human evaluation (+68 Elo gain). Ablations show that the three axes provide complementary supervision signals, while the reward preserves overall video generation quality with only modest training overhead.

teaser image

Pretrained Checkpoints and Data

Asset Hugging Face Notes
PhyMotion-CausalForcing-1.3B LoRA 6kplus/PhyMotion-CausalForcing-1.3B (model) LoRA adapter for the Causal Forcing 1.3B base, post-trained with the PhyMotion reward.
MotionX prompts (train 21,348 / test 1,123) 6kplus/PhyMotion-MotionX-Prompts (dataset) train.txt is used for RL rollout during post-training; test.txt is used for evaluation.

Download both:

# LoRA adapter
huggingface-cli download 6kplus/PhyMotion-CausalForcing-1.3B \
  --local-dir checkpoints/phymotion-causalforcing

# Train + test prompt splits
huggingface-cli download 6kplus/PhyMotion-MotionX-Prompts \
  --repo-type dataset --local-dir dataset/motionx

Environment Setup

  1. Create the Python environment and install dependencies. requirements.txt covers the full stack including MuJoCo 3.3.6 and SMPL-X — no separate steps needed.
conda create -n phymotion python=3.10 -y
conda activate phymotion
pip install torch==2.6.0 torchvision==0.21.0 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
pip install flash-attn==2.7.4.post1 --no-build-isolation

Quick sanity check the env:

python -c "import torch, flash_attn, mujoco, smplx; \
print(f'torch={torch.__version__} cuda={torch.cuda.is_available()}, flash_attn={flash_attn.__version__}, mujoco={mujoco.__version__}')"
# Expected output:
# torch=2.6.0+cu124 cuda=True, flash_attn=2.7.4.post1, mujoco=3.3.6
  1. Install GVHMR. The reward calls GVHMR in-process to recover SMPL-X meshes from generated frames.
git clone https://github.com/zju3dv/GVHMR.git ~/GVHMR
export GVHMR_ROOT=~/GVHMR

Download the GVHMR checkpoint bundle (~9 GB) from HuggingFace:

for ckpt in \
  gvhmr/gvhmr_siga24_release.ckpt \
  hmr2/epoch=10-step=25000.ckpt \
  vitpose/vitpose-h-multi-coco.pth \
  yolo/yolov8x.pt; do
  huggingface-cli download camenduru/GVHMR "$ckpt" \
    --local-dir $GVHMR_ROOT/inputs/checkpoints
done

SMPL-X body models (required): The GVHMR bundle does not include the SMPL-X body model files — these must be obtained separately.

  1. Register (free academic license) at https://smpl-x.is.tue.mpg.de/ and download the SMPL-X model zip.
  2. Extract and place the following three files:
$GVHMR_ROOT/inputs/checkpoints/body_models/smplx/SMPLX_NEUTRAL.npz
$GVHMR_ROOT/inputs/checkpoints/body_models/smplx/SMPLX_MALE.npz
$GVHMR_ROOT/inputs/checkpoints/body_models/smplx/SMPLX_FEMALE.npz

The training script and reward module read GVHMR_ROOT from the environment.

After GVHMR's pip dependencies are resolved, pin scipy to avoid a numpy/ufunc incompatibility:

pip install --force-reinstall scipy==1.15.2

The humanoid MJCF model used to retarget SMPL is bundled inside this repo (astrolabe/scorers/video/), so no additional asset is required.

  1. Download the Wan2.1 T2V-1.3B base components (transformer config, VAE, and UMT5-XXL text encoder). ~17 GB.
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir wan_models/Wan2.1-T2V-1.3B
  1. Download the Causal Forcing 1.3B sampler weights (the autoregressive distilled version of Wan2.1 T2V-1.3B). PhyMotion's RL post-training starts from this. (~5.3 GB)
huggingface-cli download zhuhz22/Causal-Forcing \
  chunkwise/causal_forcing.pt \
  --local-dir checkpoints/casualforcing
# Result: checkpoints/casualforcing/chunkwise/causal_forcing.pt
  1. (Optional) Download our pretrained PhyMotion-CausalForcing-1.3B LoRA + the MotionX prompt splits from Hugging Face:
# LoRA adapter (700 MB)
huggingface-cli download 6kplus/PhyMotion-CausalForcing-1.3B \
  --local-dir checkpoints/phymotion-causalforcing

# Prompt splits: train.txt (21,348) and test.txt (1,123)
huggingface-cli download 6kplus/PhyMotion-MotionX-Prompts \
  --repo-type dataset --local-dir dataset/motionx

To train on your own prompt list instead, drop your one-prompt-per-line files at dataset/motionx/train.txt and dataset/motionx/test.txt.

Stage 1: PhyMotion Reward

The reward grounds each generated video in a 3D body and scores it along three feasibility axes (kinematic, contact, dynamic). It is implemented as a single function in astrolabe/rewards.py.

Axis Sub-scores
Kinematic joint velocity, joint acceleration, self-penetration
Contact foot slip, ground penetration, foot float, balance
Dynamic joint torque, ground reaction force, metabolic effort

The final reward is the mean of the three axes. All feasibility code (joint-based kinematics and MuJoCo-based contact / dynamics) lives in a single file: astrolabe/scorers/video/smpl_feasibility.py.

To wire the reward into a config:

config.reward_fn = {"phymotion_score": 1.0}

To combine with a perceptual reward (e.g. HPSv3) for balanced training:

config.reward_fn = {
    "phymotion_score":   1.0,
    "video_hpsv3_local": 1.0,
}

Stage 2: RL Post-Training

Launch RL post-training of Causal Forcing 1.3B with the PhyMotion reward.

export GVHMR_ROOT=/path/to/GVHMR
torchrun --nproc_per_node=8 scripts/train_nft_wan.py \
  --config configs/nft_casual_forcing.py:casual_forcing_video_phymotion
  • nproc_per_node: number of GPUs on a single node.

  • --config: a <file>:<entry> selector. The entry casual_forcing_video_phymotion uses the PhyMotion reward (see configs/nft_casual_forcing.py for other entries that mix in perceptual rewards).

Outputs are written to logs/nft/<base_model>/<run_name>_<timestamp>/:

  • checkpoints/checkpoint-<step>/lora/ — PEFT LoRA adapter (rank 256 on CausalWanAttentionBlock).

  • optimizer.pt, scaler.pt, and W&B / TensorBoard logs.

Stage 3: Inference

Roll out a trained LoRA on a list of prompts.

# Using the released PhyMotion-CausalForcing-1.3B LoRA 
torchrun --nproc_per_node=1 scripts/inference_wan.py \
  --base_model checkpoints/casualforcing/chunkwise/causal_forcing.pt \
  --lora_path  checkpoints/phymotion-causalforcing \
  --prompt_file prompts/sample.txt \
  --output_dir outputs/test \
  --num_frames 45 --height 480 --width 832 \
  --guidance_scale 3.0 \
  --denoising_steps "1000,750,500,250" \
  --num_frame_per_block 3 \
  --mixed_precision bf16 --seed 42

To use your own freshly trained LoRA, point --lora_path at your checkpoint dir:

--lora_path  logs/nft/wan_casual_chunk/casual_forcing_video_phymotion_<TS>/checkpoints/phymotion-causalforcing
  • --base_model: path to the Causal Forcing 1.3B checkpoint.

  • --lora_path: a checkpoint-<step>/ folder or its lora/ subdir.

  • --prompt_file: a one-prompt-per-line text file.

  • --output_dir: directory for the generated mp4s. Expect ~5 seconds per video on a single A100.

Hardware and Reference Runtimes

Our reported numbers were produced on:

  • Hardware: 1 node with 8× NVIDIA A100 80 GB
  • CUDA: 12.4; Python: 3.10; PyTorch: 2.6.0; flash-attn: 2.7.4.post1.

Approximate per-stage compute / wall-clock:

Stage Hardware Wall clock
Stage 2 (RL post-training) 8× A100 80 GB ~60 hours for 330 steps (≈ 10 min/step at batch 8)
Stage 3 (inference, 45 frames @ 480×832) 1× A100 / RTX 4090 ~5 seconds per video
Stage 1 (reward, 1 video) 1× A100 (GVHMR + MuJoCo) ~3 seconds per video

Acknowledgements

This codebase builds on several excellent open-source projects. We thank the authors and maintainers of Astrolabe for the RL / reward-training infrastructure, FastVideo for efficient video generation and training utilities, and GVHMR for human mesh recovery used in our 3D motion reward pipeline. Their publicly released code made this work possible.

Citation

If you find this work useful, please consider citing:

@article{huang2026phymotion,
  title={PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation},
  author={Huang, Yidong and Wang, Zun and Lin, Han and Kim, Dong-Ki and Omidshafiei, Shayegan and Yoon, Jaehong and Cho, Jaemin and Zhang, Yue and Bansal, Mohit},
  journal={arXiv preprint arXiv:2605.14269},
  year={2026}
}

About

Official implementation of paper "PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages