Xiangyang Luo1,2*, Xiaozhe Xin2*✉, Tao Feng1, Xu Guo1, Meiguang Jin2, Junfeng Ma2
1 Tsinghua University 2 Alibaba Group
* Equal contribution ✉ Corresponding author
demo.mp4
- [May 6, 2026] We release the training guideline and pose-driven inference of CoInteract.
- [April 27, 2026] We release the inference code and checkpoint of CoInteract.
- [April 22, 2026] We release the Paper and Project page of CoInteract.
| Stage | Status | Description | Date |
|---|---|---|---|
| 1 | ✅ | Release inference code and model weights | - |
| 2 | ✅ | Release pose-driven checkpoint and inference | - |
| 3 | ✅ | Release training code | - |
git clone https://github.com/luoxyhappy/CoInteract.git
cd CoInteract
conda create -n cointeract python=3.10
pip install -e .We rely on two base models plus our CoInteract checkpoint. The easiest way to fetch everything is the HuggingFace CLI:
# Wan2.2-S2V-14B base model
hf download Wan-AI/Wan2.2-S2V-14B \
--local-dir ./models/Wan2.2-S2V-14B
# Chinese Wav2Vec2 audio encoder
hf download jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn \
--local-dir ./models/chinese-wav2vec2-large
# CoInteract checkpoint
hf download georgexin/cointeract \
--local-dir ./models/CoInteractNote: alternative endpoint for restricted networks
If huggingface.co is unreachable from your environment, configure the community mirror before running the download commands:
export HF_ENDPOINT=https://hf-mirror.comTo persist this setting, append the line to ~/.bashrc or ~/.zshrc.
| Model | Link |
|---|---|
| Wan2.2-S2V-14B | Wan-AI/Wan2.2-S2V-14B |
| Chinese Wav2Vec2 Large | jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn |
| CoInteract Checkpoint | georgexin/cointeract |
Run batch inference with the default demos CSV (paths resolve under ./models/):
python batch_infer.py \
--csv_path ./examples/demos/demos.csv \
--output_dir ./output_videos \
--height 1280 \
--width 720 \
--cfg_scale 7.0 \
--num_clips 3We recommend running at 720p (--height 1280 --width 720) for the best visual quality. A lower-resolution 480p setting (--height 832 --width 480) is available for memory-constrained GPUs.
| Resolution | Height × Width | Peak GPU Memory |
|---|---|---|
| 720p (recommended) | 1280 × 720 | ~59 GB |
| 480p | 832 × 480 | ~45 GB |
Input CSV must contain columns audio, person_image, prompt. Optional columns: product_image, prompt2, prompt3.
person_image: path to the reference image of the speaker (identity / first frame).product_image: path to the product reference image (object appearance). Leave empty for pure speech-driven generation.prompt2,prompt3: optional per-clip prompts used for interactive generation, allowing different textual instructions across sequential clips to drive multi-turn interactions.
We provide our generated results for the demos in ./output_videos for reference.
Notes. If you want to try your own cases, we recommend using product images with a clean white background for best results, and keeping your prompt in a format consistent with the examples provided in
./examples/demos/demos.csv.
Beyond audio-only control, CoInteract also supports pose-driven generation, where a pre-extracted pose skeleton video guides the full-body motion of the speaker while the audio still drives lip-sync.
The pose-driven checkpoint lives in the same HuggingFace repository as the default one.
hf download georgexin/cointeract \
--local-dir ./models/CoInteractAfter downloading, you should see an additional checkpoint_pose.safetensors under ./models/CoInteract/.
A ready-to-use CSV is shipped at ./examples/demos/posedriven/posedriven.csv.
python batch_infer.py \
--csv_path ./examples/demos/posedriven/posedriven.csv \
--lora_path ./models/CoInteract/checkpoint_pose.safetensors \
--output_dir ./output_videos/posedriven \
--height 1280 \
--width 720 \
--cfg_scale 7.0 \
--num_clips 3We provide our generated results in ./output_videos/posedriven for reference.
Please refer to ./examples/wanvideo/model_training/README.md for the end-to-end walkthrough.
CoInteract enables high-quality speech-driven human-object interaction video synthesis with fine-grained spatial control. It supports diverse generation modes including video generation, unified generation, and interactive generation.
Key contributions:
- Human-Aware Mixture-of-Experts (MoE): A spatial routing mechanism that dynamically dispatches tokens to specialized expert networks, supervised by GT bounding boxes during training and fully automatic at inference.
- Spatially-Structured Co-Generation: Joint training of RGB video and HOI depth maps provides structural guidance for realistic interactions, without requiring depth input at inference time.
@article{luo2026cointeract,
title={CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation},
author={Luo, Xiangyang and Xin, Xiaozhe and Feng, Tao and Guo, Xu and Jin, Meiguang and Ma, Junfeng},
journal={arXiv preprint arXiv:2604.19636},
year={2026}
}This project is released under the Apache License 2.0. Note that the underlying base models (e.g., Wan2.2-S2V-14B and jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn) are governed by their own licenses; please comply with them when using the corresponding weights.

