CoInteract: Spatially-Structured Co-Generation for Interactive Human-Object Video Synthesis

Xiangyang Luo^1,2*, Xiaozhe Xin^2*✉, Tao Feng¹, Xu Guo¹, Meiguang Jin², Junfeng Ma²
¹ Tsinghua University ² Alibaba Group
^* Equal contribution ^✉ Corresponding author

Demo

demo.mp4

🔥News

[May 6, 2026] We release the training guideline and pose-driven inference of CoInteract.
[April 27, 2026] We release the inference code and checkpoint of CoInteract.
[April 22, 2026] We release the Paper and Project page of CoInteract.

🗺️Roadmap

Stage	Status	Description	Date
1	✅	Release inference code and model weights	-
2	✅	Release pose-driven checkpoint and inference	-
3	✅	Release training code	-

Installation

git clone https://github.com/luoxyhappy/CoInteract.git
cd CoInteract
conda create -n cointeract python=3.10
pip install -e .

Model Weights

We rely on two base models plus our CoInteract checkpoint. The easiest way to fetch everything is the HuggingFace CLI:

# Wan2.2-S2V-14B base model
hf download Wan-AI/Wan2.2-S2V-14B \
    --local-dir ./models/Wan2.2-S2V-14B

# Chinese Wav2Vec2 audio encoder
hf download jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn \
    --local-dir ./models/chinese-wav2vec2-large

# CoInteract checkpoint 
hf download georgexin/cointeract \
    --local-dir ./models/CoInteract

Note: alternative endpoint for restricted networks

If huggingface.co is unreachable from your environment, configure the community mirror before running the download commands:

export HF_ENDPOINT=https://hf-mirror.com

To persist this setting, append the line to ~/.bashrc or ~/.zshrc.

Model	Link
Wan2.2-S2V-14B	Wan-AI/Wan2.2-S2V-14B
Chinese Wav2Vec2 Large	jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn
CoInteract Checkpoint	georgexin/cointeract

Inference

Run batch inference with the default demos CSV (paths resolve under ./models/):

python batch_infer.py \
    --csv_path ./examples/demos/demos.csv \
    --output_dir ./output_videos \
    --height 1280 \
    --width 720 \
    --cfg_scale 7.0 \
    --num_clips 3

We recommend running at 720p (--height 1280 --width 720) for the best visual quality. A lower-resolution 480p setting (--height 832 --width 480) is available for memory-constrained GPUs.

Resolution	Height × Width	Peak GPU Memory
720p (recommended)	1280 × 720	~59 GB
480p	832 × 480	~45 GB

Input CSV must contain columns audio, person_image, prompt. Optional columns: product_image, prompt2, prompt3.

person_image: path to the reference image of the speaker (identity / first frame).
product_image: path to the product reference image (object appearance). Leave empty for pure speech-driven generation.
prompt2, prompt3: optional per-clip prompts used for interactive generation, allowing different textual instructions across sequential clips to drive multi-turn interactions.

We provide our generated results for the demos in ./output_videos for reference.

Notes. If you want to try your own cases, we recommend using product images with a clean white background for best results, and keeping your prompt in a format consistent with the examples provided in ./examples/demos/demos.csv.

Pose-Driven Inference

Beyond audio-only control, CoInteract also supports pose-driven generation, where a pre-extracted pose skeleton video guides the full-body motion of the speaker while the audio still drives lip-sync.

1. Download the pose checkpoint

The pose-driven checkpoint lives in the same HuggingFace repository as the default one.

hf download georgexin/cointeract \
    --local-dir ./models/CoInteract

After downloading, you should see an additional checkpoint_pose.safetensors under ./models/CoInteract/.

2. Prepare the CSV

A ready-to-use CSV is shipped at ./examples/demos/posedriven/posedriven.csv.

3. Run batch inference

python batch_infer.py \
    --csv_path ./examples/demos/posedriven/posedriven.csv \
    --lora_path ./models/CoInteract/checkpoint_pose.safetensors \
    --output_dir ./output_videos/posedriven \
    --height 1280 \
    --width 720 \
    --cfg_scale 7.0 \
    --num_clips 3

We provide our generated results in ./output_videos/posedriven for reference.

Training

Please refer to ./examples/wanvideo/model_training/README.md for the end-to-end walkthrough.

✨Highlights

CoInteract enables high-quality speech-driven human-object interaction video synthesis with fine-grained spatial control. It supports diverse generation modes including video generation, unified generation, and interactive generation.

Key contributions:

Human-Aware Mixture-of-Experts (MoE): A spatial routing mechanism that dynamically dispatches tokens to specialized expert networks, supervised by GT bounding boxes during training and fully automatic at inference.
Spatially-Structured Co-Generation: Joint training of RGB video and HOI depth maps provides structural guidance for realistic interactions, without requiring depth input at inference time.

Citation

@article{luo2026cointeract,
  title={CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation},
  author={Luo, Xiangyang and Xin, Xiaozhe and Feng, Tao and Guo, Xu and Jin, Meiguang and Ma, Junfeng},
  journal={arXiv preprint arXiv:2604.19636},
  year={2026}
}

Acknowledgments

License

This project is released under the Apache License 2.0. Note that the underlying base models (e.g., Wan2.2-S2V-14B and jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn) are governed by their own licenses; please comply with them when using the corresponding weights.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
assets		assets
diffsynth		diffsynth
examples		examples
output_videos		output_videos
.gitignore		.gitignore
LICENSE		LICENSE
batch_infer.py		batch_infer.py
ds_config.json		ds_config.json
readme.md		readme.md
requirements.txt		requirements.txt
setup.py		setup.py
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CoInteract: Spatially-Structured Co-Generation for Interactive Human-Object Video Synthesis

Demo

🔥News

🗺️Roadmap

Installation

Model Weights

Inference

Pose-Driven Inference

1. Download the pose checkpoint

2. Prepare the CSV

3. Run batch inference

Training

✨Highlights

Citation

Acknowledgments

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CoInteract: Spatially-Structured Co-Generation for Interactive Human-Object Video Synthesis

Demo

🔥News

🗺️Roadmap

Installation

Model Weights

Inference

Pose-Driven Inference

1. Download the pose checkpoint

2. Prepare the CSV

3. Run batch inference

Training

✨Highlights

Citation

Acknowledgments

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages