Skip to content

luoxyhappy/CoInteract

Repository files navigation

CoInteract: Spatially-Structured Co-Generation for Interactive Human-Object Video Synthesis

Xiangyang Luo1,2*, Xiaozhe Xin2*✉, Tao Feng1, Xu Guo1, Meiguang Jin2, Junfeng Ma2
1 Tsinghua University   2 Alibaba Group
* Equal contribution   Corresponding author

Demo

demo.mp4

🔥News

  • [May 6, 2026] We release the training guideline and pose-driven inference of CoInteract.
  • [April 27, 2026] We release the inference code and checkpoint of CoInteract.
  • [April 22, 2026] We release the Paper and Project page of CoInteract.

🗺️Roadmap

Stage Status Description Date
1 Release inference code and model weights -
2 Release pose-driven checkpoint and inference -
3 Release training code -

Installation

git clone https://github.com/luoxyhappy/CoInteract.git
cd CoInteract
conda create -n cointeract python=3.10
pip install -e .

Model Weights

We rely on two base models plus our CoInteract checkpoint. The easiest way to fetch everything is the HuggingFace CLI:

# Wan2.2-S2V-14B base model
hf download Wan-AI/Wan2.2-S2V-14B \
    --local-dir ./models/Wan2.2-S2V-14B

# Chinese Wav2Vec2 audio encoder
hf download jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn \
    --local-dir ./models/chinese-wav2vec2-large

# CoInteract checkpoint 
hf download georgexin/cointeract \
    --local-dir ./models/CoInteract
Note: alternative endpoint for restricted networks

If huggingface.co is unreachable from your environment, configure the community mirror before running the download commands:

export HF_ENDPOINT=https://hf-mirror.com

To persist this setting, append the line to ~/.bashrc or ~/.zshrc.

Model Link
Wan2.2-S2V-14B Wan-AI/Wan2.2-S2V-14B
Chinese Wav2Vec2 Large jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn
CoInteract Checkpoint georgexin/cointeract

Inference

Run batch inference with the default demos CSV (paths resolve under ./models/):

python batch_infer.py \
    --csv_path ./examples/demos/demos.csv \
    --output_dir ./output_videos \
    --height 1280 \
    --width 720 \
    --cfg_scale 7.0 \
    --num_clips 3

We recommend running at 720p (--height 1280 --width 720) for the best visual quality. A lower-resolution 480p setting (--height 832 --width 480) is available for memory-constrained GPUs.

Resolution Height × Width Peak GPU Memory
720p (recommended) 1280 × 720 ~59 GB
480p 832 × 480 ~45 GB

Input CSV must contain columns audio, person_image, prompt. Optional columns: product_image, prompt2, prompt3.

  • person_image: path to the reference image of the speaker (identity / first frame).
  • product_image: path to the product reference image (object appearance). Leave empty for pure speech-driven generation.
  • prompt2, prompt3: optional per-clip prompts used for interactive generation, allowing different textual instructions across sequential clips to drive multi-turn interactions.

We provide our generated results for the demos in ./output_videos for reference.

Notes. If you want to try your own cases, we recommend using product images with a clean white background for best results, and keeping your prompt in a format consistent with the examples provided in ./examples/demos/demos.csv.

Pose-Driven Inference

Beyond audio-only control, CoInteract also supports pose-driven generation, where a pre-extracted pose skeleton video guides the full-body motion of the speaker while the audio still drives lip-sync.

1. Download the pose checkpoint

The pose-driven checkpoint lives in the same HuggingFace repository as the default one.

hf download georgexin/cointeract \
    --local-dir ./models/CoInteract

After downloading, you should see an additional checkpoint_pose.safetensors under ./models/CoInteract/.

2. Prepare the CSV

A ready-to-use CSV is shipped at ./examples/demos/posedriven/posedriven.csv.

3. Run batch inference

python batch_infer.py \
    --csv_path ./examples/demos/posedriven/posedriven.csv \
    --lora_path ./models/CoInteract/checkpoint_pose.safetensors \
    --output_dir ./output_videos/posedriven \
    --height 1280 \
    --width 720 \
    --cfg_scale 7.0 \
    --num_clips 3

We provide our generated results in ./output_videos/posedriven for reference.

Training

Please refer to ./examples/wanvideo/model_training/README.md for the end-to-end walkthrough.

✨Highlights

CoInteract enables high-quality speech-driven human-object interaction video synthesis with fine-grained spatial control. It supports diverse generation modes including video generation, unified generation, and interactive generation.

Key contributions:

  • Human-Aware Mixture-of-Experts (MoE): A spatial routing mechanism that dynamically dispatches tokens to specialized expert networks, supervised by GT bounding boxes during training and fully automatic at inference.
  • Spatially-Structured Co-Generation: Joint training of RGB video and HOI depth maps provides structural guidance for realistic interactions, without requiring depth input at inference time.

Citation

@article{luo2026cointeract,
  title={CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation},
  author={Luo, Xiangyang and Xin, Xiaozhe and Feng, Tao and Guo, Xu and Jin, Meiguang and Ma, Junfeng},
  journal={arXiv preprint arXiv:2604.19636},
  year={2026}
}

Acknowledgments

License

This project is released under the Apache License 2.0. Note that the underlying base models (e.g., Wan2.2-S2V-14B and jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn) are governed by their own licenses; please comply with them when using the corresponding weights.

About

Official Implementation of CoInteract: Spatially-Structured Co-Generation for Interactive Human-Object Video Synthesis

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors