Xiangyang Luo1,2*, Xiaozhe Xin2*✉, Tao Feng1, Xu Guo1, Meiguang Jin2, Junfeng Ma2
1 Tsinghua University 2 Alibaba Group
* Equal contribution ✉ Corresponding author
demo.mp4
| Stage | Status | Description |
|---|---|---|
| 1 | 🔜 | Release inference code and model weights (within one week) |
| 2 | 🔜 | Release training code |
| 3 | 📋 | Add pose control support |
CoInteract enables high-quality speech-driven human-object interaction video synthesis with fine-grained spatial control. It supports diverse generation modes including video generation, unified generation, and interactive generation.
Key contributions:
- Human-Aware Mixture-of-Experts (MoE): A spatial routing mechanism that dynamically dispatches tokens to specialized expert networks (hand expert + face expert), supervised by GT bounding boxes during training and fully automatic at inference.
- Spatially-Structured Co-Generation: Joint training of RGB video and HOI depth maps provides structural guidance for realistic interactions, without requiring depth input at inference time.
@misc{luo2026cointeractphysicallyconsistenthumanobjectinteraction,
title={CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation},
author={Xiangyang Luo and Xiaozhe Xin and Tao Feng and Xu Guo and Meiguang Jin and Junfeng Ma},
year={2026},
eprint={2604.19636},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2604.19636},
}
