Create and activate a conda environment:
conda create -n stereoworld python=3.11 -y
conda activate stereoworldInstall the Python dependencies:
pip install -r requirements.txt
pip install -e .FFmpeg is required for video IO. If it is not already available on your system, install it with conda:
conda install -c conda-forge ffmpeg -yIf your machine requires a specific PyTorch/CUDA build, install the matching PyTorch package for your GPU driver and CUDA runtime before running inference.
Download all required model weights into the models directory.
mkdir -p models
pip install -U "huggingface_hub[cli]"Download the StereoWorld weights:
huggingface-cli download KXingLab/stereoworld --local-dir ./modelsDownload Wan2.1-T2V-1.3B:
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./models/Wan-AI/Wan2.1-T2V-1.3BDownload VideoLLaMA3-7B:
huggingface-cli download DAMO-NLP-SG/VideoLLaMA3-7B --local-dir ./models/VideoLLaMA3-7BThe inference code expects the following files and directories:
models/
+-- stereo.safetensors
+-- VideoLLaMA3-7B/
+-- Wan-AI/
+-- Wan2.1-T2V-1.3B/
+-- diffusion_pytorch_model.safetensors
+-- models_t5_umt5-xxl-enc-bf16.pth
+-- Wan2.1_VAE.pth
Please follow the license terms and usage conditions of the corresponding model repositories:
The inference pipeline in main.py performs the following steps:
- Loads the input video and splits long videos into overlapping segments.
- Uses VideoLLaMA3 to generate captions for each video segment.
- Uses Wan2.1-T2V-1.3B with the StereoWorld LoRA weights to generate stereo video segments.
- Blends and concatenates the generated segments into the final output video.
Run inference with:
python main.py --video_path test/1.mp4 --output_path output/1.mp4The script also supports dashed argument names:
python main.py --video-path test/1.mp4 --output-path output/1.mp4Arguments:
--video_path,--video-path: input video path. Default:test/1.mp4.--output_path,--output-path: output video path. Default:output/1.mp4.
For dataset preparation, movie source handling, caption generation, depth annotation, and disparity annotation, see the dataset processing guide:
@article{xing2025stereoworld,
title={StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation},
author={Xing, Ke and Jin, Xiaojie and Li, Longfei and Yin, Yuyang and Liang, Hanwen and Luo, Guixun and Fang, Chen and Wang, Jue and Plataniotis, Konstantinos N and Zhao, Yao and others},
journal={arXiv preprint arXiv:2512.09363},
year={2025}
}