StereoWorld

Environment Setup

Create and activate a conda environment:

conda create -n stereoworld python=3.11 -y
conda activate stereoworld

Install the Python dependencies:

pip install -r requirements.txt
pip install -e .

FFmpeg is required for video IO. If it is not already available on your system, install it with conda:

conda install -c conda-forge ffmpeg -y

If your machine requires a specific PyTorch/CUDA build, install the matching PyTorch package for your GPU driver and CUDA runtime before running inference.

Model Weights

Download all required model weights into the models directory.

mkdir -p models
pip install -U "huggingface_hub[cli]"

Download the StereoWorld weights:

huggingface-cli download KXingLab/stereoworld --local-dir ./models

Download Wan2.1-T2V-1.3B:

huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./models/Wan-AI/Wan2.1-T2V-1.3B

Download VideoLLaMA3-7B:

huggingface-cli download DAMO-NLP-SG/VideoLLaMA3-7B --local-dir ./models/VideoLLaMA3-7B

The inference code expects the following files and directories:

models/
+-- stereo.safetensors
+-- VideoLLaMA3-7B/
+-- Wan-AI/
    +-- Wan2.1-T2V-1.3B/
        +-- diffusion_pytorch_model.safetensors
        +-- models_t5_umt5-xxl-enc-bf16.pth
        +-- Wan2.1_VAE.pth

Please follow the license terms and usage conditions of the corresponding model repositories:

Inference Pipeline

The inference pipeline in main.py performs the following steps:

Loads the input video and splits long videos into overlapping segments.
Uses VideoLLaMA3 to generate captions for each video segment.
Uses Wan2.1-T2V-1.3B with the StereoWorld LoRA weights to generate stereo video segments.
Blends and concatenates the generated segments into the final output video.

Run inference with:

python main.py --video_path test/1.mp4 --output_path output/1.mp4

The script also supports dashed argument names:

python main.py --video-path test/1.mp4 --output-path output/1.mp4

Arguments:

--video_path, --video-path: input video path. Default: test/1.mp4.
--output_path, --output-path: output video path. Default: output/1.mp4.

Dataset Construction

For dataset preparation, movie source handling, caption generation, depth annotation, and disparity annotation, see the dataset processing guide:

datasets/data.md

Citation

@article{xing2025stereoworld,
  title={StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation},
  author={Xing, Ke and Jin, Xiaojie and Li, Longfei and Yin, Yuyang and Liang, Hanwen and Luo, Guixun and Fang, Chen and Wang, Jue and Plataniotis, Konstantinos N and Zhao, Yao and others},
  journal={arXiv preprint arXiv:2512.09363},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
datasets		datasets
diffsynthfinal		diffsynthfinal
test		test
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
setup.py		setup.py
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

StereoWorld

Environment Setup

Model Weights

Inference Pipeline

Dataset Construction

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

StereoWorld

Environment Setup

Model Weights

Inference Pipeline

Dataset Construction

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages