Skip to content

ke-xing/StereoWorldCode

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

StereoWorld

Environment Setup

Create and activate a conda environment:

conda create -n stereoworld python=3.11 -y
conda activate stereoworld

Install the Python dependencies:

pip install -r requirements.txt
pip install -e .

FFmpeg is required for video IO. If it is not already available on your system, install it with conda:

conda install -c conda-forge ffmpeg -y

If your machine requires a specific PyTorch/CUDA build, install the matching PyTorch package for your GPU driver and CUDA runtime before running inference.

Model Weights

Download all required model weights into the models directory.

mkdir -p models
pip install -U "huggingface_hub[cli]"

Download the StereoWorld weights:

huggingface-cli download KXingLab/stereoworld --local-dir ./models

Download Wan2.1-T2V-1.3B:

huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./models/Wan-AI/Wan2.1-T2V-1.3B

Download VideoLLaMA3-7B:

huggingface-cli download DAMO-NLP-SG/VideoLLaMA3-7B --local-dir ./models/VideoLLaMA3-7B

The inference code expects the following files and directories:

models/
+-- stereo.safetensors
+-- VideoLLaMA3-7B/
+-- Wan-AI/
    +-- Wan2.1-T2V-1.3B/
        +-- diffusion_pytorch_model.safetensors
        +-- models_t5_umt5-xxl-enc-bf16.pth
        +-- Wan2.1_VAE.pth

Please follow the license terms and usage conditions of the corresponding model repositories:

Inference Pipeline

The inference pipeline in main.py performs the following steps:

  1. Loads the input video and splits long videos into overlapping segments.
  2. Uses VideoLLaMA3 to generate captions for each video segment.
  3. Uses Wan2.1-T2V-1.3B with the StereoWorld LoRA weights to generate stereo video segments.
  4. Blends and concatenates the generated segments into the final output video.

Run inference with:

python main.py --video_path test/1.mp4 --output_path output/1.mp4

The script also supports dashed argument names:

python main.py --video-path test/1.mp4 --output-path output/1.mp4

Arguments:

  • --video_path, --video-path: input video path. Default: test/1.mp4.
  • --output_path, --output-path: output video path. Default: output/1.mp4.

Dataset Construction

For dataset preparation, movie source handling, caption generation, depth annotation, and disparity annotation, see the dataset processing guide:

datasets/data.md

Citation

@article{xing2025stereoworld,
  title={StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation},
  author={Xing, Ke and Jin, Xiaojie and Li, Longfei and Yin, Yuyang and Liang, Hanwen and Luo, Guixun and Fang, Chen and Wang, Jue and Plataniotis, Konstantinos N and Zhao, Yao and others},
  journal={arXiv preprint arXiv:2512.09363},
  year={2025}
}

About

[CVPR2026]Official implementation of "StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages