Skip to content

hetima/m2svid

 
 

Repository files navigation

About this fork

  • conda不要でvenvで使えるようにしたました(最低限のrequirements.txt
  • なんか最終処理が25フレームまでしか対応してないみたいなので分割処理して結合するようにしたました(inpaint_and_refine.py
  • m2svid_weights.ptopen_clip_pytorch_model.binを合体させた safetensors
  • サブモジュールのリポジトリを直接内包
  • Gradioインターフェイス

インストール

git clone https://github.com/hetima/m2svid.git

Windows 11、python 3.12 の環境で動作確認しています。

torch関連はrequirements.txtに書いてないので個別にインストールしてください。以下はpython 3.12のvenvにインストールする例です。flash_attnは ussoewwin/Flash-Attention-2_for_Windows · Hugging Face からダウンロードできます。GitHubとかにもあると思います。torch 2.10だとうまく動く組み合わせを見つけられませんでした。2.9.1が無難だと思います。2.9.1用のxformersは0.0.33.post2みたいですが、それだと動かなかったので0.3.3を入れました。

uv pip install torch==2.9.1 torchvision==0.24.1 torchaudio==2.9.1 --index-url https://download.pytorch.org/whl/cu130
uv pip install xformers==0.0.33 --no-deps #--no-depsしないと2.9が入る
uv pip install "path/to/flash_attn-2.8.3+cu130torch2.9.1cxx11abiTRUE-cp312-cp312-win_amd64.whl"

uv pip install -r requirements.txt

使い方はオリジナルと同じです。「Get started」の説明にあるモデルをダウンロードして配置し、「Inference」の項目のスクリプトを実行。PYTHONPATHの追加はスクリプト内でするようにしたので不要です。

必要なモデルは Hugging Face にまとめています。ckptsという名前のフォルダを作ってこれをそのまま配置すればOKです。

inpaint_and_refine.py--save_sbs--save_anaglyphのフラグを付けて生成するファイルを選べるようにしました。sbsはデフォルトで生成されます。オフにしたい場合--no-save_sbsを付けてください。

また、--chunk_sizeパラメータも付けました。一度に処理するフレーム数を指定できます(デフォルト10、最大25)。VRAM12GBで512x512の動画をそれなりの速度で処理できる限界は12くらいです。

--use_proresフラグも付けました。これを渡すとmp4ではなくProRes LTのmovを書き出します。

m2svid_combined_quanto_int8.safetensors

m2svid_weights.ptopen_clip_pytorch_model.binを合体させたものです。Hugging Face からダウンロードできます。

optimum-quantoを使って一部パラメータをint8にしたものです。VRAM12GBでメモリ消費量はあんまり変わらない気がしますが、処理速度は速くなっています。--model_configm2svid_combined.yamlを指定し、--quanto_int8フラグを付け加えて実行してください(ファイル名に quanto_int8 が含まれていたら自動判定するようにはしています)

python inpaint_and_refine.py  \
        --mask_antialias 0 \
        --model_config configs/m2svid_combined.yaml \
        --ckpt ckpts/m2svid_combined_quanto_int8.safetensors \
        --video_path demo/input.mp4  \
        --reprojected_path outputs/reprojected/input_reprojected.mp4 \
        --reprojected_mask_path outputs/reprojected/input_reprojected_mask.mp4\
        --output_folder outputs/m2svid \
        --quanto_int8 \

m2svid_combined_fp16.safetensors

fp16 に変換したものです。処理速度やメモリ消費量はたぶん変わりません。あんまり意味ありません。ファイルサイズを削減するだけです。

--model_configm2svid_combined.yamlを指定して使用してください。--quanto_int8 フラグは付けないでください。

Gradio

app.py でGradioサーバーが立ち上がります。必要なモデルは全部自動で取ってきます。

CLIで使う例

一発変換するPowerShellスクリプト例(リポジトリをカレントディレクトリにして実行してください)

# conv filepath_to_convert [project_name]
function global:conv() {
    $sw = [System.Diagnostics.Stopwatch]::StartNew()
    # Set-Location "path/to/m2svid"

    # setting
    $outRootPath = "outputs"
    $cnfg = "configs/m2svid_combined.yaml"
    $ckpt = "ckpts/m2svid_combined/m2svid_combined_fp16.safetensors"
    # $ckpt = "ckpts/m2svid_combined/m2svid_combined_quanto_int8.safetensors"

    if ($null -eq $args[0]){
        Write-Output "no file path"
        return
    }
    if (!(Test-Path $args[0])) {
        Write-Output "file path does not exists"
        return
    }
    
    $path = $args[0]
    $fullPath = (Resolve-Path "$path").Path
    $baseName = [System.IO.Path]::GetFileNameWithoutExtension($path)

    if ($null -eq $args[1]){
        # $timestamp = Get-Date -Format "yyyy-MM-dd_HHmmss"
        # $projectName = "${timestamp}_${baseName}"
        $projectName = "$baseName"
    }else{
        $projectName = $args[1]
    }
    $outPath = Join-Path -Path $outRootPath -ChildPath "$projectName"
    [void](New-Item -Path $outPath -ItemType Directory -Force)

    $npz = Join-Path -Path $outPath -ChildPath "${baseName}.npz"
    if (!(Test-Path "$npz")) {
        # --num_inference_steps 25
        python third_party\DepthCrafter\run.py --video-path "$fullPath" --save_folder "$outPath" --save_npz True --num_inference_steps 5 --max_res 1024
    }else{
        Write-Host "npz exists. skip step 1."
    }

    $reprojected = Join-Path -Path $outPath -ChildPath "${baseName}_reprojected.mp4"
    $reprojectedMask = Join-Path -Path $outPath -ChildPath "${baseName}_reprojected_mask.mp4"
    if (!(Test-Path "$reprojected") -or !(Test-Path "$reprojectedMask")) {
        # --disparity_perc 0.05
        python warping.py --video_path "$fullPath" --depth_path "$npz" --output_path_reprojected "$reprojected" --output_path_mask "$reprojectedMask" --disparity_perc 0.1
    }else{
        Write-Host "reprojected exists. skip step 2."
    }

    python inpaint_and_refine.py --mask_antialias 0 --model_config "$cnfg" --ckpt "$ckpt" --video_path "$fullPath" --reprojected_path "$reprojected" --reprojected_mask_path "$reprojectedMask" --output_folder "$outRootPath" --use_prores --save_sbs --chunk_size 8

    $sw.Stop()
    Write-Host "Elapsed Time: $($sw.Elapsed.ToString("hh\:mm\:ss"))" -ForegroundColor Cyan
}

以下オリジナルのREADME

M2SVid: End-to-End Inpainting and Refinement for Monocular-to-Stereo Video Conversion

Project Page arXiv

by Nina Shvetsova, Goutam Bhat, Prune Truong, Hilde Kuehne, Federico Tombari

Accepted to 3DV 2026!

Update: [March 20, 2026] We have released the pre-trained M2SVid weights!


This is not an officially supported Google product. This project is not eligible for the Google Open Source Software Vulnerability Rewards Program.


📄 Abstract

We tackle the problem of monocular-to-stereo video conversion and propose a novel architecture for inpainting and refinement of the warped right view obtained by depth-based reprojection of the input left view. We extend the Stable Video Diffusion (SVD) model to utilize the input left video, the warped right video, and the disocclusion masks as conditioning input to generate a high-quality right camera view. In order to effectively exploit information from neighboring frames for inpainting, we modify the attention layers in SVD to compute full attention for discoccluded pixels. Our model is trained to generate the right view video in an end-to-end manner without iterative diffusion steps by minimizing image space losses to ensure high-quality generation. Our approach outperforms previous state-of-the-art methods, being ranked best 2.6× more often than the second-place method in a user study, while being 6× faster.

🛠️ Get started

Weights

  1. Download ckpts.zip from Hi3D repo and unzip (follow step "2. Download checkpoints here and unzip."). Our model follows Hi3D implementation and uses the same openclip model.

  2. Download the M2SVid weights (8.5Gb) and extract them into the ckpts folder: unzip m2svid_weights.zip -d ckpts/. We provide two model variants: one with full attention for disoccluded tokens (m2svid_weights.pt, 4.64Gb) and one without full attention (m2svid_no_full_atten_weights.pt, 4.6Gb).

  3. Optional (for training only): download stable-video-diffusion-img2vid-xt checkpoint and put it in ckpts/.

Environment

  1. Create conda env depthcrafter following DepthCrafter instructions
  2. Create conda env sgm. We used cuda 11.8, python=3.10.6, torch==2.0.1 torchvision==0.15.2. We tested our model training/inference on GPUs A100 and H100.
conda env create -f environment.yml -n sgm

⚙️ Inference

Run inference on demo video:

bash inference.sh

See examples outputs in demo folder.

Note 1: The width/hight of the video should be divisible by 64.

Note 2: The model was trained on a resolution of 512x512. For inference of higher resolution videos, please follow the tiling approach described in the StereoCrafter paper. Our released models support temporal and spatial stitching.

Inference Steps:

  1. Depth prediction and depth-based warping
source /opt/conda/bin/activate ""
conda activate depthcrafter
PYTHONPATH="third_party/DepthCrafter/::${PYTHONPATH}" python third_party/DepthCrafter/run.py  \
        --video-path demo/input.mp4 --save_folder outputs/depthcrafter --save_npz True --num_inference_steps 25 --max_res 1024

PYTHONPATH="./:./third_party/Hi3D_Official/:./third_party/pytorch_msssim/:${PYTHONPATH}" python warping.py  \
        --video_path demo/input.mp4 \
        --depth_path outputs/depthcrafter/input.npz \
        --output_path_reprojected outputs/reprojected/input_reprojected.mp4  \
        --output_path_mask outputs/reprojected/input_reprojected_mask.mp4 \
        --disparity_perc 0.05
  1. Inpainting and refinement with M2SVid
source /opt/conda/bin/activate ""
conda activate sgm
PYTHONPATH="./:./third_party/Hi3D_Official/:./third_party/pytorch_msssim/:${PYTHONPATH}" python inpaint_and_refine.py  \
        --mask_antialias 0 \
        --model_config configs/m2svid.yaml \
        --ckpt ckpts/m2svid_weights.pt \
        --video_path demo/input.mp4  \
        --reprojected_path outputs/reprojected/input_reprojected.mp4 \
        --reprojected_mask_path outputs/reprojected/input_reprojected_mask.mp4\
        --output_folder outputs/m2svid \

Note: If you are using the version without full attention, ensure you use the m2svid_no_full_atten.yaml config instead:

source /opt/conda/bin/activate ""
conda activate sgm
PYTHONPATH="./:./third_party/Hi3D_Official/:./third_party/pytorch_msssim/:${PYTHONPATH}" python inpaint_and_refine.py  \
        --mask_antialias 0 \
        --model_config configs/m2svid_no_fullatten.yaml \
        --ckpt ckpts/m2svid_no_full_atten_weights.pt \
        --video_path demo/input.mp4  \
        --reprojected_path outputs/reprojected/input_reprojected.mp4 \
        --reprojected_mask_path outputs/reprojected/input_reprojected_mask.mp4\
        --output_folder outputs/m2svid_no_full_atten \

🏋️ Training and Quantitative Evaluation

Datasets

We used the Ego4D and Stereo4D datasets for model training and evaluation.

  1. Download and preprocess the Stereo4D dataset into the folder datasets/stereo4d by following the official instructions. You only need to perform the rectification and stereo matching steps. Then, you can warp all videos using our warping.py script. At the end, you should have the following folders: left_rectified, right_rectified, reprojected, and reprojected_mask. We provide the train/val split in datasets/stereo4d/subsets.

  2. For Ego4D, we use only videos with the attribute is_stereo=True, resulting in 263 videos in total. Download videos into datasets/ego4d by following the official instructions. We rectify the videos, split them into 150-frames clips, and apply the BiDAStereo model to estimate disparities. Check the ego4d preprocessing README for more details. At the end, you should have the following folders: cropped_videos (side by side rectified and cropped left and right videos), reprojected, and reprojected_mask. We provide the train/val split in datasets/ego4d/subsets.

Training

  1. Download stable-video-diffusion-img2vid-xt checkpoint and put it to ckpts.

  2. Run make_m2svid_init.py to modify SVD models weights for ours M2SVid model configuration with left view, warped view and mask conditioning.

source /opt/conda/bin/activate ""
conda activate sgm
PYTHONPATH="./:./third_party/Hi3D_Official/:./third_party/pytorch_msssim/:${PYTHONPATH}" python make_m2svid_init.py
  1. Run training
source /opt/conda/bin/activate ""
conda activate sgm
PYTHONPATH="./:./third_party/Hi3D_Official/:./third_party/pytorch_msssim/:${PYTHONPATH}" python third_party/Hi3D_Official/train_test_updated.py \
    --base configs/training/m2svid_train.yaml \
    --no-test True \
    --train True \
    --logdir outputs/training/m2svid

Evaluation

Evaluation on stereo4d:

source /opt/conda/bin/activate ""
conda activate sgm
PYTHONPATH="./:./third_party/Hi3D_Official/:./third_party/pytorch_msssim/:${PYTHONPATH}" python third_party/Hi3D_Official/train_test_updated.py \
    --base configs/training/m2svid_train.yaml \
    --dataset_base configs/testing/stereo4d.yaml \
    --no-test False \
    --train False \
    --logdir outputs/training/m2svid \
    --resume /home/jupyter/outputs_m2svid/training/m2svid/checkpoints/epoch=000120.ckpt

Evaluation on ego4d:

source /opt/conda/bin/activate ""
conda activate sgm
PYTHONPATH="./:./third_party/Hi3D_Official/:./third_party/pytorch_msssim/:${PYTHONPATH}" python third_party/Hi3D_Official/train_test_updated.py \
    --base configs/training/m2svid_train.yaml \
    --dataset_base configs/testing/ego4d.yaml \
    --no-test False \
    --train False \d
    --logdir outputs/training/m2svid \
    --resume /home/jupyter/outputs_m2svid/training/m2svid/checkpoints/epoch=000000.ckpt

Evaluation of Released Models

To reproduce the paper's results on Stereo4D and Ego4D using our released weights:

source /opt/conda/bin/activate ""
conda activate sgm

# Evaluate on Stereo4D
PYTHONPATH="./:./third_party/Hi3D_Official/:./third_party/pytorch_msssim/:${PYTHONPATH}" python third_party/Hi3D_Official/train_test_updated.py \
    --base configs/testing/pretrained_m2svid.yaml \
    --dataset_base configs/testing/stereo4d.yaml \
    --no-test False \
    --train False \
    --logdir outputs/training/m2svid 

# Evaluate on Ego4D
PYTHONPATH="./:./third_party/Hi3D_Official/:./third_party/pytorch_msssim/:${PYTHONPATH}" python third_party/Hi3D_Official/train_test_updated.py \
    --base configs/training/pretrained_m2svid.yaml \
    --dataset_base configs/testing/stereo4d.yaml \
    --no-test False \
    --train False \
    --logdir outputs/training/m2svid 

🎓 Citation

@article{shvetsova2026m2svid,
  title={M2SVid: End-to-End Inpainting and Refinement for Monocular-to-Stereo Video Conversion},
  author={Shvetsova, Nina and Bhat, Goutam and Truong, Prune and Kuehne, Hilde and Tombari, Federico},
  journal={3DV},
  year={2026}
}

About

This is the official code release for “M2SVid: End-to-End Inpainting and Refinement for Monocular-to-Stereo Video Conversion”. 3DV 2026

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 98.8%
  • Shell 1.2%