This repository contains code to compute depth from a single image. It accompanies our paper:
Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer
René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, Vladlen Koltun
and our preprint:
Vision Transformers for Dense Prediction
René Ranftl, Alexey Bochkovskiy, Vladlen Koltun
MiDaS was trained on up to 12 datasets (ReDWeb, DIML, Movies, MegaDepth, WSVD, TartanAir, HRWSI, ApolloScape, BlendedMVS, IRS, KITTI, NYU Depth V2) with
multi-objective optimization.
The original model that was trained on 5 datasets (MIX 5
in the paper) can be found here.
The figure below shows an overview of the different MiDaS models; the bubble size scales with number of parameters.
- Pick one or more models and download the corresponding weights to the
weights
folder:
MiDaS 3.1
- For highest quality: dpt_beit_large_512
- For moderately less quality, but better speed-performance trade-off: dpt_swin2_large_384
- For embedded devices: dpt_swin2_tiny_256, dpt_levit_224
- For inference on Intel CPUs, OpenVINO may be used for the small legacy model: openvino_midas_v21_small .xml, .bin
MiDaS 3.0: Legacy transformer models dpt_large_384 and dpt_hybrid_384
MiDaS 2.1: Legacy convolutional models midas_v21_384 and midas_v21_small_256
-
Set up dependencies:
conda env create -f environment.yaml conda activate midas-py310
For the Next-ViT model, install
./install_next_vit.sh
For the OpenVINO model, install
pip install openvino
-
Place one or more input images in the folder
input
. -
Run the model with
python run.py --model_type <model_type> --input_path input --output_path output
where
<model_type>
is chosen from dpt_beit_large_512, dpt_beit_large_384, dpt_beit_base_384, dpt_swin2_large_384, dpt_swin2_base_384, dpt_swin2_tiny_256, dpt_swin_large_384, dpt_next_vit_large_384, dpt_levit_224, dpt_large_384, dpt_hybrid_384, midas_v21_384, midas_v21_small_256, openvino_midas_v21_small_256. -
The resulting depth maps are written to the
output
folder.
- By default, the inference resizes the height of input images to the size of a model to fit into the encoder. This
size is given by the numbers in the model names of the accuracy table. Some models do not only support a single
inference height but a range of different heights. Feel free to explore different heights by appending the extra
command line argument
--height
. Unsupported height values will throw an error. Note that using this argument may decrease the model accuracy. - By default, the inference keeps the aspect ratio of input images when feeding them into the encoder if this is
supported by a model (all models except for Swin, Swin2, LeViT). In order to resize to a square resolution,
disregarding the aspect ratio while preserving the height, use the command line argument
--square
.
If you want the input images to be grabbed from the camera and shown in a window, leave the input and output paths away and choose a model type as shown above:
python run.py --model_type <model_type> --side
The argument --side
is optional and causes both the input RGB image and the output depth map to be shown
side-by-side for comparison.
-
Make sure you have installed Docker and the NVIDIA Docker runtime.
-
Build the Docker image:
docker build -t midas .
-
Run inference:
docker run --rm --gpus all -v $PWD/input:/opt/MiDaS/input -v $PWD/output:/opt/MiDaS/output -v $PWD/weights:/opt/MiDaS/weights midas
This command passes through all of your NVIDIA GPUs to the container, mounts the
input
andoutput
directories and then runs the inference.
The pretrained model is also available on PyTorch Hub
See README in the tf
subdirectory.
Currently only supports MiDaS v2.1.
See README in the mobile
subdirectory.
See README in the ros
subdirectory.
Currently only supports MiDaS v2.1. DPT-based models to be added.
We provide a zero-shot error
MiDaS Model | DIW WHDR |
Eth3d AbsRel |
Sintel AbsRel |
TUM δ1 |
KITTI δ1 |
NYUv2 δ1 |
% |
Par. M |
FPS |
---|---|---|---|---|---|---|---|---|---|
Inference height 512 | |||||||||
v3.1 BEiTL-512 | 0.1137 | 0.0659 | 0.2366 | 6.13 | 11.56* | 1.86* | 345 | 5.7 | |
v3.1 BEiTL-512 |
0.1121 | 0.0614 | 0.2090 | 6.46 | 5.00* | 1.90* | 345 | 5.7 | |
Inference height 384 | |||||||||
v3.1 BEiTL-512 | 0.1245 | 0.0681 | 0.2176 | 6.13 | 6.28* | 2.16* | 345 | 12 | |
v3.1 Swin2L-384 |
0.1106 | 0.0732 | 0.2442 | 8.87 | 5.84* | 2.92* | 213 | 41 | |
v3.1 Swin2B-384 |
0.1095 | 0.0790 | 0.2404 | 8.93 | 5.97* | 3.28* | 102 | 39 | |
v3.1 SwinL-384 |
0.1126 | 0.0853 | 0.2428 | 8.74 | 6.60* | 3.34* | 213 | 49 | |
v3.1 BEiTL-384 | 0.1239 | 0.0667 | 0.2545 | 7.17 | 9.84* | 2.21* | 344 | 13 | |
v3.1 Next-ViTL-384 | 0.1031 | 0.0954 | 0.2295 | 9.21 | 6.89* | 3.47* | 72 | 30 | |
v3.1 BEiTB-384 | 0.1159 | 0.0967 | 0.2901 | 9.88 | 26.60* | 3.91* | 112 | 31 | |
v3.0 DPTL-384 | 0.1082 | 0.0888 | 0.2697 | 9.97 | 8.46 | 8.32 | 344 | 61 | |
v3.0 DPTH-384 | 0.1106 | 0.0934 | 0.2741 | 10.89 | 11.56 | 8.69 | 123 | 50 | |
v2.1 Large384 | 0.1295 | 0.1155 | 0.3285 | 12.51 | 16.08 | 8.71 | 105 | 47 | |
Inference height 256 | |||||||||
v3.1 Swin2T-256 |
0.1211 | 0.1106 | 0.2868 | 13.43 | 10.13* | 5.55* | 42 | 64 | |
v2.1 Small256 | 0.1344 | 0.1344 | 0.3370 | 14.53 | 29.27 | 13.43 | 21 | 90 | |
Inference height 224 | |||||||||
v3.1 LeViT224 |
0.1314 | 0.1206 | 0.3148 | 18.21 | 15.27* | 8.64* | 51 | 73 |
* No zero-shot error, because models are also trained on KITTI and NYU Depth V2
Best values per column and same validation height in bold
The improvement in the above table is defined as the relative zero-shot error with respect to MiDaS v3.0
DPTL-384 and averaging over the datasets. So, if
Note that the improvements of 10% for MiDaS v2.0 → v2.1 and 21% for MiDaS v2.1 → v3.0 are not visible from the improvement column (Imp.) in the table but would require an evaluation with respect to MiDaS v2.1 Large384 and v2.0 Large384 respectively instead of v3.0 DPTL-384.
Test configuration
- Windows 10
- 11th Gen Intel Core i7-1185G7 3.00GHz
- 16GB RAM
- Camera resolution 640x480
- openvino_midas_v21_small_256
Speed: 22 FPS
- [Dec 2022] Released MiDaS v3.1:
- New models based on 5 different types of transformers (BEiT, Swin2, Swin, Next-ViT, LeViT)
- Training datasets extended from 10 to 12, including also KITTI and NYU Depth V2 using BTS split
- Best model, BEiTLarge 512, with resolution 512x512, is on average about 28% more accurate than MiDaS v3.0
- Integrated live depth estimation from camera feed
- [Sep 2021] Integrated to Huggingface Spaces with Gradio. See Gradio Web Demo.
- [Apr 2021] Released MiDaS v3.0:
- New models based on Dense Prediction Transformers are on average 21% more accurate than MiDaS v2.1
- Additional models can be found here
- [Nov 2020] Released MiDaS v2.1:
- New model that was trained on 10 datasets and is on average about 10% more accurate than MiDaS v2.0
- New light-weight model that achieves real-time performance on mobile platforms.
- Sample applications for iOS and Android
- ROS package for easy deployment on robots
- [Jul 2020] Added TensorFlow and ONNX code. Added online demo.
- [Dec 2019] Released new version of MiDaS - the new model is significantly more accurate and robust
- [Jul 2019] Initial release of MiDaS (Link)
Please cite our paper if you use this code or any of the models:
@ARTICLE {Ranftl2022,
author = "Ren\'{e} Ranftl and Katrin Lasinger and David Hafner and Konrad Schindler and Vladlen Koltun",
title = "Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer",
journal = "IEEE Transactions on Pattern Analysis and Machine Intelligence",
year = "2022",
volume = "44",
number = "3"
}
If you use a DPT-based model, please also cite:
@article{Ranftl2021,
author = {Ren\'{e} Ranftl and Alexey Bochkovskiy and Vladlen Koltun},
title = {Vision Transformers for Dense Prediction},
journal = {ICCV},
year = {2021},
}
Our work builds on and uses code from timm and Next-ViT. We'd like to thank the authors for making these libraries available.
MIT License