-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Depth map scale for KITTI data #11
Comments
Hi, thanks a lot for your interest in the project. Manydepth is all about training a depth model from monocular video sequences alone. In this setting (similar to Monodepth2 "M" models), depths and poses are estimated up to some arbitrary scale factor (the same way monocular SLAM or Sfm cannot resolve absolute scale). We do not know this in advance, and it will change with each model that is trained (the network effectively gets to decide its scale) - note this is why we need the "adaptive cost volume" we introduce as one of our contributions in the paper. At evaluation time we need to apply median scaling to our estimated depths to allow for comparison to ground truth lidar (see here). This is identical to Monodepth2 for a "mono" model. If you want to get to a rough real world scale, you could try scaling Manydepth's outputs by the average median scaling of the test set. In your comment you are comparing Manydepth to a model from Monodepth2 which was trained using both monocular sequences and stereo pairs (hence the name I hope this makes sense, but if not please let me know and I can try to clarify. |
Hi, |
@JamieWatson683 |
@ChauChorHim - the median scaling is purely an evaluation step, so we can compare to the GT and obtain scores. This is the same as in Monodepth2, and indeed (almost) all depth estimation works trained from monocular video. The adaptive cost volume is a training time technique. A cost volume uses hypothesised depth planes to warp previous features to establish a potential that a certain depth is correct. In the training from monocular video case - we do not know the scale, and so we cannot decide the hypothesised depth planes in advance (imagine that we define our min/max depth planes to be 0.5m and 100m, but the network can pick any scale, and perhaps it compresses everything such that it's max depth is only 10m. This would mean almost all of our cost volume is unhelpful). Instead we need to estimate the depth planes as we train - hence the adaptive cost volume. Does that help at all? @JinraeKim - the median scaling approach is the standard for evaluating depth estimation works trained on monocular video (going back to SfmLearner). It can be thought of in the same way as monocular slam techniques, or structure from motion - both of these give outputs only up to scale, and are (highly) unlikely to be in real world scale. There are a few monocular depth papers which try to address the scale issue, but that was not the goal of ManyDepth. |
Thank you so much! This helped me a lot. If you don't mind, could you answer this as well? Thank you in advance! |
What is the scaling factor needed to get metric depth maps from output disparity maps with the KITTI dataset?
I see that a lot of the code is from monodepth2 including using the same disparity to depth transformation when predicting for KITTI images, that is;
disp_to_depth
with default values0.1
and100
, followed by scaling with the KITTI stereo factor of5.4
. Using these default values the transformation can be summarised by the following formuladepth = 5.4 / (0.01+9.99*disparity)
However using this same transformation on the output of manydepth results in depth maps with completely different scales to that of the monodepth2 depth maps. For example the output of
test_sequence_target.jpg
on the manydepthKITTI_HR
model using multi mode has the following statistics:0.651358
0.247255
0.187170
0.027917
18.6921
3.23547
2.87261
0.828594
Compare this with the output of running the same image on the monodepth2
mono+stereo_1024x320
model:0.114764
0.037749
0.026548
0.006090
76.2298
20.6049
19.6213
4.66927
The same can be seen for any images in the KITTI dataset.
Clearly because the scale of the raw output disparities is very different there needs to be a different scale applied when transforming into depth, but I can't find anywhere in the code what this should be. Is there a known value to scale the depths maps for KITTI images so that depth is in a metric scale, or at least they more match the scale used by monodepth2 for KITTI images?
The text was updated successfully, but these errors were encountered: