Depth map scale for KITTI data #11

Benjabby · 2021-05-21T17:24:54Z

What is the scaling factor needed to get metric depth maps from output disparity maps with the KITTI dataset?

I see that a lot of the code is from monodepth2 including using the same disparity to depth transformation when predicting for KITTI images, that is; disp_to_depth with default values 0.1 and 100, followed by scaling with the KITTI stereo factor of 5.4. Using these default values the transformation can be summarised by the following formula

depth = 5.4 / (0.01+9.99*disparity)

However using this same transformation on the output of manydepth results in depth maps with completely different scales to that of the monodepth2 depth maps. For example the output of test_sequence_target.jpg on the manydepth KITTI_HR model using multi mode has the following statistics:

output	max value	mean	median	min value
raw disparity	`0.651358`	`0.247255`	`0.187170`	`0.027917`
depth map	`18.6921`	`3.23547`	`2.87261`	`0.828594`

Compare this with the output of running the same image on the monodepth2 mono+stereo_1024x320 model:

output	max value	mean	median	min value
raw disparity	`0.114764`	`0.037749`	`0.026548`	`0.006090`
depth map	`76.2298`	`20.6049`	`19.6213`	`4.66927`

The same can be seen for any images in the KITTI dataset.

Clearly because the scale of the raw output disparities is very different there needs to be a different scale applied when transforming into depth, but I can't find anywhere in the code what this should be. Is there a known value to scale the depths maps for KITTI images so that depth is in a metric scale, or at least they more match the scale used by monodepth2 for KITTI images?

The text was updated successfully, but these errors were encountered:

JamieWatson683 · 2021-05-24T14:53:15Z

Hi, thanks a lot for your interest in the project.

Manydepth is all about training a depth model from monocular video sequences alone. In this setting (similar to Monodepth2 "M" models), depths and poses are estimated up to some arbitrary scale factor (the same way monocular SLAM or Sfm cannot resolve absolute scale). We do not know this in advance, and it will change with each model that is trained (the network effectively gets to decide its scale) - note this is why we need the "adaptive cost volume" we introduce as one of our contributions in the paper.

At evaluation time we need to apply median scaling to our estimated depths to allow for comparison to ground truth lidar (see here). This is identical to Monodepth2 for a "mono" model. If you want to get to a rough real world scale, you could try scaling Manydepth's outputs by the average median scaling of the test set.

In your comment you are comparing Manydepth to a model from Monodepth2 which was trained using both monocular sequences and stereo pairs (hence the name mono+stereo_1024x320). Since this model uses stereo pairs during training, it will have a real world scale given by the fixed, known baseline between the 2 cameras. In Monodepth2, this baseline is set to be 0.1m, whereas in reality the cameras are 0.54m apart. This is why at evaluation time they scale by 5.4.

I hope this makes sense, but if not please let me know and I can try to clarify.

ChauChorHim · 2022-04-01T05:42:50Z

Hi, thanks a lot for your interest in the project.

Manydepth is all about training a depth model from monocular video sequences alone. In this setting (similar to Monodepth2 "M" models), depths and poses are estimated up to some arbitrary scale factor (the same way monocular SLAM or Sfm cannot resolve absolute scale). We do not know this in advance, and it will change with each model that is trained (the network effectively gets to decide its scale) - note this is why we need the "adaptive cost volume" we introduce as one of our contributions in the paper.

At evaluation time we need to apply median scaling to our estimated depths to allow for comparison to ground truth lidar (see here). This is identical to Monodepth2 for a "mono" model. If you want to get to a rough real world scale, you could try scaling Manydepth's outputs by the average median scaling of the test set.

In your comment you are comparing Manydepth to a model from Monodepth2 which was trained using both monocular sequences and stereo pairs (hence the name mono+stereo_1024x320). Since this model uses stereo pairs during training, it will have a real world scale given by the fixed, known baseline between the 2 cameras. In Monodepth2, this baseline is set to be 0.1m, whereas in reality the cameras are 0.54m apart. This is why at evaluation time they scale by 5.4.

I hope this makes sense, but if not please let me know and I can try to clarify.

Hi,
Thanks for your detailed answer, but I am still a little bit confused. Since the estimated depths is needed to apply median scaling, then what's the meaning of adaptive cost volume?

JinraeKim · 2022-09-02T04:52:12Z

@JamieWatson683
TBH, I don't understand the point of this median scaling approach. Can it be consistent over multiple test datasets? If not, what is the point of the self-supervised depth estimation? Maybe I miss something...

JamieWatson683 · 2022-09-02T09:12:34Z

@ChauChorHim - the median scaling is purely an evaluation step, so we can compare to the GT and obtain scores. This is the same as in Monodepth2, and indeed (almost) all depth estimation works trained from monocular video.

The adaptive cost volume is a training time technique. A cost volume uses hypothesised depth planes to warp previous features to establish a potential that a certain depth is correct. In the training from monocular video case - we do not know the scale, and so we cannot decide the hypothesised depth planes in advance (imagine that we define our min/max depth planes to be 0.5m and 100m, but the network can pick any scale, and perhaps it compresses everything such that it's max depth is only 10m. This would mean almost all of our cost volume is unhelpful).

Instead we need to estimate the depth planes as we train - hence the adaptive cost volume.

Does that help at all?

@JinraeKim - the median scaling approach is the standard for evaluating depth estimation works trained on monocular video (going back to SfmLearner). It can be thought of in the same way as monocular slam techniques, or structure from motion - both of these give outputs only up to scale, and are (highly) unlikely to be in real world scale. There are a few monocular depth papers which try to address the scale issue, but that was not the goal of ManyDepth.

JinraeKim · 2022-09-02T23:51:13Z

@JinraeKim - the median scaling approach is the standard for evaluating depth estimation works trained on monocular video (going back to SfmLearner). It can be thought of in the same way as monocular slam techniques, or structure from motion - both of these give outputs only up to scale, and are (highly) unlikely to be in real world scale. There are a few monocular depth papers which try to address the scale issue, but that was not the goal of ManyDepth.

Thank you so much! This helped me a lot.
I also studied more after asking the question and realised the inherit issue of scales from the monocular depth estimation.

If you don't mind, could you answer this as well?
I'm reading manydepth paper and still don't understand how the networks work "after" the cost volume part.
For example, the cost volume is constructed with predefined depths (linearly spaced within d_min, d_max, namely).
This can also be interpreted as "likelihood" as mentioned in the paper.
It is fed (with the feature) into the additional encoder-decoder map of the target frame.
I think the encoder-decoder need to be trained as well, but it's not obvious to me how one can train the encoder-decoder to predict the depth of the target frame. If we need ground-truth depths to train those, it is not a self-supervised scheme, so I may miss something important.

Thank you in advance!

Benjabby closed this as completed May 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Depth map scale for KITTI data #11

Depth map scale for KITTI data #11

Benjabby commented May 21, 2021

JamieWatson683 commented May 24, 2021

ChauChorHim commented Apr 1, 2022

JinraeKim commented Sep 2, 2022

JamieWatson683 commented Sep 2, 2022

JinraeKim commented Sep 2, 2022 •

edited

Loading

Depth map scale for KITTI data #11

Depth map scale for KITTI data #11

Comments

Benjabby commented May 21, 2021

JamieWatson683 commented May 24, 2021

ChauChorHim commented Apr 1, 2022

JinraeKim commented Sep 2, 2022

JamieWatson683 commented Sep 2, 2022

JinraeKim commented Sep 2, 2022 • edited Loading

JinraeKim commented Sep 2, 2022 •

edited

Loading