# Introduction

Monocular depth estimation - deduction of image depth based on a single image

Disparity = shift of pixel between 2 images. It depends on point depth.

<img src="depth_binocular.png" width = 500>
Note that here $x_l$ and $x_r$ are coordinates for left and right coordinate systems respectively. That's why we when we calculate disparity, we subtract $x_r$ and add $x_l$ to T.

T is also called B - Baseline. Thus disparity will be:

$$ Z = \frac{Bf}{d} $$

# Pre Deep Learning Models

## Learning Depth from Single Monocular Images

Saxenna, 2005, Stanford

[[arxiv]](http://www.cs.cornell.edu/~asaxena/learningdepth/NIPS_LearningDepth.pdf)

They modeled image depth with MRFs:
- Y (depths) as a 2D field
- X (local features) as a 2D field

Each image is divided on patches. Each patch is described by a number of features (650 features).

Feature vector was constructed manually
- 17 filters
- neighboring information (at 3 different scales)
- for each pair of patches they estimate their differences in histograms (for each of 17 filters)

Their first model (Gaussian) maximizes the probability:
<img src="predeeplearning_mrf_1.png" width=500>

First term represents error of depth prediction. Second term regularizes big differences in neighboring patches.

It's called Gaussian since model probability is represented by multivariate gaussian distribution.

Their second model (Lapalcian) maximizes the probability:
<img src="predeeplearning_mrf_2.png" width=500>

It's pretty much the same as Gaussian but here they use Laplacian distribution <img src = "laplacian.svg">

The solutions to the problem were achieved by standard Likelihood maximization techniques.

### Dataset

They collceted their own dataset of ~400 images using 3D-scanners.

<img src="predeeplearning_results.png" width=400>


# Deep Learning Models

## Depth Map Prediction from a Single Image using a Multi-Scale Deep Network

Eigen, 2014, NYU

[[arxiv]](https://arxiv.org/pdf/1406.2283.pdf)

One of the first works on DU using Deep Learning.

Network consists of two branches
1. coarse estimator - regular cNN that regresses the depth
2. fine etimator - another regular cNN but additionally it uses global information from coarse prediction of the first network

<img src="eigen_architecture.png" width=750>

Also the authors enhanced standard MSE error $\sum_i \big( log y_i - log y^*_i \big) ^ 2$ and added summand:

$$\sum \big( log y_i - log y^*_i + \frac{1}{n} \sum_i (log y^*_i - log y_i) ^ 2 \big) $$

They motivated as large part of error can be attributed to wrong scale perception not. So they added a mean shift between prediction and ground truth so that model on rellative preedcitions.

### Datsets

#### NYUDepth
Dataset with indoor images

https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html

#### KITTI
Dataset with images taken from autonomous vehicles with corresponding depths

http://www.cvlibs.net/datasets/kitti/

### Results

Here are some examples for NYUDepth:
1. image
2. coarse prediction
3. fine-grained prediction
4. ground-truth

<img src="dl_results.png" width=700>



## Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue

[[arxiv]](https://arxiv.org/pdf/1603.04992.pdf), 2016, Adelaide

The authors propose an (almost) unsupervised method that requires only a set of stereopairs or images with known movement.

Getting a set of stereopaired images is easier than getting a set of labelled images.

Scheme of the algorithm:
1. Encoder network computes a depth map for the left image
2. Right image is warped (transformed) taking into account the estimated depth map and known movement
3. Model compares original and reconstruvted image to assess the reconstruction error

<img src="unsupervised_scheme.png" width=750>

During the warping step

$$I_2(x) := I_2(x + \frac{fB}{d(x)}) $$

each pixel on the right image is moved according to its estimated depth from first image (close pixels move more, far pixels move less).

The encoder part of the trained network then considered a depth estimator.


## Unsupervised Learning of Depth and Ego-Motion from Video

[arxiv](https://arxiv.org/abs/1704.07813), 2017, Google

This approach generalizes the previous one. It implements the same idea but instead a stereo image pair as an input the model uses short video sequence (several consequtive frames).

The movement of the camera is not fixed anymore, now it is modeled as an arbitrary transformation T(p).

Ego-motion = visual odometry - determining the current location of a robot based on sequence of images from its viewpoint. Motion modeling is a well studied problem itself.

The model consists of 2 branches:
1. Depth cNN (encoder that calculates image depth)
2. Pose cNN (decoder that estimates image transformation)

The view transformation $T$ learnt by PoseCNN is then applied to source images $I_{t-1}, I_{t+1}$ using output of the first network D(p)

<img src= "google_architecture.png" width=500>

The model uses standard photmetric loss between 2 images - the source image $I_s$ and its reconstructed versions $I^r_1$,$I^r_1$:

$L = \sum_s \sum_p | I_s(p) - I_r(p) |$, where 

- $p$ - pixel
- $r$ - index of reference image (for example $t-1$, $t+1$)
- $I_t$ - source image, 
- $I_s$ - reconstructed reference image.

#### Network structure
Both networks architecure is quite straightforward:

<img src="google_dispnet.png" width=750>

#### How does the view transformation look like
$$p_s := K \cdot T_{t \rightarrow s} \cdot D_t(p_t) \cdot K^{−1} \cdot t$$

where
- $K$ = camera instrinsics matrix
- $T_{t \rightarrow s}$ = view transformation matrix
- $D_t$ = predicted depth
- $p_t$ = coordinates

Since new coordinates after transformation are continous, they use bilinear approximation. Which is differentialble => can be used in a network fitting.

#### What regulizers are used

To smoothen the depth mask they add a regularization term as $L_1$ of gradients:

$$L = \sum_s \sum_p | I_s(p) - I_r(p) | +  \lambda_s \cdot L_{smooth}$$

Model performs poorly when there is motion on the scene. To add robustness to the model they introduce a third network that predicts an explainability mask. It highlights the most difiicult parts of the scene that contain motion / occlusions or else. This mask $E(p)$ is then added as a weight to the view syntesis loss:

$$L = \sum_s \sum_p E(p) \cdot | I_s(p) - I_r(p) | + \lambda_s \cdot L_{smooth}$$

To prevent model from always setting difficulty mask to zero thet had to add regulizer for the difficulty mask too:

$$L = \sum_s \sum_p | I_s(p) - I_r(p) | +  \lambda_s \cdot L_{smooth} + \lambda_e \cdot \sum L_{reg}(E_s^l)$$

Here are some examples of explainability masks

<img src = "google_exp.png" width=700>



## Unsupervised Monocular Depth Estimation with Left-Right Consistency

[[arxiv]](https://arxiv.org/abs/1609.03677), 2017, UCL

The authors emphasize the importance of tackling moving objects in the scene. That's why they 