# End-to-End Learning of Geometry and Context for Deep Stereo Regression

[arXiv](https://arxiv.org/abs/1703.04309) | [TensorFlow 1](https://github.com/Jiankai-Sun/GC-Net) | [TensorFlow 2](https://github.com/MaidouPP/gc_net_stereo) | [Keras](https://github.com/LinHungShi/GCNetwork)

* siamese residual FCN提特征 + 构建cost volume + **3D CNN学习正则化 + soft argmin**
* 提出soft argmin, 解决了普通argmin的两个问题: **离散**(我们期望能估计subpixel)和**不可微**(loss无法反传). 从而能直接从cost volume得到disparity

![](imgs/parameters.png)

![](imgs/architecture.png)

# Unary Features
不用raw pixels而用feature representation, 从而对photometric appearance更robust的matching.

只是一个简单的residual FCN, 这里讨论一下网络深度与参数量


# 3D CNN to learn regularization

用soft argmin从cost volume直接出regression结果是很好的想法, 模型有望做的非常compact, 然而从论文来看. 学习regularization的3D CNN部分不得不做的很臃肿, 最终导致了非常大的显存占用.

# Soft ArgMin

***

![](imgs/differentiable_argmin.png)


提出Soft Argmin
$$soft \space argmin := \Sigma^{D_{max}}_{d=0}{d \times \sigma(-c_d)}$$

其中, $\sigma(\cdot)$为softmax

解决了普通argmin的两个问题:
* 只能得到整数解, 而我们希望能得到subpixel的disparity
* 不可微, 加到loss function里是不能BP的


同时, 如图所示, 在面临multi-modal的输入时, 有可能出现偏移, 只能寄希望于前面的网络可以学会一个uni-modal的分布


> However, compared to the argmin operation, its output is influenced by all values. This leaves it susceptible to multi-modal distributions, as the output will not take the most likely. Rather, it will estimate a weighted average of all modes. To overcome this limitation, we rely on the network’s regularization to produce a disparity probability

## 多峰问题测试

In [4]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable

In [1]:
def get_indices(cv, S):
    """
    build indices from a cost volume
    
    cv: (B, C, H, W)
    indices: (B, C_h, C_w, H, W)
    """
    C_h, C_w = cv.size()
    indices = torch.zeros((2*S+1, 2*S+1, 2))
    indices[..., 0] = torch.linspace(-S, S, 2*S+1)[:, None]
    indices[..., 1] = torch.linspace(-S, S, 2*S+1)
    # indices = indices.view(1, 2*S+1, 2*S+1, 2, 1, 1).expand(B, C_h, C_w, 2, H, W)
    # print(indices[..., 0, 0,0])
    # quit()


    indices_y = torch.linspace(-S, S, 2*S+1).view(2*S+1, 1).expand(2*S+1, 2*S+1)
    indices_x = torch.linspace(-S, S, 2*S+1).view(1, 2*S+1).expand(2*S+1, 2*S+1)

    return indices_y, indices_x

In [41]:
def gen_flow_soft(corr):
    C_h, C_w = corr.size()
    softmax = F.softmax(corr.view(C_h*C_w), dim = 0).view(C_h, C_w)
    indices_y, indices_x = get_indices(corr, C_h // 2)
    soft_argmax_y = (softmax * indices_y).sum()
    soft_argmax_x = (softmax * indices_x).sum()
    print(torch.cat([indices_y, indices_x], dim = 1))
    return torch.stack([soft_argmax_x, soft_argmax_y], dim = 0)

In [42]:
def gen_flow_hard(corr):
    max, _ = corr.max(1)
    _, flow_y = max.max(0)

    max, _ = corr.max(0)
    _, flow_x = max.max(0)
    flow_hard = torch.stack([flow_x, flow_y], dim = 0).float() - (corr.size(0) // 2)
    return flow_hard

单峰, OK

In [58]:
corr_uni_modal = torch.Tensor(
    [
        [1  ,   2,    3],
        [0  ,   1,    0],
        [0  , 100,    0],
    ])

In [59]:
gen_flow_soft(corr_uni_modal)

tensor([[-1., -1., -1., -1.,  0.,  1.],
        [ 0.,  0.,  0., -1.,  0.,  1.],
        [ 1.,  1.,  1., -1.,  0.,  1.]])


tensor([ 6.4600e-43,  1.0000e+00])

In [60]:
gen_flow_hard(corr_uni_modal)

tensor([ 0.,  1.])

多峰, 出现偏移

In [105]:
corr_multi_modal = torch.Tensor(
    [
        [100  ,   100,    3],
        [0  ,   1,    0],
        [0  ,  4.5,    101],
    ])

In [106]:
gen_flow_soft(corr_multi_modal)

tensor([[-1., -1., -1., -1.,  0.,  1.],
        [ 0.,  0.,  0., -1.,  0.,  1.],
        [ 1.,  1.,  1., -1.,  0.,  1.]])


tensor([ 0.3642,  0.1522])

In [107]:
gen_flow_hard(corr_multi_modal)

tensor([ 1.,  1.])