# **Convolutional neural network architecture for geometric matching**

**Authors: Ignacio Rocco  (IDI), Relja Arandjelovic (ENS), Josef Sivic (3CIIRC)**

**Official Github**: https://github.com/ignacio-rocco/cnngeometric_pytorch

---

**Edited By Su Hyung Choi (Key Summary & Code Practice)**

If you have any issues on this scripts, please PR to the repository below.

**[Github: @JonyChoi - Computer Vision Paper Reviews]** https://github.com/jonychoi/Computer-Vision-Paper-Reviews

Edited Jan 10 2022

---

### **Abstract**

<table>
    <thead>
        <tr>
            <th>
                Abstract
            </th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>
                <i>
                    We address the problem of determining correspondences
                    between two images in agreement with a geometric model
                    such as an affine or thin-plate spline transformation, and
                    estimating its parameters. The contributions of this work
                    are three-fold. First, we propose a convolutional neural network architecture for geometric matching. The architecture
                    is based on three main components that mimic the standard
                    steps of feature extraction, matching and simultaneous inlier detection and model parameter estimation, while being
                    trainable end-to-end. Second, we demonstrate that the network parameters can be trained from synthetically generated imagery without the need for manual annotation and
                    that our matching layer significantly increases generalization capabilities to never seen before images. Finally, we
                    show that the same model can perform both instance-level
                    and category-level matching giving state-of-the-art results
                    on the challenging Proposal Flow dataset.
                </i>
            </td>
        </tr>
    </tbody>
</table>

### **1. Introduction**


<table>
    <thead>
        <tr>
            <th>
                Introduction
            </th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>
                <p>
                    Estimating correspondences between images is one of
                    the fundamental problems in computer vision [19, 25] with
                    applications ranging from large-scale 3D reconstruction [3]
                    to image manipulation [21] and semantic segmentation
                    [42]. Traditionally, correspondences consistent with a geometric model such as epipolar geometry or planar affine
                    transformation, are computed by detecting and matching
                    local features (such as SIFT [38] or HOG [12, 22]), followed by pruning incorrect matches using local geometric
                    constraints [43, 47] and robust estimation of a global geometric transformation using algorithms such as RANSAC
                    [18] or Hough transform [32, 34, 38]. This approach works
                    well in many cases but fails in situations that exhibit (i) large
                    changes of depicted appearance due to e.g. intra-class variation [22], or (ii) large changes of scene layout or non-rigid deformations that require complex geometric models with
                    many parameters which are hard to estimate in a manner
                    robust to outliers.
                </p>
                <table>
                    <tbody>
                        <tr>
                            <td>
                                <img src="./imgs/figure1.png" width="550px" />
                            </td>
                            <td>
                                Figure 1: Our trained geometry estimation network automatically
                                aligns two images with substantial appearance differences. It is
                                able to estimate large deformable transformations robustly in the
                                presence of clutter.
                            </td>
                        </tr>
                    </tbody>
                </table>
                <p>
                    In this work we build on the traditional approach and
                    develop a convolutional neural network (CNN) architecture
                    that mimics the standard matching process. First, we replace the standard local features with powerful trainable
                    convolutional neural network features [31, 46], which allows us to handle large changes of appearance between
                    the matched images. Second, we develop trainable matching and transformation estimation layers that can cope with
                    noisy and incorrect matches in a robust way, mimicking the
                    good practices in feature matching such as the second nearest neighbor test [38], neighborhood consensus [43, 47] and
                    Hough transform-like estimation [32, 34, 38].
                </p>
                <p>
                    The outcome is a convolutional neural network architecture trainable for the end task of geometric matching,
                    which can handle large appearance changes, and is therefore
                    suitable for both instance-level and category-level matching
                    problems.
                </p>
            </td>
        </tr>
    </tbody>
</table>


### **2. Related Work**


<table>
    <thead>
        <tr>
            <th>
                Related Work
            </th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>
                <p>            
                    The classical approach for finding correspondences involves identifying interest points and computing local descriptors around these points [10, 11, 24, 37, 38, 39, 43].
                    6148
                    While this approach performs relatively well for instancelevel matching, the feature detectors and descriptors lack
                    the generalization ability for category-level matching.
                </p>
                <p>
                    Recently, convolutional neural networks have been used
                    to learn powerful feature descriptors which are more robust
                    to appearance changes than the classical descriptors [9, 23,
                    28, 45, 52]. However, these works still divide the image into
                    a set of local patches and extract a descriptor individually
                    from each patch. Extracted descriptors are then compared
                    with an appropriate distance measure [9, 28, 45], by directly
                    outputting a similarity score [23, 52], or even by directly
                    outputting a binary matching/non-matching decision [4].
                </p>
                <p>
                    In this work, we take a different approach, treating the
                    image as a whole, instead of a set of patches. Our approach
                    has the advantage of capturing the interaction of the different parts of the image in a greater extent, which is not possible when the image is divided into a set of local regions.
                </p>
                <p>
                    Related are also network architectures for estimating
                    inter-frame motion in video [17, 48, 50] or instance-level
                    homography estimation [14], however their goal is very different from ours, targeting high-precision correspondence
                    with very limited appearance variation and background
                    clutter. Closer to us is the network architecture of [29]
                    which, however, tackles a different problem of fine-grained
                    category-level matching (different species of birds) with
                    limited background clutter and small translations and scale
                    changes, as their objects are largely centered in the image.
                    In addition, their architecture is based on a different matching layer, which we show not to perform as well as the
                    matching layer used in our work.
                </p>
                <p>
                    Some works, such as [11, 15, 22, 30, 35, 36], have addressed the hard problem of category-level matching, but
                    rely on traditional non-trainable optimization for matching
                    [11, 15, 30, 35, 36], or guide the matching using object proposals [22]. On the contrary, our approach is fully trainable
                    in an end-to-end manner and does not require any optimization procedure at evaluation time, or guidance by object proposals.
                </p>
                <p>
                    Others [33, 44, 53] have addressed the problems of instance and category-level correspondence by performing
                    joint image alignment. However, these methods differ from
                    ours as they: (i) require class labels; (ii) don’t use CNN features; (iii) jointly align a large set of images, while we align
                    image pairs; and (iv) don’t use a trainable CNN architecture
                    for alignment as we do.
                </p>
            </td>
        </tr>
    </tbody>
</table>


### **3. Architecture for geometric matching**

<table>
    <thead>
        <tr>
            <th>
                Architecture for geometric matching
            </th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>
                <p>
                    In this section, we introduce a new convolutional neural network architecture for estimating parameters of a geometric transformation between two input images. The architecture is designed to mimic the classical computer vision pipeline (e.g. [40]), while using differentiable modules
                    so that it is trainable end-to-end for the geometry estimation task. The classical approach consists of the following
                    stages: (i) local descriptors (e.g. SIFT) are extracted from
                    both input images, (ii) the descriptors are matched across
                    images to form a set of tentative correspondences, which
                    are then used to (iii) robustly estimate the parameters of the
                    geometric model using RANSAC or Hough voting.
                </p>     
                <table>
                    <tbody>
                        <tr>
                            <td>
                                <img src="./imgs/figure2.png" width="500px" />
                            </td>
                            <td>
                                <img src="./imgs/figure2_description.png" width="500px" />
                            </td>
                        </tr>
                    </tbody>
                </table>
                <p>
                    Our architecture, illustrated in Fig. 2, mimics this process by: (i) passing input images IA and IB through a
                    siamese architecture consisting of convolutional layers, thus
                    extracting feature maps fA and fB which are analogous to
                    dense local descriptors, (ii) matching the feature maps (“descriptors”) across images into a tentative correspondence
                    map fAB, followed by a (iii) regression network which directly outputs the parameters of the geometric model, ˆθ, in
                    a robust manner. The inputs to the network are the two images, and the outputs are the parameters of the chosen geometric model, e.g. a 6-D vector for an affine transformation.
                </p>
                <p>
                    In the following, we describe each of the three stages in detail.
                </p>
            </td>
        </tr>
    </tbody>
</table>

### **3.1. Feature extraction**

<table>
    <thead>
        <tr>
            <th>
                Feature extraction
            </th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>
                <p>
                    The first stage of the pipeline is feature extraction, for
                    which we use a standard CNN architecture. A CNN without fully connected layers takes an input image and produces a feature map f ∈ R
                    h×w×d
                    , which can be interpreted
                    as a h × w dense spatial grid of d-dimensional local descriptors. A similar interpretation has been used previously
                    in instance retrieval [5, 7, 8, 20] demonstrating high discriminative power of CNN-based descriptors. Thus, for feature extraction we use the VGG-16 network [46], cropped
                    at the pool4 layer (before the ReLU unit), followed by
                    per-feature L2-normalization. We use a pre-trained model,
                    originally trained on ImageNet [13] for the task of image
                    classification. As shown in Fig. 2, the feature extraction network is duplicated and arranged in a siamese configuration
                    such that the two input images are passed through two identical networks which share parameters.
                </p>
            </td>
        </tr>
    </tbody>
</table>

### **3.2. Matching network**


<table>
    <thead>
        <tr>
            <th>
                Matching network
            </th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>
                <p>
                    The image features produced by the feature extraction
                    networks should be combined into a single tensor as input to
                    the regressor network to estimate the geometric transformation. We first describe the classical approach for generating
                    tentative correspondences, and then present our matching
                    layer which mimics this process.
                </p>       
                <img src="./imgs/figure3.png" width="500px" />
                <img src="./imgs/figure3_description.png" width="500px" />
                <p>
                    <strong>Tentative matches in classical geometry estimation.</strong>
                    Classical methods start by computing similarities between
                    all pairs of descriptors across the two images. From this
                    point on, the original descriptors are discarded as all the
                    necessary information for geometry estimation is contained
                    in the pairwise descriptor similarities and their spatial locations. Secondly, the pairs are pruned by either thresholding
                    the similarity values, or, more commonly, only keeping the
                    matches which involve the nearest (most similar) neighbors.
                    Furthermore, the second nearest neighbor test [38] prunes
                    the matches further by requiring that the match strength is
                    significantly stronger than the second best match involving
                    the same descriptor, which is very effective at discarding
                    ambiguous matches.
                </p>
                <p>
                    <strong>Matching layer.</strong> Our matching layer applies a similar procedure. Analogously to the classical approach, only descriptor similarities and their spatial locations should be
                    considered for geometry estimation, and not the original descriptors themselves.
                </p>
                <p>
                    To achieve this, we propose to use a correlation layer
                    followed by normalization. Firstly, all pairs of similarities
                    between descriptors are computed in the correlation layer.
                    Secondly, similarity scores are processed and normalized
                    such that ambiguous matches are strongly down-weighted.
                </p>
                <p>
                    In more detail, given L2-normalized dense feature
                    maps fA, fB ∈ R
                    h×w×d
                    , the correlation map cAB ∈
                    R
                    h×w×(h×w) outputted by the correlation layer contains at
                    each position the scalar product of a pair of individual descriptors fA ∈ fA and fB ∈ fB, as detailed in Eq. (1).
                </p>
                <img src="./imgs/equation1.png" width="500px" />
                <p>
                    where (i, j) and (ik, jk) indicate the individual feature positions in the h×w dense feature maps, and k = h(jk−1)+ik
                    is an auxiliary indexing variable for (ik, jk).
                </p>
                <p>
                    A diagram of the correlation layer is presented in Fig. 3.
                    Note that at a particular position (i, j), the correlation map
                    cAB contains the similarities between fB at that position and
                    all the features of fA.
                </p>
                <p>
                    As is done in the classical methods for tentative correspondence estimation, it is important to postprocess the
                    pairwise similarity scores to remove ambiguous matches.
                    To this end, we apply a channel-wise normalization of the
                    correlation map at each spatial location to produce the final tentative correspondence map fAB. The normalization
                    is performed by ReLU, to zero out negative correlations,
                    followed by L2-normalization, which has two desirable effects. First, let us consider the case when descriptor fB correlates well with only a single feature in fA. In this case,
                    the normalization will amplify the score of the match, akin
                    to the nearest neighbor matching in classical geometry estimation. Second, in the case of the descriptor fB matching
                    multiple features in fA due to the existence of clutter or
                    repetitive patterns, matching scores will be down-weighted
                    similarly to the second nearest neighbor test [38]. However,
                    note that both the correlation and the normalization operations are differentiable with respect to the input descriptors,
                    which facilitates backpropagation thus enabling end-to-end
                    learning.
                </p>
                <p>
                    <strong>Discussion.</strong> The first step of our matching layer, namely
                    the correlation layer, is somewhat similar to layers used in
                    DeepMatching [50] and FlowNet [17]. However, DeepMatching [50] only uses deep RGB patches and no part
                    of their architecture is trainable. FlowNet [17] uses a spatially constrained correlation layer such that similarities are
                    are only computed in a restricted spatial neighborhood thus
                    limiting the range of geometric transformations that can be
                    captured. This is acceptable for their task of learning to estimate optical flow, but is inappropriate for larger transformations that we consider in this work. Furthermore, neither
                    of these methods performs score normalization, which we
                    find to be crucial in dealing with cluttered scenes.
                </p>
                <p>
                    Previous works have used other matching layers to combine descriptors across images, namely simple concatenation of descriptors along the channel dimension [14] or subtraction [29]. However, these approaches suffer from two
                    problems. First, as following layers are typically convolutional, these methods also struggle to handle large transformations as they are unable to detect long-range matches.
                    Second, when concatenating or subtracting descriptors, instead of computing pairwise descriptor similarities as is
                    commonly done in classical geometry estimation and mimicked by the correlation layer, image content information
                    is directly outputted. To further illustrate why this can be
                    problematic, consider two pairs of images that are related
                    with the same geometric transformation – the concatenation
                    and subtraction strategies will produce different outputs for
                    the two cases, making it hard for the regressor to deduce the geometric transformation. In contrast, the correlation layer
                    output is likely to produce similar correlation maps for the
                    two cases, regardless of the image content, thus simplifying the problem for the regressor. In line with this intuition,
                    in Sec. 5.5 we show that the concatenation and subtraction
                    methods indeed have difficulties generalizing beyond the
                    training set, while our correlation layer achieves generalization yielding superior results.
                </p>
            </td>
        </tr>
    </tbody>
</table>