# **Convolutional neural network architecture for geometric matching**

**Authors: Ignacio Rocco  (IDI), Relja Arandjelovic (ENS), Josef Sivic (3CIIRC)**

**Official Github**: https://github.com/ignacio-rocco/cnngeometric_pytorch

---

**Edited By Su Hyung Choi (Key Summary & Code Practice)**

If you have any issues on this scripts, please PR to the repository below.

**[Github: @JonyChoi - Computer Vision Paper Reviews]** https://github.com/jonychoi/Computer-Vision-Paper-Reviews

Edited Jan 10 2022

---

### **Abstract**

<table>
    <thead>
        <tr>
            <th>
                Abstract
            </th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>
                <i>
                    We address the problem of determining correspondences
                    between two images in agreement with a geometric model
                    such as an affine or thin-plate spline transformation, and
                    estimating its parameters. The contributions of this work
                    are three-fold. First, we propose a convolutional neural network architecture for geometric matching. The architecture
                    is based on three main components that mimic the standard
                    steps of feature extraction, matching and simultaneous inlier detection and model parameter estimation, while being
                    trainable end-to-end. Second, we demonstrate that the network parameters can be trained from synthetically generated imagery without the need for manual annotation and
                    that our matching layer significantly increases generalization capabilities to never seen before images. Finally, we
                    show that the same model can perform both instance-level
                    and category-level matching giving state-of-the-art results
                    on the challenging Proposal Flow dataset.
                </i>
            </td>
        </tr>
    </tbody>
</table>

### **1. Introduction**


<table>
    <thead>
        <tr>
            <th>
                Introduction
            </th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>
                <p>
                    Estimating correspondences between images is one of
                    the fundamental problems in computer vision [19, 25] with
                    applications ranging from large-scale 3D reconstruction [3]
                    to image manipulation [21] and semantic segmentation
                    [42]. Traditionally, correspondences consistent with a geometric model such as epipolar geometry or planar affine
                    transformation, are computed by detecting and matching
                    local features (such as SIFT [38] or HOG [12, 22]), followed by pruning incorrect matches using local geometric
                    constraints [43, 47] and robust estimation of a global geometric transformation using algorithms such as RANSAC
                    [18] or Hough transform [32, 34, 38]. This approach works
                    well in many cases but fails in situations that exhibit (i) large
                    changes of depicted appearance due to e.g. intra-class variation [22], or (ii) large changes of scene layout or non-rigid deformations that require complex geometric models with
                    many parameters which are hard to estimate in a manner
                    robust to outliers.
                </p>
                <table>
                    <tbody>
                        <tr>
                            <td>
                                <img src="./imgs/figure1.png" width="550px" />
                            </td>
                            <td>
                                Figure 1: Our trained geometry estimation network automatically
                                aligns two images with substantial appearance differences. It is
                                able to estimate large deformable transformations robustly in the
                                presence of clutter.
                            </td>
                        </tr>
                    </tbody>
                </table>
                <p>
                    In this work we build on the traditional approach and
                    develop a convolutional neural network (CNN) architecture
                    that mimics the standard matching process. First, we replace the standard local features with powerful trainable
                    convolutional neural network features [31, 46], which allows us to handle large changes of appearance between
                    the matched images. Second, we develop trainable matching and transformation estimation layers that can cope with
                    noisy and incorrect matches in a robust way, mimicking the
                    good practices in feature matching such as the second nearest neighbor test [38], neighborhood consensus [43, 47] and
                    Hough transform-like estimation [32, 34, 38].
                </p>
                <p>
                    The outcome is a convolutional neural network architecture trainable for the end task of geometric matching,
                    which can handle large appearance changes, and is therefore
                    suitable for both instance-level and category-level matching
                    problems.
                </p>
            </td>
        </tr>
    </tbody>
</table>


### **2. Related Work**

The classical approach for finding correspondences involves identifying interest points and computing local descriptors around these points [10, 11, 24, 37, 38, 39, 43].
6148
While this approach performs relatively well for instancelevel matching, the feature detectors and descriptors lack
the generalization ability for category-level matching.

Recently, convolutional neural networks have been used
to learn powerful feature descriptors which are more robust
to appearance changes than the classical descriptors [9, 23,
28, 45, 52]. However, these works still divide the image into
a set of local patches and extract a descriptor individually
from each patch. Extracted descriptors are then compared
with an appropriate distance measure [9, 28, 45], by directly
outputting a similarity score [23, 52], or even by directly
outputting a binary matching/non-matching decision [4].

In this work, we take a different approach, treating the
image as a whole, instead of a set of patches. Our approach
has the advantage of capturing the interaction of the different parts of the image in a greater extent, which is not possible when the image is divided into a set of local regions.

Related are also network architectures for estimating
inter-frame motion in video [17, 48, 50] or instance-level
homography estimation [14], however their goal is very different from ours, targeting high-precision correspondence
with very limited appearance variation and background
clutter. Closer to us is the network architecture of [29]
which, however, tackles a different problem of fine-grained
category-level matching (different species of birds) with
limited background clutter and small translations and scale
changes, as their objects are largely centered in the image.
In addition, their architecture is based on a different matching layer, which we show not to perform as well as the
matching layer used in our work.

Some works, such as [11, 15, 22, 30, 35, 36], have addressed the hard problem of category-level matching, but
rely on traditional non-trainable optimization for matching
[11, 15, 30, 35, 36], or guide the matching using object proposals [22]. On the contrary, our approach is fully trainable
in an end-to-end manner and does not require any optimization procedure at evaluation time, or guidance by object proposals.

Others [33, 44, 53] have addressed the problems of instance and category-level correspondence by performing
joint image alignment. However, these methods differ from
ours as they: (i) require class labels; (ii) don’t use CNN features; (iii) jointly align a large set of images, while we align
image pairs; and (iv) don’t use a trainable CNN architecture
for alignment as we do.
