# **Neighbourhood Consensus Networks**

**Authors: Ignacio Rocco†, Mircea Cimpoi‡, Relja Arandjelovic´
§, Akihiko Torii∗,
Tomas Pajdla‡, Josef Sivic†,‡**

**†Inria / ‡CIIRC, CTU in Prague / §DeepMind / ∗Tokyo Institute of Technology**

**[Official Github Code](https://github.com/ignacio-rocco/ncnet)** / **[Project Page](https://www.di.ens.fr/willow/research/ncnet/)** / **[Pdf](https://arxiv.org/abs/1810.10510)**

---

**Edited By Su Hyung Choi (Key Summary & Code Practice)**

If you have any issues on this scripts, please PR to the repository below.

**[Github: @JonyChoi - Computer Vision Paper Reviews](https://github.com/jonychoi/Computer-Vision-Paper-Reviews)**

Edited March 21 2022

---

***"Global Context에서 unique invariant feature들을 주변(neighbourhood)의 ambiguous points로 확장시켜서 uncertainity를 해결해보자!***

***+ CNN 기반 완전 differentiable(=End to End학습 가능한) 아키텍쳐 구축"***

***=> 기존의 trainable한 model조차도 hand crafted와 별반 다르지 않았던 성능을 개선***

## **Abstract**

***"We address the problem of finding reliable dense correspondences between a pair
of images."***

#### **Contribution**

---

***First***, *Inspired by the classic idea of disambiguating feature matches using semi-local constraints, we develop an end-to-end trainable convolutional neural network architecture that identifies sets of spatially consistent matches by analyzing neighbourhood consensus patterns in the 4D space of all possible correspondences between a pair of images without the need for a global geometric model.*

> ***Semi-Local Constraints*를 이용하여 애매모한 Feature를 매칭시키는 클래식한 아이디어에서 영감을 얻은바, Global Gemoetric Model의 필요 없이, 4D Space에서 가능한 모든 neighbourhood consensus 패턴들을 분석해서 spatially matching하는, End-to-End로 학습가능한 CNN Architecture를 개발.**

---

***Second***, *We demonstrate that the model can be trained effectively from weak supervision in the form of matching and non-matching image pairs without the need for costly manual annotation of point to point correspondences.*

> **Manual annotation, 즉 완전 하드 annotated 방식이 아닌 matching과 non-matchinig image pairs 형태를 갖춘 weak supervision을 통해 효율적 학습**

---

***Third***, *We show the proposed neighbourhood consensusnetwork can be applied to a range of matching tasks including both category- and instance-level matching, obtaining the state-of-the-art results on the PF Pascal dataset and the InLoc indoor visual localization benchmark.*

> **여기서는 정성적 평가. Neighbourhood Consensus Network는 Category, Instance-level Matching을 포함한 넓은 matching task에서 PF PASCAL 데이터셋과 InLoc Indoor Visual Localization 벤치마크에서 소타달성.**

---

## **Introduction**

> **기존의 Visual Correspondence는 *Viewpoint*의 변화나, *illumination* 등 다양한 variation에 꽤 괜찮은 결과를 낳았으나, 여전히 hand crafted 모델에 비해서 trainable한 모델들이 근소하게 좋은 성능을 보이고 있음. (OD나 Classification과 다르게)**

*"One of the reasons for this plateauing performance could be the currently dominant approach for
finding image correspondence based on matching individual image features. While we have now better
local patch descriptors, the matching is still performed by variants of the nearest neighbour assignment
in a feature space followed by separate disambiguation stages based on geometric constraints."*

> **애매모하고 variate한 feature들 분리하고, invariate한 부분들 위주로 가장 가까운 neighbour을 갖다가 assign해버리는 individual image feature 기반의 image correspondence 가 dominant한 approach.**

*"This
approach has, however, fundamental limitations. Imagine a scene with textureless regions or repetitive
patterns, such as a corridor with almost textureless walls and only few distinguishing features. A
small patch of an image, depicting a repetitive pattern or a textureless area, is indistinguishable from other portions of the image depicting the same repetitive or textureless pattern. Such matches will be
either discarded [23] or incorrect. As a result, matching individual patch descriptors will often fail in
such challenging situations."*

> **그러나 이건 근본적인 문제점. Textureless한 region 또는 반복되는 패턴의 회랑 같은 것들은 조금의 분류가능한 feature들을 가지고 있음. 때문에 이러한 Individual한 patch descriptors는 이러한 challenging한 상황에서 실패**


*"In this work we take a different direction and develop a trainable neural network architecture that
disambiguates such challenging situations by analyzing local neighbourhood patterns in a full set of
dense correspondences. **The intuition is the following: in order to disambiguate a match on a repetitive
pattern, it is necessary to analyze a larger context of the scene that contains a unique non-repetitive
feature.** The information from this unique match can then be propagated to the neighbouring uncertain
matches. **In other words, the certain unique matches will support the close-by uncertain ambiguous
matches in the image."***

> **그래서 큰 Context에서 Unique한 것들이 가까운 애매모한 것들을 Matching하는 걸 support함으로써 uncertain한 것들을 처리하자!! + trainable CNN은 당연하고**


## **Related Works**

#### **Matching with hand-crafted image descriptors**

- **What is it**: *"Traditionally, correspondences between images
have been obtained by hand crafted local invariant feature detectors and descriptors [23, 25, 42]
that were extracted from the image with a controlled degree of invariance to local geometric and
photometric transformations. Candidate (tentative) correspondences were then obtained by variants
of nearest neighbour matching. Strategies for removing ambiguous and non-distinctive matches
include the widely used second nearest neighbour ratio test [23], or enforcing matches to be mutual
nearest neighbours."*

- **Limitation**: "*Both approaches work well for many applications, but have the disadvantage
of discarding many correct matches, which can be problematic for challenging scenes, such as
indoor spaces considered in this work that include repetitive and textureless areas. While successful,
handcrafted descriptors have only limited tolerance to large appearance changes beyond the built-in
invariance.*"

> **전통적으로, Correspondence는 local invariant feature (상대적으로 안변하는 local feautre들) 중심으로 Candidate correspondences (애매모한 애들)을 nearest neighbour참고하여 처리. 즉 nearest neighbour을 information삼는 방식은 예전에도 hand crafted된 방식으로 존재.**


> **하지만 Indoor나 반복적이고 textureless한 scene들은 잘 처리하지 못함. 즉, INVARIANCE가 뚜렷하게 탑재된 큰 변화에 있어서 한계적으로 작동**

---

#### **Matching with trainable descriptors.**

- **What is it**: *"The majority of trainable image descriptors are based on
convolutional neural networks (CNNs) and typically operate on patches extracted using a feature
detector such as DoG [23], yielding a sparse set of descriptors [3, 4, 10, 17, 36, 37] or use a pre-trained
image-level CNN feature extractor [26, 32]. Others have recently developed trainable methods that
comprise both feature detection and description [7, 26, 43]. The extracted descriptors are typically
compared using the Euclidean distance,"*

- **Limitation**: *"but an appropriate similarity score can be also learnt in
a discriminative manner [13, 44], where a trainable model is used to both extract descriptors and produce a similarity score. Finding matches consistent with a geometric model is typically performed
in a separate post-processing stage [3, 4, 7, 10, 17, 22, 26, 36, 37, 43]."*

> **trainable한 친구들은 CNN기반. patch 단위로 feature extracting 하거나 (DoG) / sprase set descriptors 적용하거나 / pre-trained된 image-level CNN feautre extractor 적용. feature detection과 description에 있어서 둘다 trainable한 methods 적용한 애들도 있음. 걔네들 descriptors는 유클라디언 distance를 주로 사용.**

> **그러나 trainable model 같은 경우 extract descriptors하고 유사도 도출하는데 둘다쓰임. matching은 별도의 다른 stage로 구분되어 사용해야 했음**

---

#### **Trainable image alignment.**

- **What is it**: *"Recently, end-to-end trainable methods have been developed to
produce correspondences between images according to a parametric geometric model, such as an
affine, perspective or thin-plate spline transformation [28, 29]. In these works, all pairwise feature
matches are computed and used to estimate the geometric transformation parameters using a CNN.*
***Unlike previous methods that capture only a sparse set of correspondences, this geometric estimation
CNN captures interactions between a full set of dense correspondences."***

- **Limitation**:  "*However, these methods
currently only estimate a low complexity parametric transformation, and therefore their application
is limited to only coarse image alignment tasks.*"

- **Solution**: *"In contrast, we target a more general problem of
identifying reliable correspondences between images of a general 3D scene. Our approach is not
limited to a low dimensional parametric model, but outputs a generic set of locally consistent image
correspondences, applicable to a wide range of computer vision problems ranging from category-level
image alignment to camera pose estimation. The proposed method builds on the classical ideas of
neighbourhood consensus, which we review next."*

> **최근에 나온 End to End trainable한 Correspondence한 Methods들은 parametic geometric model로 이전 모델들이 sparse한 correspondences를 검출하는 반면에 전체의 dense correspondences를 검출. 그러나 아직은 low complexity parametic transformation에 그쳐서 coarse한 이미지 alignment task에 머물고 있음.**

> **반면에 해당 논문의 method는 low dimensional parametric model에 그치지 않고, 3D scene의 일반적인 이미지의 안정적인 correspondeces를 통해 넓은 비젼분야에 걸쳐서 적용가능한 (category-level에서 부터 camera pose estimation까지) method임. 이 기본 아이디어는 향후 설명할 Neighbourhood consensus임.**


---


#### **Match filtering by neighbourhood consensus**

- **What is it**: *"Several strategies have been introduced to decide
whether a match is correct or not, given the supporting evidence from the neighbouring matches. The
early examples analyzed the patterns of* ***distances [46] or angles [34] between neighbouring matches.***
*Later work simply counts the number of consistent matches in a certain image neighbourhood [33, 38],
which can be built in a scale invariant manner [30] or using a regular image grid [5]. While
simple, these techniques have been remarkably effective in removing random incorrect matches and
disambiguating local repetitive patterns [30]."*

- **Solution**: *" Inspired by this simple yet powerful idea we develop a
neighbourhood consensus network – a convolutional neural architecture that (i) analyzes the full set of
dense matches between a pair of images and (ii) learns patterns of locally consistent correspondences
directly from data."*

---

#### **Flow and disparity estimation**

- **What is it**: *"Related are also methods that estimate optical flow or stereo
disparity such as [6, 15, 16, 24, 39], or their trainable counterparts [8, 19, 40]. These works also aim
at establishing reliable point to point correspondences between images."*

- **Solution**: *"However, we address a more
general matching problem where images can have large viewpoint changes (indoor localization) or
major changes in appearance (category-level matching). This is different from optical flow where
image pairs are usually consecutive video frames with small viewpoint or appearance changes, and
stereo where matching is often reduced to a local search around epipolar lines. The optical flow
and stereo problems are well addressed by specialized methods that explicitly exploit the problem
constraints (such as epipolar line constraint, small motion, smoothness, etc.)."*


## **Proposed approach**

***"In this work, we combine the robustness of neighbourhood consensus filtering with the power of
trainable neural architectures."***

>

- *We design a model which learns to discriminate a reliable match by
recognizing patterns of supporting matches in its neighbourhood.*

>

- *Furthermore, we do this in a fully
differentiable way, such that this trainable matching module can be directly combined with strong
CNN image descriptors. The resulting pipeline can then be trained in an end-to-end manner for the
task of feature matching. An overview of our proposed approach is presented in Fig. 1.*

>

- *There are five main components: (i) dense feature extraction and matching, (ii) the neighbourhood consensus
network, (iii) a soft mutual nearest neighbour filtering, (iv) extraction of correspondences from the
output 4D filtered match tensor, and (v) weakly supervised training loss. These components are
described next.*

>

#### **Dense feature extraction and matching**

![](https://www.di.ens.fr/willow/research/ncnet/images/teaser.png)

---

#### **Neighbourhood consensus network**
---

#### **Soft mutual nearest neighbour filtering**
---

#### **Extracting correspondences from the correlation map**
---

#### **Weakly-supervised training**



## **Experimental results**

#### **Implementation details**

---

#### **Category-level matching**

---

#### **Instance-level matching**

---

#### **Limitations**

---

## **Conclusion**