# **Neighbourhood Consensus Networks**

**Authors: Ignacio Rocco†, Mircea Cimpoi‡, Relja Arandjelovic´
§, Akihiko Torii∗,
Tomas Pajdla‡, Josef Sivic†,‡**

**†Inria / ‡CIIRC, CTU in Prague / §DeepMind / ∗Tokyo Institute of Technology**

**[Official Github Code](https://github.com/ignacio-rocco/ncnet)** / **[Project Page](https://www.di.ens.fr/willow/research/ncnet/)** / **[Pdf](https://arxiv.org/abs/1810.10510)**

---

**Edited By Su Hyung Choi (Key Summary & Code Practice)**

If you have any issues on this scripts, please PR to the repository below.

**[Github: @JonyChoi - Computer Vision Paper Reviews](https://github.com/jonychoi/Computer-Vision-Paper-Reviews)**

Edited March 21 2022

---

## **Summary**

***"Global Context에서 unique invariant feature들을 주변(neighbourhood)의 ambiguous points로 확장시켜서 uncertainity를 해결해보자!"***
> ***general한 matching을 위해서 더 large한 viewpoint changes (indoor localization)와 major changes in appearance (category-level matching)을 해결하고자 함.***

> ***4D CNN과 Neighbourhood Consensus Network, Soft mutual nearest neighbour filtering이 핵심***

***+ CNN 기반 완전 differentiable(=End to End학습 가능한) 아키텍쳐 구축 <--> Hand-crafted methods***

> ***기존의 trainable한 model조차도 hand crafted와 별반 다르지 않았던 성능을 개선***

#### **Architectures**

> **1. 이미지 페어 Ia와 Ib에 대해서 각각 CNN을 이용해서 dense image descriptors fA와 fB를 추출 => 4D Tensors.**

> **2. 모든 각 개별 feature 매치들은(fAij 와 Fbkl)은 4-D spaces에서 represented**

> **3. 모든 매치들의 matching scores를 4-D correlation tensor C에 저장**

> **4. 이 매치들을 Soft-nearest neigbour filtering과 neighbourhood consensus network에 넣어서 최종 correspondence도출**

---

**5가지 main components.**

> **1. Dense Feature Extraction과 Matching**

> **2. Neighbourhood Consensus Network => 4D CNN**

> **3. Soft mutual nearest neighbour filtering**

> **4. 4D output filtered match tensor로 부터 correspondences 추출**

> **5. Weakly supervised training loss**

## **Abstract**

***"We address the problem of finding reliable dense correspondences between a pair
of images."***

#### **Contribution**

---

***First***, *Inspired by the classic idea of disambiguating feature matches using semi-local constraints, we develop an end-to-end trainable convolutional neural network architecture that identifies sets of spatially consistent matches by analyzing neighbourhood consensus patterns in the 4D space of all possible correspondences between a pair of images without the need for a global geometric model.*

> ***Semi-Local Constraints*를 이용하여 애매모한 Feature를 매칭시키는 클래식한 아이디어에서 영감을 얻은바, Global Gemoetric Model의 필요 없이, 4D Space에서 가능한 모든 neighbourhood consensus 패턴들을 분석해서 spatially matching하는, End-to-End로 학습가능한 CNN Architecture를 개발.**

---

***Second***, *We demonstrate that the model can be trained effectively from weak supervision in the form of matching and non-matching image pairs without the need for costly manual annotation of point to point correspondences.*

> **Manual annotation, 즉 완전 하드 annotated 방식이 아닌 matching과 non-matchinig image pairs 형태를 갖춘 weak supervision을 통해 효율적 학습**

---

***Third***, *We show the proposed neighbourhood consensusnetwork can be applied to a range of matching tasks including both category- and instance-level matching, obtaining the state-of-the-art results on the PF Pascal dataset and the InLoc indoor visual localization benchmark.*

> **여기서는 정성적 평가. Neighbourhood Consensus Network는 Category, Instance-level Matching을 포함한 넓은 matching task에서 PF PASCAL 데이터셋과 InLoc Indoor Visual Localization 벤치마크에서 소타달성.**

---

## **Introduction**

> **기존의 Visual Correspondence는 *Viewpoint*의 변화나, *illumination* 등 다양한 variation에 꽤 괜찮은 결과를 낳았으나, 여전히 hand crafted 모델에 비해서 trainable한 모델들이 근소하게 좋은 성능을 보이고 있음. (OD나 Classification과 다르게)**

*"One of the reasons for this plateauing performance could be the currently dominant approach for
finding image correspondence based on matching individual image features. While we have now better
local patch descriptors, the matching is still performed by variants of the nearest neighbour assignment
in a feature space followed by separate disambiguation stages based on geometric constraints."*

> **애매모하고 variate한 feature들 분리하고, invariate한 부분들 위주로 가장 가까운 neighbour을 갖다가 assign해버리는 individual image feature 기반의 image correspondence 가 dominant한 approach.**

*"This
approach has, however, fundamental limitations. Imagine a scene with textureless regions or repetitive
patterns, such as a corridor with almost textureless walls and only few distinguishing features. A
small patch of an image, depicting a repetitive pattern or a textureless area, is indistinguishable from other portions of the image depicting the same repetitive or textureless pattern. Such matches will be
either discarded [23] or incorrect. As a result, matching individual patch descriptors will often fail in
such challenging situations."*

> **그러나 이건 근본적인 문제점. Textureless한 region 또는 반복되는 패턴의 회랑 같은 것들은 조금의 분류가능한 feature들을 가지고 있음. 때문에 이러한 Individual한 patch descriptors는 이러한 challenging한 상황에서 실패**


*"In this work we take a different direction and develop a trainable neural network architecture that
disambiguates such challenging situations by analyzing local neighbourhood patterns in a full set of
dense correspondences. **The intuition is the following: in order to disambiguate a match on a repetitive
pattern, it is necessary to analyze a larger context of the scene that contains a unique non-repetitive
feature.** The information from this unique match can then be propagated to the neighbouring uncertain
matches. **In other words, the certain unique matches will support the close-by uncertain ambiguous
matches in the image."***

> **그래서 큰 Context에서 Unique한 것들이 가까운 애매모한 것들을 Matching하는 걸 support함으로써 uncertain한 것들을 처리하자!! + trainable CNN은 당연하고**


## **Related Works**

#### **Matching with hand-crafted image descriptors**

- **What is it**: *"Traditionally, correspondences between images
have been obtained by hand crafted local invariant feature detectors and descriptors [23, 25, 42]
that were extracted from the image with a controlled degree of invariance to local geometric and
photometric transformations. Candidate (tentative) correspondences were then obtained by variants
of nearest neighbour matching. Strategies for removing ambiguous and non-distinctive matches
include the widely used second nearest neighbour ratio test [23], or enforcing matches to be mutual
nearest neighbours."*

- **Limitation**: "*Both approaches work well for many applications, but have the disadvantage
of discarding many correct matches, which can be problematic for challenging scenes, such as
indoor spaces considered in this work that include repetitive and textureless areas. While successful,
handcrafted descriptors have only limited tolerance to large appearance changes beyond the built-in
invariance.*"

> **전통적으로, Correspondence는 local invariant feature (상대적으로 안변하는 local feautre들) 중심으로 Candidate correspondences (애매모한 애들)을 nearest neighbour참고하여 처리. 즉 nearest neighbour을 information삼는 방식은 예전에도 hand crafted된 방식으로 존재.**


> **하지만 Indoor나 반복적이고 textureless한 scene들은 잘 처리하지 못함. 즉, INVARIANCE가 뚜렷하게 탑재된 큰 변화에 있어서 한계적으로 작동**

---

#### **Matching with trainable descriptors.**

- **What is it**: *"The majority of trainable image descriptors are based on
convolutional neural networks (CNNs) and typically operate on patches extracted using a feature
detector such as DoG [23], yielding a sparse set of descriptors [3, 4, 10, 17, 36, 37] or use a pre-trained
image-level CNN feature extractor [26, 32]. Others have recently developed trainable methods that
comprise both feature detection and description [7, 26, 43]. The extracted descriptors are typically
compared using the Euclidean distance,"*

- **Limitation**: *"but an appropriate similarity score can be also learnt in
a discriminative manner [13, 44], where a trainable model is used to both extract descriptors and produce a similarity score. Finding matches consistent with a geometric model is typically performed
in a separate post-processing stage [3, 4, 7, 10, 17, 22, 26, 36, 37, 43]."*

> **trainable한 친구들은 CNN기반. patch 단위로 feature extracting 하거나 (DoG) / sprase set descriptors 적용하거나 / pre-trained된 image-level CNN feautre extractor 적용. feature detection과 description에 있어서 둘다 trainable한 methods 적용한 애들도 있음. 걔네들 descriptors는 유클라디언 distance를 주로 사용.**

> **그러나 trainable model 같은 경우 extract descriptors하고 유사도 도출하는데 둘다쓰임. matching은 별도의 다른 stage로 구분되어 사용해야 했음**

---

#### **Trainable image alignment.**

- **What is it**: *"Recently, end-to-end trainable methods have been developed to
produce correspondences between images according to a parametric geometric model, such as an
affine, perspective or thin-plate spline transformation [28, 29]. In these works, all pairwise feature
matches are computed and used to estimate the geometric transformation parameters using a CNN.*
***Unlike previous methods that capture only a sparse set of correspondences, this geometric estimation
CNN captures interactions between a full set of dense correspondences."***

- **Limitation**:  "*However, these methods
currently only estimate a low complexity parametric transformation, and therefore their application
is limited to only coarse image alignment tasks.*"

- **Solution**: *"In contrast, we target a more general problem of
identifying reliable correspondences between images of a general 3D scene. Our approach is not
limited to a low dimensional parametric model, but outputs a generic set of locally consistent image
correspondences, applicable to a wide range of computer vision problems ranging from category-level
image alignment to camera pose estimation. The proposed method builds on the classical ideas of
neighbourhood consensus, which we review next."*

> **최근에 나온 End to End trainable한 Correspondence한 Methods들은 parametic geometric model로 이전 모델들이 sparse한 correspondences를 검출하는 반면에 전체의 dense correspondences를 검출. 그러나 아직은 low complexity parametic transformation에 그쳐서 coarse한 이미지 alignment task에 머물고 있음.**

> **반면에 해당 논문의 method는 low dimensional parametric model에 그치지 않고, 3D scene의 일반적인 이미지의 안정적인 correspondeces를 통해 넓은 비젼분야에 걸쳐서 적용가능한 (category-level에서 부터 camera pose estimation까지) method임. 이 기본 아이디어는 향후 설명할 Neighbourhood consensus임.**


---


#### **Match filtering by neighbourhood consensus**

- **What is it**: *"Several strategies have been introduced to decide
whether a match is correct or not, given the supporting evidence from the neighbouring matches. The
early examples analyzed the patterns of* ***distances [46] or angles [34] between neighbouring matches.***
*Later work simply counts the number of consistent matches in a certain image neighbourhood [33, 38],
which can be built in a scale invariant manner [30] or using a regular image grid [5]. While
simple, these techniques have been remarkably effective in removing random incorrect matches and
disambiguating local repetitive patterns [30]."*

- **Solution**: *"Inspired by this simple yet powerful idea we develop a
neighbourhood consensus network – a convolutional neural architecture that (i) analyzes the full set of
dense matches between a pair of images and (ii) learns patterns of locally consistent correspondences
directly from data."*

> **Neigbouring을 heurestic으로 matching하는 여러 방법들이 존재하였음. 몇몇 방법들은 neigbouring 매치들간에 거리나 각도의 패턴을 분석함으로써 구현하였고, 나중에 몇몇 작업들은 확실한 neigbourhood 이미지에서 consistent matches들을 세는 방법으로 구현. 이는 잘못된 matches와 애매모한 반복적인 local 패턴들을 제거하는데 효과적.**

> **여기에 영감을 받아서 CNN기반의 neighbourhood consensus network 개발. 
 (i) pair 이미지 전체 set에 대한 dense matches (ii) 데이터로부터 directly하게 locally consistent correpondeces 패턴까지 학습**

---

#### **Flow and disparity estimation**

- **What is it**: *"Related are also methods that estimate optical flow or stereo
disparity such as [6, 15, 16, 24, 39], or their trainable counterparts [8, 19, 40]. These works also aim
at establishing reliable point to point correspondences between images."*

- **Limitation**: *"This(our method) is different from optical flow where
image pairs are usually consecutive video frames with small viewpoint or appearance changes, and
stereo where matching is often reduced to a local search around epipolar lines. The optical flow
and stereo problems are well addressed by specialized methods that explicitly exploit the problem
constraints (such as epipolar line constraint, small motion, smoothness, etc.)."*

- **Solution**: *"However, we address a more
general matching problem where images can have large viewpoint changes (indoor localization) or
major changes in appearance (category-level matching).*

> **optical flow나 stereo disparity(치우침) 측정을 위한 방법들과 trainable한 counterparts들에 대한 methods들도 존재. 그러나 optical flow는 small viewpoint 나 appearance changes에 대해서 consecutive한 video frames들을 갖음. 또한 local search를 위해서 epipolar liines들 주변에서 매칭이 줄어드는 경향. 이러한 optical flow는 epipolar line constraint나 small motion, smoothness들을 가지고 처리하는데 특화되있음.**

> **그러나 해당 논문에서는 좀 더 general한 matching을 위해서 더 large한 viewpoint changes (indoor localization) 과 major changes in appearance (category-level matching)을 해결하고자 함.**

## **Proposed approach**

***"In this work, we combine the robustness of neighbourhood consensus filtering with the power of
trainable neural architectures."***

> **Trainable Neural Architecture에 neigbourhood consensus 필터링을 합쳐서 powerful한 성능을 내고자 하였음.**

- *We design a model which learns to discriminate a reliable match by
recognizing patterns of supporting matches in its neighbourhood.*

> **neighbourhhod 참조해서 인식하는 것으로 matches들을 좀 더 reliable한 matches을 식별하도록 학습**

- *Furthermore, we do this in a fully
differentiable way, such that this trainable matching module can be directly combined with strong
CNN image descriptors. The resulting pipeline can then be trained in an end-to-end manner for the
task of feature matching. An overview of our proposed approach is presented in Fig. 1.*

> **완전히 differentiable 하게 구현. End-to-End로 학습 가능. (Fig. 1에 자세히 서술)**

- *There are five main components: (i) dense feature extraction and matching, (ii) the neighbourhood consensus
network, (iii) a soft mutual nearest neighbour filtering, (iv) extraction of correspondences from the
output 4D filtered match tensor, and (v) weakly supervised training loss. These components are
described next.*

> **5가지 main components.**
    > **1. Dense Feature Extraction과 Matching**
    > **2. Neighbourhood Consensus Network**
    > **3. Soft mutual nearest neighbour filtering**
    > **4. 4D output filtered match tensor로 부터 correspondences 추출**
    > **5. Weakly supervised training loss**

---

### **Dense feature extraction and matching**

![](https://www.di.ens.fr/willow/research/ncnet/images/teaser.png)

> ***"Figure 1: Overview of the proposed method.*** *A fully convolutional neural network is used to extract dense
image descriptors fA and fB for images IA and IB, respectively. All pairs of individual feature matches fAij and fBkl are represented in the 4-D space of matches (i, j, k, l) (here shown as a 3-D perspective for illustration),
and their matching scores stored in the 4-D correlation tensor c. These matches are further processed by the
proposed soft-nearest neighbour filtering and neighbourhood consensus network (see Figure 2) to produce the
final set of output correspondences.*

**1. 이미지 페어 Ia와 Ib에 대해서 각각 CNN을 이용해서 dense image descriptors fA와 fB를 추출.**

**2. 모든 각 개별 feature 매치들은(fAij 와 Fbkl)은 4-D spaces에서 represented**

**3. 모든 매치들의 matching scores를 4-D correlation tensor C에 저장**

**4. 이 매치들을 Soft-nearest neigbour filtering과 neighbourhood consensus network에 넣어서 최종 correspondence도출**

*"While classic hand-crafted neighbourhood consensus approaches are applied after a hard assignment
of matches is done, this is not well suited for developing a matching method that is differentiable
and amenable for end-to-end training. The reason is that the step of selecting a particular match is
not differentiable with respect to the set of all the possible features.* ***In addition, in case of repetitive
features, assigning the match to the first nearest neighbour might result in an incorrect match, in which
case the hard assignment would lose valuable information about the subsequent closest neighbours."***

> **Hand-crafted neighbourhood consensus approaches들은 match 후에 consensus를 적용함으로써 다양한 가능성에 대해서 쉽게 적용되기 힘듦. 또한 first nearest neighbour로 assign해버리면 하위의 closet neighbour 정보 손실 가능성 => incorrect match**

*"Therefore, in order to have an approach that is amenable to end-to-end training, all pairwise feature
matches need to be computed and stored. For this we use an approach similar to [28]. Given two sets
of dense feature descriptors fA = {fAij } and fB = {fBij } corresponding to the images to be matched, the exhaustive pairwise cosine similarities between them are computed and stored in a 4-D tensor c ∈ R h×w×h×w referred to as correlation map, where:
"*

![](https://raw.githubusercontent.com/jonychoi/Computer-Vision-Paper-Reviews/main/Correspondence/Ncnet%20-%20Neighbourhood%20consensus%20networks%20for%20estimating%20image%20correspondences/imgs/img1.png)

*"Note that, by construction, elements of c in the vicinity of index ijkl correspond to matches between features that are in the local neighbourhoods NA and NB of descriptors fAij in image A and fBkl in image B, respectively, as illustrated in Fig. 1; this structure of the 4-D correlation map tensor c will be exploited in the next section."*

> **그래서 End-to-End training을 위해서 모든 matches를 computing하고 저장한 후, 해당 sets들로 부터 코사인 유사도를 도출**

> **위 수식에서 ijkl index의 C elements들은 descriptors fAij와 fBkl의 각각 local neighbourhoods인 NA와 NB 간의 매치들임**

---

### **Neighbourhood consensus network**

*"The correlation map contains the scores of all pairwise matches. In order to further process and filter
the matches,* ***we propose to use 4-D convolutional neural network (CNN) for the neighbourhood
consensus task (denoted by N(·)),*** *which is illustrated in Fig. 2."*

> **Correlation map은 모든 pairwise 매치들 포함하고 있음 이를 위해서 neighbourhood consensus task를 위한 4D CNN을 제안**

*"Determining the correct matches from the correlation map is, a priori, a significant challenge. Note
that* ***the number of correct matches are of order of hw, while the size of the correlation map is of
the order of (hw)^2.*** *This means that* ***the great majority of the information in the correlation map
corresponds to matching noise due to incorrectly matched features."***

> **image resolution이 h x w라면, 이에 맞는 match들의 개수는 hw일 것임. 그러나 correlation map은 hw^2임(두 이미지 페어에서 모든 픽셀들의 가능한 페어의 개수) 이는 상당수의 많은 정보들이 correlation map에서 잘못된 matched feature때문에 noisy하게 매칭되는 것임**

***"However, supported by the idea of neighbourhood consensus presented in Sec. 1, we can expect
correct matches to have a coherent set of supporting matches surrounding them in the 4-D space.***
*These geometric patterns are equivariant with translations in the input images; that is, if the images are
translated,* ***the matching pattern is also translated in the 4-D space by an equal amount.*** *This property
motivates the use of 4-D convolutions for processing the correlation map as the same operations
should be performed regardless of the location in the 4-D space. This is analogous to the motivation
for using 2-D convolutions to process individual images –* ***it makes sense to use convolutions, instead of for example a fully connected layer, in order to profit from weight sharing and keep the number of trainable parameters low.*** *Furthermore, it facilitates sample-efficient training as a single training
example* ***provides many error signals to the convolutional weights, since the same weights are applied
at all positions of the correlation map.*** *Finally, by processing matches with a 4D convolutional
network we* ***establish a strong locality prior on the relationships between the matches.*** *That is, by design, the network will determine the quality of a match by examining only the information in a
local 2D neighbourhood in each of the two images."*

> **그러나 neighbourhood consensus이용해서 4D 공간에 주위에 있는 애들을 올바르게 매치시킬 수 있음. 이 Geometric pattern들은 input image에 translation이 같이됨. 즉, 이미지들이 translation되면 매칭 패턴들도 4D space에 같은 양만큼 translate됨.**

> **이것으로부터 4D space에 어떤 위치든 상관없이 같은 연산이 되어야 한다는 걸 알 수 있음. = 우리가 흔히 아는 그냥 Convolution이랑 비슷 3d 던 2d던.. 아무튼 이렇게 4D Tensors를 통으로 Convolution하는 network만듦으로써 좋은 점이 당연히 weight sharing 되고(num of training parameters low), 무엇보다 강한 매치들간에 locality prior를 갖게 함. + 효율적인 학습은 당연**

> **요것으로부터 네트워크는 이제 그냥 두 이미지 간에 local 2d neighbourhood에서 최적의 매칭하는 문제로 바뀜**

![]()

*"The proposed neighbourhood consensus network has several convolutional layers, as illustrated in
Fig. 2, each followed by ReLU non-linearities. The convolutional filters of the first layer of the
proposed CNN span a local 4-D region of the matches space, which corresponds to the Cartesian
product of local neighbourhoods NA and NB in each image, respectively. Therefore, each 4-D filter
of the first layer can process and detect patterns in all pairwise matches of these two neighbourhoods.
This first layer has N1 filters that can specialize in learning different local geometric deformations,
producing N1 output channels, that correspond to the agreement with these local deformations at each
4-D point of the correlation tensor. These output channels are further processed by subsequent 4-D
convolutional layers. The aim is that these layers capture more complex patterns by combining the
outputs from the previous layer, analogously to what has been observed for 2-D CNNs [45]. Finally,
the neighbourhood consensus CNN produces a single channel output, which has the same dimensions
as the 4D input matches."*

![](https://raw.githubusercontent.com/jonychoi/Computer-Vision-Paper-Reviews/main/Correspondence/Ncnet%20-%20Neighbourhood%20consensus%20networks%20for%20estimating%20image%20correspondences/imgs/additional.png)

Reference from [Neighbourhood Consensus Networks](https://www.youtube.com/watch?v=sRBviaVN4GE)

> **input 이미지 순서에 상관없기 위해서 앞에 Dense Extractor(Resnet) 및 Feature matching을 다시 할 필요없이(=correlation map을 다시 구할 필요 없이) 위와 같은 방식으로 ij kl coordinate만 swap하여 구현**

---

### **Soft mutual nearest neighbour filtering**
---

### **Extracting correspondences from the correlation map**
---

### **Weakly-supervised training**



## **Experimental results**

#### **Implementation details**

---

#### **Category-level matching**

---

#### **Instance-level matching**

---

#### **Limitations**

---

## **Conclusion**