# **End-to-end weakly-supervised semantic alignment**

**Authors: Ignacio Rocco1,2, Relja Arandjelovic´3, Josef Sivic1,2,4**

**1 DI ENS / 2 Inria / 3 DeepMind / 4 CIIRC, CTU in Prague**

**[Official Github Code](https://github.com/ignacio-rocco/weakalign)** / **[Project Page](https://www.di.ens.fr/willow/research/weakalign/)** / **[Pdf](https://arxiv.org/abs/1712.06861)**

---

**Edited By Su Hyung Choi (Key Summary & Code Practice)**

If you have any issues on this scripts, please PR to the repository below.

**[Github: @JonyChoi - Computer Vision Paper Reviews](https://github.com/jonychoi/Computer-Vision-Paper-Reviews)**

Edited March 21 2022

---

## **Summary**

This is a
challenging task due to large intra-class variation, changes
in viewpoint and background clutter. We present the following three principal contributions.

## **Abstract**

***"We tackle the task of semantic alignment where the goal
is to compute dense semantic correspondence aligning two
images depicting objects of the same category."***

#### **Contribution**

---

***First***, *we develop a convolutional neural network architecture for semantic alignment
that is trainable in an end-to-end manner from weak image-level supervision in the form of matching image pairs.*

> **Weak image-level supervision으로 End-to-End 학습 가능한 semantic alignment를 위한 CNN개발**

---

***Second***, *the main component of this architecture is a differentiable soft inlier scoring module, inspired by the RANSAC inlier scoring procedure, that computes the quality of the alignment based on only geometrically consistent correspondences thereby reducing the effect of background clutter.*

> **RANSAC inlier scoring procedure에 영감을 받아서, background clutter 감소시키는 geometrically consistent correspondeces만 기반으로 하여 alignment quality computer하는  differentiable soft inlier scoring module 개발**

---

***Third***, *we demonstrate that the proposed approach achieves state-of-the-art
performance on multiple standard benchmarks for semantic
alignment.*

> **여기서는 정성적 평가. 여러 Semantic alignment에 있어서 소타 달성**

---

## **Introduction**

![](https://www.di.ens.fr/willow/research/weakalign/images/teaser.jpg)

<table>
    <tr>
        <td>
        </td>
        <td>
            <span>
                <i>"Figure 1: We describe a CNN architecture that, given an input image pair (top), outputs dense semantic correspondence between the
                two images together with the aligning geometric transformation
                (middle) and discards geometrically inconsistent matches (bottom). The alignment model is learnt from weak supervision in
                the form of matching image pairs without correspondences."
                </i>
            </span>
        </td>
    </tr>
</table>


---

*"Finding correspondence is one of the fundamental problems in computer vision. Initial work has focused on finding correspondence between images depicting the same object or scene with applications in image stitching [30], multiview 3D reconstruction [11], motion estimation [6, 33] or tracking [4, 22].*"

> **Correspondence는 비젼에서 근본적인 문제 중 하나. Inital work는 image stitching이나 multiview 3D reconstruction, motion estimation이나 tracking에서 같은 object나 scene을 찾는 방법에 초점**

***"In this work we study the problem of finding category-level correspondence, or semantic alignment [1, 20], where the goal is to establish dense correspondence between different objects belonging to the same category, such as the two different motorcycles illustrated in Fig. 1. This is an important problem with applications in object recognition [19], image editing [3], or robotics [23]."***

> **이 논문에서는 위 figure처럼 같은 카테고리에서 다른 object 간에 dense correspondence하는게 목표.**


## **Related Works**

#### **Matching with hand-crafted image descriptors**

- **What is it**: *"Traditionally, correspondences between images
have been obtained by hand crafted local invariant feature detectors and descriptors [23, 25, 42]
that were extracted from the image with a controlled degree of invariance to local geometric and
photometric transformations. Candidate (tentative) correspondences were then obtained by variants
of nearest neighbour matching. Strategies for removing ambiguous and non-distinctive matches
include the widely used second nearest neighbour ratio test [23], or enforcing matches to be mutual
nearest neighbours."*

- **Limitation**: "*Both approaches work well for many applications, but have the disadvantage
of discarding many correct matches, which can be problematic for challenging scenes, such as
indoor spaces considered in this work that include repetitive and textureless areas. While successful,
handcrafted descriptors have only limited tolerance to large appearance changes beyond the built-in
invariance.*"

> **전통적으로, Correspondence는 local invariant feature (상대적으로 안변하는 local feautre들) 중심으로 Candidate correspondences (애매모한 애들)을 nearest neighbour참고하여 처리. 즉 nearest neighbour을 information삼는 방식은 예전에도 hand crafted된 방식으로 존재.**


> **하지만 Indoor나 반복적이고 textureless한 scene들은 잘 처리하지 못함. 즉, INVARIANCE가 뚜렷하게 탑재된 큰 변화에 있어서 한계적으로 작동**

---

#### **Matching with trainable descriptors.**

- **What is it**: *"The majority of trainable image descriptors are based on
convolutional neural networks (CNNs) and typically operate on patches extracted using a feature
detector such as DoG [23], yielding a sparse set of descriptors [3, 4, 10, 17, 36, 37] or use a pre-trained
image-level CNN feature extractor [26, 32]. Others have recently developed trainable methods that
comprise both feature detection and description [7, 26, 43]. The extracted descriptors are typically
compared using the Euclidean distance,"*

- **Limitation**: *"but an appropriate similarity score can be also learnt in
a discriminative manner [13, 44], where a trainable model is used to both extract descriptors and produce a similarity score. Finding matches consistent with a geometric model is typically performed
in a separate post-processing stage [3, 4, 7, 10, 17, 22, 26, 36, 37, 43]."*

> **trainable한 친구들은 CNN기반. patch 단위로 feature extracting 하거나 (DoG) / sprase set descriptors 적용하거나 / pre-trained된 image-level CNN feautre extractor 적용. feature detection과 description에 있어서 둘다 trainable한 methods 적용한 애들도 있음. 걔네들 descriptors는 유클라디언 distance를 주로 사용.**

> **그러나 trainable model 같은 경우 extract descriptors하고 유사도 도출하는데 둘다쓰임. matching은 별도의 다른 stage로 구분되어 사용해야 했음**

---

#### **Trainable image alignment.**

- **What is it**: *"Recently, end-to-end trainable methods have been developed to
produce correspondences between images according to a parametric geometric model, such as an
affine, perspective or thin-plate spline transformation [28, 29]. In these works, all pairwise feature
matches are computed and used to estimate the geometric transformation parameters using a CNN.*
***Unlike previous methods that capture only a sparse set of correspondences, this geometric estimation
CNN captures interactions between a full set of dense correspondences."***

- **Limitation**:  "*However, these methods
currently only estimate a low complexity parametric transformation, and therefore their application
is limited to only coarse image alignment tasks.*"

- **Solution**: *"In contrast, we target a more general problem of
identifying reliable correspondences between images of a general 3D scene. Our approach is not
limited to a low dimensional parametric model, but outputs a generic set of locally consistent image
correspondences, applicable to a wide range of computer vision problems ranging from category-level
image alignment to camera pose estimation. The proposed method builds on the classical ideas of
neighbourhood consensus, which we review next."*

> **최근에 나온 End to End trainable한 Correspondence한 Methods들은 parametic geometric model로 이전 모델들이 sparse한 correspondences를 검출하는 반면에 전체의 dense correspondences를 검출. 그러나 아직은 low complexity parametic transformation에 그쳐서 coarse한 이미지 alignment task에 머물고 있음.**

> **반면에 해당 논문의 method는 low dimensional parametric model에 그치지 않고, 3D scene의 일반적인 이미지의 안정적인 correspondeces를 통해 넓은 비젼분야에 걸쳐서 적용가능한 (category-level에서 부터 camera pose estimation까지) method임. 이 기본 아이디어는 향후 설명할 Neighbourhood consensus임.**


---


#### **Match filtering by neighbourhood consensus**

- **What is it**: *"Several strategies have been introduced to decide
whether a match is correct or not, given the supporting evidence from the neighbouring matches. The
early examples analyzed the patterns of* ***distances [46] or angles [34] between neighbouring matches.***
*Later work simply counts the number of consistent matches in a certain image neighbourhood [33, 38],
which can be built in a scale invariant manner [30] or using a regular image grid [5]. While
simple, these techniques have been remarkably effective in removing random incorrect matches and
disambiguating local repetitive patterns [30]."*

- **Solution**: *"Inspired by this simple yet powerful idea we develop a
neighbourhood consensus network – a convolutional neural architecture that (i) analyzes the full set of
dense matches between a pair of images and (ii) learns patterns of locally consistent correspondences
directly from data."*

> **Neigbouring을 heurestic으로 matching하는 여러 방법들이 존재하였음. 몇몇 방법들은 neigbouring 매치들간에 거리나 각도의 패턴을 분석함으로써 구현하였고, 나중에 몇몇 작업들은 확실한 neigbourhood 이미지에서 consistent matches들을 세는 방법으로 구현. 이는 잘못된 matches와 애매모한 반복적인 local 패턴들을 제거하는데 효과적.**

> **여기에 영감을 받아서 CNN기반의 neighbourhood consensus network 개발. 
 (i) pair 이미지 전체 set에 대한 dense matches (ii) 데이터로부터 directly하게 locally consistent correpondeces 패턴까지 학습**

---

#### **Flow and disparity estimation**

- **What is it**: *"Related are also methods that estimate optical flow or stereo
disparity such as [6, 15, 16, 24, 39], or their trainable counterparts [8, 19, 40]. These works also aim
at establishing reliable point to point correspondences between images."*

- **Limitation**: *"This(our method) is different from optical flow where
image pairs are usually consecutive video frames with small viewpoint or appearance changes, and
stereo where matching is often reduced to a local search around epipolar lines. The optical flow
and stereo problems are well addressed by specialized methods that explicitly exploit the problem
constraints (such as epipolar line constraint, small motion, smoothness, etc.)."*

- **Solution**: *"However, we address a more
general matching problem where images can have large viewpoint changes (indoor localization) or
major changes in appearance (category-level matching).*

> **optical flow나 stereo disparity(치우침) 측정을 위한 방법들과 trainable한 counterparts들에 대한 methods들도 존재. 그러나 optical flow는 small viewpoint 나 appearance changes에 대해서 consecutive한 video frames들을 갖음. 또한 local search를 위해서 epipolar liines들 주변에서 매칭이 줄어드는 경향. 이러한 optical flow는 epipolar line constraint나 small motion, smoothness들을 가지고 처리하는데 특화되있음.**

> **그러나 해당 논문에서는 좀 더 general한 matching을 위해서 더 large한 viewpoint changes (indoor localization) 과 major changes in appearance (category-level matching)을 해결하고자 함.**

## **Proposed approach**

***"In this work, we combine the robustness of neighbourhood consensus filtering with the power of
trainable neural architectures."***

>

- *We design a model which learns to discriminate a reliable match by
recognizing patterns of supporting matches in its neighbourhood.*

>

- *Furthermore, we do this in a fully
differentiable way, such that this trainable matching module can be directly combined with strong
CNN image descriptors. The resulting pipeline can then be trained in an end-to-end manner for the
task of feature matching. An overview of our proposed approach is presented in Fig. 1.*

>

- *There are five main components: (i) dense feature extraction and matching, (ii) the neighbourhood consensus
network, (iii) a soft mutual nearest neighbour filtering, (iv) extraction of correspondences from the
output 4D filtered match tensor, and (v) weakly supervised training loss. These components are
described next.*

>

#### **Dense feature extraction and matching**

![](https://www.di.ens.fr/willow/research/ncnet/images/teaser.png)

---

#### **Neighbourhood consensus network**
---

#### **Soft mutual nearest neighbour filtering**
---

#### **Extracting correspondences from the correlation map**
---

#### **Weakly-supervised training**



## **Experimental results**

#### **Implementation details**

---

#### **Category-level matching**

---

#### **Instance-level matching**

---

#### **Limitations**

---

## **Conclusion**