<a href="https://colab.research.google.com/github/leeds1219/Article_Review/blob/main/AudioVisual_Grouping_Network.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Audio-Visual Grouping Network for Sound Localization from Mixtures**
Shentong MO, Yapeng Tian


<img src = "https://drive.google.com/uc?id=1tPSL_zpnv99tELSvOgUkbmE69Xr6uvNF">


# Baseline Single-source Localization & Audio-Visual Encoder

During the training, we do not have
bounding boxes and mask-level annotations. Therefore, we
can only use the video-level label for the mixture spectrogram and image to perform **weakly-supervised learning**

$L_{baseline} = -\frac{1}{B}\Sigma_{b=1}^{B}log\frac{exp(\frac{1}{\tau}sim(F_{b}^{a},F_{b}^{v}))}{\Sigma_{m=1}^{B}exp(\frac{1}{τ}sim(F_{b}^{a},F_{m}^{v}))}$

Multiple-Instance Learning (MIL) framework deals with scenarios where the labels for training data are associated with sets.

**Multiple-instance contrastive objective** is a technique used within this framework to train models effectively.


Positive and Negative Pairs : positive pair of similar and negative pair of disimilar

Contrastive Loss : Define a contrastive loss function that encourages the model to make the representations of positive pairs similar and those of negative pairs dissimilar.
This loss function penalizes the model when the representations of positive pairs are not close enough and when the representations of negative pairs are too close

Training Objective : The objective is to learn representations of bags in such a way that bags belonging to the same class are closer in the learned embedding space compared to bags from different classes.

Learning Representations : The model learns to create embeddings/representations for bags that capture the bag-level characteristics important for classification.
These representations are used to make predictions on new unseen bags.


B is batch, sim() denotes the **max-pooled audio-visual cosine similarity** of $F^{a}$ and $F^{v}=\{f_{v}^{p}\}_{p=1}^{P}$ across all P spatial locations, D is dimension size and $\tau$ is the temperature hyper-parameter

**Max pooled cosine similarity** is a method used in natural language processing (NLP) to compute the similarity between two sets of embeddings.

When you have multiple embeddings representing different elements within two sets, you can calculate cosine similarity between each pair of embeddings. The max pooled cosine similarity then takes the maximum similarity score among all the calculated pairwise similarities.

For instance, suppose you have two sets of embeddings: Set A (embedding representations of elements from the first set) and Set B (embedding representations of elements from the second set).

To compute max pooled cosine similarity:

1. Calculate cosine similarity between each element in Set A and each element in Set B.

2. For each element in Set A, find the highest cosine similarity among its similarities with elements in Set B.

3. Take the maximum similarity score among all the highest similarities calculated in step 2.

This approach can be useful in various NLP tasks, such as semantic similarity measurements, information retrieval, or text classification, where you want to determine the overall similarity between two sets based on their constituent elements' embeddings.

# Audio-Visual Class Tokens(Transformer Layers)

C is the source event categories

Aggregate audio-visual tokens $\{ \hat{c}_{i}^{a} \}_{i=1}^{C}$,
Global audio and spatial visual features $\hat{f}^{a}$


Align features and tokens from raw input of image and audio mixture with **self attention transformers**: $\hat{f}^{a}$, $\{ \hat{c}_{i}^{a} \}_{i=1}^{C}$ = $\{\phi(x_{j}^{a}, X^{a},X^{a})\}_{j=1}^{1+C}$

$\text{Attention}(Q, K, V) = \text{Softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V
$

$X^{a}$ = $\{x_{j}^{a}\}_{j=1}^{1+C}$ = [$f^{a}$;$\{c_{i}\}_{i=1}^{C}$] is the entire set of input features, $x_{j}^{a}$ is the indiviual element in the set.

${\phi^{a}}(x_{j}^{a}, X^{a},X^{a}) = \text{Softmax}\left(\frac{{x_{j}^{a}}{X^{a}}^T}{\sqrt{D}}\right) {X^{a}}
$

Individual source class probability with **Softmax and Fully connected layer**

$e_{i}$ = Softmax(FC(${c^{i}}$))

**Class-Constraint Loss**

**Cross entropy loss** : $\Sigma_{i=1}^{C}CE(h_{i},e_{i})$

$h_{i}$ is the **one hot encoding** with its target category entry i as 1.

**One-hot encoding** is a technique used in machine learning and data processing to convert categorical data into a numerical format.

# Audio-Visual Grouping Module

Grouping blocks $g^{a}(\cdot)$, $g^{v}(\cdot)$

**Similarity matrix**

$A_{i}^{a}$ = Softmax($W_{q}^{a}\hat{f}^{a}\cdot W_{k}^{a}\hat{c}_{i}^{a}$)

$A_{p,i}^{v}$ = Softmax($W_{q}^{v}\hat{f}_{p}^{v}\cdot W_{k}^{v}\hat{c}_{i}^{v}$)

Weighted Sum calculation with the similarity matrix to gain Categori-Aware Representations

$g_{i}^{a}$ = $g^{a}(\hat{f}^{a},\hat{c}_{i}^{a})$ = $\hat{c}_{i}^{a}$ + $W_{o}^{a}$$A_{i}^{a}\frac{W_{v}^{a}\hat{f}^{a}}{A_{i}^{a}}$

$g_{i}^{v}$ = $g^{a}(\{\hat{f}_{p}^{v}\}_{p=1}^{P},\hat{c}_{i}^{v})$

= $\hat{c}_{i}^{v}$ + $W_{o}^{v}$$\frac{\Sigma_{p=1}^{P}A_{p,i}^{v}{W_{v}^{v}\hat{f}_{p}^{v}}}{{\Sigma_{p=1}^{P}A_{p,i}^{v}}}$

**Binary probability** is defined getting the grouping representation as input

$p_{i}^{a}$ = Sigmoid(FC($g_{i}^{a}$))

$p_{i}^{v}$ = Sigmoid(FC($g_{i}^{v}$))

By applying audio-visual source classes $\{y_{i}\}_{i=1}^{C}$ as the weak supervision and combining the class-constraint loss

**Binary cross-entropy** for each category to handle multi-label classification problem

**Audio-Visual Grouping Loss**

$L_{group}$ = $\Sigma_{i=1}^{C}\{CE(h_{i},e_{i})+BCE(y_{i},p_{i}^{a})+BCE(y_{i},p_{i}^{v})\}$


With the class-constrained loss, categori-aware representations are generated for audio-visual alignment

$\{g_{i}^{a}\}_{i=1}^{C}$

$\{g_{i}^{v}\}_{i=1}^{C}$

global audio and visual representations for N source embeddings

$\{g_{n}^{a}\}_{n=1}^{N}$

$\{g_{n}^{v}\}_{n=1}^{N}$

are chosen from C categories according to **ground truth class(y)**

# Localization

Audio-visual similarity is calculated by **max pooling audio-visual cosine similarities** of the class-aware audio features $F_{b,n}^{a}$
= $g_{n}^{a}$ and the spatial-level $F_{b,n}^{v}$ = $\{f_{p}^{v}\odot g_{n}^{v}\}_{p=1}^{P}$

meaning $n$th source in $b$th sample

$L_{loc}$ = $-\frac{1}{BN}\Sigma_{b=1}^{B}\Sigma_{n=1}^{N}log\frac{exp(\frac{1}{τ}sim(F_{b,n}^{a},F_{b,n}^{v}))}{\Sigma_{m=1}^{B}exp(\frac{1}{τ}sim(F_{b,n}^{c},F_{m,n}^{v}))}$

Representations for N source embeddings are chosen from C categories according to the corresponding GT class

B is mini-batch size, N is sources in each batch

End-to-end loss $L$ = $L_{loc}+L_{group}$

# Inference

Use audio-visual **cosine similarity map** between class-aware audio-visual representations to generate $n$the source localization map with $P$ locations.

The final localization map is gerated through **bilinear interpolation** of the similarity map.

**Bilinear interpolation** is particularly useful in RoI Align when estimating values at fractional coordinates within feature maps, providing a more accurate representation of the spatial information than methods like max pooling or average pooling.