# FINCH Clustering test

In this notebook, we test the [FINCH clustering algorithm](https://github.com/ssarfraz/FINCH-Clustering) on a small subset of highly interpretable neurons (the classification pseudo-layer FC neurons), and use the [h-NNE dimensionality reduction method](https://github.com/koulakis/h-nne) to plot it in 2-dimensions.

## Index

- [Initialization](#initialization)
- [Loading classification pseudo-layer activations](#loading-classification-pseudo-layer-activations)
- [Clustering and Plotting with FINCH and h-NNE](#clustering-and-plotting-with-finch-and-h-nne)

## Initialization

This section contains necessary setup for the experiment. It can usually be collapsed and the experiment can still be understood without poring over these details.

- [Imports and hyperparameters](#imports-and-hyperparameters)
- [Seeding randomness](#seeding-randomness)
- [Loading the model and dataset](#loading-the-model-and-dataset)

In [1]:
from CLAPWrapper import CLAPWrapper
from esc50_dataset import ESC50
import torch
import torch.nn as nn
import torch.nn.functional as F
from einops import rearrange, reduce, repeat
import os
import re
import logging
from collections import Counter
from datetime import datetime
import numpy as np
from scipy import stats
from tqdm.notebook import tqdm
from icecream import ic
import pandas as pd
from sklearn.metrics import accuracy_score
from IPython.display import Audio, clear_output
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.figure_factory as ff
from typing import Callable, List
from finch import FINCH
from hnne import HNNE

The `module_activation_dict`'s keys are precisely the layers of interest, the activations for which are stored.
The activation functions are needed for our method of activation storage.

In [8]:
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
TOP_K = 100
NUM_CLASSES = 50
NUM_INSTANCES = 2000

module_activation_dict = {
    # Classification pseudo-layer
    'classification_layer': nn.Identity()
}

module_list = list(module_activation_dict.keys())

### Seeding randomness
This is to seed any and all randomness that might be present in the model.  \
This **should** only be the dropout layers in between the projection matrices, and a random sample
of the audio to be taken if the model's input audio duration, does **not** match with
the dataset's input audio duration.

The dropout layers should be deactivated after turning the model to evaluation mode, and
CLAP 's expected audio duration does match with ESC-50's audio duration (5 seconds), so this shouldn't matter,
but better to be safe than sorry.


In [3]:
def seed_everything(seed: int):
    import random, os
    import numpy as np
    import torch
    
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = True

In [4]:
seed_everything(42)

### Loading the model and dataset


#### Loading weights
Microsoft has made the pre-trained weights for CLAP available for download on request [here](https://zenodo.org/record/7312125#.Y22vecvMIQ9).

In [5]:
weights_path = "/scratch/pratyaksh.g/clap/CLAP_weights_2022_microsoft.pth"
clap_model = CLAPWrapper(weights_path, use_cuda=True if DEVICE == "cuda" else False)

In [6]:
clap_model.clap.eval()

CLAP(
  (audio_encoder): AudioEncoder(
    (base): Cnn14(
      (spectrogram_extractor): Spectrogram(
        (stft): STFT(
          (conv_real): Conv1d(1, 513, kernel_size=(1024,), stride=(320,), bias=False)
          (conv_imag): Conv1d(1, 513, kernel_size=(1024,), stride=(320,), bias=False)
        )
      )
      (logmel_extractor): LogmelFilterBank()
      (bn0): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv_block1): ConvBlock(
        (conv1): Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      (conv_block2): ConvBlock(
        (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
 

#### Loading dataset
We use the [ESC-50 dataset](https://github.com/karolpiczak/ESC-50), which consists of 2000 recordings across 50 classes of environmental sounds, each 5 seconds long.

In [7]:
dataset = ESC50(root="/scratch/pratyaksh.g/esc50/", download=False)

Using downloaded and verified file: /scratch/pratyaksh.g/esc50/ESC-50-master.zip


2000it [00:00, 11737.70it/s]

Loading audio files





## Storing classification pseudo-layer activations

We need to compute the 2000-dimensional embeddings over dataset instances for the final pseudo classification layer, which we do using hooks.

In [9]:
module2name = dict((module, name) for name, module in clap_model.clap.named_modules())
raw_outputs = {}        # Without activation function
activated_outputs = {}  # With activation function

def save_activation(module, input, output):
    if type(module) != str:
        name = module2name[module]
    else:
        name = module

    activated_outputs[name] = activated_outputs.get(name, [])
    activation = module_activation_dict[name](output).squeeze().cpu().numpy()
    activated_outputs[name].append(activation)

    raw_outputs[name] = raw_outputs.get(name, [])
    raw_output = output.squeeze().cpu().numpy()
    raw_outputs[name].append(raw_output)

for name, module in clap_model.clap.named_modules():
    if name in module_list:
        module.register_forward_hook(save_activation)


### Classification on ESC-50
Due to the nature of CLAP, zero-shot classification can be done on the dataset. This is done by converting the class names to text prompts, and computing the text embeddings
for these prompts. Since these text embeddings exist in the same joint space as the audio embeddings, the logits for each class against an audio then just become the similarity
of the text embedding corresponding to that class to the audio embedding under consideration.

We also explicitly store these as the outputs/activations of the 'classification_layer', layer index 3. The 'classification_layer' outputs have the class indices according to the internal
dataset classes input though, instead of ESC50's, so we need to keep that in mind.

In [10]:
# Computing text embeddings
prompt = 'this is a sound of '
y = [prompt + x for x in dataset.classes]
text_embeddings = clap_model.get_text_embeddings(y)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [11]:
# Computing audio embeddings
y_preds, y_labels = [], []
for layer_idx in tqdm(range(len(dataset))):
    x, _, one_hot_target = dataset.__getitem__(layer_idx)
    audio_embeddings = clap_model.get_audio_embeddings([x], resample=True)
    similarity = clap_model.compute_similarity(audio_embeddings, text_embeddings)
    y_pred = F.softmax(similarity.detach().cpu(), dim=1).numpy()
    save_activation('classification_layer', similarity, F.softmax(similarity.detach(), dim=1))
    y_preds.append(y_pred)
    y_labels.append(one_hot_target.detach().cpu().numpy())

clear_output()
print("Done!")
print("Computed audio embeddings for {} samples".format(len(dataset)))

Done!
Computed audio embeddings for 2000 samples


In [12]:
y_labels, y_preds = np.concatenate(y_labels, axis=0), np.concatenate(y_preds, axis=0)
acc = accuracy_score(np.argmax(y_labels, axis=1), np.argmax(y_preds, axis=1))
print('ESC50 Accuracy {}'.format(acc))

ESC50 Accuracy 0.827


### Transforming stored activations into neuron embeddings
`activated_outputs['classification_layer']` contains the activations we care about. It is of length 2000, with each index storing the activations of the 50 neurons in the classification pseudo-layer for that instance. We need to transform this into a tensor of shape `(50, 2000)`, so that each of the 50 neurons would have a 200-dimensional embedding.

In [29]:
embeddings = np.stack(activated_outputs['classification_layer']).T
embeddings.shape

(50, 2000)

## Clustering and plotting with FINCH and h-NNE

In [36]:
# Cluster the neurons using FINCH
c, n_clust, _ = FINCH(embeddings)


Partition 0: 15 clusters
Partition 1: 6 clusters
Partition 2: 2 clusters


In [37]:
print(c.shape)
print(n_clust)


(50, 3)
[15, 6, 2]


In [38]:
hnne = HNNE()
projection = hnne.fit_transform(embeddings)

In [51]:
df = pd.DataFrame({
    'x': projection[:, 0],
    'y': projection[:, 1],
    'p0': c[:, 0],
    'p1': c[:, 1],
    'p2': c[:, 2],
    'class': dataset.classes,
})

#### Partition 2
Partition 2 seems to have roughly clustered it into 'living things' and 'non-living things', with 'door wood creaks' being an exception.

In [58]:
px.scatter(df, x='x', y='y', color='p2',hover_data=['class'], title='Clustering partition 2')

In [93]:
# For each cluster in partition 2, find the dataset classes
for i in range(n_clust[2]):
    idx = np.where(c[:, 2] == i)[0]
    cluster = [dataset.classes[j] for j in idx]
    print(f"Cluster {i}: {cluster}")

Cluster 0: ['dog', 'rooster', 'pig', 'cow', 'frog', 'cat', 'hen', 'insects', 'sheep', 'crow', 'crickets', 'chirping birds', 'crying baby', 'sneezing', 'breathing', 'coughing', 'laughing', 'snoring', 'door wood creaks']
Cluster 1: ['rain', 'sea waves', 'crackling fire', 'water drops', 'wind', 'pouring water', 'toilet flush', 'thunderstorm', 'clapping', 'footsteps', 'brushing teeth', 'drinking sipping', 'door wood knock', 'mouse click', 'keyboard typing', 'can opening', 'washing machine', 'vacuum cleaner', 'clock alarm', 'clock tick', 'glass breaking', 'helicopter', 'chainsaw', 'siren', 'car horn', 'engine', 'train', 'church bells', 'airplane', 'fireworks', 'hand saw']


#### Partition 1
Partition 1 

In [59]:
px.scatter(df, x='x', y='y', color='p1',hover_data=['class'], title='Clustering partition 1')

In [92]:
# For each cluster in partition 1, find the dataset classes
for i in range(n_clust[1]):
    idx = np.where(c[:, 1] == i)[0]
    cluster = [dataset.classes[j] for j in idx]
    print(f"Cluster {i}: {cluster}")

Cluster 0: ['dog', 'rooster', 'pig', 'cow', 'frog', 'cat', 'hen', 'insects', 'sheep', 'crow', 'crickets', 'chirping birds', 'door wood creaks']
Cluster 1: ['rain', 'sea waves', 'crackling fire', 'wind', 'thunderstorm', 'clapping', 'footsteps', 'glass breaking', 'fireworks']
Cluster 2: ['water drops', 'pouring water', 'brushing teeth', 'drinking sipping', 'door wood knock', 'mouse click', 'keyboard typing', 'can opening', 'clock tick', 'hand saw']
Cluster 3: ['toilet flush', 'washing machine', 'vacuum cleaner', 'clock alarm', 'helicopter', 'chainsaw', 'siren', 'car horn', 'engine', 'train', 'church bells', 'airplane']
Cluster 4: ['crying baby', 'sneezing', 'coughing', 'laughing']
Cluster 5: ['breathing', 'snoring']


#### Partition 0

In [95]:
px.scatter(df, x='x', y='y', color='p0',hover_data=['class'], title='Clustering partition 0')

In [94]:
# For each cluster in partition 0, find the dataset classes
for i in range(n_clust[0]):
    idx = np.where(c[:, 0] == i)[0]
    cluster = [dataset.classes[j] for j in idx]
    print(f"Cluster {i}: {cluster}")

Cluster 0: ['dog', 'rooster', 'hen', 'crow', 'chirping birds']
Cluster 1: ['pig', 'cow', 'sheep']
Cluster 2: ['frog', 'insects', 'crickets']
Cluster 3: ['cat', 'door wood creaks']
Cluster 4: ['rain', 'crackling fire']
Cluster 5: ['sea waves', 'wind', 'thunderstorm']
Cluster 6: ['water drops', 'pouring water', 'drinking sipping', 'door wood knock', 'can opening']
Cluster 7: ['toilet flush', 'washing machine', 'vacuum cleaner']
Cluster 8: ['crying baby', 'sneezing', 'coughing', 'laughing']
Cluster 9: ['clapping', 'footsteps', 'glass breaking', 'fireworks']
Cluster 10: ['breathing', 'snoring']
Cluster 11: ['brushing teeth', 'hand saw']
Cluster 12: ['mouse click', 'keyboard typing', 'clock tick']
Cluster 13: ['clock alarm', 'car horn']
Cluster 14: ['helicopter', 'chainsaw', 'siren', 'engine', 'train', 'church bells', 'airplane']
