# Storing activations
We store activations for each of the neurons in the layers of interest in Microsoft's [CLAP](https://github.com/microsoft/CLAP) (Contrastive Language-Audio Pretraining), specifically the audio encoder, with respect to the
dataset instances in [ESC50](https://github.com/karolpiczak/ESC-50), an environmental sound classification dataset.

In this notebook, we compute the activations for the dataset and store these.

> IMPORTANT: The model needs to be on evaluation mode and completely seeded.

## Imports and hyperparameters
We also enlist the layers of interest in this section.

In [11]:
from CLAPWrapper import CLAPWrapper
from esc50_dataset import ESC50
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from tqdm.notebook import tqdm
from sklearn.metrics import accuracy_score
from IPython.display import clear_output
from icecream import ic
import pandas as pd

The `module_activation_dict`'s keys are precisely the layers of interest, the activations for which are stored.
The activation functions are needed for our method of activation storage.

In [12]:
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
TOP_K = 100
NUM_CLASSES = 50

module_activation_dict = {
    # Conv blocks
    'audio_encoder.base.conv_block1': nn.Identity(),
    'audio_encoder.base.conv_block2': nn.Identity(),
    'audio_encoder.base.conv_block3': nn.Identity(),
    'audio_encoder.base.conv_block4': nn.Identity(),
    'audio_encoder.base.conv_block5': nn.Identity(),
    'audio_encoder.base.conv_block6': nn.Identity(),
    # FC layers
    'audio_encoder.base.fc1': F.relu,
    'audio_encoder.projection.linear1': F.gelu,
    'audio_encoder.projection.linear2': nn.Identity(),
    # Classification pseudo-layer
    'classification_layer': nn.Identity()
}

module_list = list(module_activation_dict.keys())

## Seeding randomness
This is to seed any and all randomness that might be present in the model.  \
This **should** only be the dropout layers in between the projection matrices, and a random sample
of the audio to be taken if the model's input audio duration, does **not** match with
the dataset's input audio duration.  \

The dropout layers should be deactivated after turning the model to evaluation mode, and
CLAP 's expected audio duration does match with ESC-50's audio duration (5 seconds), so this shouldn't matter,
but better to be safe than sorry.


In [13]:
def seed_everything(seed: int):
    import random, os
    import numpy as np
    import torch
    
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = True

In [14]:
seed_everything(42)

## Loading the model and dataset
### Loading weights
Microsoft has made the pre-trained weights for CLAP available for download on request [here](https://zenodo.org/record/7312125#.Y22vecvMIQ9).

In [15]:
weights_path = "/scratch/pratyaksh.g/clap/CLAP_weights_2022_microsoft.pth"
clap_model = CLAPWrapper(weights_path, use_cuda=True if DEVICE == "cuda" else False)

In [16]:
clap_model.clap.eval()

CLAP(
  (audio_encoder): AudioEncoder(
    (base): Cnn14(
      (spectrogram_extractor): Spectrogram(
        (stft): STFT(
          (conv_real): Conv1d(1, 513, kernel_size=(1024,), stride=(320,), bias=False)
          (conv_imag): Conv1d(1, 513, kernel_size=(1024,), stride=(320,), bias=False)
        )
      )
      (logmel_extractor): LogmelFilterBank()
      (bn0): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv_block1): ConvBlock(
        (conv1): Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      (conv_block2): ConvBlock(
        (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
 

### Loading dataset
We use the [ESC-50 dataset](https://github.com/karolpiczak/ESC-50), which consists of 2000 recordings across 50 classes of environmental sounds, each 5 seconds long.

In [17]:
dataset = ESC50(root="/scratch/pratyaksh.g/esc50/", download=False)

Using downloaded and verified file: /scratch/pratyaksh.g/esc50/ESC-50-master.zip


848it [00:00, 8472.47it/s]

Loading audio files


2000it [00:00, 9285.85it/s]


## Storing activations
The audio encoder has the CNN section, followed by a fully connected layer that brings it to a dimensionality of 2048, and then to 527 for the Audioset classification task,
which was used to pre-train it. Then the projection layer has two fully connected layers, both of which have an output of dimensionality 1024.
The output of the final layer is the embedding of the audio in the joing space.

We collect the activations on these three fully connected linear layers:
1. In the audio encoder, with output dimension 2048, which is followed by a ReLU
2. In the projection module, with output dimension 1024 which is followed by a GeLU
3. In the projection module again, with output dimension of 1024 with no activation
> TODO: Add graph showing how outputs and inputs of the three layers are linked

### Hooks for activation storage
We use PyTorch's internal mechanism to hook onto these three modules using their names (which are in `module_list`). We then register a function to the hook which
simply stores the output from these modules. Note that the output is stored only from the linear layer, and therefore has not yet passed through the activation function.

In [18]:
module2name = dict((module, name) for name, module in clap_model.clap.named_modules())
raw_outputs = {}        # Without activation function
activated_outputs = {}  # With activation function

def save_activation(module, input, output):
    if type(module) != str:
        name = module2name[module]
    else:
        name = module

    activated_outputs[name] = activated_outputs.get(name, [])
    activation = module_activation_dict[name](output).squeeze().cpu().numpy()
    activated_outputs[name].append(activation)

    raw_outputs[name] = raw_outputs.get(name, [])
    raw_output = output.squeeze().cpu().numpy()
    raw_outputs[name].append(raw_output)

for name, module in clap_model.clap.named_modules():
    if name in module_list:
        module.register_forward_hook(save_activation)


### Classification on ESC-50
Due to the nature of CLAP, zero-shot classification can be done on the dataset. This is done by converting the class names to text prompts, and computing the text embeddings
for these prompts. Since these text embeddings exist in the same joint space as the audio embeddings, the logits for each class against an audio then just become the similarity
of the text embedding corresponding to that class to the audio embedding under consideration.

We also explicitly store these as the outputs/activations of the 'classification_layer', layer index 3. The 'classification_layer' outputs have the class indices according to the internal
dataset classes input though, instead of ESC50's, so we need to keep that in mind.

In [19]:
# Computing text embeddings
prompt = 'this is a sound of '
y = [prompt + x for x in dataset.classes]
text_embeddings = clap_model.get_text_embeddings(y)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.




In [20]:
# Computing audio embeddings
y_preds, y_labels = [], []
for layer_idx in tqdm(range(len(dataset))):
    x, _, one_hot_target = dataset.__getitem__(layer_idx)
    audio_embeddings = clap_model.get_audio_embeddings([x], resample=True)
    similarity = clap_model.compute_similarity(audio_embeddings, text_embeddings)
    y_pred = F.softmax(similarity.detach().cpu(), dim=1).numpy()
    save_activation('classification_layer', similarity, F.softmax(similarity.detach(), dim=1))
    y_preds.append(y_pred)
    y_labels.append(one_hot_target.detach().cpu().numpy())

clear_output()
print("Done!")
print("Computed audio embeddings for {} samples".format(len(dataset)))

  0%|          | 0/2000 [00:00<?, ?it/s]

In [None]:
y_labels, y_preds = np.concatenate(y_labels, axis=0), np.concatenate(y_preds, axis=0)
acc = accuracy_score(np.argmax(y_labels, axis=1), np.argmax(y_preds, axis=1))
print('ESC50 Accuracy {}'.format(acc))

ESC50 Accuracy 0.827


### Saving values to `csv` files
The activation values that are recorded from running the ESC-50 dataset through the CLAP model now need to be prepared for further analysis.
Storing them in a pandas dataframe is a convenient method since it allows us to arbitrarily group them easily. We also write the dataframe
to csv, so the analysis can actually be performed without the need to actually run the entire model everytime.

In [None]:
# Format the dataframe properly
records = []
# This is only for the FC layers
start_idx = module_list.index('audio_encoder.base.fc1')
for layer_idx, layer_name in enumerate(tqdm(module_list[start_idx:])):
    for patch_idx, patch in enumerate(tqdm(dataset)):
        path, class_name, _ = dataset[patch_idx]
        patch_name = path.split('/')[-1].split('.')[0]
        class_idx = patch_name.split('-')[-1]
        ndim = activated_outputs[layer_name][patch_idx].shape[0]

        current_record = pd.DataFrame({
            'layer_idx': [layer_idx] * ndim,
            'layer_name': [layer_name] * ndim,
            'neuron_idx': list(range(ndim)),
            'patch_idx': [patch_idx] * ndim,
            'patch_name': [patch_name] * ndim,
            'raw_output': raw_outputs[layer_name][patch_idx],
            'activation': activated_outputs[layer_name][patch_idx],
            'class_idx': [class_idx] * ndim,
            'class_name': [class_name] * ndim,
        })

        records.append(current_record)

act_values = pd.concat(records, ignore_index=True)
act_values.to_csv('/home2/pratyaksh.g/MS-CLAP/data/fc-activations.csv')

clear_output()
print("Done!")
print("Stored for {} layers and {} patches".format(len(module_list), len(dataset)))

We do the same for CNN activations, but since the number of neurons in the CNN layers with its 2D convolutions and multiple channels per convolution is far too great,
we have to deal with it slightly differently. We simply store the direct activations for a layer all together.

In [None]:
# Format the dataframe properly
records = []
# This is only for the CNN blocks
end_idx = module_list.index('audio_encoder.base.fc1')
for layer_idx, layer_name in enumerate(tqdm(module_list[:end_idx])):
    for patch_idx, patch in enumerate(tqdm(dataset)):
        path, class_name, _ = dataset[patch_idx]
        patch_name = path.split('/')[-1].split('.')[0]
        class_idx = patch_name.split('-')[-1]

        torch.save(raw_outputs[layer_name][patch_idx], f'/scratch/pratyaksh.g/clap/data/conv-activations/{layer_name}/act-{patch_idx}.pt')

clear_output()
print("Done!")
print("Stored for {} layers and {} patches".format(len(module_list[:end_idx]), len(dataset)))

## Computing zero neurons
Neurons in each layer which have zero output after activation for **all** datapoints in the dataset are what we call as 'zero' or 'null' neurons. With respect to this
specific dataset, they do not contribute to the information for the final classification (or do they?).

We compute the number of zero neurons in each layer.

In [13]:
all_neurons = []
zero_neurons = []
for layer_idx, layer_name in enumerate(tqdm(module_list[start_idx:])):
    # Compute the number of records in the dataframe for this layer that have a unique neuronal index
    num_neurons = len(act_values[act_values['layer_name'] == layer_name]['neuron_idx'].unique())
    all_neurons.append(num_neurons)

    # Now for each neuron index, sum the activations across all patches and check if the sum is zero,
    num_zero_neurons = len(act_values[act_values['layer_name'] == layer_name].groupby('neuron_idx').sum().query('activation == 0'))
    zero_neurons.append(num_zero_neurons)
    
clear_output()
print("Done!")
for all, zero, layer_idx in zip(all_neurons, zero_neurons, range(len((module_list)))):
    print("Layer {} has {} neurons, of which {} are zero".format(layer_idx, all, zero))

Done!
Layer 0 has 2048 neurons, of which 525 are zero
Layer 1 has 1024 neurons, of which 0 are zero
Layer 2 has 1024 neurons, of which 0 are zero
Layer 3 has 50 neurons, of which 0 are zero


## Comparing against old activations csv
Luckily, the old data is still valid, because even after seeding and everything, things seem to be unaffected.

In [15]:
old_activations = pd.read_csv('/home2/pratyaksh.g/MS-CLAP/src/activations.csv')
new_activations = pd.read_csv('/home2/pratyaksh.g/MS-CLAP/data/activations.csv')

In [17]:
# Check if the new activations are the same as the old ones
for layer_idx, layer_name in enumerate(tqdm(module_list[start_idx:])):
    old_layer_activations = old_activations[old_activations['layer_name'] == layer_name]
    new_layer_activations = new_activations[new_activations['layer_name'] == layer_name]
    assert np.allclose(old_layer_activations['activation'].values, new_layer_activations['activation'].values)
    assert np.allclose(old_layer_activations['raw_output'].values, new_layer_activations['raw_output'].values)

  0%|          | 0/4 [00:00<?, ?it/s]