# CS109b Final Project - Milestone 5 Notebook B
### Mads Groeholdt, Sean McCabe, Josie Mobley, Bridget Sands

## About:
The progression of this notebook is the CONTINUED developement of our final project for CS109b in which we attempt to classify a cat's emotion dependent on its meow (more on this in the problem statement). 

Unfortunately, because of the complexity of the code, **we had to split the notebook into two, in addition to an appendix notebook that contains early EDA and our baseline model.** 

Therefore, this notebook is the **SECOND of the two main notebooks,** containing the second of the two advanced approaches we derived. 

### Note:
- For problem discussion, data description, and implementation of our first advanced method, please see **Notebook A**.
- For early EDA and our baseline model, please see the **Appendix Notebook**.


The table of contents, that outlines the entirety of the notebook, is found below:

## Table of Contents:
1. [Data Recap from Notebook A](#recap) 
2. [Transfer Learning](#tl)
    1. [Idea, Intro](#intro)
    2. [Code](#code)
    3. [Augmentation Extension](#aug)
    4. [Results Comment](#res)
3. [Overarching Results and Project Conclusions](#o_res)

In [1]:
import pandas as pd
import numpy as np
import re
import torchaudio
from sklearn.model_selection import train_test_split
import librosa
import os
import torch
import IPython.display as ipd
from transformers import AutoConfig, Wav2Vec2Processor, EvalPrediction
from sklearn.metrics import classification_report
import torch.nn.functional as F
from audiomentations import Compose, AddGaussianNoise, PitchShift, TimeStretch, Shift, Gain

2024-05-08 14:47:45.650487: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-05-08 14:47:45.695549: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-08 14:47:45.695585: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-08 14:47:45.696738: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-05-08 14:47:45.704097: I tensorflow/core/platform/cpu_feature_guar

## 1: Data Recap from Notebook A <a class="anchor" id="recap"></a>
As we know from notebook A, the data of this project are audio files we have been given access to by the original authors of the base paper. They got the files from scraping public access sources, such as YouTube.

We also know that we read in these audio files using the librosa or torch libraries, which converts the raw audio into numeric array representation. They can also be listened to in respect to their audio form, visualized graphically in their numeric array form, or in translated to a mel spectrogram.

There are some specifics relative to data filteration and choices, however those were all covered in **Notebook A**. Therefore, below is code that is essentially repeated from Notebook A, but must be used for the data importation in this notebook as well. We do all of the same steps.

### Important:
Please note that if you ran the data from the first notebook, it makes more sense to simply import the csv files rather than re-run all of this code- actually, if you run it having already run the code from Notebook A, without clearing and deleting the newly created augmented files, it will break and run improperly. **Therefore the code is commented out.** 

In [2]:
# Read in the data
data = []
def is_raw_mp3(file_name):
    return bool(re.search(r'\.mp3$', file_name) and not re.search(r'_aug1\(1\)\.mp3$', file_name))
for subdir, dirs, files in os.walk("data/"):
    for file in files:
        filepath = os.path.join(subdir, file)
        class_label = os.path.basename(subdir)
        name = file
        
        try:
            s = torchaudio.load(filepath)
            if is_raw_mp3(filepath):
                data.append({
                    "name": name,
                    "path": filepath,
                    "emotion": class_label
                })
        except Exception as e:
            print(str(filepath), e)
            pass

# Inspect data
df = pd.DataFrame(data)
df.head()
print(f'The dataframe has {len(df.index)} total samples')

# Filter problematic data (1d arrays instead of 2d)
sr = 16000
def speech_to_array_fn(path):
    speech_array, sampling_rate = torchaudio.load(path)
    resampler = torchaudio.transforms.Resample(sampling_rate, sr)
    speech = resampler(speech_array).squeeze().numpy()
    return speech

drop_indices = []
for index, row in df.iterrows():
    audio_arr = speech_to_array_fn(row['path'])

    if audio_arr.shape[0] != 2:
        drop_indices.append(index)

df = df.drop(drop_indices)
print(f'The trimmed dataframe has {len(df.index)} total samples')

# Train test split:
save_path = "data/"
train_df, test_df = train_test_split(df, test_size=0.2, random_state=101, stratify=df["emotion"])
train_df, validation_df = train_test_split(train_df, test_size=0.2, random_state=101, stratify=train_df["emotion"])
train_df = train_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)
validation_df = validation_df.reset_index(drop=True)
train_df.to_csv(f"{save_path}/train.csv", sep="\t", encoding="utf-8", index=False)
test_df.to_csv(f"{save_path}/test.csv", sep="\t", encoding="utf-8", index=False)
validation_df.to_csv(f"{save_path}/validation.csv", sep="\t", encoding="utf-8", index=False)
print(f'Train data size: {train_df.shape}')
print(f'Test data size: {test_df.shape}')
print(f'Validation data size: {validation_df.shape}')

# Create data augmenter:
augmenter = Compose([
    # speed change between 0.9 and 1
    TimeStretch(min_rate=0.9, max_rate=1.0, p=0.5),
    # pitch shift between -4 and 4
    PitchShift(min_semitones=-4, max_semitones=4, p=0.5),
    # gaussian noise
    AddGaussianNoise(min_amplitude=0.1, max_amplitude=0.5, p=0.5),
    # time shift within 20% of original clip
    Shift(min_shift=-0.2, max_shift=0.2, shift_unit='fraction', p=0.5),

    Gain(min_gain_db=-10 * np.log10(1.1), max_gain_db=-10 * np.log10(0.5), p=0.5)
])

data/Happy/desktop.ini Error opening 'data/Happy/desktop.ini': Format not recognised.


Note: Illegal Audio-MPEG-Header 0xbf082800 at offset 7536.
Note: Trying to resync...
Note: Hit end of (available) data during resync.


The dataframe has 2961 total samples


Note: Illegal Audio-MPEG-Header 0xbf082800 at offset 7536.
Note: Trying to resync...
Note: Hit end of (available) data during resync.


The trimmed dataframe has 2942 total samples
Train data size: (1882, 3)
Test data size: (589, 3)
Validation data size: (471, 3)


In [3]:
%%time
# Call augmenter on data
save_augmented_path = "data/augmented/"
# Ensure the directory exists
os.makedirs(save_augmented_path, exist_ok=True)
augmented_data = []
sample_rate = sr
n_aug = 1
for index, row in train_df.iterrows():
    for i in range(n_aug):
        arr = speech_to_array_fn(row['path'])
        aug_arr = augmenter(samples = arr, sample_rate = sample_rate)

        # Convert the augmented numpy array back to tensor for saving
        augmented_audio_tensor = torch.tensor(aug_arr)

        # Construct a new file path
        augmented_file_path = os.path.join(save_augmented_path, f"{row['name']}_augmented{i}.mp3")
        if index%396 == 0:
            print(augmented_file_path)
        # Save the augmented audio file
        torchaudio.save(augmented_file_path, augmented_audio_tensor, sample_rate)

        # Add to augmented data list
        augmented_data.append({
            'name': f"{row['name']}_augmented{i}",
            'path': augmented_file_path,
            'emotion': row['emotion']
        })

# Save augmented data
aug_df = pd.DataFrame(augmented_data)
print(f'{len(aug_df)} augmented sounds')
aug_df.head()
train_aug_df = pd.concat([train_df, aug_df])
train_aug_df.to_csv(f"data/train_aug.csv", sep="\t", encoding="utf-8", index=False)
print(f'Augmented train data size: {train_aug_df.shape}')

data/augmented/Edit9113Grl.mp3_augmented0.mp3
data/augmented/LastEntry_cat1046Hiss.mp3_augmented0.mp3
data/augmented/cat_youtube01203.mp3_augmented0.mp3
data/augmented/LastEntry_cat1201Fit.mp3_augmented0.mp3
data/augmented/cat_youtube01265.mp3_augmented0.mp3




1882 augmented sounds
Augmented train data size: (3764, 3)
CPU times: user 3min 41s, sys: 1.28 s, total: 3min 42s
Wall time: 3min 58s


## 2: Transfer Learning <a class="anchor" id="tl"></a>
In the following section we first introduce then implement and discuss the final of three overarching methods (the first being the baseline model defined and derived in the Appendix notebook, the second being the remake of the paper's CNN structure outlined in Notebook A), the transfer learning method.

### 2a: Idea, Intro <a class="anchor" id="intro"></a>
Our second approach of remaking the CNN from the paper drastically improved upon our initial baseline model that struggled to even predict as well as random guessing. However, upon completion of the model we were still curious about how we could extend and improve on this project. Again we expanded upon an idea first proposed in the paper- transfer learning.

In the paper, they first use a CNN trained on the million song dataset and then further train a Convolutional restricted Boltzmann machine (CRBM) based model with the cat audio in order to extract features. Similar to their just CNN methodolgy, they further fit classification algorithms in order to classify the emotions.

In our own adaptation of transfer learning, rather than using the CNN trained on the million song dataset, we make use of what we learrned about HuggingFace in class and use a facebook wave2vec base model. We also deviate from the group, as we directly make predictions within our pipeline rather than using the framework for feature extraction.

### 2b: Code <a class="anchor" id="code"></a>
The following is the code to implement such method. 

The model requires slightly different preprocessing than what we did for the previous models, relative to loading the data into HuggingFace Dataset objects, therefore we begin with this. There are similar aspects, however, including assurance of all files being the same length and truncating/padding them if not.


### Note:
If the preprocessing/augmentation of the code was done in this notebook, uncomment and run the first cell. However, if it was not and the augmentation was rather ran from Notebook A, **as we recommend**, the second cell is better suited.

In [4]:
# Change dataframes to Dataset objects
# Import function necessary
from datasets import Dataset
train_dataset = Dataset.from_pandas(train_df)
train_aug_dataset = Dataset.from_pandas(train_aug_df)
eval_dataset = Dataset.from_pandas(validation_df)
test_dataset = Dataset.from_pandas(test_df)

In [5]:
from datasets import load_dataset, load_metric

data_files = {
    "train": "data/train.csv",
    "train_aug": "data/train_aug.csv",
    "validation": "data/validation.csv",
    "test": "data/test.csv",
}

dataset = load_dataset("csv", data_files=data_files, delimiter="\t", )
train_dataset = dataset["train"]
train_aug_dataset = dataset["train_aug"]
eval_dataset = dataset["validation"]
test_dataset = dataset["test"]

Generating train split: 0 examples [00:00, ? examples/s]

Generating train_aug split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [6]:
# Print datasets to visualize
print(train_dataset)
print(eval_dataset)
print(train_aug_dataset)
print(test_dataset)

Dataset({
    features: ['name', 'path', 'emotion'],
    num_rows: 1882
})
Dataset({
    features: ['name', 'path', 'emotion'],
    num_rows: 471
})
Dataset({
    features: ['name', 'path', 'emotion'],
    num_rows: 3764
})
Dataset({
    features: ['name', 'path', 'emotion'],
    num_rows: 589
})


In [7]:
input_column = "path"
output_column = "emotion"

label_list = train_dataset.unique(output_column)
label_list.sort()
num_labels = len(label_list)
print(f"Our data has {num_labels} classes: {label_list}")



#### Model Configuration
Now that we have our data specified and uploaded, we can begin specifying our base model and the parameters for fine-tuning. We have decided to use the wave2vec2 model, initially released by Facebook AI researchers. It is a large pre-trained model for automated speech recognition that, although initially trained on human speech, has been used by other HuggingFace users for animal sound classification. The following cells imports the model and defines a set of configurations for it.

In [8]:
model_name_or_path = "facebook/wav2vec2-base-960h"
pooling_mode = "mean"

In [9]:
config = AutoConfig.from_pretrained(
    model_name_or_path,
    num_labels=10,
    label2id={label: i for i, label in enumerate(label_list)},
    id2label={i: label for i, label in enumerate(label_list)},
    finetuning_task="wav2vec2_clf",
)
setattr(config, 'pooling_mode', pooling_mode)

In [10]:
processor = Wav2Vec2Processor.from_pretrained(model_name_or_path, num_labels=10)
target_sampling_rate = processor.feature_extractor.sampling_rate
print(f"The target sampling rate: {target_sampling_rate}")

The target sampling rate: 16000


#### Padding
To be able to feed our data to the model, we must pad it to all have equal length. As seen in earlier preprocessing, there are some right-tail outliers that would make every padding unnecessarily long, so once again we set the max_length for each audio file to be 150,000.

#### Translation, Numerical Encoding, and Padding
The below translates each audio file into its numerical encoding, adds a numerically encoded label to each observation, and sends each sample through the pre-defined Wave2Vec2 processor.

In [11]:
# Define file max length, as 
MAX_LENGTH = 150000

# Helper function to read in data
def speech_file_to_array_fn(path):
    speech_array, sampling_rate = torchaudio.load(os.getcwd() + "/"+path)
    resampler = torchaudio.transforms.Resample(sampling_rate, target_sampling_rate)
    speech = resampler(speech_array).squeeze().numpy()
    return speech

# Helper function to id data
def label_to_id(label, label_list):
    return config.label2id[label]
    #label_list.index(label)

# Helper function to pad 
def padding(audio_lst):
    padded = []
    for file in audio_lst:
        cur_length = file.shape[1]
        if cur_length == MAX_LENGTH:
            padded.append(file)
            continue
        total_padding = MAX_LENGTH - cur_length
        if total_padding < 0:
            # truncate if file is too long
            padded.append(file[:,:MAX_LENGTH])
            continue
        # handling parity
        front_padding_len = int(total_padding / 2)
        after_padding_len = int(total_padding - front_padding_len)

        pad_before = np.zeros((2, front_padding_len))
        pad_after = np.zeros((2, after_padding_len))
        padded_signal = np.hstack([pad_before, file, pad_after])
        padded.append(padded_signal)
    return padded

# Helper function that combines all preprocessing
def preprocess_function(examples):
    speech_list = [speech_file_to_array_fn(path) for path in examples[input_column]]
    target_list = [label_to_id(label, label_list) for label in examples[output_column]]

    speech_list = padding(speech_list)
    
    reshaped_list = [x.reshape(-1) for x in speech_list]

    result = processor(reshaped_list, sampling_rate=target_sampling_rate, padding=True)
    result["labels"] = list(target_list)
    return result

In [12]:
# Create augmented training dataset
train_aug_dataset = train_aug_dataset.map(
    preprocess_function,
    batch_size=16,
    batched=True
)

# Create training dataset
train_dataset = train_dataset.map(
    preprocess_function,
    batch_size=16,
    batched=True
)

# Create evaluation dataset
eval_dataset = eval_dataset.map(
    preprocess_function,
    batch_size=16,
    batched=True
)

print(f"Training input_values shape: {np.array(train_dataset[0]['input_values']).shape}")
print(f"Augmented training input_values shape: {np.array(train_aug_dataset[0]['input_values']).shape}")
print(f"Training labels: {train_dataset[0]['labels']} - {train_dataset[0]['emotion']}")

Map:   0%|          | 0/3764 [00:00<?, ? examples/s]



Map:   0%|          | 0/1882 [00:00<?, ? examples/s]



Map:   0%|          | 0/471 [00:00<?, ? examples/s]

Training input_values shape: (300000,)
Augmented training input_values shape: (300000,)


#### GPU Check-In
Before model-training, we add a quick manual check to ensure the GPU has sufficient memory available to fine-tune the model.

In [13]:
print(torch.cuda.memory_summary(device=None, abbreviated=False))

|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |      0 B   |      0 B   |      0 B   |      0 B   |
|       from large pool |      0 B   |      0 B   |      0 B   |      0 B   |
|       from small pool |      0 B   |      0 B   |      0 B   |      0 B   |
|---------------------------------------------------------------------------|
| Active memory         |      0 B   |      0 B   |      0 B   |      0 B   |
|       from large pool |      0 B   |      0 B   |      0 B   |      0 B   |
|       from small pool |      0 B   |      0 B   |      0 B   |      0 B   |
|---------------------------------------------------------------

#### GPU Check-In Interpretation:
As is shown above, it is a good sign that the machine is mostly clear and has zeros across the board. This means we should have ample space to train and run our models.

#### Intra-Training Model Metrics
In addition to the standard loss reporting for the fine-tuning, we would like the model to also report the accuracy after each completed epoch. To enable this, we build the below custom function to pass to the Trainer so we can keep track of the accuracy throughout training.

In [14]:
def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
    preds = np.argmax(preds, axis=1)
    
    return {"accuracy": (preds == p.label_ids).astype(np.float32).mean().item()}

#### Model Training
Now we are ready to train the model. We import the model from the Huggingface library, and then choose to freeze the convolutional layers of the model to not change the weights of the "general" layers of the base model too much. Our initial 3 training-passes, as reported in the below training history, had all the model's weight flexible, which led to worse performance than the frozen one across a number of learning rates and epochs.

We output the weights of the model to a directory "Hubert-results" for reproducibility and to enable us to obtain a previous best performance as we atttempt different model specifications.

In [15]:
from transformers import (Wav2Vec2ForSequenceClassification,
                          TrainingArguments, Trainer)

model_id = "facebook/wav2vec2-base-960h"
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_id, num_labels=10)

#Freeze the convolutional layers of the model
model.freeze_feature_extractor()

training_args = TrainingArguments(
    output_dir="Hubert-results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    num_train_epochs=3
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    compute_metrics=compute_metrics,
    eval_dataset=eval_dataset,
    tokenizer=processor.feature_extractor
)

trainer.train()

Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['classifier.bias', 'classifier.weight', 'projector.bias', 'projector.weight', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
Detected kernel version 4.14.336, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.808734,0.309979
2,2.109200,1.436614,0.562633
3,1.696600,1.274571,0.728238


Checkpoint destination directory Hubert-results/checkpoint-471 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory Hubert-results/checkpoint-942 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory Hubert-results/checkpoint-1413 already exists and is non-empty.Saving will proceed but saved results may be invalid.


TrainOutput(global_step=1413, training_loss=1.7743613640636058, metrics={'train_runtime': 2108.3843, 'train_samples_per_second': 2.678, 'train_steps_per_second': 0.67, 'total_flos': 9.611076425688e+17, 'train_loss': 1.7743613640636058, 'epoch': 3.0})

#### Transfer Learning Training Comments
In fine-tuning our transfer learning model, we have attempted to vary a number of hyperparameters to optimize the training process with the amount of data we have available. At first, we attempted to use a batch size of 16, but this overloaded the available GPU, which led us to decrease the batch size to 1. With this setup, we went through the following iterations and hyperparameter changes:

1. Learning rate = 5e-5. epochs = 3. training_batch_size = 1. Only predicted the class 9 - Warning for all test samples.
 
Results:

| Epoch | Training Loss | Validation Loss |
|-------|---------------|-----------------|
|   1   |    2.312300   |     2.305967    |
|   2   |    2.313600   |     2.303125    |
|   3   |    2.305600   |     2.302968    |


2. per_device_training_batch_size = 4. Learning rate = 2e-5. epochs = 3. Results:

| Epoch | Training Loss | Validation Loss |
|-------|---------------|-----------------|
| 1     | 2.598000        | 2.094482      |
| 2     | 2.219000      | 1.896535        |
| 3     | 2.000000      | 1.813706        |

3. Same settings as above, but have frozen the convolutional (decoding) layers of the base model for fine-tuning. Results:

| Epoch | Training Loss | Validation Loss | Validation Accuracy |
|-------|---------------|-----------------|----------|
| 1     | No log        | 1.702815        | 0.471338 |
| 2     | 2.070200      | 1.322905        | 0.649682 |
| 3     | 1.532900      | 1.173278        | 0.717622 |

4. Unfroze the decoding (convolutional) layers. Learning rate = 2e-6. Results:

| Epoch | Training Loss | Validation Loss | Validation Accuracy |
|-------|---------------|-----------------|---------------------|
| 1     | 0.905300      | 0.935630        | 0.749469            |
| 2     | 0.884900      | 0.892472        | 0.762208            |
| 3     | 0.792500      | 0.886109        | 0.768578            |


#### Model Evaluation
We can now move to evaluate our model past just the accuracy and loss performance. We first check to see that our GPU is still available after training (it usually is). Following that, we define a predict function that generates predicted values for each of the samples from the dataset passed to it.

In [16]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Device: {device}")

Device: cuda


In [17]:
def predict(batch):
    features = processor(batch["input_values"], sampling_rate=processor.feature_extractor.sampling_rate, return_tensors="pt", padding=True)

    input_values = features.input_values.to(device)

    with torch.no_grad():
        logits = model(input_values).logits 

    pred_ids = torch.argmax(logits, dim=-1).detach().cpu().numpy()
    batch["predicted"] = pred_ids
    return batch


#### Classification Report
Now that we have predictions with both numerical and written labels, we can utilize sklearn's classification_report() function to generate a report of the accuracy of the model across our different classes- as we did previously with our CNN remake model. 

We first call our `predict` function on the training and validation sets, in order to understand the fit and performance of the model.

#### Validation Results
As we can see from below, the model performs relatively well on the validation set, with 72% overall accuracy. We can see that the between-category accuracy has significant variance, with precision going from 0.54 (Fighting) to 0.96 (Resting) and recall ranging from 0.35 (Paining) on the low-end to 0.94 (Resting) on the high-end. We know that the performance imbalance is not from an imbalance in the number of samples. Rather, as can be heard from the above audio samples, we can hear that some of the noises have a "longer" sound profile than others, meaning the padding on average is smaller which likely will lead to better results since there is more to classify the sound on. We will do an investigation of this later on by looking at average padding length by category.

In [18]:
result = eval_dataset.map(predict, batched=True, batch_size=4)
y_true = result['labels']
y_pred = result['predicted']
label_names = [config.id2label[i] for i in range(config.num_labels)]
label_names



Map:   0%|          | 0/471 [00:00<?, ? examples/s]

['Angry',
 'Defence',
 'Fighting',
 'Happy',
 'HuntingMind',
 'Mating',
 'MotherCall',
 'Paining',
 'Resting',

In [19]:
print(classification_report(y_true, y_pred, target_names=label_names))

              precision    recall  f1-score   support

       Angry       0.68      0.75      0.71        48
     Defence       0.86      0.93      0.90        46
    Fighting       0.59      0.54      0.57        48
       Happy       0.53      0.83      0.64        47
 HuntingMind       0.90      0.78      0.84        46
      Mating       0.82      0.58      0.68        48
  MotherCall       0.83      0.83      0.83        47
     Paining       0.63      0.37      0.47        46
     Resting       0.85      0.98      0.91        47

    accuracy                           0.73       471
   macro avg       0.74      0.73      0.72       471
weighted avg       0.74      0.73      0.72       471


#### Train Results
To understand the nature of our model performance, it would be useful to understand the accuracy of the model on our train set as well, to ensure that the model is not overfitting to the train set. As we can see, the model has similar performance on the train set as on both test and validation, suggesting that overfitting is not the issue. We believe we could improve model performance by increasing the number of epochs we run the fine-tuning for, but with the consideration of saving penguin lives we have decided not to increase the number of epochs by too much.

In [20]:
train_result = train_dataset.map(predict, batched=True, batch_size=4)
y_true = train_result['labels']
y_pred = train_result['predicted']
id2label={i: label for i, label in enumerate(label_list)}
label_names = [config.id2label[i] for i in range(config.num_labels)]
print(classification_report(y_true, y_pred, target_names=label_names))

Map:   0%|          | 0/1882 [00:00<?, ? examples/s]

              precision    recall  f1-score   support

       Angry       0.73      0.82      0.78       192
     Defence       0.89      0.98      0.93       186
    Fighting       0.69      0.65      0.67       190
       Happy       0.58      0.88      0.70       186
 HuntingMind       0.92      0.66      0.77       185
      Mating       0.83      0.64      0.72       193
  MotherCall       0.84      0.89      0.87       187
     Paining       0.49      0.28      0.35       184
     Resting       0.85      0.99      0.92       187

    accuracy                           0.75      1882
   macro avg       0.75      0.75      0.74      1882
weighted avg       0.75      0.75      0.74      1882


#### Test Data Evaluation
We finally bring our test data into play and run the same evaluation on that as we did on the validation set above. As we can see, the model performs very similarly on the test set as on the validation set, and has similar discrepancies in accuracy between the different categories.

In [21]:
# Map test dataset
test_dataset = test_dataset.map(
    preprocess_function,
    batch_size=16,
    batched=True
)

Map:   0%|          | 0/589 [00:00<?, ? examples/s]

In [22]:
test_result = test_dataset.map(predict, batched=True, batch_size=4)

Map:   0%|          | 0/589 [00:00<?, ? examples/s]

In [23]:
y_true = test_result['labels']
y_pred = test_result['predicted']

id2label={i: label for i, label in enumerate(label_list)}

label_names = [config.id2label[i] for i in range(config.num_labels)]

print(classification_report(y_true, y_pred, target_names=label_names))

              precision    recall  f1-score   support

       Angry       0.73      0.78      0.76        60
     Defence       0.88      1.00      0.94        58
    Fighting       0.57      0.52      0.54        60
       Happy       0.52      0.83      0.64        58
 HuntingMind       0.90      0.79      0.84        58
      Mating       0.69      0.55      0.61        60
  MotherCall       0.79      0.81      0.80        59
     Paining       0.53      0.32      0.40        57
     Resting       0.94      0.98      0.96        59

    accuracy                           0.71       589
   macro avg       0.72      0.71      0.71       589
weighted avg       0.72      0.71      0.71       589


#### Evaluation of Uncertainty
Another important element of evaluating the model's classification is understanding how certain it is of each prediction it makes. Below, we explore this by listing the probabilities for each class of each prediction.

In [24]:
sampling_rate = sr

def speech_file_to_array_fn(path, sampling_rate):
    speech_array, _sampling_rate = torchaudio.load(path)
    resampler = torchaudio.transforms.Resample(_sampling_rate)
    speech = resampler(speech_array).squeeze().numpy()
    return speech


def predict(path, sampling_rate):
    speech = speech_file_to_array_fn(path, sampling_rate)
    features = processor(speech, sampling_rate=sampling_rate, return_tensors="pt", padding=True)

    input_values = features.input_values.to(device)

    with torch.no_grad():
        logits = model(input_values).logits

    scores = F.softmax(logits, dim=1).detach().cpu().numpy()[0]
    outputs = [{"Emotion": config.id2label[i], "Score": f"{round(score * 100, 3):.1f}%"} for i, score in enumerate(scores)]
    return outputs

STYLES = """
<style>
div.display_data {
    margin: 0 auto;
    max-width: 500px;
}
table.xxx {
    margin: 50px !important;
    float: right !important;
    clear: both !important;
}
table.xxx td {
    min-width: 300px !important;
    text-align: center !important;
}
</style>
""".strip()

def prediction(df_row):
    path, emotion = df_row["path"], df_row["emotion"]
    df = pd.DataFrame([{"Emotion": emotion, "Sentence": "    "}])
    setup = {
        'border': 2,
        'show_dimensions': True,
        'justify': 'center',
        'classes': 'xxx',
        'escape': False,
    }
    ipd.display(ipd.HTML(STYLES + df.to_html(**setup) + "<br />"))
    speech, sr = torchaudio.load(path)
    speech = speech[0].numpy().squeeze()
    speech = librosa.resample(np.asarray(speech), orig_sr=sr, target_sr=sampling_rate)
    ipd.display(ipd.Audio(data=np.asarray(speech), autoplay=True, rate=sampling_rate))

    outputs = predict(path, sampling_rate)
    r = pd.DataFrame(outputs)
    ipd.display(ipd.HTML(STYLES + r.to_html(**setup) + "<br />"))

In [25]:
test = pd.DataFrame(dataset['test'])
prediction(test.iloc[3])

Unnamed: 0,Emotion,Sentence
0,MotherCall,


Unnamed: 0,Emotion,Score
0,Angry,1.7%
1,Defence,13.5%
2,Fighting,4.8%
3,Happy,4.5%
4,HuntingMind,21.5%
5,Mating,4.8%
6,MotherCall,35.8%
7,Paining,3.9%
8,Resting,7.1%
9,Warning,2.4%


In [26]:
prediction(test.iloc[5])

Unnamed: 0,Emotion,Sentence
0,Resting,


Unnamed: 0,Emotion,Score
0,Angry,1.5%
1,Defence,3.2%
2,Fighting,1.5%
3,Happy,1.6%
4,HuntingMind,15.7%
5,Mating,1.7%
6,MotherCall,4.5%
7,Paining,1.4%
8,Resting,67.6%
9,Warning,1.5%


#### Interpretation: 
As we can see from the two above examples, the model has varying certainty in its predictions, with MotherCall only being given a 36% probability in the softmax distribution, while Resting received 68% in its respective distribution. This shows that despite the model classifying them both correctly, there is substantial difference in the certainty, and thereby robustness, in the classifications of the model. This uncertainty is in line with that was seen in the above classification report, where Resting has substantially higher accuracy than MotherCall.

### 2c: Augmentation Extension <a class="anchor" id="aug"></a>
We further extend our practices by making use of the augmented data below:

Now that we have established much better performance than previous models through the transfer learning approach, we can try to train the same model with the augmented data to see if we can improve performance with that. 

In [27]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_aug_dataset,
    compute_metrics=compute_metrics,
    eval_dataset=eval_dataset,
    tokenizer=processor.feature_extractor
)

trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
Detected kernel version 4.14.336, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Epoch,Training Loss,Validation Loss,Accuracy
1,1.6944,0.931758,0.730361
2,1.2199,0.844043,0.768578
3,1.0072,0.65073,0.842887


TrainOutput(global_step=2823, training_loss=1.2709018708626683, metrics={'train_runtime': 3906.8821, 'train_samples_per_second': 2.89, 'train_steps_per_second': 0.723, 'total_flos': 1.9222152851376e+18, 'train_loss': 1.2709018708626683, 'epoch': 3.0})

#### Augmented Transfer Learning Training Comments
Although we experienced with hyperparameter tuning in the un-augmented case, it seems prudent to at least experiment a little with different hyperparameters for the augmented case as well. We ran the following train procedures for the model:

1. Frozen convolutional layers. learning_rate = 2e-5. Results:

| Epoch | Training Loss | Validation Loss | Validation Accuracy |
|-------|---------------|-----------------|---------------------|
| 1     | 1.980300      | 1.758419        | 0.363057            |
| 2     | 1.574300      | 1.199370        | 0.622081            |
| 3     | 1.309900      | 0.915809        | 0.723992            |
| 4     | 1.231000      | 0.866098        | 0.747346            |
| 5     | 1.003800      | 0.792015        | 0.774947            |

2. Unfrozen convolutional layers. learning_rate = 2e-6. Results:

| Epoch | Training Loss | Validation Loss | Validation Accuracy |
|-------|---------------|-----------------|---------------------|
| 1     | 0.977600      | 0.782855        | 0.772824            |
| 2     | 0.921000      | 0.755599        | 0.781316            |
| 3     | 0.870900      | 0.761300        | 0.783440            |


In [29]:
# Have to redefine predict function after the second training pass
def predict(batch):
    features = processor(batch["input_values"], sampling_rate=processor.feature_extractor.sampling_rate, return_tensors="pt", padding=True)

    input_values = features.input_values.to(device)

    with torch.no_grad():
        logits = model(input_values).logits 

    pred_ids = torch.argmax(logits, dim=-1).detach().cpu().numpy()
    batch["predicted"] = pred_ids
    return batch

In [31]:
# Validation
result = eval_dataset.map(predict, batched=True, batch_size=4)
y_true = result['labels']
y_pred = result['predicted']
label_names = [config.id2label[i] for i in range(config.num_labels)]
print('Validation Dataset')
print(classification_report(y_true, y_pred, target_names=label_names))

# Testing
result = test_dataset.map(predict, batched=True, batch_size=4)
y_true = result['labels']
y_pred = result['predicted']
label_names = [config.id2label[i] for i in range(config.num_labels)]
print('Testing Dataset')
print(classification_report(y_true, y_pred, target_names=label_names))

Map:   0%|          | 0/471 [00:00<?, ? examples/s]

Validation Dataset
              precision    recall  f1-score   support

       Angry       0.83      0.90      0.86        48
     Defence       0.98      0.91      0.94        46
    Fighting       0.89      0.81      0.85        48
       Happy       0.72      0.89      0.80        47
 HuntingMind       0.87      0.85      0.86        46
      Mating       0.84      0.77      0.80        48
  MotherCall       0.91      0.89      0.90        47
     Paining       0.76      0.61      0.67        46
     Resting       0.85      0.98      0.91        47

    accuracy                           0.84       471
   macro avg       0.85      0.84      0.84       471
weighted avg       0.85      0.84      0.84       471


Map:   0%|          | 0/589 [00:00<?, ? examples/s]

Testing Dataset
              precision    recall  f1-score   support

       Angry       0.85      0.87      0.86        60
     Defence       0.92      0.97      0.94        58
    Fighting       0.81      0.72      0.76        60
       Happy       0.63      0.86      0.73        58
 HuntingMind       0.92      0.84      0.88        58
      Mating       0.81      0.73      0.77        60
  MotherCall       0.93      0.92      0.92        59
     Paining       0.64      0.49      0.55        57
     Resting       0.91      0.98      0.94        59

    accuracy                           0.82       589
   macro avg       0.82      0.82      0.82       589
weighted avg       0.82      0.82      0.82       589


#### Interpretation:
We see that training our model with augmented data under identical hyperparameters improves both overall model performance and generalization ability, with an impressive 82% accuracy rate on our unseen test set--recall that our previous model that we trained on only raw data yielded 74% train accuracy and 71% test accuracy. This suggests that our augmented data helped our model generalize better to unseen data and allowed it to learn features of our dataset in fewer epochs of training. Interestly, the model with augmented data shows the worst performance in predicting 'Paining' noises, a flaw it has in common with our earlier model--this could be due to the inherent nature of these sounds or perhaps an underlying error in our dataset itself. 

In [33]:
#Have to redefine predict-function after second train definition.
sampling_rate = sr

def speech_file_to_array_fn(path, sampling_rate):
    speech_array, _sampling_rate = torchaudio.load(path)
    resampler = torchaudio.transforms.Resample(_sampling_rate)
    speech = resampler(speech_array).squeeze().numpy()
    return speech


def predict(path, sampling_rate):
    speech = speech_file_to_array_fn(path, sampling_rate)
    features = processor(speech, sampling_rate=sampling_rate, return_tensors="pt", padding=True)

    input_values = features.input_values.to(device)

    with torch.no_grad():
        logits = model(input_values).logits

    scores = F.softmax(logits, dim=1).detach().cpu().numpy()[0]
    outputs = [{"Emotion": config.id2label[i], "Score": f"{round(score * 100, 3):.1f}%"} for i, score in enumerate(scores)]
    return outputs

STYLES = """
<style>
div.display_data {
    margin: 0 auto;
    max-width: 500px;
}
table.xxx {
    margin: 50px !important;
    float: right !important;
    clear: both !important;
}
table.xxx td {
    min-width: 300px !important;
    text-align: center !important;
}
</style>
""".strip()

def prediction(df_row):
    path, emotion = df_row["path"], df_row["emotion"]
    df = pd.DataFrame([{"Emotion": emotion, "Sentence": "    "}])
    setup = {
        'border': 2,
        'show_dimensions': True,
        'justify': 'center',
        'classes': 'xxx',
        'escape': False,
    }
    ipd.display(ipd.HTML(STYLES + df.to_html(**setup) + "<br />"))
    speech, sr = torchaudio.load(path)
    speech = speech[0].numpy().squeeze()
    speech = librosa.resample(np.asarray(speech), orig_sr=sr, target_sr=sampling_rate)
    ipd.display(ipd.Audio(data=np.asarray(speech), autoplay=True, rate=sampling_rate))

    outputs = predict(path, sampling_rate)
    r = pd.DataFrame(outputs)
    ipd.display(ipd.HTML(STYLES + r.to_html(**setup) + "<br />"))

#### We Finally Finish with an Evaluation of the Augmented Model's Uncertainty:

In [34]:
prediction(test.iloc[3])

Unnamed: 0,Emotion,Sentence
0,MotherCall,


Unnamed: 0,Emotion,Score
0,Angry,0.2%
1,Defence,1.2%
2,Fighting,1.1%
3,Happy,2.2%
4,HuntingMind,2.6%
5,Mating,1.2%
6,MotherCall,88.7%
7,Paining,2.0%
8,Resting,0.6%
9,Warning,0.3%


In [35]:
prediction(test.iloc[5])

Unnamed: 0,Emotion,Sentence
0,Resting,


Unnamed: 0,Emotion,Score
0,Angry,0.4%
1,Defence,0.3%
2,Fighting,0.1%
3,Happy,0.2%
4,HuntingMind,1.2%
5,Mating,0.3%
6,MotherCall,0.2%
7,Paining,0.1%
8,Resting,96.8%
9,Warning,0.3%


#### Interpretation:
While we caution against generalizing conclusions based on very few examples, it appears from the softmax distributions above that not our augmented model makes more accurate predictions with a higher degree of confidence, assigning probabilities at or near 90% to selected classes, as opposed to the earlier model where we saw that the majority class was assigned only a ~35% probability. 

### 2d: Results comment <a class="anchor" id="res"></a>
Note that throughout our development and steps we have discussed results, therefore there are no formal results to discuss that haven't been touched on already. Final considerations regarding the transfer learning models will be evaluated in the last section of our project, Overarching Results, which proceeds after this.

## 3: Overarching Results and Project Conclusions <a class="anchor" id="o_res"></a>

#### Revisit Problem:
We were trying to determine a cat's emotion relative to an audio file of its meow. 

Our project was inspired by and is an extension/expansion of a 2018 paper by Yagya Raj Pandeya, Dongwhoon Kim, and Joonwhoan Lee, which attempted to answer the question by extracting features using a CNN structure as well as a CDBN structure added on top of a pretrained CNN and then further putting the features through classification algorithms. Rather than separating the features and use of the additional classification algorithms, we attempted to directly predict the classification within the model pipelines.

In our project, we considered 10 different emotions, [Angry, Defence, Fighting, Happy, HuntingMind, Mating, MotherCall, Paining, Resting, Warning]. 

#### Recap Data:
We got our data for the project by requesting it from the original authors of the base paper, Pandeya et. al. They sourced it by scraping public sources such as YouTube. The original dataset had about ~3000 files.

To investigate the model, we played examples from it in its raw audio form, plotted examples of each emotion's raw numeric form as well as its mel spectrogram forms, and overlayed files. Our general exploration conclusions were that there were no obvious consistentcies by emotion, as they each had varying file lengths, amplitudes, and more. We justified this discovery, as just as each human has a different voice, each cat has its own meow.

A lot of preprocessing went into the data in order to get it ready for modeling. First, the raw audio files needed to be converted into numeric arrays. We utilized the `torch` and `librosa` libraries to do such conversions. In order to be ready for modeling, the files needed to be equal in length. Following analysis of the distrubution of all files' lengths, we determined 150,000 sequence steps to be a good cutoff value for this. We therefore truncated or padded all files so that they reached length 150,000. 

We also recognized that the amount of data might be cause for concern, as relatively speaking the dataset was quite small. In order to address this, we created an additional augmented set of data, randomly selecting speed changes (range 0.9 to 1), pitch shifting (range −4 to 4), dynamic range compression (range 0.5 to 1.1), insertion of noise (range 0.1 to 0.5), and time shifting (20% in forward or backward direction) to be applied to the files. The only downside to such augmentation was that it took a significant amount of time to run, and with that, a lot of memory to hold. Therefore we also included a work around in case either of or Notebook A or B had to be run multiple times in which rather than rerunning the augmentation the data is imported through a saved csv.

#### General Project Plan:
Following the preparation of our data, we moved onto endeavoring the modeling aspect of the project. As briefly mentioned above, we deviated from the base paper's plan in the capacity that **rather than using the NN structures for feature extraction, we wanted to use them to predict the emotion classifications directly.**. In our general project, we followed a Goldilocks structure, starting with a simple baseline CNN model, amplifying concepts of the baseline model to create an advanced CNN (which we actually duplicated a version of the CNN they used in the paper), and finally pivoting to transfer learning for the last. 

#### Baseline Model Conclusions:
Our baseline model was incredibly simple. It was a seven layer CNN that took in the mel spectrogram versions of the data that were preprocessed by use of the `librosa` library. It also failed greatly. As there are 10 different emotions we are considering, a random guess would have 10% probability of being right on average. Our simple baseline model performed at just about .1 accuracy, sometimes even worse. Therefore, taking a random guess would be a better classification method than using our simple model. The only positive of such poor performance was that there was a lot of room for improvement.

#### Remake Models Conclusions:
Our second modeling attempt was a remake of the base paper's CNN that extracted features to be fed into a further classification algorithm. The architecture the paper outlines is that of a model pretrained on the million song dataset, not one that they uniquely derived themselves- however we decided to implement it directly because of the models general success. However, rather than use an additional classification algorithm, we adapted the model to make the predictions directly. In addition, though the model's architecture is discussed in further detail in Notebook A, we added batch normalizations and dropouts to the convolution layers as well (inclusion of such was unspecified in the paper). Rather than taking in already converted mel spectrogram forms of the data, the architecture converted the input to mel spectrogram form in its first layer, utilizing the `kapre` library, built as an extension from the `keras`. 

We first trained the CNN on just the original data, but fit a second model that included the augmented data. Both models performed much better than the original baseline model. The non-augmented trained model had an average accuracy of about ~.5 (though dependent on kernal restarts varied between about ~.45 and ~.55), while the augmented trained model had an average accuracy of about ~.4 (again varying between about ~.35 and ~.45)- therefore, they were 5x and 4x more accurate than the baseline, respectively. The lower accuracy of the model trained with augmented data is a bit surprising, as we hypothesized that more data would allow for a better understanding of patterns, however the decline in accuracy might suggest overfitting instead. Prior to landing on the versions of the models published in Notebook A, we fiddled with paramters in hopes of further success, but were unable to gain any further relative success and therefore moved forward with the ones we published. Upon reflection of the training metric graphs, the oscilation suggests that the model could benefit from a smaller learning rate which we were prevented from experimenting with to a big degree because of resources. Relatively speaking, the fact that this model simply adapted the million song dataset model's architecture and produced such results is quite impressive, as the original model the paper references was quite literally trained on a million files.

#### Transfer Learning Models Conclusions:

#### Strengths:
Relative to the scope of this project, we consider the results of our project a relative success. Strengths of our project include immense improvement from our simple baseline model, in both our CNN models and transfer learning models, data preprocessing, conversions of audio files into data forms in which they could be fed into and trained on models, and adaptions of problems relative to little data (augmentation). Not to mention, use of what we learned in class to do it all, as well as expanding on many of these ideas- we were able to build the CNN structures, as well as have familiarity with HuggingFace from our exposure in class. Using such knowledge, we were able to properly evaluate the model and process, and though the final iterations of the models are published in our submitted notebooks, were able to debug and adjust as problems came up.

#### Weaknesses:
Though we consider our project to be relatively successful, there are some places of weakness that could use improvement. First, as mentioned briefly before, the data. We could immensely benefit from more data- though we attemtped to augment our given data further, it cascaded into the next weakness of discussion- memory and resources. Beginning with our augmentation attempts, we realized the magnitude of resources we would need to complete this project. As mentioned in lecture, one of the big downsides to NNs is how many resources they need to run, which increases as the complexity of the problem increases. Though we did make use of the class' provided JupyterHub, we were still unable to put all of our work in a single notebook, as when we attempted to do and run it, the kernel would crash. We acknowledge that this comes from both the scale of our models as well as the objects created that housed the data for training- as mentioned, the models took in array objects that had 150,000 entries, after all. For the scale of this project and the resources availble, most of these issues were unavoidable and would require further GPU allocation.

#### Future Improvement:
If we were to revist this project in the future, most of what we would do differently is dependent on if there is a change in resources. Theoretically, if we had more GPUs or memory accesss (though hopefully we don't kill too many more trees!), we would decrease learning rates and increase batch sizes as well as epochs, hoping to get as much accuracy as possible (without overfitting, of course!). Again dependent on resources we would also improve on our models as we continuously develop them, as often times we would attempt adjustments just for the kernel to fail and force a hard reset on our work.

#### Wrap-Up:
In conclusion, we consider our project a relative success to its goals and scope. We tried three distinct methods to answer the problem question (baseline, remake, transfer learning), in which we attempted to optimize results, both using only the data we were given as well as the addition of augmented data. To derive such strategies, we used methods we learned in class in addition to taking concepts beyond the class scope. Our baseline model was no better than random guess, our remake of the paper's CNN proved better results, but our transfer learning had the best results. Given that we took an overarching different approach than the authors of the base paper, focusing on making the classification predictions in pipeline rather than further feeding model features to classification algorithms, we are happy with our results, and believe that if we had access to more resources (ie if our kernel wouldn't die if we adjusted and intensified our training parameters), we could produce even better results. 