# Fine-tuning Whisper on Speech Pathology Dataset

## Goal

The goal of the Cleft Palate project (name TBD) at Vanderbilt DSI is to classify audio clips of patients' voices as containing hypernasality (a speech impediment) or not. The patients with hypernasality can then be recommended for speech pathology intervention. This is currently evaluated by human speech pathologists, which requires access to these medical providers. Our hope is to train a model that can classify this speech impediment for expedited patient access to a speech pathologist.

Tutorial created with guidance from ["Fine Tuning OpenAI Whisper Model for Audio Classifcation in PyTorch"](https://www.daniweb.com/programming/computer-science/tutorials/540802/fine-tuning-openai-whisper-model-for-audio-classification-in-pytorch)

## Model

We plan to use the Whisper embedings from OpenAI and train a classification model, either using Whisper with a sequence classification head or another classification LLM.

## Data

The data in this notebook is publicly available voice recordings featuring hypernasality and control groups. In the future we hope to train our model on private patient data from Vanderbilt University Medical Center (VUMC).

### Split Data

We need to split our data into train and test sets, then save those for further experiments.

In [None]:
!pip install torch
!pip install datasets
!pip install librosa
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --up

In [None]:
# import libraries
import datasets
from datasets import load_dataset, DatasetDict,  Audio
import pandas as pd
import os
import glob
import librosa
import io
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, classification_report, accuracy_score
from transformers import WhisperModel, WhisperFeatureExtractor, AdamW
import torch
import torch.nn as nn
import torch.utils.data
from torch.utils.data import Dataset, DataLoader
from datasets import load_dataset
from sklearn.metrics import f1_score, classification_report, accuracy_score

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
pwd

'/workspace/cleft_palate_choja'

In [None]:
data_path = "/workspace/cleft_palate_choja/WAV_PUBLIC_SAMPLES"

train_catalog = "/workspace/cleft_palate_choja/train.csv"
test_catalog = "/workspace/cleft_palate_choja/test.csv"

In [None]:
train_metadata = pd.read_csv(train_catalog)
train_metadata

Unnamed: 0,File_Name,Sampling_Rate_(Hz),Channels,Duration_(seconds),folder,hypernasality,original_text,OPENAI_Whisper_text,WAV_filename,WAV_folder
0,ACPA ted had a dog with white feet-3.mp3,44100.0,1.0,4.13,CASES,1.0,ted had a dog with white feet,Ted and a dog with white feet.,ACPA ted had a dog with white feet-3.wav,CASES_WAV
1,cdc 4 (and then go to school).mp3,44100.0,2.0,1.41,CONTROLS,0.0,and then go to school,and then go to school.,cdc 4 (and then go to school).wav,CONTROLS_WAV
2,Video 1_4 (and can I have some more material).mp3,44100.0,2.0,3.60,CONTROLS,0.0,and can I have some more material,And can I have some more material?,Video 1_4 (and can I have some more material).wav,CONTROLS_WAV
3,NEW - video 2 (three times).mp3,44100.0,2.0,1.28,CONTROLS,0.0,three times,Three times.,NEW - video 2 (three times).wav,CONTROLS_WAV
4,cdc 4 (and then he brushed his teeth).mp3,44100.0,2.0,1.52,CONTROLS,0.0,and then he brushed his teeth,And then he brushed his teeth.,cdc 4 (and then he brushed his teeth).wav,CONTROLS_WAV
...,...,...,...,...,...,...,...,...,...,...
142,video 1 (pizza bundt).mp3,44100.0,2.0,1.80,CONTROLS,0.0,pizza bundt,Pizza Funt!,video 1 (pizza bundt).wav,CONTROLS_WAV
143,ACPA most boys like to play football-3.mp3,48000.0,1.0,3.31,CASES,1.0,most boys like to play football,Most boys like to play football.,ACPA most boys like to play football-3.wav,CASES_WAV
144,Facebook (take a tire).mp3,44100.0,1.0,1.75,CASES,1.0,take a tire,See you next time!,Facebook (take a tire).wav,CASES_WAV
145,Video 5_1 (feet).mp3,44100.0,2.0,1.04,CASES,1.0,feet,Peace.,Video 5_1 (feet).wav,CASES_WAV


In [None]:
# Splitting the dataset into training and validation sets: 70% for training and 30% for validation.
# The 'random_state' ensures the split is reproducible.
train_df, val_df = train_test_split(train_metadata, test_size = 0.3, random_state = 42)

In [None]:
# Extracting the filenames of the WAV audio files from the training dataframe and converting them to a list.
train_files = train_df["WAV_filename"].tolist()

# Extracting the folder names where the WAV audio files are stored from the training dataframe and converting them to a list.
train_folder = train_df["WAV_folder"].tolist()

# Creating full file paths for each WAV file in the training set by joining the base data path, folder name, and file name.
# This is done for all files in the 'train_files' list.
train_full_paths = [os.path.join(data_path, train_folder[i], train_files[i]) for i in range(0, len(train_files))]

# 'train_full_paths' now contains the complete paths for each audio file in the training dataset.


In [None]:
# Extracting the 'hypernasality' labels from the training dataframe and converting them into a list.
train_labels = train_df["hypernasality"].tolist()

# Displaying the first 10 labels from the 'train_labels' list for a quick check or overview.
train_labels[0:10]

[0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0]

In [None]:
# Preparing the validation set:

# Extracting the filenames of WAV audio files from the validation dataframe and converting them into a list.
val_files = val_df["WAV_filename"].tolist()

# Extracting the folder names where the WAV audio files are stored from the validation dataframe and converting them into a list.
val_folder = val_df["WAV_folder"].tolist()

# Creating full file paths for each WAV file in the validation set by joining the base data path, folder name, and file name.
# This is performed for all files in the 'val_files' list.
val_full_paths = [os.path.join(data_path, val_folder[i], val_files[i]) for i in range(0, len(val_files))]

# Extracting the 'hypernasality' labels from the validation dataframe and converting them into a list.
val_labels = val_df["hypernasality"].tolist()

In [None]:
# Determining the total number of labels in the validation set.
# This represents the count of samples in the validation dataset.
len(val_labels)

45

In [None]:
test_metadata = pd.read_csv(test_catalog)

In [None]:
# Adding columns to the test dataset for WAV audio file data.

# In the 'WAV_filename' column, change the file extension from ".mp3" to ".wav" in each filename.
# This is done by replacing '.mp3' with '.wav' in the 'File_Name' column of the test_metadata dataframe.
test_metadata['WAV_filename'] = test_metadata['File_Name'].str.replace('.mp3', '.wav')

# Create a new column 'WAV_folder' in the test_metadata dataframe.
# This column is generated by appending "_WAV" to each value in the existing 'folder' column.
# This helps in categorizing or identifying the folder as containing WAV files.
test_metadata['WAV_folder'] = test_metadata['folder'] + "_WAV"


  test_metadata['WAV_filename'] = test_metadata['File_Name'].str.replace('.mp3', '.wav')


In [None]:
# Extracting and preparing file path information for the test dataset:

# Retrieving the filenames of WAV audio files from the test metadata dataframe and converting them into a list.
test_files = test_metadata["WAV_filename"].tolist()

# Extracting the folder names where the WAV audio files are stored in the test dataset and converting them into a list.
test_folder = test_metadata["WAV_folder"].tolist()

# Generating full file paths for each WAV file in the test dataset.
# This is achieved by joining the base data path, folder name, and file name for each file in 'test_files'.
# The os.path.join function ensures that the paths are correctly formed irrespective of the operating system.
test_full_paths = [os.path.join(data_path, test_folder[i], test_files[i]) for i in range(0, len(test_files))]

# 'test_full_paths' now contains the complete paths for each audio file in the test dataset.


In [None]:
# Extracting the 'hypernasality' labels from the test metadata dataframe and converting them into a list.
# This list, 'test_labels', will contain the hypernasality status (likely as categorical data) for each test sample.
test_labels = test_metadata["hypernasality"].tolist()


### Create PyTorch datasets

In [None]:
# Creating a training dataset for audio processing:
# 'datasets.Dataset.from_dict' creates a dataset from a dictionary.
# Here, the dictionary has two keys: 'audio' and 'labels'.
# 'audio' key contains the list of paths to the training audio files ('train_full_paths').
# 'labels' key contains the corresponding labels from 'train_labels'.
train_audio_dataset = datasets.Dataset.from_dict({"audio": train_full_paths, "labels": train_labels})

# Casting the 'audio' column of the dataset to a specific data type.
# 'Audio(sampling_rate=16_000)' specifies that the data in the 'audio' column
# should be treated as audio data with a sampling rate of 16,000 Hz.
# This is important for ensuring consistency in audio data processing.
train_audio_dataset = train_audio_dataset.cast_column("audio", Audio(sampling_rate=16_000))

# Creating a test dataset in a similar manner to the training dataset.
# 'test_full_paths' provides the paths to the test audio files, and 'test_labels' are their corresponding labels.
test_audio_dataset = datasets.Dataset.from_dict({"audio": test_full_paths, "labels": test_labels})

# Casting the 'audio' column in the test dataset to the Audio data type with a sampling rate of 16,000 Hz.
test_audio_dataset = test_audio_dataset.cast_column("audio", Audio(sampling_rate=16_000))

# Creating a validation dataset, following the same procedure as for the training and test datasets.
# 'val_full_paths' and 'val_labels' provide the audio file paths and labels for the validation data, respectively.
val_audio_dataset = datasets.Dataset.from_dict({"audio": val_full_paths, "labels": val_labels})

# Casting the 'audio' column in the validation dataset to the Audio data type with the specified sampling rate.
val_audio_dataset = val_audio_dataset.cast_column("audio", Audio(sampling_rate=16_000))


In [None]:
# Define the model checkpoint to be used. Here, it is "openai/whisper-base",
# which likely refers to a base version of OpenAI's Whisper model.
model_checkpoint = "openai/whisper-base"

# Initialize a feature extractor for the Whisper model.
# The feature extractor is loaded with the weights from the specified model checkpoint.
# Feature extractors are used to preprocess the audio data before feeding it to the model.
feature_extractor = WhisperFeatureExtractor.from_pretrained(model_checkpoint)

# Initialize the encoder part of the Whisper model.
# The encoder is loaded with the weights from the specified model checkpoint.
# This encoder will be used to process the extracted features and generate embeddings or predictions.
encoder = WhisperModel.from_pretrained(model_checkpoint)

# Setting up the device for model computations.
# This line checks if CUDA (an interface for working with Nvidia GPUs) is available.
# If CUDA is available, it sets the device to 'cuda' (GPU) for faster computation.
# If not, it uses 'cpu'. Using a GPU can significantly speed up model training and inference.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


In [None]:
class SpeechClassificationDataset(torch.utils.data.Dataset):
    def __init__(self, audio_data,  text_processor):
        # Constructor for the dataset class.
        # 'audio_data' is expected to be a list or similar collection of audio data and associated metadata.
        # 'text_processor' is a text processing tool, possibly for feature extraction or preprocessing.

        self.audio_data = audio_data
        self.text_processor = text_processor

    def __len__(self):
        # This method returns the total number of samples in the dataset.
        return len(self.audio_data)

    def __getitem__(self, index):
        # This method retrieves a single data sample from the dataset.

        # Process the audio data at the specified index using the text_processor.
        # 'return_tensors="pt"' indicates that the processed output should be PyTorch tensors.
        # The sampling rate from the audio data is also used in the processing.
        inputs = self.text_processor(self.audio_data[index]["audio"]["array"],
                                     return_tensors="pt",
                                     sampling_rate=self.audio_data[index]["audio"]["sampling_rate"])

        # Extract the input features from the processed inputs.
        input_features = inputs.input_features

        # Prepare the 'decoder_input_ids' which might be used by the model for decoding/processing.
        # The 'encoder.config.decoder_start_token_id' is multiplied to create a tensor.
        # This is model-specific and depends on how the model expects the input.
        decoder_input_ids = torch.tensor([[1, 1]]) * encoder.config.decoder_start_token_id

        # Extract the label for the current audio sample.
        # 'labels' here are likely to be categorical values or similar, stored as NumPy arrays.
        labels = np.array(self.audio_data[index]['labels'])

        # The method returns the processed input features, decoder input IDs, and the labels for the current sample.
        return input_features, decoder_input_ids, torch.tensor(labels)


In [None]:
# Create instances of the SpeechClassificationDataset for each data set (training, testing, and validation).
# The datasets are initialized with their respective audio datasets and the feature extractor.
train_dataset = SpeechClassificationDataset(train_audio_dataset, feature_extractor)
test_dataset = SpeechClassificationDataset(test_audio_dataset, feature_extractor)
val_dataset = SpeechClassificationDataset(val_audio_dataset, feature_extractor)

# Define the batch size. This is the number of samples that will be processed together in one pass (batch) during training.
batch_size = 8

# Initialize a DataLoader for the training dataset.
# 'batch_size=batch_size' configures the loader to provide data in batches of the specified size.
# 'shuffle=True' ensures that the data is shuffled at each epoch, which helps in reducing overfitting and improving model generalization.
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

# Initialize a DataLoader for the validation dataset.
# Data shuffling is turned off ('shuffle=False') as it is not necessary for validation and testing datasets.
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

# Initialize a DataLoader for the test dataset, also with shuffling turned off.
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)


## Fine Tune Whisper Model

Whisper model from HuggingFace.

In [None]:
class SpeechClassifier(nn.Module):
    def __init__(self, num_labels, encoder):
        # This is the constructor for the SpeechClassifier class.
        # It initializes the class with a specific number of labels and a pre-trained encoder.
        super(SpeechClassifier, self).__init__()

        self.encoder = encoder  # This assigns the pre-trained encoder (Whisper model) to the class.

        # A classifier is defined as a sequential neural network.
        # It consists of several linear layers (fully connected layers) with ReLU activation functions in between.
        self.classifier = nn.Sequential(
            nn.Linear(self.encoder.config.hidden_size, 4096),  # Linear layer from hidden size of encoder to 4096 nodes.
            nn.ReLU(),  # ReLU activation function.
            nn.Linear(4096, 2048),  # Another linear layer reducing nodes from 4096 to 2048.
            nn.ReLU(),  # ReLU activation function.
            nn.Linear(2048, 1024),  # Linear layer from 2048 to 1024 nodes.
            nn.ReLU(),  # ReLU activation function.
            nn.Linear(1024, 512),  # Linear layer from 1024 to 512 nodes.
            nn.ReLU(),  # ReLU activation function.
            nn.Linear(512, num_labels)  # Final linear layer outputting to the number of labels.
        )

    def forward(self, input_features, decoder_input_ids):
        # The forward method defines how the input data passes through the network.

        # The encoder takes input features and decoder input IDs, returning the model's outputs.
        outputs = self.encoder(input_features, decoder_input_ids=decoder_input_ids)

        # Extracts the last hidden state's first token representation.
        # This is a common practice for classification tasks using transformers.
        pooled_output = outputs['last_hidden_state'][:, 0, :]

        # Passes this pooled output through the classifier to get the final logits.
        logits = self.classifier(pooled_output)

        return logits


In [None]:
# Setting the number of labels (classes) for the classification task.
# 'num_labels = 2' implies that the task is binary classification.
num_labels = 2

# Instantiating the SpeechClassifier model.
# The model is initialized with the number of labels and the encoder (from the Whisper model).
# The '.to(device)' method moves the model to a GPU if available, otherwise to the CPU.
# This helps in leveraging GPU acceleration for faster computation during training.
model = SpeechClassifier(num_labels, encoder).to(device)

# Initializing the optimizer for the model training.
# 'AdamW' is a variant of the Adam optimizer, commonly used in deep learning.
# It takes model parameters as its first argument.
# 'lr=2e-5' sets the learning rate. This value is a common default for fine-tuning models in NLP.
# 'betas=(0.9, 0.999)' sets the coefficients used for computing running averages of the gradient and its square.
# 'eps=1e-08' is a very small number to prevent any division by zero in the implementation.
optimizer = AdamW(model.parameters(), lr=2e-5, betas=(0.9, 0.999), eps=1e-08)

# Defining the loss function.
# 'nn.CrossEntropyLoss' is commonly used for classification tasks.
# It combines a SoftMax activation with a cross-entropy loss function.
criterion = nn.CrossEntropyLoss()




In [None]:
# Define the training function for the model without a separate validation phase.
def train(model, train_loader, optimizer, criterion, device, num_epochs):
    # 'num_epochs' is the number of times the entire training dataset is passed through the model.

    for epoch in range(num_epochs):
        # Iterating over each epoch.

        model.train()
        # Setting the model to training mode. This is crucial as it enables
        # the training-specific operations like dropout.

        for i, batch in enumerate(train_loader):
            # Looping over each batch in the training data loader.

            # Unpacking the batch to get input features, decoder input IDs, and labels.
            input_features, decoder_input_ids, labels = batch

            # Removing unnecessary dimensions and moving the data to the specified device (CPU or GPU).
            input_features = input_features.squeeze().to(device)
            decoder_input_ids = decoder_input_ids.squeeze().to(device)

            # Reshaping the labels, converting them to long datatype, and moving to the specified device.
            labels = labels.view(-1).type(torch.LongTensor).to(device)

            optimizer.zero_grad()
            # Clearing the gradients of all optimized tensors. This is important as gradients are accumulated.

            # Forward pass: computing logits by passing the input features and decoder input IDs through the model.
            logits = model(input_features, decoder_input_ids)

            # Calculating the loss between the model outputs (logits) and the labels.
            loss = criterion(logits, labels)
            loss.backward()
            # Backward pass: computing gradient of the loss with respect to model parameters.

            optimizer.step()
            # Adjusting the model parameters based on the computed gradients.

            # Print training progress information every 8 batches.
            if (i+1) % 8 == 0:
                print(f'Epoch {epoch+1}/{num_epochs}, Batch {i+1}/{len(train_loader)}, Train Loss: {loss.item():.4f}')

        # Saving the model's state after each epoch. This saves the trained weights.
        torch.save(model.state_dict(), 'best_model.pt')


In [None]:
# Define the training function with an additional validation phase for evaluating model performance.
def train(model, train_loader, val_loader, optimizer, criterion, device, num_epochs):

    best_accuracy = 0.0  # Initialize the best accuracy variable to keep track of the highest accuracy reached.

    for epoch in range(num_epochs):  # Iterate over each epoch.

        model.train()  # Set the model to training mode.

        for i, batch in enumerate(train_loader):  # Iterate over each batch in the training data loader.

            # Extract input features, decoder input IDs, and labels from the current batch.
            input_features, decoder_input_ids, labels = batch

            # Preprocess and move the data to the appropriate device (CPU/GPU).
            input_features = input_features.squeeze().to(device)
            decoder_input_ids = decoder_input_ids.squeeze().to(device)
            labels = labels.view(-1).type(torch.LongTensor).to(device)

            optimizer.zero_grad()  # Reset gradients for the optimizer.

            # Forward pass: compute logits by passing inputs through the model.
            logits = model(input_features, decoder_input_ids)

            # Compute the loss between model predictions and actual labels.
            loss = criterion(logits, labels)
            loss.backward()  # Backward pass to compute gradients.

            optimizer.step()  # Update model parameters based on gradients.

            # Print training loss every 8 batches for monitoring.
            if (i + 1) % 8 == 0:
                print(f'Epoch {epoch + 1}/{num_epochs}, Batch {i + 1}/{len(train_loader)}, Train Loss: {loss.item():.4f}')

        # Perform evaluation on the validation set after each training epoch.
        val_loss, val_accuracy, val_f1, _, _ = evaluate(model, val_loader, device)

        # Check if the current validation accuracy is the best and save the model if it is.
        if val_accuracy > best_accuracy:
            best_accuracy = val_accuracy
            torch.save(model.state_dict(), 'best_model.pt')  # Save the model.

        # Print validation performance metrics for the current epoch.
        print("========================================================================================")
        print(f'Epoch {epoch + 1}/{num_epochs}, Val Loss: {val_loss:.4f}, Val Accuracy: {val_accuracy:.4f}, Val F1: {val_f1:.4f}, Best Accuracy: {best_accuracy:.4f}')
        print("========================================================================================")


In [None]:
# Define the evaluation function for the model.
def evaluate(model, data_loader, device):

    all_labels = []  # Initialize a list to store all actual labels.
    all_preds = []   # Initialize a list to store all model predictions.
    total_loss = 0.0  # Initialize the total loss to zero.

    with torch.no_grad():  # Disable gradient computation for evaluation, which reduces memory usage and speeds up computation.

        for i, batch in enumerate(data_loader):  # Iterate over each batch in the data loader.

            # Unpack the batch to get input features, decoder input IDs, and labels.
            input_features, decoder_input_ids, labels = batch

            # Preprocess and move the data to the appropriate device (CPU/GPU).
            input_features = input_features.squeeze().to(device)
            decoder_input_ids = decoder_input_ids.squeeze().to(device)
            labels = labels.view(-1).type(torch.LongTensor).to(device)

            # Perform a forward pass through the model to get logits.
            logits = model(input_features, decoder_input_ids)

            # Compute the loss between the model's logits and the actual labels.
            loss = criterion(logits, labels)
            total_loss += loss.item()  # Accumulate the total loss.

            # Get the class predictions from the logits.
            _, preds = torch.max(logits, 1)

            # Store the labels and predictions to calculate metrics later.
            all_labels.append(labels.cpu().numpy())
            all_preds.append(preds.cpu().numpy())

    # Concatenate all batches' labels and predictions into single arrays.
    all_labels = np.concatenate(all_labels, axis=0)
    all_preds = np.concatenate(all_preds, axis=0)

    # Calculate the average loss, accuracy, and F1 score.
    loss = total_loss / len(data_loader)
    accuracy = accuracy_score(all_labels, all_preds)
    f1 = f1_score(all_labels, all_preds, average='macro')

    # Return the calculated metrics along with all labels and predictions.
    return loss, accuracy, f1, all_labels, all_preds


In [None]:
# Importing the librosa library, commonly used for audio processing tasks.
import librosa

# Setting the number of epochs for training the model.
# An epoch is a full iteration over the entire training dataset.
num_epochs = 5

# Initiating the training process of the model.
# The 'train' function takes the model, train and validation data loaders, optimizer, loss criterion,
# computation device, and number of epochs as parameters.
# This function will train the model on the training dataset for the specified number of epochs
# and evaluate its performance on the validation dataset after each epoch.
train(model, train_loader, val_loader, optimizer, criterion, device, num_epochs)


Epoch 1/5, Batch 8/13, Train Loss: 0.6367
Epoch 1/5, Val Loss: 0.6483, Val Accuracy: 0.5111, Val F1: 0.3382, Best Accuracy: 0.5111
Epoch 2/5, Batch 8/13, Train Loss: 0.3895
Epoch 2/5, Val Loss: 0.3521, Val Accuracy: 0.8444, Val F1: 0.8432, Best Accuracy: 0.8444
Epoch 3/5, Batch 8/13, Train Loss: 0.0461
Epoch 3/5, Val Loss: 0.5403, Val Accuracy: 0.8222, Val F1: 0.8148, Best Accuracy: 0.8444
Epoch 4/5, Batch 8/13, Train Loss: 0.0011
Epoch 4/5, Val Loss: 0.4466, Val Accuracy: 0.8889, Val F1: 0.8889, Best Accuracy: 0.8889
Epoch 5/5, Batch 8/13, Train Loss: 0.0007
Epoch 5/5, Val Loss: 0.9317, Val Accuracy: 0.8000, Val F1: 0.7935, Best Accuracy: 0.8889


### Validation

Before running the model on the test set, let's examine the validation set and see how our model is doing.

In [None]:
# Loading the state dictionary of the best-performing model from the saved file 'best_model.pt'.
# This file contains the trained weights of the model that achieved the highest accuracy during training.
state_dict = torch.load('best_model.pt')

# Creating a new instance of the SpeechClassifier model.
# 'num_labels' specifies the number of output labels for the classification task.
# The model is again initialized with the same encoder as used during training.
num_labels = 2
model = SpeechClassifier(num_labels, encoder).to(device)

# Loading the saved state dictionary into this new model instance.
# This effectively transfers the learned weights to the new model.
model.load_state_dict(state_dict)

# Evaluating the model on the validation dataset.
# The 'evaluate' function returns multiple metrics but here we're only interested in the actual and predicted labels.
# These labels are used to assess the model's classification performance on the validation set.
_, _, _, all_labels, all_preds = evaluate(model, val_loader, device)

In [None]:
# VALIDATION PHASE CONTINUED

# Printing a detailed classification report.
# The 'classification_report' from scikit-learn provides metrics such as precision, recall, and F1-score for each class.
# These metrics give a comprehensive view of the model's performance across different classes.
# 'all_labels' are the true labels, and 'all_preds' are the predictions made by the model on the validation set.
print(classification_report(all_labels, all_preds))

# Printing the overall accuracy of the model on the validation set.
# 'accuracy_score' computes the proportion of correctly predicted observations to the total observations.
# This gives a quick and clear indication of how often the model is correct across all classes.
print(accuracy_score(all_labels, all_preds))


              precision    recall  f1-score   support

           0       0.87      0.91      0.89        22
           1       0.91      0.87      0.89        23

    accuracy                           0.89        45
   macro avg       0.89      0.89      0.89        45
weighted avg       0.89      0.89      0.89        45

0.8888888888888888


This is too good to be true. Checking the contents of labels, preds, and data balance.

In [None]:
all_labels

array([1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
       0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0,
       1])

In [None]:
all_preds

array([1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
       0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
       1])

In [None]:
# Calculating the proportion of positive labels in the training dataset.
# This operation sums up all the values in 'train_labels' and then divides by the total number of labels.
# 'train_labels' is a list containing the labels for each training sample.
# Assuming a binary classification task, this line effectively computes the fraction of samples
# that belong to the positive class (often denoted as '1').
# This metric is useful for understanding the class distribution in the training set,
# particularly for identifying class imbalances.
sum(train_labels) / len(train_labels)


0.5294117647058824

In [None]:
# Calculating the proportion of positive labels in the validation dataset.
# This line sums all the values in 'val_labels' and divides by the total number of labels.
# 'val_labels' contains the labels for each sample in the validation dataset.
# In a binary classification context, this calculation reveals the fraction of samples
# that are categorized as the positive class (usually represented as '1').
# It's a useful metric to assess the class balance in the validation set,
# which can influence how well the model generalizes from training to validation data.
sum(val_labels) / len(val_labels)


0.5111111111111111

In [None]:
# TESTING PHASE
# This section is dedicated to evaluating the model's performance on the test dataset.

# Loading the state dictionary of the best model from 'best_model.pt'.
# This file contains the trained model weights that achieved the highest accuracy during the training phase.
state_dict = torch.load('best_model.pt')

# Creating a new instance of the SpeechClassifier model.
# 'num_labels = 2' indicates that the classification task is binary.
# The model is initialized with the specified encoder and set to the appropriate device (CPU or GPU).
model = SpeechClassifier(num_labels, encoder).to(device)

# Loading the saved state dictionary into the newly created model instance.
# This ensures that the model uses the previously trained and optimized weights.
model.load_state_dict(state_dict)

# Evaluating the model's performance on the test dataset.
# The 'evaluate' function is called with the test data loader and the computation device.
# It returns the loss, accuracy, F1 score, and the actual and predicted labels, but here
# we're only interested in the actual and predicted labels.
_, _, _, all_labels, all_preds = evaluate(model, test_loader, device)

# Printing a detailed classification report.
# This includes metrics such as precision, recall, and F1-score for each class,
# offering a comprehensive view of the model's performance on the test data.
print(classification_report(all_labels, all_preds))

# Printing the overall accuracy of the model on the test set.
# This is the proportion of correctly predicted observations to the total number of observations,
# providing a quick overview of the model's effectiveness in making correct predictions.
print(accuracy_score(all_labels, all_preds))


              precision    recall  f1-score   support

           0       0.84      0.89      0.86        18
           1       0.89      0.84      0.86        19

    accuracy                           0.86        37
   macro avg       0.87      0.87      0.86        37
weighted avg       0.87      0.86      0.86        37

0.8648648648648649


I don't want to run testing yet as we want to explore more models.

### Model Troubleshooting

So far our results look too good to be true (98% validation accuracy). In the cells below I run through some troubleshooting methods to ensure our model is not overfit or learning the wrong representations.

Ensure that the labels are correct.

In [None]:
train_df[train_df["WAV_folder"] == "CONTROLS_WAV"]["hypernasality"]

93     0.0
140    0.0
108    0.0
65     0.0
28     0.0
117    0.0
84     0.0
142    0.0
44     0.0
15     0.0
114    0.0
47     0.0
110    0.0
78     0.0
5      0.0
120    0.0
77     0.0
34     0.0
111    0.0
43     0.0
95     0.0
131    0.0
8      0.0
13     0.0
3      0.0
38     0.0
72     0.0
6      0.0
109    0.0
2      0.0
123    0.0
112    0.0
46     0.0
79     0.0
41     0.0
90     0.0
75     0.0
32     0.0
141    0.0
37     0.0
1      0.0
52     0.0
103    0.0
74     0.0
121    0.0
146    0.0
20     0.0
14     0.0
Name: hypernasality, dtype: float64

In [None]:
train_df

Unnamed: 0,File_Name,Sampling_Rate_(Hz),Channels,Duration_(seconds),folder,hypernasality,original_text,OPENAI_Whisper_text,WAV_filename,WAV_folder
93,ACPA Santa came home since the snow fell.mp3,44100.0,1.0,3.19,CONTROLS,0.0,Santa came home since the snow fell,Santa came home since the snow fell.,ACPA Santa came home since the snow fell.wav,CONTROLS_WAV
140,cdc 5 (can I play with Jack).mp3,44100.0,2.0,1.57,CONTROLS,0.0,can I play with Jack,Can I play with Jack?,cdc 5 (can I play with Jack).wav,CONTROLS_WAV
108,cdc 6 (the polar bears are dancing).mp3,44100.0,2.0,2.32,CONTROLS,0.0,the polar bears are dancing,"Um, the polar bears are dancing.",cdc 6 (the polar bears are dancing).wav,CONTROLS_WAV
0,ACPA ted had a dog with white feet-3.mp3,44100.0,1.0,4.13,CASES,1.0,ted had a dog with white feet,Ted and a dog with white feet.,ACPA ted had a dog with white feet-3.wav,CASES_WAV
73,Video 1_4 (seesaw).mp3,44100.0,2.0,1.15,CASES,1.0,seesaw,P.S.A.,Video 1_4 (seesaw).wav,CASES_WAV
...,...,...,...,...,...,...,...,...,...,...
71,Video 4_4 (well it will help me).mp3,44100.0,2.0,2.32,CASES,1.0,well it will help me,"Wow, em vừa học đĩa",Video 4_4 (well it will help me).wav,CASES_WAV
106,ACPA buy baby a bib.mp3,48000.0,1.0,1.92,CASES,1.0,buy baby a bib,"Hi, I'm Hayley Mim.",ACPA buy baby a bib.wav,CASES_WAV
14,Video 1_18 (pretend it stops running when the ...,44100.0,2.0,5.80,CONTROLS,0.0,pretend it stops running when the car is going,"When it stops running, when the car is going.",Video 1_18 (pretend it stops running when the ...,CONTROLS_WAV
92,Video 2_4 (daddy).mp3,44100.0,2.0,0.57,CASES,1.0,daddy,Fanny,Video 2_4 (daddy).wav,CASES_WAV


Making a dummy label set to make sure that my model isn't taking random guesses.

In [None]:
# dummy data
import random

# Define the length of the list you want
length = len(train_labels)  # Change this to your desired length

# Generate a list of random 1s and 0s of the specified length
dummy_list = [random.choice([0, 1]) for _ in range(length)]



In [None]:
dummy_df = train_df
dummy_df["DUMMY"] = dummy_list

In [None]:
dummy_audio_dataset = datasets.Dataset.from_dict({"audio": train_full_paths,
                                                  "labels":dummy_list}
                                                 ).cast_column("audio", Audio(sampling_rate=16_000))

dummy_dataset = SpeechClassificationDataset(dummy_audio_dataset,  feature_extractor)

batch_size = 8

dummy_loader = DataLoader(dummy_dataset, batch_size=batch_size, shuffle=True)


In [None]:
model_checkpoint = "openai/whisper-base"

feature_extractor = WhisperFeatureExtractor.from_pretrained(model_checkpoint)
encoder = WhisperModel.from_pretrained(model_checkpoint)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
num_labels = 2

model = SpeechClassifier(num_labels, encoder).to(device)
optimizer = AdamW(model.parameters(), lr=2e-5, betas=(0.9, 0.999), eps=1e-08)
criterion = nn.CrossEntropyLoss()



In [None]:
num_epochs = 5
train(model, dummy_loader, val_loader, optimizer, criterion, device, num_epochs)

Epoch 1/5, Batch 8/13, Train Loss: 0.7530
Epoch 1/5, Val Loss: 0.7072, Val Accuracy: 0.4889, Val F1: 0.3284, Best Accuracy: 0.4889
Epoch 2/5, Batch 8/13, Train Loss: 0.6759
Epoch 2/5, Val Loss: 0.6932, Val Accuracy: 0.5333, Val F1: 0.5249, Best Accuracy: 0.5333
Epoch 3/5, Batch 8/13, Train Loss: 0.2588
Epoch 3/5, Val Loss: 1.1008, Val Accuracy: 0.4889, Val F1: 0.3631, Best Accuracy: 0.5333
Epoch 4/5, Batch 8/13, Train Loss: 0.2797
Epoch 4/5, Val Loss: 1.1925, Val Accuracy: 0.5333, Val F1: 0.4658, Best Accuracy: 0.5333
Epoch 5/5, Batch 8/13, Train Loss: 0.0114
Epoch 5/5, Val Loss: 1.7554, Val Accuracy: 0.5333, Val F1: 0.5181, Best Accuracy: 0.5333


Model is not learning with the dummy data....

## Simpler Model

Let's train a simpler model to see how our model does compared to a simpler one such as SVM or Random Forrest. Generated with help from ChatGPT4

### SVM

Support Vector Machine

In [None]:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

# Function to extract Mel-Frequency Cepstral Coefficients (MFCCs) from an audio file.
def extract_mfcc_features(file_path, n_mfcc=13):
    # Load the audio file using librosa.
    audio, sample_rate = librosa.load(file_path, sr=None)
    # Extract MFCC features from the audio.
    mfccs = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=n_mfcc)
    # Scale the MFCCs by taking the average across time, resulting in a fixed-size vector per audio file.
    mfccs_scaled = np.mean(mfccs.T, axis=0)
    return mfccs_scaled

# Combine audio file paths from both training and testing datasets.
audio_files = train_full_paths + test_full_paths
# Combine the corresponding labels from both training and testing datasets.
labels = train_labels + test_labels

# Extract MFCC features from each audio file.
features = [extract_mfcc_features(file) for file in audio_files]

# Split the dataset into training, testing, and validation sets.
X_train, x_test, y_train, y_test = train_test_split(features, labels, test_size=0.3, random_state=42)
x_train, x_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.3, random_state=42)

# Standardize the features by removing the mean and scaling to unit variance.
scaler = StandardScaler()
X_train = scaler.fit_transform(x_train)
X_test = scaler.transform(x_test)

# Initialize and train the Support Vector Machine (SVM) classifier with a linear kernel.
svm_model = SVC(kernel='linear')
svm_model.fit(x_train, y_train)

# Make predictions using the validation set.
y_pred = svm_model.predict(x_val)

# Evaluate the model's performance on the validation set.
print("Accuracy:", accuracy_score(y_val, y_pred))
print("Classification Report:", classification_report(y_val, y_pred))


Accuracy: 0.8333333333333334
Classification Report:               precision    recall  f1-score   support

         0.0       0.88      0.83      0.86        18
         1.0       0.77      0.83      0.80        12

    accuracy                           0.83        30
   macro avg       0.83      0.83      0.83        30
weighted avg       0.84      0.83      0.83        30



### Random Forest


In [None]:
from sklearn.ensemble import RandomForestClassifier

# Initialize and train the Random Forest classifier.
# 'RandomForestClassifier' is a type of ensemble learning method, where multiple decision trees are used.
# 'n_estimators=100' specifies that 100 trees should be used in the forest.
# This parameter can be adjusted to optimize performance.
rf_model = RandomForestClassifier(n_estimators=100)

# Training the Random Forest model with the training data.
# 'x_train' contains the feature vectors, and 'y_train' contains the corresponding labels.
rf_model.fit(x_train, y_train)

# Making predictions using the trained Random Forest model on the validation dataset.
# 'x_val' contains the feature vectors of the validation set.
y_pred = rf_model.predict(x_val)

# Evaluating the performance of the Random Forest classifier.
# 'accuracy_score' measures the overall accuracy of the model on the validation set.
# 'classification_report' provides a detailed report including metrics like precision, recall, and F1-score for each class.
# These metrics are helpful to understand the model's performance in detail, especially in multi-class classification tasks.
print("Accuracy:", accuracy_score(y_val, y_pred))
print("Classification Report:", classification_report(y_val, y_pred))


Accuracy: 0.8333333333333334
Classification Report:               precision    recall  f1-score   support

         0.0       0.88      0.83      0.86        18
         1.0       0.77      0.83      0.80        12

    accuracy                           0.83        30
   macro avg       0.83      0.83      0.83        30
weighted avg       0.84      0.83      0.83        30

