# Introdution to Machine Learning - Course Project Report

Group members:
   - Grzegorz Prasek
   - Jakub Kindracki
   - Mykhailo Shamrai
   - Mateusz Mikiciuk
   - Ernest Mołczan

In this report we will describe our implementation of CNN supposed to classify users allowed to the system and users not allowed (binary classification).

## Table of contents:
1. Dataset
2. Exploratory Data Analysis
3. Preparing audio files for generating spectrograms
3. Generating spectrograms
4. Classifying spectrograms for train, test and validation datasets
5. Model
6. Training loop
7. **interpretability** - visualizing the behavior and function of individual cnn layers and using if for data exploration
8. **uncertainty** - using monte carlo dropout to estimate classification confidence. Comparing dropout to an ensemble of CNN networks.
9. **parameter space** examining how much individual layers of the network change during training. Investigating their re-initialization robustness.

# 1. Dataset

## 1.1 Introduction to the Dataset
The project is based on the DAPS (Device and Produced Speech) dataset, which was specifically designed for speech processing and analysis research. The primary goal of this dataset is to provide high-quality speech recordings that can be utilized in applications such as speech recognition, speaker classification, and acoustic analysis.

The DAPS dataset was chosen as the primary data source due to its following characteristics:

- **Data Quality**: The recordings are clean and diverse, enabling precise testing of models under both laboratory and simulated conditions.
- **Speaker Diversity**: The dataset includes recordings from 20 distinct speakers, divided into two classes:
  - **Class 1 (Acceptable individuals)**: Includes recordings from speakers F1, F7, F8, M3, M6, and M8.
  - **Class 0 (Unacceptable individuals)**: Includes recordings from the remaining 14 speakers.
- **Alignment with Project Requirements**: The dataset provides recordings that can be easily transformed into spectrograms, which are essential for the CNN-based approach employed in this project.

---

## 1.2 Dataset Characteristics
Each recording in the DAPS dataset is available as a `.wav` file and exhibits the following features:

- **Standard Sampling Format**: All recordings are sampled at 16 kHz, which is sufficient for most speech processing applications.
- **Variety in Recording Lengths**: The recordings vary in duration, necessitating preprocessing to standardize the samples for comparability.
- **Natural and Artificial Noise**: The dataset includes samples with varying levels of noise, allowing for robustness testing of the model against disturbances.

Additionally, the DAPS dataset was selected due to its accessibility and clear licensing terms, which permit its legal use for educational and research purposes.

---

## 1.3 Data Preparation
To effectively utilize the DAPS dataset in the project, several key data preparation steps were undertaken:

### a) Data Cleaning
The data cleaning process aimed to remove samples that could negatively impact model performance. The following tasks were performed:

- **Duplicate Elimination**: Redundant recordings were removed to prevent overrepresentation of certain samples in the training set.
- **Silence Removal**: Segments containing silence were identified and eliminated to improve model efficiency.

### b) Data Splitting
The dataset was split into three subsets:

- **Training Set**: 60% of the data, used for model training.
- **Validation Set**: 20% of the data, used for model evaluation during training.
- **Test Set**: 20% of the data, used for final model evaluation.

The split was performed to ensure no overlap between subsets, preventing data leakage, i.e., no fragments of the same recording were included in both the training and test sets.

### c) Data Augmentation
To increase data diversity and enhance the model's robustness against noise, the following augmentation techniques were applied:

- **Adding Background Noise**: Artificial noise of varying intensities was introduced to simulate real-world acoustic conditions.
- **Pitch Shifting**: The pitch of recordings was altered to increase speaker diversity.
- **Trimming Recordings**: Samples were cropped to a fixed length to ensure consistency across input data.

---

## 1.4 Exceptional Cases in the Data
During data analysis, certain samples were identified as particularly challenging for classification:

- **Low-Volume Recordings**: Required signal amplification to enhance quality.
- **Samples with Significant Background Noise**: These were leveraged to evaluate the model's noise resistance.
- **Acoustically Similar Speakers**: These samples demanded special attention during model training.

---

## 1.5 Challenges and Solutions
Several challenges were encountered while working with the data, which were addressed as follows:

### Class Imbalance
- **Problem**: Class 1 was underrepresented, with only six speakers compared to 14 in Class 0.
- **Solution**: Data augmentation techniques were used to increase the number of samples for Class 1.

### Impact of Noise
- **Problem**: High levels of noise in some recordings negatively affected classification performance.
- **Solution**: A noise reduction process was applied to the audio files, and augmentation with various noise levels was employed to improve robustness.

---

## 1.6 Conclusion
The prepared and processed dataset provided a solid foundation for training and testing the speech recognition model. The preprocessing steps enabled the identification and resolution of potential issues, such as the heterogeneity in recording quality. The final dataset is diverse, well-balanced, and optimized for use in spectrogram-based models.

## 2. Exploratory Data Analysis
Exploratory Data Analysis (EDA) is a critical step in any data-driven project, as it allows us to understand the structure, patterns, and nuances of the dataset before applying machine learning models. The primary goal of EDA is to identify anomalies, relationships, or trends within the data and ensure its quality for further processing. By performing EDA, we can detect potential issues such as missing values, inconsistencies, or outliers and gain insights that guide decisions about data preprocessing and feature engineering. In this project, EDA is particularly important to analyze the spectrogram representations of audio data and understand class distributions, background noise patterns, and other factors that might affect the model's performance. Ultimately, EDA lays the foundation for building an effective and reliable recognition system. We had a lot of interesring ideas of what can we do, to obrain bore information about the data. Some of them were more successful, a part of it was less. In this section we will discuss every step in EDA that was made.

## 2.1 Number of records for each speaker
The first step in our exploratory data analysis was to verify whether each speaker had approximately the same number of recordings. This procedure is crucial for identifying any imbalances in the dataset that could skew the model’s training.

Although the problem setup inherently introduces some imbalance — since Class 0 contains more speakers than Class 1, even with equal recordings per speaker — ensuring consistency in recording counts per speaker within each class.

As shown in the screenshot below, the number of recordings per speaker is consistent across the dataset, with the exception of m2. However, since not all m2's records are included in the task, we can conclude that the recordings are uniformly distributed among the speakers relevant to the classification problem.
![Distribution of number of records according to speaker name](./EDA_screenshots/speaker_number_of_records.png)

# 2.2 Distribution of duration for a record
The next idea was to check the distribution of recording durations for each speaker. Even though the number of recordings is the same for everyone, this doesn’t guarantee there’s no imbalance—some speakers might have much longer recordings than others, which could affect the model.

To analyze this, we used boxplots, as they clearly show how durations are spread out. In the first screenshot (shown below), we noticed that the durations varied a lot between speakers, with some having much longer recordings than others.

![Boxplots of duration for recording for each speaker before cleaning from silence](./EDA_screenshots/boxplots_before_cleaning.png)

To fix this, we applied a silence removal method, which we plan to use in all further steps of the project. After cleaning, the second boxplot shows that the durations are much more even across speakers, and the imbalance has been reduced in some level.

![Boxplots of duration for recording for each speaker after cleaning from silence](./EDA_screenshots/boxplots_after_cleaning.png)

# 2.3 Avarage Intencity for each speaker
The next idea (though not very successful) was to analyze the average intensity for each frequency across all spectrograms for each speaker.

The plan was to take all spectrograms generated for a person and calculate the average "brightness" for each frequency. In a spectrogram, we can imagine the X-axis as time, the Y-axis as frequency, and the color as the intensity of the sound at a given moment for that frequency. By converting the spectrograms to grayscale (a step we would do later for training the model anyway), we calculated the average brightness for each frequency across all spectrograms for each speaker. On X-axis of the plot we have logarithmic scale frequency values and on Y-axis avarage value from sum for each spectrogram was appeared.

The goal was to find patterns in how different speakers’ voices are represented in the spectrograms. Frequencies where a speaker's voice is more prominent should have a lighter color, while less prominent frequencies would appear darker. (below are screenshots for a whole plot and for two speakers - man and woman).
![Avarage intencity for each frequency for each speaker](./EDA_screenshots/plot_avarage_intencities.png)

![Avarage intencity for each frequency for f1 and m5 speakers](./EDA_screenshots/avarage_intencities_f1_m5.png)

Unfortunately, due to the high amount of noise and some uncleaned silent parts in the recordings, the results were not very informative and didn’t reveal any clear patterns.

# 2.4 Fundamental Frequency Analysis
Another idea was explore the so called fundalmental frequency (F0) in the voices of each speaker. That aproach potentially could help us find patterns in voices. Fundamental frequency refers to the lowest frequency of a voice signal. It specifies specific pitch of a music tone. in human speech F0 correspondds to the vibration of the vocal cords. Also the higher harmonics carry important information about speaker's voice characteristics. The idea was to find that frequencies and train model with applied new lines for each frequency or even train on new pictures with only lines for each harmonic.

To extract the fundamental frequency, we used PYIN alghorithm. This alghorithm estimates the F0 for each moment in the time. Overlayed F0 and spectrograms are shown on the screenshots below.

![Fundamental frequency overlayed with spectrograms](./EDA_screenshots/fundamental_frequencies.png)

While this approach had potential, it turned out to be more complex than we initially expected.

# 2.5 MEL scale spectrograms
Another idea we explored was to use spectrograms converted to the MEL scale. The MEL scale is a frequency scale that is designed to mimic the way humans perceive sound. It focuses on the range of frequencies that are most relevant to human hearing (typically from around 20 Hz to 20 kHz), while compressing or ignoring frequencies that are less important. This makes MEL spectrograms particularly suitable for voice analysis, as they highlight the critical frequency ranges associated with speech.

The main advantage of using MEL spectrograms is that they are more focused on the frequencies relevant to human voice, which can make the data more expressive and potentially simplify the learning process for the model. However, as shown in the examples below, MEL spectrograms also tend to have a lot of empty black space, representing frequency ranges that are cut off.

![Examples of MEL scale spectrogram](./EDA_screenshots/MEL_freq_example_1.png)
![Examples of MEL scale spectrogram](./EDA_screenshots/MEL_freq_example_2.png)

We trained a separate model specifically on MEL spectrograms. While this approach seemed promising at first, the model quickly began to overfit to the training data after just a few epochs. This was evident from the learning curves, as shown in the plot below. The high training accuracy combined with a significant gap in validation accuracy suggested that the model struggled to generalize to unseen data when trained on MEL spectrograms alone.

![Train and Validation loss and accuracy functions](./EDA_screenshots/train_valid_los_ac_mel.png)

## 6. Model

In this chapter, we will describe the Convolutional Neural Network (CNN) model used for our project. The model is designed to classify spectrogram images into two classes. Below, we provide an overview of the model architecture and the code implementation.

#### Model Architecture

The CNN model consists of the following layers:
1. **Convolutional Layers**: Three convolutional layers with ReLU activation and max pooling.
2. **Fully Connected Layers**: Two fully connected layers with dropout for regularization.
3. **Output Layer**: A final fully connected layer for binary classification.

The input to the model is a grayscale image with a size of 224x224 pixels.

#### Code Implementation

Here is the implementation of the `SpectrogramCNN` model in PyTorch:

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F


class SpectrogramCNN(nn.Module):
    def __init__(self, num_classes=2):
        super(SpectrogramCNN, self).__init__()

        # Input is grayscale (1 channel), so input channels = 1
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3, stride=1, padding=1)
        self.conv2 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=1, padding=1)
        self.conv3 = nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, stride=1, padding=1)

        # Max Pooling
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2, padding=0)

        # Fully connected layers
        self.fc1 = nn.Linear(128 * 28 * 28, 512)  # Based on 224x224 input size after 3 pooling layers
        self.fc2 = nn.Linear(512, num_classes)  # Output layer (for binary classification, num_classes=2)

        # Dropout (optional, helps prevent overfitting)
        self.dropout = nn.Dropout(0.5)

    def forward(self, x):
        # Convolutional layers with ReLU activation and Max Pooling
        x = self.pool(F.relu(self.conv1(x)))  # Output: (32, 112, 112)
        x = self.pool(F.relu(self.conv2(x)))  # Output: (64, 56, 56)
        x = self.pool(F.relu(self.conv3(x)))  # Output: (128, 28, 28)

        # Flatten the tensor for fully connected layers
        x = x.view(-1, 128 * 28 * 28)  # Flattening the output of conv layers

        # Fully connected layers with dropout
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)  # Output: (batch_size, num_classes)

        return x

This model is designed to process grayscale spectrogram images and classify them into one of two classes. The use of convolutional layers helps in extracting spatial features from the images, while the fully connected layers perform the final classification. Dropout is used to prevent overfitting during training.


## 7. Training Loop

In this chapter, we will explain how the training and evaluation of our CNN model are performed. The training loop is responsible for optimizing the model's parameters, while the evaluation loop assesses the model's performance on the validation set.

#### Training Loop

The training loop involves the following steps:
1. **Model Initialization**: The model, loss function, and optimizer are initialized.
2. **Epoch Loop**: The training process runs for a specified number of epochs.
3. **Batch Loop**: For each epoch, the model processes the training data in batches.
4. **Forward Pass**: The model makes predictions on the input data.
5. **Loss Calculation**: The loss between the predictions and the true labels is computed.
6. **Backward Pass**: Gradients are calculated, and the model's parameters are updated.
7. **Validation**: After each epoch, the model is evaluated on the validation set to monitor its performance.

Here is the code implementation of the training loop:

In [None]:
import time
import torch.optim as optim
import torch.nn as nn
import torch
import numpy as np
from sklearn.metrics import f1_score


MODEL_PATH = "./model.pth"
LAST_MODEL_PATH = "./last_model.pth"


def train_model(model, train_loader, val_loader, num_epochs=25, learning_rate=0.001, model_path=MODEL_PATH):
    # Define the loss function and optimizer
    criterion = nn.CrossEntropyLoss()  # For classification tasks
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)

    # Move model to GPU if available
    # For Mac
    if torch.backends.mps.is_available():
        device = torch.device("mps")
        print("Using MPS device")
    else:
        device = torch.device("cpu")
        print("MPS device not found, using CPU")

    # For Windows
    # device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    model.to(device)

    # Training loop
    best_val_acc = 0.0
    for epoch in range(num_epochs):
        print(f"Starting epoch {epoch + 1}/{num_epochs} at {time.strftime('%Y-%m-%d %H:%M:%S', time.localtime())}")
        model.train()  # Set the model to training mode
        running_loss = 0.0
        correct = 0
        total = 0

        for inputs, labels in train_loader:
            inputs, labels = inputs.to(device), labels.to(device)  # Move data to GPU if available

            # Zero the parameter gradients
            optimizer.zero_grad()

            # Forward pass
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()  # Backpropagation
            optimizer.step()  # Update the weights

            # Track loss and accuracy
            running_loss += loss.item() * inputs.size(0)
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

        epoch_loss = running_loss / len(train_loader.dataset)
        epoch_acc = correct / total

        # Validation phase
        val_loss, val_acc = evaluate_model(model, val_loader, criterion, device)

        print(f"Ending epoch {epoch + 1}/{num_epochs} at {time.strftime('%H:%M:%S', time.localtime())}, "
              f"Train Loss: {epoch_loss:.4f}, Train Acc: {epoch_acc:.4f}, "
              f"Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.4f}")

        # Save the best model based on validation accuracy
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            torch.save(model.state_dict(), model_path)  # Save the best model

    torch.save(model.state_dict(), LAST_MODEL_PATH)  # Save the last model
    print("Training complete. Best validation accuracy: {:.4f}".format(best_val_acc))

#### Evaluation Loop

The evaluation loop involves the following steps:
1. **Model Evaluation Mode**: The model is set to evaluation mode to disable dropout and batch normalization.
2. **Batch Loop**: The model processes the validation data in batches.
3. **Forward Pass**: The model makes predictions on the input data.
4. **Loss Calculation**: The loss between the predictions and the true labels is computed.
5. **Accuracy Calculation**: The accuracy of the model's predictions is calculated.
6. **F1 Score Calculation**: The F1 score is computed to evaluate the model's performance.

Here is the code implementation of the evaluation loop:

In [None]:
def evaluate_model(model, data_loader, criterion, device):
    model.eval()  # Set the model to evaluation mode
    running_loss = 0.0
    correct = 0
    total = 0
    all_labels = []
    all_predictions = []

    with torch.no_grad():  # Disable gradient computation during evaluation
        for inputs, labels in data_loader:
            inputs, labels = inputs.to(device), labels.to(device)

            outputs = model(inputs)
            loss = criterion(outputs, labels)

            running_loss += loss.item() * inputs.size(0)
            _, predicted = torch.max(outputs, 1)

            total += labels.size(0)
            correct += (predicted == labels).sum().item()

            # Collect predictions and labels for F1 score calculation
            all_labels.extend(labels.cpu().numpy())
            all_predictions.extend(predicted.cpu().numpy())

    epoch_loss = running_loss / len(data_loader.dataset)
    epoch_acc = correct / total

    # Calculate F1 score
    f1 = f1_score(all_labels, all_predictions, average="weighted")  # Change "weighted" if you need macro or micro F1 score
    print(f"F1 Score: {f1:.4f}")

    return epoch_loss, epoch_accdef evaluate_model(model, data_loader, criterion, device):
    model.eval()  # Set the model to evaluation mode
    running_loss = 0.0
    correct = 0
    total = 0
    all_labels = []
    all_predictions = []

    with torch.no_grad():  # Disable gradient computation during evaluation
        for inputs, labels in data_loader:
            inputs, labels = inputs.to(device), labels.to(device)

            outputs = model(inputs)
            loss = criterion(outputs, labels)

            running_loss += loss.item() * inputs.size(0)
            _, predicted = torch.max(outputs, 1)

            total += labels.size(0)
            correct += (predicted == labels).sum().item()

            # Collect predictions and labels for F1 score calculation
            all_labels.extend(labels.cpu().numpy())
            all_predictions.extend(predicted.cpu().numpy())

    epoch_loss = running_loss / len(data_loader.dataset)
    epoch_acc = correct / total

    # Calculate F1 score
    f1 = f1_score(all_labels, all_predictions, average="weighted")  # Change "weighted" if you need macro or micro F1 score
    print(f"F1 Score: {f1:.4f}")

    return epoch_loss, epoch_acc

These functions together form the core of the training and evaluation process for our CNN model.

# Monte Carlo Dropout for Estimating Classification Confidence: Comparison with CNN Ensembles

---

## 1. Introduction

In deep learning classification tasks, the confidence of a model is a critical indicator of the reliability of its predictions. Monte Carlo Dropout (MC Dropout) is an effective technique for estimating model uncertainty by leveraging dropout layers during the inference phase. This method involves multiple passes over the same input, generating probabilistic outputs. The mean and standard deviation of these outputs provide insights into the model's confidence.

This section presents a detailed analysis of MC Dropout, comparing it with an ensemble of CNN models, which requires training multiple independent networks. Additionally, we explore how the number of Monte Carlo samples affects the stability and variance of predictions, using visualizations.

---

## 2. Experiment with Monte Carlo Dropout

### 2.1 Experiment Setup

- **Model**: A trained CNN model with an active dropout layer in the fully connected layer (FC1) with a dropout probability of 50% (`p=0.5`).
- **Input Data**: Test spectrograms representing two classes.
- **Procedure**:
  1. Dropout was activated during the test phase (`model.train()`).
  2. `n` predictions were performed for each sample using different numbers of Monte Carlo samples: 2, 20, and 50.
  3. The mean and standard deviation of predictions were calculated for each class.

---

## 3. Python Function for Monte Carlo Dropout Predictions

The following function implements MC Dropout, enabling multiple forward passes through the model to generate predictions with uncertainty estimates:

```python
import numpy as np

def mc_dropout_predictions(model, data_loader, num_samples, device):
    model.train()  # Activate dropout
    all_predictions = []

    with torch.no_grad():
        for batch_idx, (inputs, _) in enumerate(data_loader):
            if batch_idx == 254:
                break
            inputs = inputs.to(device)
            print(f"Processing batch {batch_idx + 1}...")  # Batch info

            # Perform multiple predictions with active dropout
            predictions = []
            for sample_idx in range(num_samples):
                outputs = torch.softmax(model(inputs), dim=1)
                print(f"Sample {sample_idx + 1}: outputs shape = {outputs.shape}")
                predictions.append(outputs.cpu().numpy())

            predictions = np.array(predictions)
            print(f"Batch {batch_idx + 1}: predictions shape = {predictions.shape}")  # Batch results shape
            all_predictions.append(predictions)

    return all_predictions
```

This function:
- Activates dropout layers during inference to introduce variability.
- Processes each batch of data to generate `num_samples` predictions per sample.
- Computes predictions as probability distributions using `torch.softmax`.
- Logs the batch and prediction details for debugging purposes.

---

## 4. Results and Visualizations

### 4.1 Triangular Plots (2 Monte Carlo Samples)

The triangular plots below illustrate the relationship between the mean predicted probability (X-axis) and the standard deviation (Y-axis) for two classes (Class 0 and Class 1). With only 2 Monte Carlo samples, the results exhibit significant variance, resulting in triangular-shaped distributions:

- **Class 0**: Variance is highest for mean probabilities near 0.5.
- **Class 1**: Similar to Class 0, the highest uncertainty is observed around the midpoint of the probability range.

![Title of Image](chapter_9/0416c84c-c4a0-4ecf-a146-3fc8217526ba.jpg)
![Title of Image](chapter_9/55badd5f-e8c5-4226-ac4c-4ce84c1093ba.jpg)
### 4.2 Plots for 20 Monte Carlo Samples

As the number of Monte Carlo samples increases to 20, the plots become more compact:

- **Class 0**: Standard deviation significantly decreases, particularly for extreme mean probabilities (close to 0 or 1).
- **Class 1**: Results stabilize, providing better confidence estimation.

![Title of Image](chapter_9/ff198697-8f3a-4ee4-8ab9-fe73633bf368.jpg)
![Title of Image](chapter_9/1170b8d3-81f7-49c9-8349-88eeaf214fee.jpg)

### 4.3 Plots for 50 Monte Carlo Samples

With 50 Monte Carlo samples:

- **Class 0 and Class 1**: Variance is minimized, and results become highly stable, allowing clear differentiation between confident and uncertain predictions. The shapes of the plots resemble more parabolic distributions.

![Title of Image](chapter_9/5ae9ade2-5d78-461f-8018-2f9d1440a369.jpg)
![Title of Image](chapter_9/download.jpg)

### 4.4 Histograms

The histograms below present the distribution of mean probabilities for both classes:

- **With fewer samples (e.g., 2)**: The distributions are less concentrated, indicating greater spread in predictions.
- **With more samples (e.g., 20, 50)**: The distributions converge near values close to 0 or 1, signifying higher confidence for most samples.

![Title of Image](chapter_9/1d2ac6ee-f0a5-410b-9a60-b82e68539b95.jpg)
![Title of Image](chapter_9/7db013c0-d410-40ae-b2a9-364aa612315b.jpg)


---

## 5. Comparison with CNN Ensembles

### 5.1 Ensemble Setup

To compare MC Dropout with ensembles, CNN models were trained, and their predictions were averaged to compute mean probabilities and variance. The results were then compared with MC Dropout at 50 samples.

### 5.2 Observations

1. **Computational Complexity**:
   - MC Dropout is significantly more efficient computationally, as it requires only one trained model.
2. **Stability of Results**:
   - With sufficient Monte Carlo samples (e.g., 50), MC Dropout achieves results comparable to ensembles.
3. **Practical Applicability**:
   - MC Dropout is more practical in environments with limited computational resources.

---

## 6. Conclusions

Monte Carlo Dropout is a practical and efficient method for estimating model confidence in classification tasks. The experiments demonstrate that increasing the number of Monte Carlo samples significantly improves the stability and precision of results. Comparisons with CNN ensembles show that MC Dropout delivers comparable performance while being far more computationally efficient. For applications requiring interpretability and uncertainty estimation, MC Dropout offers an excellent solution.

---

## Python Implementation for Visualizations
