# Introdution to Machine Learning - Course Project Report

Group members:
   - Grzegorz Prasek
   - Jakub Kindracki
   - Mykhailo Shamrai
   - Mateusz Mikiciuk
   - Ernest Mołczan

In this report we will describe our implementation of CNN supposed to classify users allowed to the system and users not allowed (binary classification).

## Table of contents:
1. Dataset
2. Exploratory Data Analysis
3. Preparing audio files for generating spectrograms
3. Generating spectrograms
4. Classifying spectrograms for train, test and validation datasets
5. Model
6. Training loop
7. [EXTRA] **interpretability** - visualizing the behavior and function of individual cnn layers and using if for data exploration
8. [EXTRA] **uncertainty** - using monte carlo dropout to estimate classification confidence. Comparing dropout to an ensemble of CNN networks.
9. [EXTRA] **parameter space** examining how much individual layers of the network change during training. Investigating their re-initialization robustness.

# 2. Exploratory Data Analysis
Exploratory Data Analysis (EDA) is a critical step in any data-driven project, as it allows us to understand the structure, patterns, and nuances of the dataset before applying machine learning models. The primary goal of EDA is to identify anomalies, relationships, or trends within the data and ensure its quality for further processing. By performing EDA, we can detect potential issues such as missing values, inconsistencies, or outliers and gain insights that guide decisions about data preprocessing and feature engineering. In this project, EDA is particularly important to analyze the spectrogram representations of audio data and understand class distributions, background noise patterns, and other factors that might affect the model's performance. Ultimately, EDA lays the foundation for building an effective and reliable recognition system. We had a lot of interesring ideas of what can we do, to obrain bore information about the data. Some of them were more successful, a part of it was less. In this section we will discuss every step in EDA that was made.

# 2.1 Number of records for each speaker
The first step in our exploratory data analysis was to verify whether each speaker had approximately the same number of recordings. This procedure is crucial for identifying any imbalances in the dataset that could skew the model’s training.

Although the problem setup inherently introduces some imbalance — since Class 0 contains more speakers than Class 1, even with equal recordings per speaker — ensuring consistency in recording counts per speaker within each class.

As shown in the screenshot below, the number of recordings per speaker is consistent across the dataset, with the exception of m2. However, since not all m2's records are included in the task, we can conclude that the recordings are uniformly distributed among the speakers relevant to the classification problem.
![Distribution of number of records according to speaker name](./EDA_screenshots/speaker_number_of_records.png)

# 2.2 Distribution of duration for a record
The next idea was to check the distribution of recording durations for each speaker. Even though the number of recordings is the same for everyone, this doesn’t guarantee there’s no imbalance—some speakers might have much longer recordings than others, which could affect the model.

To analyze this, we used boxplots, as they clearly show how durations are spread out. In the first screenshot (shown below), we noticed that the durations varied a lot between speakers, with some having much longer recordings than others.

![Boxplots of duration for recording for each speaker before cleaning from silence](./EDA_screenshots/boxplots_before_cleaning.png)

To fix this, we applied a silence removal method, which we plan to use in all further steps of the project. After cleaning, the second boxplot shows that the durations are much more even across speakers, and the imbalance has been reduced in some level.

![Boxplots of duration for recording for each speaker after cleaning from silence](./EDA_screenshots/boxplots_after_cleaning.png)

# 2.3 Avarage Intencity for each speaker
The next idea (though not very successful) was to analyze the average intensity for each frequency across all spectrograms for each speaker.

The plan was to take all spectrograms generated for a person and calculate the average "brightness" for each frequency. In a spectrogram, we can imagine the X-axis as time, the Y-axis as frequency, and the color as the intensity of the sound at a given moment for that frequency. By converting the spectrograms to grayscale (a step we would do later for training the model anyway), we calculated the average brightness for each frequency across all spectrograms for each speaker. On X-axis of the plot we have logarithmic scale frequency values and on Y-axis avarage value from sum for each spectrogram was appeared.

The goal was to find patterns in how different speakers’ voices are represented in the spectrograms. Frequencies where a speaker's voice is more prominent should have a lighter color, while less prominent frequencies would appear darker. (below are screenshots for a whole plot and for two speakers - man and woman).
![Avarage intencity for each frequency for each speaker](./EDA_screenshots/plot_avarage_intencities.png)

![Avarage intencity for each frequency for f1 and m5 speakers](./EDA_screenshots/avarage_intencities_f1_m5.png)

Unfortunately, due to the high amount of noise and some uncleaned silent parts in the recordings, the results were not very informative and didn’t reveal any clear patterns.

# 2.4 Fundamental Frequency Analysis
Another idea was explore the so called fundalmental frequency (F0) in the voices of each speaker. That aproach potentially could help us find patterns in voices. Fundamental frequency refers to the lowest frequency of a voice signal. It specifies specific pitch of a music tone. in human speech F0 correspondds to the vibration of the vocal cords. Also the higher harmonics carry important information about speaker's voice characteristics. The idea was to find that frequencies and train model with applied new lines for each frequency or even train on new pictures with only lines for each harmonic.

To extract the fundamental frequency, we used PYIN alghorithm. This alghorithm estimates the F0 for each moment in the time. Overlayed F0 and spectrograms are shown on the screenshots below.

![Fundamental frequency overlayed with spectrograms](./EDA_screenshots/fundamental_frequencies.png)

While this approach had potential, it turned out to be more complex than we initially expected.


# 2.5 MEL scale spectrograms
Another idea we explored was to use spectrograms converted to the MEL scale. The MEL scale is a frequency scale that is designed to mimic the way humans perceive sound. It focuses on the range of frequencies that are most relevant to human hearing (typically from around 20 Hz to 20 kHz), while compressing or ignoring frequencies that are less important. This makes MEL spectrograms particularly suitable for voice analysis, as they highlight the critical frequency ranges associated with speech.

The main advantage of using MEL spectrograms is that they are more focused on the frequencies relevant to human voice, which can make the data more expressive and potentially simplify the learning process for the model. However, as shown in the examples below, MEL spectrograms also tend to have a lot of empty black space, representing frequency ranges that are cut off.

![Examples of MEL scale spectrogram](./EDA_screenshots/MEL_freq_example_1.png)
![Examples of MEL scale spectrogram](./EDA_screenshots/MEL_freq_example_2.png)

We trained a separate model specifically on MEL spectrograms. While this approach seemed promising at first, the model quickly began to overfit to the training data after just a few epochs. This was evident from the learning curves, as shown in the plot below. The high training accuracy combined with a significant gap in validation accuracy suggested that the model struggled to generalize to unseen data when trained on MEL spectrograms alone.

![Train and Validation loss and accuracy functions](./EDA_screenshots/train_valid_los_ac_mel.png)