## Import Modules and Prerequisites
---
Please use this section to import any necessary modules that will be required later in this notebook like the example given.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# add any needed libraries
import os
import sys
import librosa
import librosa.display
from pathlib import Path
import seaborn as sns
sys.path.append(os.path.abspath(os.path.join('audio')))
from audioblock import audio_load, extract_features, split_to_signals, get_duration

sys.path.append(os.path.abspath(os.path.join('utils')))
from utils import load_pickle, plot_confusion_matrix

%matplotlib inline

## Automatic Speech Recognition
---
#### Note: There is no expectation of coding a highly sophisticated solution in this current small time period. Each question can be answered either with a short code example along with a possible written explaination of a more elaborate approach or with not highly tuned models, due to lack of available resources and time.

A common task in Acoustics is to predict the speaker from corresponding audio signals (speaker identification). In the provided corpus (see the project description), you can find transcripts under various speech settings and speaking conditions. 

### 1. Train a classifier on the Solo Speech condition dataset that will reach an acceptable accuracy score.
---
Feel free to follow any design choices you feel fit the problem best. Briefly describe your approach in markdown cells, along with any necessary comments on your choices. Explain your choices with the appropriate evaluation plots - analysis

In [None]:
#Path initialization
ROOT_PATH = Path.cwd()
ROOT_DATASET_PATH = ROOT_PATH.joinpath('dataset')
SOLO_DATASET_PATH = ROOT_DATASET_PATH.joinpath('data').joinpath('solo')
OUTPUT_PATH = ROOT_PATH.joinpath('output')

#### Audio Feature Set

I going to use one kind of feature, known as MFCCs. The mel frequency cepstral coefficients (MFCCs) of an audio signal are a small set of features (usually about 10–20) which describe the overall shape of the spectral envelope. MFCCs were frequently used for voice recognition. It was also used to describe “timbre”.

Furthermore, I will enrich the mfccs features with deltas and deltas-deltas MFCCs. The idea behind using delta (differential) and delta-delta (acceleration) coefficients is that in order to recognize speech better, we need to understand the dynamics of the power spectrum, i.e., the trajectories of MFCCs over time.

It is common to use the 13 MFCCs for this kind of tasks, so and I will use 13 MFCCs

Lets extract the above mentioned features for an audio file of the SOLO dataset

In [None]:
#Create a list which contains all the audio files' paths of solo dataset
audiofiles_solo = list((SOLO_DATASET_PATH).glob('**/*.wav'))

SAMPLE_RATE = 16000 #Hz
NUM_MFCC = 13
N_FFT = 128 #milliseconds
HOP_LENGTH = 32 #Hz
sig = audio_load(filepath=audiofiles_solo[0], sr=SAMPLE_RATE)

mfccs, delta_mfccs, delta2_mfccs = extract_features(
                    sig = sig,
                    sr = SAMPLE_RATE,
                    num_mfcc=NUM_MFCC,
                    n_fft=N_FFT,
                    hop_length=HOP_LENGTH
                )

Lets plot the waveform and the extracted features mfccs, deltas and deltas-deltas)

In [None]:
fig, ax = plt.subplots(nrows=4, ncols=1, sharex=True, figsize = (20, 15))

librosa.display.waveshow(sig, sr=SAMPLE_RATE, ax=ax[0])  # put signal in row 0, 
librosa.display.specshow(mfccs, sr=SAMPLE_RATE ,x_axis='time', ax=ax[1]) # mfccs in row 1
librosa.display.specshow(delta_mfccs, sr=SAMPLE_RATE ,x_axis='time', ax=ax[2]) # deltas mfccs in row 2
librosa.display.specshow(delta2_mfccs, sr=SAMPLE_RATE ,x_axis='time', ax=ax[3]) # delta-deltas mfccs in row 3

After the selection of MFCCs as my feature set, I had to decide what kind of classifier to use. I could use trandiotinal machine learning algorithms such as KNN. Although, deep neural networks achieve highest accuracy, so I followed this path. 

I choose as classifier a DNN with LSTM layers. LSTM networks are well-suited to classifying, processing and making predictions based on time series data. Long short-term memory (LSTM) network has been proved to be effective with temporal features because they can handle the sequence of the features using a type of memory.

LSTM expects the feature vectors to be in the same shape. This means that in order to use a LSTM architecture the feature extraction must occur using fixed size time windows of the audio files. 

However our audio files are not in the same length (duration). So I need to split each audio file in fixed time windows and extract the features for each segment. I choose to use 2 seconds time window without overlap.

So lets split the above waveform into 2 second time windows and extract the features

In [None]:
#Get the audio file's duration
duration = get_duration(sig, sr=SAMPLE_RATE)
print(f"The audio file's duration is {round(duration, 2)} seconds")

So we expect only one segment after the signal's splitting

In [None]:
WINDOW_SIZE = 2000 #milliseconds
WINDOW_STEP = 2000 #milliseconds, this means that we dont have overlap

segs = split_to_signals(sig, sr=SAMPLE_RATE, size=WINDOW_SIZE, slide=WINDOW_STEP)
print(f"The number of audio segments after signal splitting is: {len(segs)}")

Lets extract the mfccs, deltas and delta-deltas features for the segment of 2 seconds, and concat them in order to see the shape of the feature vectors which describe the audio segment.

In [None]:
N_FFT = 128 #milliseconds
HOP_LENGTH = 32 #milliseconds
NUM_MFCC = 13

mfccs, delta_mfccs, delta2_mfccs = extract_features(
    sig = segs[0],
    sr = SAMPLE_RATE,
    num_mfcc=NUM_MFCC,
    n_fft=N_FFT,
    hop_length=HOP_LENGTH
)

#vertical feature stack
mfccs_conc = np.vstack((mfccs, delta_mfccs))
mfccs_conc = np.vstack((mfccs_conc, delta2_mfccs))
print(mfccs_conc.shape)

Finally, as we can see then the time window is represented by 63 timesteps of 39 features.

You can use the flows/audio_feature_extraction.py to extract the above mentioned audio features for fixed size time windows for a set of audio files

#### Setting up the experiment

Our target is to create a speaker indentification system using the extracted features from all the audio files of SOLO dataset. 
This means that we need to use some data from all the users in the training and use the rest of them in testing in order to see our system's evaluation.
Altough in order out experiment to be valid, we must secure that segments of the same audio file will not be in the training and in the testing at the same time. 

Furthermore we want the evaluate our model with each audio file and not only with a small part of the dataset. 
For this reason I used a GroupKFold experimentation setup, with 5 folds without overlap. 
Each fold represents 20% of the initial dataset.
The groups are created using the filename as a key. 
So, audio segments of the same audio file are always in the same fold, avoiding any type
of data leakage between train and test set

Furthermore, with GroupKFold, we secure that all the audio files will be in testing position at the end,
because with each iteration 4 folds are used for training and the rest one for testing in a circular manner

Finally, I wanted to have a validation set for monitoring the model training. I followed the same method
and I used the above 4 folds for performing a second GroupKFold with only one iteration (5 splits, 4 for training and 1 for validation)

The code for all these can be found in flows/train_speaker_identification.py

In the below cell we can see the results of the method.

In [None]:
results_path = OUTPUT_PATH.joinpath('step1_results_groupkfold.pickle')
results = load_pickle(results_path)
results

Plot the confusion matrix (normalized and not normlized values)

In [None]:
plot_confusion_matrix(results['label'].values, results['prediction'].values, norm=False, fullpath=OUTPUT_PATH.joinpath('step1_cm.jpg'))
plot_confusion_matrix(results['label'].values, results['prediction'].values, norm=True, fullpath=OUTPUT_PATH.joinpath('step1_normalized_cm.jpg'))

### 2. Assuming that you needed to apply the learned rules / models on the Fast Speech condition dataset, without having that (test) dataset beforehand, what you would do?
---
The goal is to approach the classification accuracy obtained on the train dataset to the test dataset, without using the latter for training. Describe any challenges (if they exist) and code your solution below following the same guidelines 

Regarding this task, it is each very easy to guess that the pretrained model using the SOLO dataset will have a low performance predicting the FAST dataset. Let's confirm it. 
Steps:
 1. Extract features for the FAST dataset using the flows/audio_feature_extraction.py
 2. Use the pretrained model in SOLO dataset to predict the FAST dataset using flows/predict_speaker.py

The results are the following:

In [None]:
results_path = OUTPUT_PATH.joinpath('step2_results.pickle')
results = load_pickle(results_path)
results

Plot the confusion matrix (normalized and not normlized values)

In [None]:
plot_confusion_matrix(results['label'].values, results['prediction'].values, norm=False, fullpath=OUTPUT_PATH.joinpath('step2_cm.jpg'))
plot_confusion_matrix(results['label'].values, results['prediction'].values, norm=True, fullpath=OUTPUT_PATH.joinpath('step2_normalized_cm.jpg'))

The reason the model didn't achieve well is very clear. The pretrained model is an LSTM architecture that was trained to recognize the speaker who talks with a normal tempo. Talking very fast affects the sequence of MFCCs features and the model fails because expects the features with different sequences or in other words fewer phonemes in a fixed-size time window.

So, to solve that problem we need to find a different solution in which time doesn't play a role.
Such a solution would be to use the wav2vec 2.0,

Wav2vec 2.0 is a framework that achieves state-of-art results in ASR tasks regardless of how fast a user speaks or the pauses in his talk 
The idea behind this is that since the model achieves very good results in speech-to-text problem it means that in some intermediate layer of it, the model captures the user's 'voice' in a perfect way.
A wav2vec model consists of a feature encoder with 7 blocks of CNN layers followed by a transformer encoder with 24 blocks of layers
So, we can exploit its potential, extracting audio representations from an intermediate layer as I did during my thesis (dealing with speech emotion recognition) achieving state-of-art results.

### 3. Another important task is to perform gender classification on the same datasets, but there are no available labels. You can use the entirety of data you have at your disposal. Describe possible approaches to this problem and code the most robust solution of your choice. 

In general, women speak at a higher pitch—about an octave higher than men. An adult woman's average range is from 165 to 255 Hz, while a man's is 85 to 155 Hz.

Due to the lack of gender labels I will try to classify the gender using the fundemental frequency of the corresponding audio file using a statistical rule. For the extraction of the fundamendal frequency I used an algorithm called Probabilistic YIN (PYIN) which is implemented by librosa library. 

PYIN is a modification of the YIN algorithm for the estimation of the fundamental frequency (F0) but with better results. The first stage of PYIN follows the same steps as the original YIN algorithm, differing only in the thresholding stage, where it assumes a threshold distribution, in contrast to YIN, which relies on a single threshold. 

The PYIN method is divided into two stages: 
- frame-wise extraction of multiple pitch candidates with associated probabilities 
- these probabilities are used  as emission probabilities in a Hidden Markov Model which trys to find the likely pitch sequence given the emissions


Lets extract the fundamental frequency of an audio file from SOLO dataset.
I will use the librosa's default parameters for PYIN algorithm but I will search for F0 between 50 and 300 Hz

In [None]:
SAMPLE_RATE = 22050

#Load audio file
sig = audio_load(filepath=audiofiles_solo[0], sr=SAMPLE_RATE)

#Apply the PYIN algorithm for extracting the F0 of the signal
f0, voiced_flag, voiced_probs = librosa.pyin(
    sig,
    fmin=50,
    fmax=300,
    sr=SAMPLE_RATE
)

times = librosa.times_like(f0)

Overlay F0 over a spectrogram

In [None]:
D = librosa.amplitude_to_db(np.abs(librosa.stft(sig)), ref=np.max)
fig, ax = plt.subplots()
img = librosa.display.specshow(D, x_axis='time', y_axis='log', ax=ax)
ax.set(title='pYIN fundamental frequency estimation')
fig.colorbar(img, ax=ax, format="%+2.f dB")
ax.plot(times, f0, label='f0', color='cyan', linewidth=3)
ax.legend(loc='upper right')

As we can see from the plot the F0 values are about 200Hz which means that the voice belongs to a woman.

In general, for predicting the gender from fundamendal frequency we need to set a statistical rule. As I mentioned above about the pitch, an adult woman's average range is from 165 to 255 Hz, while a man's is 85 to 155 Hz. So we can set a threshold at 160 Hz. So I compared this threshold with the most representative value of F0, which is the median value of the vector. So:
- If F0 > 160Hz then the speaker is a female
- If F0 <= 160Hz then the speaker is a male

Using the /flows/gender_classification.py, a dataset is created calculating the median and mean value of F0 for each audio file of the dataset.
Also, I created a mapping dictionary between user name and its gender from dataset's description pdf file for evaluating my method.

For this step, I used both SOLO and FAST dataset.

Let's load the extracted dataset.

In [None]:
dataset_path = OUTPUT_PATH.joinpath('step3_dataset.pickle')
dataset = load_pickle(dataset_path)
dataset

Lets plot a grouped boxplot of median f0 distribution across male and female genders

In [None]:
sns.boxplot(x='gender', y='median_f0', data=dataset)

The above boxplot confirms that the rule I set will recognize the gender with high accuracy

Lets apply the rule I set and see the results

In [None]:
#Predict gender based on rule
threshold = 160 #Hz
dataset["prediction"] = dataset["median_f0"].apply(lambda x: 'female' if x>threshold else 'male')

Plot confusion matrix

In [None]:
plot_confusion_matrix(dataset['gender'].values, dataset['prediction'].values, norm=False, fullpath=OUTPUT_PATH.joinpath('step3_cm.jpg'), figsize=(8,5))
plot_confusion_matrix(dataset['gender'].values, dataset['prediction'].values, norm=True, fullpath=OUTPUT_PATH.joinpath('step3_normalized_cm.jpg'), figsize=(8,5))

## Thank you in advance. Good luck!