# Introduction

Do not spend too much time trying to get very tiny metrics improvement. Once you have a model with a correct predictive power, you should better spend time explaining your data cleaning & preparation pipeline as well as explanations & visualizations of the results.

The goal is to see your fit with our company culture & engineering needs, spending 50h on an over-complicated approach will not give you bonus points compared to a simple, yet effective, to-the-point solution.

## About the data

The dataset you will be working with is called Emo-DB and can be found [here](http://emodb.bilderbar.info/index-1280.html).

It is a database containing samples of emotional speech in German. It contains samples labeled with one of 7 different emotions: Anger, Boredom, Disgust, Fear, Happiness, Sadness and Neutral. 

Please download the full database and refer to the documentation to understand how the samples are labeled (see "Additional information")
   
The goal of this project is to develop a model which is able to **classify samples of emotional speech**. Feel free to use any available library you would need, but beware of re-using someone else's code without mentionning it!

## Deliverable

The end-goal is to deliver us a zip file containing:
* This report filled with your approach, in the form of an **iPython Notebook**.
* A **5-10 slides PDF file**, containing a technical presentation covering the important aspects of your work
* A Dockerfile which defines a container for the project. The container should handle everything (download the data, run the code, etc...). When running the container it should expose the jupyter notebook on one port and expose a Flask API on another one. The Flask app contains two endpoints:
  - One for training the model
  - One for querying the last trained model with an audio file of our choice in the dataset
* A README.md which should contain the commands to build and run the docker container, as well as how to perform the queries to the API. 
* Any necessary .py, .sh or other files needed to run your code.

# Solution

My solution follows the approach proposed in this reference paper. 

Source: Dias Issa, M. Fatih Demirci, Adnan Yazici,
Speech emotion recognition with deep convolutional neural networks,
Biomedical Signal Processing and Control,
Volume 59,
2020,
101894,
ISSN 1746-8094,
https://doi.org/10.1016/j.bspc.2020.101894.

The authors propose a CNN-based architecture for emotion recognition. In the paper they claim that their method outperforms all existing methods with the exception of one

Jianfeng Zhao, Xia Mao, Lijiang Chen,
Speech emotion recognition using deep 1D & 2D CNN LSTM networks,
Biomedical Signal Processing and Control,
Volume 47,
2019,
Pages 312-323,
ISSN 1746-8094,
https://doi.org/10.1016/j.bspc.2018.08.035.

As this latter paper is more demanding in terms of work and implementation, I decided that for the scope of this project the difference in performance between the two approaches is negligible (less than 1\%).

# Libraries Loading

In [None]:
import pandas as pd
import librosa, librosa.display
import cv2
import numpy as np
from matplotlib import pyplot as plt
from glob import glob
import os
import random

In [None]:
def filename(x):
    return os.path.split(x)[1]

I decided to keep the sampling rate as the original of the dataset to avoid upsampling.

In [None]:
sr = 16000 # Expected sampling rate

In [None]:
def load_sample(x):
    signal, _sr = librosa.load(x, sr=None)
    assert _sr == sr
    return signal

In [None]:
# Paths
test_dir = "/data/test"
raw_data_dir = "/raw_data"
flask_dir = "/src/flask_app"
models_dir = "/models"

# Data Preparation & Cleaning

Data preparation and augmentation is automatically handled at container's startup. This way the the user does not need to run any command other than querying the api and the notebook is entirely optional as it only contains data visualization.

First of all the container translates the german character code for the emotion in the filename into an easily readable english string, for example 'W' is mapped to 'anger'. The filename map is provided as a json file so that the python source code does not need to be inspected/recompiled for modifications of the mapping. This is a general practice I applied several time in this project: parameters are separated from the python code. I chose json format for representing config objects.

Note that in the reference paper fear corresponds to anxiety here, as the german word 'Angst' can have both meanings.

Next, the container performs random train/test split, choosing the test size to be 20\% as this roughly corresponds to the test size used in the reference paper.

In order to avoid bias, we never inspect the test set and only show some analysis of train samples.

In [None]:
# Filter out the test samples
test_set = {filename(x) for x in glob(os.path.join(test_dir, "*/*.wav"))}
x_train = [x for x in glob(os.path.join(raw_data_dir, "*.wav")) if filename(x) not in test_set]

In [None]:
# Shuffle for unbiased data exploration
random.shuffle(x_train)

The audio samples seem normalized by amplitude. There are several peaks and the amplitude is not stable as we would expect from a complex signal as the human speech.

In [None]:
for i in range(10):
    signal = load_sample(x_train[i])
    librosa.display.waveplot(signal, sr=sr)
    plt.xlabel("Time")
    plt.ylabel("Amplitude")
    plt.show()

Next we show the frequency domain. We can observe that most of the energy is concentrated on low frequencies.

In [None]:
for i in range(10):
    signal = load_sample(x_train[i])
    fft = np.fft.fft(signal)
    magnitude = np.abs(fft)
    frequency = np.linspace(0, sr//2, len(magnitude)//2)
    plt.plot(frequency, magnitude[:(len(magnitude)//2)])
    plt.xlabel("Frequency")
    plt.ylabel("Magnitude")
    plt.show()

Moreover, the spectrogram doesn't show noisy patterns as speckles or high amplitude concentrated on specific frequencies.

In [None]:
for i in range(10):
    signal = load_sample(x_train[i])
    stft = librosa.core.stft(signal)
    spectrogram = librosa.amplitude_to_db(np.abs(stft))
    librosa.display.specshow(spectrogram, sr=sr)
    plt.xlabel("Time")
    plt.ylabel("Frequency")
    plt.colorbar()
    plt.show()

If we sort the samples by speaker and text index we observe that the distribution of the sequence length depends on the speaker. Each speaker is assigned a random color to distinguish them and we can observe that some speakers have a preference for quick utterance (such as the 4th) while others prefer to take more time (the second speaker being the slowest). 

In [None]:
plt.figure(figsize=(12,7))
plt.title("Sequence Lengths per Speaker")
plt.grid(True)
plt.xlabel("Samples Sorted by Speaker and Text Index")
plt.ylabel("Time (sec)")
color_assign = {}
colors = []
lengths = []
sorted_x_train = sorted(x_train)

# assign a random color to each speaker so that
# they don't look similar
for file in sorted(sorted_x_train):
    speaker = filename(file)[:2]
    color_assign[speaker] = 0
assignments = np.arange(0, len(color_assign))
random.shuffle(assignments)
for i, key in enumerate(color_assign.keys()):
    color_assign[key] = assignments[i]

# plot sequence length distribution across speakers
for file in sorted_x_train:
    signal = load_sample(file)
    speaker = filename(file)[:2]
    lengths.append(len(signal)/sr)
    colors.append(color_assign[speaker])
plt.scatter(np.arange(0, len(lengths)), lengths, marker='o', cmap='jet', s=15, c=colors)
pass

Nonetheless, most sequences are between 1 and 6 seconds disregarding outliers.

In [None]:
plt.title("Distribution of the sequence lengths")
plt.ylabel("Time (sec)")
plt.boxplot(lengths)
pass

### Data Augmentation

Concerning data augmentation, I followed a similar approach as the one suggested in the reference paper. It consists of two transformations, namely shifting the audio sample by a small margin and stretching it to speed it up or slowing it down.

For the shift offset I manually inspected several training samples picked uniformly at random and measured the maximum distance between the begins of utterance. I found 100 ms to be around the longest offset. Thus, I doubled that value and picked 200 ms as the maximum shift. The random shift is then drawn uniformely at random in range [0, 200] ms.

For slowing down or speeding up the audio I used the same parameters as in the reference paper, namely 0.81x and 1.23x.

However, I did not add noise to the samples as in the paper because I already found good performance of the netowrk without it.

# Feature Engineering & Modeling

In the reference paper the authors leverage multiple speech features at once, namely 
1. MFCC, 
2. Mel-scaled spectrogram, 
3. Chromagram, 
4. Spectral contrast,
5. Tonnetz representation. 

This approach is novel (it has been published this year) and it has shown competitive performance with RNN-based models. Thus, I extracted and concatenated multiple features from a sample for a total of 193 features for the neural network classifier. However, the individual number of features of each is not specified in the paper. Fortunately, I found that tonnetz representation always produces 6 values and a natural number of bands for the spectral contrast is also 6, which results in 7 values. As a result, we are left with 180 values to be generated by 3 features. Hence, I guessed that in the paper they used 60 values per feature.

Here below I show one of these features, namely the MFCC coefficients.

In [None]:
for i in range(10):
    signal = load_sample(x_train[i])
    mfcc = librosa.feature.mfcc(signal, n_mfcc=60)
    librosa.display.specshow(mfcc, sr=sr)
    plt.xlabel("Time")
    plt.ylabel("MFCC")
    plt.colorbar()
    plt.show()

In order to batch the sequences for neural network training we have to handle the varying sizes of each sequence. In the reference paper the authors reduce the sequences to the mean of each feature along time. This way, there is no need to handle padding / truncation of the signal.


## Modeling

The selected model corresponds to 'Model A' in the reference paper. The authors propose various ensembles in order to boost performance by a few points. However, for the scope of this project the gain in performance is not worth the amount of work and resources required as a model with reasonable performance is sufficient.

The neural network is composed of two units, each of which has several 1-d convolutions of size 5 and ReLUs (plus batchnorm and dropout). Before running the second unit the signal is pooled with a relatively large window (8) so that the receptive field of the second unit is large enough to blend multiple features. 

In [None]:
# Diagram depicting the nerual network's architecture
model = cv2.imread(os.path.join(flask_dir, "model.jpg"), cv2.IMREAD_GRAYSCALE)
plt.figure(figsize=(10,10))
plt.imshow(model, cmap='gray')

All hyperparameters were set according to the values reported in the paper. 

For starting the training of the neural network we will use the Flask API, which I designed in a modular fashion. Indeed, one can extend the API with more architectures and reuse the same framework for training and testing the model. For the time being only the model just described above is implemented. For starting its training you can query the following endpoint 

In [None]:
!curl http://localhost:5000/train

With successive queries of this same endpoint you can get information about the progress of the training

For the scope of this demo, I provided pre-trained model weights so that training is not required for following the analysis of results in this report. Moreover, queries are automatically configured to rely on the pre-computed weights too.

We can observe that with the given setup the model loss is minimized smoothly without oscillation and that the model converged after around the same number of epochs reported in the paper.

In [None]:
with open(os.path.join(models_dir, "best_cnn/log_loss.txt")) as loss_file:
    loss = np.array([float(x) for x in loss_file.read().split('\n') if x])
plt.title("Loss per test iteration")
plt.grid(True)
plt.xlabel('Test Iteration')
plt.ylabel('Cross-Entropy Loss')
plt.plot(loss)

# Results & Visualizations

For the analysis of results, we can query the Flask API for single items with the following command

In [None]:
! curl "http://localhost:5000/predict?sample=<some_sample>"

You can obtain the list of test samples with this other command:

In [None]:
! curl "http://localhost:5000/predict"

In [None]:
! curl "http://localhost:5000/predict?sample=16a07Fb.wav"

Or I prepared a pickle containing the predictions of all samples. This is the same as querying every utterance in the test set.

In [None]:
predictions = pd.read_pickle(os.path.join(models_dir, "best_cnn/test/predictions.pkl"))

In [None]:
predictions.head()

Although the loss monotonically decreases, I observed that the accuracy reached a plateau after a number of test iterations

In [None]:
with open(os.path.join(models_dir, "best_cnn/log_accuracy.txt")) as acc_file:
    acc = []
    for line in acc_file.read().split('\n'):
        if not line:
            continue
        acc.append(float(line.split(",")[0]))
plt.figure(figsize=(10,6))
plt.title("Accuracy per test iteration")
plt.grid(True)
plt.yticks(np.arange(0,1,0.025))
plt.xlabel('Test Iteration')
plt.ylabel('Accuracy')
plt.plot(acc)

I observed that after 300 epochs the max accuracy is around 0.90. This is surprising as the accuracy reported in the paper is 0.82.

In [None]:
# Max accuracy
(predictions.Prediction == predictions.Truth).sum() / len(predictions)

After further inspection of the reference paper I found that the accuracy reported for emo-db among various publications in the literature is inconsistent as some researchers discarded 15 samples from the dataset. The authors show that by adopting that subset they managed to obtain around 0.95 accuracy with their best model (which is an ensemble of 7 binary classifiers). The simple cnn that I implemented does not have reported accuracy in their paper for the subset of emo-db. However, in the full dataset it has slightly lower accuracy compared to the ensemble of 7 models according to the authors. Thus it's probable that the cnn would have 0.90 accuracy in the subset of emo-db 

As the test set is relatively small (around 100 samples), these 15 diffcult examples can have a large impact. Indeed, in the worst case all of them could be in the test set, which could result in a 0.15 decrease of performance compared to the case where they are all in the training set. In conclusion the measured accuracy has high variance depending on the split and this could be a reasonable explanation for the 0.08 variation that I observed. Arguably, implementing a 5-fold cross-validation as in the reference paper might reduce the issue, as the reported accuracy would be an average of 5 different splits and variance decreases with averaging.

The confusion matrix below shows that anxiety and disgust are the hardest emotions to recognize.

In [None]:
matrix = np.zeros((7, 7))
for i in range(7):
    i_truth = predictions[predictions.Truth == i]
    i_predictions = i_truth.groupby('Prediction').size()
    for j in i_predictions.index:
        matrix[i, j] = i_predictions[j] / len(i_truth)
    
classes = ['anger', 'anxiety', 'boredom', 'disgust', 'happiness', 'neutral', 'sadness']
confusion_matrix = pd.DataFrame(matrix, index = classes, columns = classes)

In [None]:
confusion_matrix

93% of the predictions have more than 50% confidence

In [None]:
high_confidence = predictions[predictions.Confidence > 0.5]

In [None]:
len(high_confidence) / len(predictions)

However removing low confidence predictions does not improve accuracy significantly.

In [None]:
# Max accuracy for high confidence
(high_confidence.Prediction == high_confidence.Truth).sum() / len(high_confidence)

The trained model has 3\% better accuracy for female speakers compared to male speakers 

In [None]:
# Add fields inferring them from the file name
predictions['Speaker'] = [int(x[:2]) for x in predictions.index]
predictions['Text'] = [x[2:5] for x in predictions.index]

In [None]:
female_speakers = {8, 9, 13, 14, 16}
predictions_on_female = predictions[predictions.apply(lambda x: x.Speaker in female_speakers, axis=1)]
# Max accuracy for female speakers
(predictions_on_female.Prediction == predictions_on_female.Truth).sum() / len(predictions_on_female)

In [None]:
predictions_on_male = predictions[predictions.apply(lambda x: x.Speaker not in female_speakers, axis=1)]
# Max accuracy for male speakers
(predictions_on_male.Prediction == predictions_on_male.Truth).sum() / len(predictions_on_male)

Here is performance by utterance

In [None]:
performance_by_text = predictions.groupby('Text').apply(lambda df: len(df[df.Prediction == df.Truth]) / len(df))
plt.bar(performance_by_text.index, performance_by_text.values)

There is a weak correlation between sequence length and accuracy of prediction. It seems that the model performs slightly better if the sequence is longer

In [None]:
seq_lengths = []
for file in sorted(glob(os.path.join(test_dir, "*/*.wav"))):
    signal = load_sample(file)
    seq_lengths.append(len(signal) / sr)
equals = predictions.Prediction == predictions.Truth
equals = equals.sort_index()
equals = equals.reset_index()
equals = equals[0] # convert equals to series

In [None]:
equals.corr(pd.Series(seq_lengths))

Lastly we can correlate the sentence complexity to accuracy. We can estimate complexity as the time required to utter the sentence. We compute this by reducing the sequence lengths for a given text to the median along the speakers.  

In [None]:
def getTextLength(df):
    """Returns the length of the sentence estimated by the median of the sequence lengths along all speakers"""
    seq = []
    for i, row in df.iterrows():
        signal = load_sample(os.path.join(raw_data_dir, i))
        time = len(signal) / sr
        seq.append(time)
    return np.median(seq)

In [None]:
sentence_lengths = predictions.groupby('Text').apply(getTextLength)

As expected, performance is negatively affected by sentence length

In [None]:
performance_by_text.corr(sentence_lengths)