# Introduction

Do not spend too much time trying to get very tiny metrics improvement. Once you have a model with a correct predictive power, you should better spend time explaining your data cleaning & preparation pipeline as well as explanations & visualizations of the results.

The goal is to see your fit with our company culture & engineering needs, spending 50h on an over-complicated approach will not give you bonus points compared to a simple, yet effective, to-the-point solution.

## About the data

The dataset you will be working with is called Emo-DB and can be found [here](http://emodb.bilderbar.info/index-1280.html).

It is a database containing samples of emotional speech in German. It contains samples labeled with one of 7 different emotions: Anger, Boredom, Disgust, Fear, Happiness, Sadness and Neutral. 

Please download the full database and refer to the documentation to understand how the samples are labeled (see "Additional information")
   
The goal of this project is to develop a model which is able to **classify samples of emotional speech**. Feel free to use any available library you would need, but beware of re-using someone else's code without mentionning it!

## Deliverable

The end-goal is to deliver us a zip file containing:
* This report filled with your approach, in the form of an **iPython Notebook**.
* A **5-10 slides PDF file**, containing a technical presentation covering the important aspects of your work
* A Dockerfile which defines a container for the project. The container should handle everything (download the data, run the code, etc...). When running the container it should expose the jupyter notebook on one port and expose a Flask API on another one. The Flask app contains two endpoints:
  - One for training the model
  - One for querying the last trained model with an audio file of our choice in the dataset
* A README.md which should contain the commands to build and run the docker container, as well as how to perform the queries to the API. 
* Any necessary .py, .sh or other files needed to run your code.

```
    AUTHOR: Niclas Simmler
    DATE: April 22, 2021
```

# Libraries Loading

First, we will need to do some basic setup. We will activate autoreload and make sure the src code is visible in jupyter.

In [1]:
%load_ext autoreload
%autoreload 2
%config Application.log_level="DEBUG"
import os
import sys

PROJECT_BASE_PATH = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
DATA_BASE_PATH = os.path.join(PROJECT_BASE_PATH, 'data')
SRC_BASE_PATH = os.path.join(PROJECT_BASE_PATH, 'src')
sys.path.insert(0, SRC_BASE_PATH)

Let's load our required modules.

We start of with our very own `ser` module. This one will hold all relevant functionalities.

In [2]:
import ser

Next, we will load some other common modules which we need.

In [3]:
import pandas as pd

# Data Preparation & Cleaning

## Download

We will first need to download the data if not already done so. In order to deal with all the logic, a wrapper class will be used, which allows for easy operation on the dataset.

The code for the dataset wrapper can be found in `src/ser/dataset.py`. First, instantiate a dataset object.

In [4]:
dataset = ser.Dataset(data_path=DATA_BASE_PATH, remote_url='http://emodb.bilderbar.info/download/download.zip')

2021-04-22 20:51:06,026 - ser.dataset - INFO - Creating Dataset Wrapper object.
2021-04-22 20:51:06,027 - ser.dataset - INFO - > Base Path at "/Users/nik/Code/visium/data"
2021-04-22 20:51:06,029 - ser.dataset - INFO - > Pristine Path at "/Users/nik/Code/visium/data/pristine"
2021-04-22 20:51:06,030 - ser.dataset - INFO - > Working Path at "/Users/nik/Code/visium/data/working"
2021-04-22 20:51:06,031 - ser.dataset - INFO - Make sure that the http://emodb.bilderbar.info/download/download.zip points to a ZIP file.


In [7]:
# Run this line below if you want to clean the data directory
#dataset.clean()

Then we download the data.

In [5]:
dataset.download()

2021-04-22 20:52:55,749 - ser.dataset - INFO - Successfully downloaded file to /Users/nik/Code/visium/data/pristine/download.zip
2021-04-22 20:52:55,750 - ser.dataset - INFO - Dataset downloaded.


True

The return value of the download is `True`, thus, everything went fine. If we were to rerun this function, it would not download anything anymore. However, using the `force=True` argument, we can initiate the download again.

Since it is a ZIP File, we will need to extract it.

In [6]:
dataset.extract()

2021-04-22 20:53:32,451 - ser.dataset - INFO - Dataset extracted.


True

The extraction went well too. If we were to rerun this function, it would not extract anything anymore. However, using the `force=True` argument, we can initiate the extraction again.

Now we need to parse the data. For this, let's have a look at the documentation (http://www.emodb.bilderbar.info/index-1280.html)

In our data folder, we have multiple files and folders. Not all of them are relevant.

In [9]:
print(os.listdir(os.path.join(DATA_BASE_PATH, 'working')))
print(os.listdir(os.path.join(DATA_BASE_PATH, 'working', 'wav'))[:10])

['wav', '.gitkeep', 'silb', 'erklaerung.txt', 'lablaut', 'erkennung.txt', 'labsilb']
['16a02Lb.wav', '14a07Wc.wav', '10a07Ad.wav', '13a05Ea.wav', '14a05Wa.wav', '14a07Na.wav', '15a05Wa.wav', '16b10Wb.wav', '09a01Nb.wav', '16a01Fc.wav']


For the task at hand, only the content of the `wav` folder are relevant. It contains wav-files that are named in the following schema (according to the documentation.

```
Positions 1-2: number of speaker
Positions 3-5: code for text
Position 6: emotion (sorry, letter stands for german emotion word)
Position 7: if there are more than two versions these are numbered a, b, c ....
```

The documentation further states information about the speakers:

```
03 - male, 31 years old
08 - female, 34 years
09 - female, 21 years
10 - male, 32 years
11 - male, 26 years
12 - male, 30 years
13 - female, 32 years
14 - female, 35 years
15 - male, 25 years
16 - female, 31 years
```

And about the spoken sample:

|code|text (german)|try of an english translation|
|--- |--- |--- |
|a01|Der Lappen liegt auf dem Eisschrank.|The tablecloth is lying on the frigde.|
|a02|Das will sie am Mittwoch abgeben.|She will hand it in on Wednesday.|
|a04|Heute abend könnte ich es ihm sagen.|Tonight I could tell him.|
|a05|Das schwarze Stück Papier befindet sich da oben neben dem Holzstück.|The black sheet of paper is located up there besides the piece of timber.|
|a07|In sieben Stunden wird es soweit sein.|In seven hours it will be.|
|b01|Was sind denn das für Tüten, die da unter dem Tisch stehen?|What about the bags standing there under the table?|
|b02|Sie haben es gerade hochgetragen und jetzt gehen sie wieder runter.|They just carried it upstairs and now they are going down again.|
|b03|An den Wochenenden bin ich jetzt immer nach Hause gefahren und habe Agnes besucht.|Currently at the weekends I always went home and saw Agnes.|
|b09|Ich will das eben wegbringen und dann mit Karl was trinken gehen.|I will just discard this and then go for a drink with Karl.|
|b10|Die wird auf dem Platz sein, wo wir sie immer hinlegen.|It will be in the place where we always store it.|

And lastly, some information about the emotions:

|letter|emotion (english)|letter|emotion (german)|
|--- |--- |--- |--- |
|A|anger|W|Ärger (Wut)|
|B|boredom|L|Langeweile|
|D|disgust|E|Ekel|
|F|anxiety/fear|A|Angst|
|H|happiness|F|Freude|
|S|sadness|T|Trauer|

Note: `N` is also an option which stands for `Neutral`.

So, the sample `16a02Lb.wav` can be parsed as the following:

* Speaker = 16 - female, 31 years
* Text = a02 for "Das will sie am Mittwoch abgeben."
* Emotion = L for "Langeweile"
* Version = b (i.e., there is at least another version a in the dataset)

The dataset function `.prepare()` is meant to create a pandas DataFrame that parses this information.


In [22]:
speakers = ['03', '08', '09', '10', '11', '12', '13', '14', '15', '16']
texts = ['a01', 'a02', 'a04', 'a05', 'a07', 'b01', 'b02', 'b03', 'b09', 'b10']
emotions = {
    'W': 'Ärger (Wut)',
    'L': 'Langeweile',
    'E': 'Ekel',
    'A': 'Angst',
    'F': 'Freude',
    'T': 'Trauer',
    'N': 'Neutral'
}
files = list()
for filename in os.listdir(os.path.join(dataset.working_path, 'wav')):
    assert len(filename) == 11, 'Encountered unknown filename.'
    _speaker = filename[0:2]
    assert _speaker in speakers, 'Encountered unknown speaker.'
    _text = filename[2:5]
    assert _text in texts, 'Encountered unknown text.'
    _emotion = filename[5:6]
    assert _emotion in emotions.keys(), 'Encountered unknown emotion.'
    _version = filename[6:7]
    assert _version in list('abcdefghijklmnopqrstuvwxyz'), 'Encountered unknown version.'
    files.append({
        'filename': filename,
        'full_path': os.path.join(dataset.working_path, 'wav', filename),
        'speaker': _speaker, 
        'text': _text,
        'emotion': _emotion,
        'version': _version
    })
pd.DataFrame(files).head()

Unnamed: 0,filename,full_path,speaker,text,emotion,version
0,16a02Lb.wav,/Users/nik/Code/visium/data/working/wav/16a02L...,16,a02,L,b
1,14a07Wc.wav,/Users/nik/Code/visium/data/working/wav/14a07W...,14,a07,W,c
2,10a07Ad.wav,/Users/nik/Code/visium/data/working/wav/10a07A...,10,a07,A,d
3,13a05Ea.wav,/Users/nik/Code/visium/data/working/wav/13a05E...,13,a05,E,a
4,14a05Wa.wav,/Users/nik/Code/visium/data/working/wav/14a05W...,14,a05,W,a


In [14]:
dataset.prepare()



# Feature Engineering & Modeling

# Results & Visualizations