# ISAT 449: Emerging Topics in Applied Data Science 

### Mini-Project: How to Make a Speech Emotion Recognizer Using Python And Scikit-learn

#### Speech Emotion Recognition – Objective
To build a model to recognize emotion from speech using the librosa and sklearn libraries and the RAVDESS dataset.

#### Speech Emotion Recognition – About the Python Mini Project
In this Python mini project, we will use the libraries librosa, soundfile, and scikit-learn (among others) to build a model using an MLPClassifier. Our model will able to recognizeemotion from sound files. We will load the data, extract features from it, then split the dataset into training and testing sets. Then, we’ll initialize an MLPClassifier and train the model. Finally, after we are satisfied with the accuracy of our model, we will test the model on a sound file generated by our voice!

#### The Dataset
For this Python mini project, we’ll use the RAVDESS dataset; this is the Ryerson Audio-Visual Database of Emotional Speech and Song dataset, and is free to download. This dataset has 7356 files rated by 247 individuals 10 times on emotional validity, intensity, and genuineness. The entire dataset is 24.8GB from 24 actors, but we’ve lowered the sample rate on allthe files, and you can download from Canvas.

#### File Summary
In total, the RAVDESS collection includes 7356 files (2880+2024+1440+1012 files).

#### File naming convention
Each of the 7356 RAVDESS files has a unique filename. The filename consists of a 7-part numerical identifier (e.g., 02-01-06-01-02-01-12.mp4). These identifiers define the stimulus characteristics:
#### Filename identifiers
* Modality (01 = full-AV, 02 = video-only, 03 = audio-only).
* Vocal channel (01 = speech, 02 = song).
* Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).
* Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the 'neutral' emotion.
* Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door").
* Repetition (01 = 1st repetition, 02 = 2nd repetition).
* Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).

<i>Filename example: 02-01-06-01-02-01-12.mp4</i>

* Video-only (02)
* Speech (01)
* Fearful (06)
* Normal intensity (01)
* Statement "dogs" (02)
* 1st Repetition (01)
* 12th Actor (12)
* Female, as the actor ID number is even.

You can find more information on the file structure and filenames from Zenodo: Filename References (https://zenodo.org/record/1188976#.X3KzGGhKhPa)

### Let's import the dependencies
1. import files

In [1]:
import soundfile# to read audio file
import numpy as np
import matplotlib.pyplot as plt
import librosa# to extract speech features
import glob
import os
import pickle# to save model after training
from sklearn.model_selection import train_test_split# for splitting training and testing
from sklearn.neural_network import MLPClassifier# multi-layer perceptron model
from sklearn.metrics import accuracy_score# to measure how good we are

2. Define a function extract_feature to extract the mfcc, chroma, and mel features from a sound file. This function takes 4 parameters- the file name and three Boolean parameters for the three features:

* **mfcc**: Mel Frequency Cepstral Coefficient, represents the short-term power spectrum of a sound
* **chroma**: Pertains to the 12 different pitch classes
* **mel**: Mel Spectrogram Frequency

Open the sound file with soundfile.SoundFile using with-as so it’s automatically closed once we’re done. Read from it and call it X. Also, get the sample rate. If chroma is True, get theShort-Time Fourier Transform of X.

Let result be an empty numpy array. Now, for each feature of the three, if it exists, make a call to the corresponding function from librosa.feature (eg- librosa.feature.mfcc for mfcc), andget the mean value. Call the function hstack() from numpy with result and the feature value, and store this in result. hstack() stacks arrays in sequence horizontally (in a columnarfashion). Then, return the result.

In [2]:
# Extract features (mfcc, chroma, mel) from a sound file
def extract_feature(file_name, mfcc, chroma, mel):
    with soundfile.SoundFile(file_name) as sound_file:
        X=sound_file.read(dtype="float32")
        sample_rate=sound_file.samplerate
        if chroma:
            stft=np.abs(librosa.stft(X))
        result=np.array([])
        if mfcc:
            mfccs=np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T, axis=0)
            result=np.hstack((result, mfccs))
        if chroma:
            chroma=np.mean(librosa.feature.chroma_stft(S=stft, sr=sample_rate).T,axis=0)
            result=np.hstack((result, chroma))
        if mel:
            mel=np.mean(librosa.feature.melspectrogram(X, sr=sample_rate).T,axis=0)
            result=np.hstack((result, mel))
    return result

In [3]:
### TO DO Comments

In [16]:
# Emotions in the RAVDESS dataset
emotions={'01':'neutral',
          '02':'calm',
          '03':'happy',
          '04':'sad',
          '05':'angry',
          '06':'fearful',
          '07':'disgust',
          '08':'surprised'
}

# Emotions to observe
observed_emotions=['calm', 'happy', 'fearful', 'disgust']

In [26]:
# Load the data and extract features for each sound file
def load_data(test_size=0.2):
    x,y=[],[]
    for file in glob.glob("ravdess-data\Actor_*\*.wav"):
        file_name=os.path.basename(file)
        emotion=emotions[file_name.split("-")[2]]
        if emotion not in observed_emotions:
            continue
        feature=extract_feature(file, mfcc=True, chroma=True, mel=True)
        x.append(feature)
        y.append(emotion)
    return train_test_split(np.array(x), y, test_size=test_size, random_state=9)

In [27]:
# TO DO Comment

In [28]:
# Split the dataset
x_train,x_test,y_train,y_test= load_data(test_size=0.25)

ValueError: With n_samples=0, test_size=0.25 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters.

In [29]:
# Get the shape of the training and testing datasets
print((x_train.shape[0], x_test.shape[0]))

NameError: name 'x_train' is not defined

In [30]:
# Get the number of features extracted
print(f'Features extracted: {x_train.shape[1]}')

NameError: name 'x_train' is not defined