#  <center> Speech Emotion Recognition <center>

### I am going to build a speech emotion detection classifier.
### But first we need to learn about what is speech recognition (SER) and why are we building this project? Well, few of the reasons are-

#### First, lets define SER i.e. Speech Emotion Recognition.
* Speech Emotion Recognition, abbreviated as SER, is the act of attempting to recognize human emotion and affective states from speech. This is capitalizing on the fact that voice often reflects underlying emotion through tone and pitch. This is also the phenomenon that animals like dogs and horses employ to be able to understand human emotion.

#### Why we need it?

1. Emotion recognition is the part of speech recognition which is gaining more popularity and need for it increases enormously. Although there are methods to recognize emotion using machine learning techniques, this project attempts to use deep learning to recognize the emotions from data.

2. SER(Speech Emotion Recognition) is used in call center for classifying calls according to emotions and can be used as the performance parameter for conversational analysis thus identifying the unsatisfied customer, customer satisfaction and so on.. for helping companies improving their services

3. It can also be used in-car board system based on information of the mental state of the driver can be provided to the system to initiate his/her safety preventing accidents to happen

#### Datasets used in this project

* Crowd-sourced Emotional Mutimodal Actors Dataset (Crema-D)
* Ryerson Audio-Visual Database of Emotional Speech and Song (Ravdess)
* Surrey Audio-Visual Expressed Emotion (Savee)
* Toronto emotional speech set (Tess)

# Importing Libraries

In [2]:
# pip install --upgrade keras


In [1]:
import pandas as pd
import numpy as np

import os
import sys

# librosa is a Python library for analyzing audio and music. It can be used to extract the data from the audio files we will see it later.
import librosa
import librosa.display
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split

# to play the audio files
from IPython.display import Audio

import keras
from keras.callbacks import ReduceLROnPlateau
from keras.models import Sequential
from keras.layers import Dense, Conv1D, MaxPooling1D, Flatten, Dropout, BatchNormalization
from keras.utils import to_categorical
from keras.utils import to_categorical
from keras.callbacks import ModelCheckpoint
from tensorflow.keras.models import load_model



import warnings
if not sys.warnoptions:
    warnings.simplefilter("ignore")
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [4]:
# !pip install kaggle


In [None]:
# from google.colab import files
# files.upload()  # Upload kaggle file that have your uername and key


In [None]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [6]:
!chmod 600 /root/.kaggle/kaggle.json


chmod: cannot access '/root/.kaggle/kaggle.json': No such file or directory


In [None]:
!kaggle datasets download -d ejlok1/surrey-audiovisual-expressed-emotion-savee

In [None]:
!kaggle datasets download -d uwrfkaggler/ravdess-emotional-speech-audio
!kaggle datasets download -d ejlok1/toronto-emotional-speech-set-tess
!kaggle datasets download -d ejlok1/cremad


In [None]:
import zipfile

# Define dataset names
datasets = [
    'surrey-audiovisual-expressed-emotion-savee.zip',
    'ravdess-emotional-speech-audio.zip',
    'toronto-emotional-speech-set-tess.zip',
    'cremad.zip'
]

# Unzip each dataset
for dataset in datasets:
    with zipfile.ZipFile(dataset, 'r') as zip_ref:
        zip_ref.extractall('/content/' + dataset.replace('.zip', ''))


## Data Preparation
* As we are working with four different datasets, so I will be creating a dataframe storing all emotions of the data in dataframe with their paths.
* We will use this dataframe to extract features for our model training.

##  <center> 1. Ravdess Dataframe <center>
Here is the filename identifiers as per the official RAVDESS website:

* Modality (01 = full-AV, 02 = video-only, 03 = audio-only).
* Vocal channel (01 = speech, 02 = song).
* Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).
* Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the 'neutral' emotion.
* Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door").
* Repetition (01 = 1st repetition, 02 = 2nd repetition).
* Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).

So, here's an example of an audio filename. 02-01-06-01-02-01-12.mp4
This means the meta data for the audio file is:

* Video-only (02)
* Speech (01)
* Fearful (06)
* Normal intensity (01)
* Statement "dogs" (02)
* 1st Repetition (01)
* 12th Actor (12) - Female (as the actor ID number is even)

In [None]:
Savee = "/content/surrey-audiovisual-expressed-emotion-savee/ALL/"
Ravdess = "/content/ravdess-emotional-speech-audio/audio_speech_actors_01-24/"
Tess = "/content/toronto-emotional-speech-set-tess/tess toronto emotional speech set data/TESS Toronto emotional speech set data/"
Crema = "/content/cremad/AudioWAV/"


In [None]:
ravdess_directory_list = os.listdir(Ravdess)

file_emotion = []
file_path = []
for dir in ravdess_directory_list:
    # as their are 20 different actors in our previous directory we need to extract files for each actor.
    actor = os.listdir(Ravdess + dir)
    for file in actor:
        part = file.split('.')[0]
        part = part.split('-')
        # third part in each file represents the emotion associated to that file.
        file_emotion.append(int(part[2]))
        file_path.append(Ravdess + dir + '/' + file)

# dataframe for emotion of files
emotion_df = pd.DataFrame(file_emotion, columns=['Emotions'])

# dataframe for path of files.
path_df = pd.DataFrame(file_path, columns=['Path'])
Ravdess_df = pd.concat([emotion_df, path_df], axis=1)

# changing integers to actual emotions.
Ravdess_df.Emotions.replace({1:'neutral', 2:'calm', 3:'happy', 4:'sad', 5:'angry', 6:'fear', 7:'disgust', 8:'surprise'}, inplace=True)
Ravdess_df.head()

## <center>2. Crema DataFrame</center>

In [None]:
crema_directory_list = os.listdir(Crema)

file_emotion = []
file_path = []

for file in crema_directory_list:
    # storing file paths
    file_path.append(Crema + file)
    # storing file emotions
    part=file.split('_')
    if part[2] == 'SAD':
        file_emotion.append('sad')
    elif part[2] == 'ANG':
        file_emotion.append('angry')
    elif part[2] == 'DIS':
        file_emotion.append('disgust')
    elif part[2] == 'FEA':
        file_emotion.append('fear')
    elif part[2] == 'HAP':
        file_emotion.append('happy')
    elif part[2] == 'NEU':
        file_emotion.append('neutral')
    else:
        file_emotion.append('Unknown')

# dataframe for emotion of files
emotion_df = pd.DataFrame(file_emotion, columns=['Emotions'])

# dataframe for path of files.
path_df = pd.DataFrame(file_path, columns=['Path'])
Crema_df = pd.concat([emotion_df, path_df], axis=1)
Crema_df.head()

##  <center> 3. TESS dataset <center>

In [None]:
tess_directory_list = os.listdir(Tess)

file_emotion = []
file_path = []

for dir in tess_directory_list:
    directories = os.listdir(Tess + dir)
    for file in directories:
        part = file.split('.')[0]
        part = part.split('_')[2]
        if part=='ps':
            file_emotion.append('surprise')
        else:
            file_emotion.append(part)
        file_path.append(Tess + dir + '/' + file)

# dataframe for emotion of files
emotion_df = pd.DataFrame(file_emotion, columns=['Emotions'])

# dataframe for path of files.
path_df = pd.DataFrame(file_path, columns=['Path'])
Tess_df = pd.concat([emotion_df, path_df], axis=1)
Tess_df.head()

##  <center> 4. SAVEE dataset <center>
The audio files in this dataset are named in such a way that the prefix letters describes the emotion classes as follows:

* 'a' = 'anger'
* 'd' = 'disgust'
* 'f' = 'fear'
* 'h' = 'happiness'
* 'n' = 'neutral'
* 'sa' = 'sadness'
* 'su' = 'surprise'

In [None]:
savee_directory_list = os.listdir(Savee)

file_emotion = []
file_path = []

for file in savee_directory_list:
    file_path.append(Savee + file)
    part = file.split('_')[1]
    ele = part[:-6]
    if ele=='a':
        file_emotion.append('angry')
    elif ele=='d':
        file_emotion.append('disgust')
    elif ele=='f':
        file_emotion.append('fear')
    elif ele=='h':
        file_emotion.append('happy')
    elif ele=='n':
        file_emotion.append('neutral')
    elif ele=='sa':
        file_emotion.append('sad')
    else:
        file_emotion.append('surprise')

# dataframe for emotion of files
emotion_df = pd.DataFrame(file_emotion, columns=['Emotions'])

# dataframe for path of files.
path_df = pd.DataFrame(file_path, columns=['Path'])
Savee_df = pd.concat([emotion_df, path_df], axis=1)
Savee_df.head()

In [None]:
# creating Dataframe using all the 4 dataframes we created so far.
data_path = pd.concat([Ravdess_df, Crema_df, Tess_df, Savee_df], axis = 0)
data_path.to_csv("data_path.csv",index=False)
data_path.head()

## Data Visualisation and Exploration

First let's plot the count of each emotions in our dataset.

In [None]:
plt.title('Count of Emotions', size=16)
sns.countplot(x = data_path.Emotions, data = data_path)
plt.ylabel('Count', size=12)
plt.xlabel('Emotions', size=12)
sns.despine(top=True, right=True, left=False, bottom=False)
plt.show()

We can also plot waveplots and spectograms for audio signals

* Waveplots - Waveplots let us know the loudness of the audio at a given time.
* Spectograms - A spectrogram is a visual representation of the spectrum of frequencies of sound or other signals as they vary with time. It’s a representation of frequencies changing with respect to time for given audio/music signals.

In [None]:
def create_waveplot(data, sr, e):
    plt.figure(figsize=(10, 3))
    plt.title('Waveplot for audio with {} emotion'.format(e), size=15)
    librosa.display.waveshow(data, sr=sr)
    plt.show()

def create_spectrogram(data, sr, e):
    # stft function converts the data into short term fourier transform
    X = librosa.stft(data)
    Xdb = librosa.amplitude_to_db(abs(X))
    plt.figure(figsize=(12, 3))
    plt.title('Spectrogram for audio with {} emotion'.format(e), size=15)
    librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
    #librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='log')
    plt.colorbar()

In [None]:
emotion='fear'
path = np.array(data_path.Path[data_path.Emotions==emotion])[1]
data, sampling_rate = librosa.load(path)
create_waveplot(data, sampling_rate, emotion)
create_spectrogram(data, sampling_rate, emotion)
Audio(path)

In [None]:
emotion='angry'
path = np.array(data_path.Path[data_path.Emotions==emotion])[1]
data, sampling_rate = librosa.load(path)
create_waveplot(data, sampling_rate, emotion)
create_spectrogram(data, sampling_rate, emotion)
Audio(path)

In [None]:
emotion='sad'
path = np.array(data_path.Path[data_path.Emotions==emotion])[1]
data, sampling_rate = librosa.load(path)
create_waveplot(data, sampling_rate, emotion)
create_spectrogram(data, sampling_rate, emotion)
Audio(path)

In [None]:
emotion='happy'
path = np.array(data_path.Path[data_path.Emotions==emotion])[1]
data, sampling_rate = librosa.load(path)
create_waveplot(data, sampling_rate, emotion)
create_spectrogram(data, sampling_rate, emotion)
Audio(path)

## Data Augmentation

- Data augmentation is the process by which we create new synthetic data samples by adding small perturbations on our initial training set.
- To generate syntactic data for audio, we can apply noise injection, shifting time, changing pitch and speed.
- The objective is to make our model invariant to those perturbations and enhace its ability to generalize.
- In order to this to work adding the perturbations must conserve the same label as the original training sample.
- In images data augmention can be performed by shifting the image, zooming, rotating ...

First, let's check which augmentation techniques works better for our dataset.

In [None]:
def noise(data):
    noise_amp = 0.035*np.random.uniform()*np.amax(data)
    data = data + noise_amp*np.random.normal(size=data.shape[0])
    return data

def stretch(data, rate=0.8):
    return librosa.effects.time_stretch(data, rate=rate)

def shift(data):
    shift_range = int(np.random.uniform(low=-5, high = 5)*2000)
    return np.roll(data, shift_range)

def pitch(data, sampling_rate, pitch_factor=0.7):
    return librosa.effects.pitch_shift(data, sr = sampling_rate, n_steps = pitch_factor)

# shift time from left or right
def time_shift(data, sample_rate):
    shift_max = 0.2  # shift by 20% of the total duration
    shift = np.random.randint(int(sample_rate * shift_max))
    direction = np.random.choice(['left', 'right'])
    if direction == 'left':
        shift = -shift
    augmented_data = np.roll(data, shift)
    return augmented_data

# to reduce volume of loud sound
def dynamic_range_compression(data):
    compressor = np.random.uniform(0.5, 1.0)
    data = np.sign(data) * (1 - np.exp(-compressor * np.abs(data)))
    return data

# adjust difference of frequency components
def equalize(data, sr):
    eq = np.random.uniform(0.8, 1.2)
    return librosa.effects.preemphasis(data, coef=eq)

# simulate effect of sound
def reverb(data):
    reverb_effect = np.convolve(data, np.random.rand(1000), mode='same')
    return reverb_effect


# taking any example and checking for techniques.
path = np.array(data_path.Path)[1]
data, sample_rate = librosa.load(path)

#### 1. Simple Audio

In [None]:
plt.figure(figsize=(14,4))
librosa.display.waveshow(y=data, sr=sample_rate)
Audio(path)

#### 2. Noise Injection

In [None]:
x = noise(data)
plt.figure(figsize=(14,4))
librosa.display.waveshow(y=x, sr=sample_rate)
Audio(x, rate=sample_rate)

We can see noise injection is a very good augmentation technique because of which we can assure our training model is not overfitted

#### 3. Stretching

In [None]:
x = stretch(data)
plt.figure(figsize=(14,4))
librosa.display.waveshow(y=x, sr=sample_rate)
Audio(x, rate=sample_rate)

#### 4. Shifting

In [None]:
x = shift(data)
plt.figure(figsize=(14,4))
librosa.display.waveshow(y=x, sr=sample_rate)
Audio(x, rate=sample_rate)

#### 5. Pitch

In [None]:
x = pitch(data, sample_rate)
plt.figure(figsize=(14,4))
librosa.display.waveshow(y=x, sr=sample_rate)
Audio(x, rate=sample_rate)

#### 5. Time Shift

In [None]:
x = time_shift(data, sample_rate)
plt.figure(figsize=(14,4))
librosa.display.waveshow(y=x, sr=sample_rate)
Audio(x, rate=sample_rate)

#### 5. Dynamic Range Compression

In [None]:
x = dynamic_range_compression(data)
plt.figure(figsize=(14,4))
librosa.display.waveshow(y=x, sr=sample_rate)
Audio(x, rate=sample_rate)

#### 5. Equalize amplitude, noise behind audio

In [None]:
x = equalize(data, sample_rate)
plt.figure(figsize=(14,4))
librosa.display.waveshow(y=x, sr=sample_rate)
Audio(x, rate=sample_rate)

#### 5. Reverb Audio

In [None]:
x = reverb(data)
plt.figure(figsize=(14,4))
librosa.display.waveshow(y=x, sr=sample_rate)
Audio(x, rate=sample_rate)

- From the above types of augmentation techniques i am using noise, stretching(ie. changing speed) and some pitching.

## Feature Extraction
- Extraction of features is a very important part in analyzing and finding relations between different things. As we already know that the data provided of audio cannot be understood by the models directly so we need to convert them into an understandable format for which feature extraction is used.


The audio signal is a three-dimensional signal in which three axes represent time, amplitude and frequency.

![image.png](https://miro.medium.com/max/633/1*7sKM9aECRmuoqTadCYVw9A.jpeg)

I am no expert on audio signals and feature extraction on audio files so i need to search and found a very good blog written by [Askash Mallik](https://medium.com/heuristics/audio-signal-feature-extraction-and-clustering-935319d2225) on feature extraction.

As stated there with the help of the sample rate and the sample data, one can perform several transformations on it to extract valuable features out of it.
1. Zero Crossing Rate : The rate of sign-changes of the signal during the duration of a particular frame.
2. Energy : The sum of squares of the signal values, normalized by the respective frame length.
3. Entropy of Energy : The entropy of sub-frames’ normalized energies. It can be interpreted as a measure of abrupt changes.
4. Spectral Centroid : The center of gravity of the spectrum.
5. Spectral Spread : The second central moment of the spectrum.
6. Spectral Entropy :  Entropy of the normalized spectral energies for a set of sub-frames.
7. Spectral Flux : The squared difference between the normalized magnitudes of the spectra of the two successive frames.
8. Spectral Rolloff : The frequency below which 90% of the magnitude distribution of the spectrum is concentrated.
9.  MFCCs Mel Frequency Cepstral Coefficients form a cepstral representation where the frequency bands are not linear but distributed according to the mel-scale.
10. Chroma Vector : A 12-element representation of the spectral energy where the bins represent the 12 equal-tempered pitch classes of western-type music (semitone spacing).
11. Chroma Deviation : The standard deviation of the 12 chroma coefficients.


In this project i am not going deep in feature selection process to check which features are good for our dataset rather i am only extracting 5 features:
- Zero Crossing Rate
- Chroma_stft
- MFCC
- RMS(root mean square) value
- MelSpectogram to train our model.
- Spectral bandwidth
- Contrast

In [None]:
def extract_features(data):
    # ZCR
    result = np.array([])
    zcr = np.mean(librosa.feature.zero_crossing_rate(y=data).T, axis=0)
    result=np.hstack((result, zcr)) # stacking horizontally

    # Chroma_stft
    stft = np.abs(librosa.stft(data))
    chroma_stft = np.mean(librosa.feature.chroma_stft(S=stft, sr=sample_rate).T, axis=0)
    result = np.hstack((result, chroma_stft)) # stacking horizontally

    # MFCC
    mfcc = np.mean(librosa.feature.mfcc(y=data, sr=sample_rate).T, axis=0)
    result = np.hstack((result, mfcc)) # stacking horizontally

    # Root Mean Square Value
    rms = np.mean(librosa.feature.rms(y=data).T, axis=0)
    result = np.hstack((result, rms)) # stacking horizontally

    # MelSpectogram
    mel = np.mean(librosa.feature.melspectrogram(y=data, sr=sample_rate).T, axis=0)
    result = np.hstack((result, mel)) # stacking horizontally

    # # Spectral Centroid

    spectral_centroid = np.mean(librosa.feature.spectral_centroid(y=data, sr=sample_rate).T, axis=0)
    result = np.hstack((result, spectral_centroid))

    # Spectral Bandwidth

    # spectral_bandwidth = np.mean(librosa.feature.spectral_bandwidth(y=data, sr=sample_rate).T, axis=0)
    # result = np.hstack((result, spectral_bandwidth))

    # Spectral Contrast

    # spectral_contrast = np.mean(librosa.feature.spectral_contrast(S=stft, sr=sample_rate).T, axis=0)
    # result = np.hstack((result, spectral_contrast))

    # Spectral Roll-off
    # spectral_rolloff = np.mean(librosa.feature.spectral_rolloff(y=data, sr=sample_rate, roll_percent=0.85).T, axis=0)
    # result = np.hstack((result, spectral_rolloff))


    return result


In [None]:
def get_features(path):
    data, sample_rate = librosa.load(path, duration=2.5, offset=0.6)

    # without augmentation
    res1 = extract_features(data)
    result = np.array(res1)

    # data with noise
    noise_data = noise(data)
    res2 = extract_features(noise_data)
    result = np.vstack((result, res2))

    # data with stretching and pitching
    new_data = stretch(data)
    data_stretch_pitch = pitch(new_data, sample_rate)
    res3 = extract_features(data_stretch_pitch)
    result = np.vstack((result, res3))

    # data with time shifting
    shifted_data = time_shift(data, sample_rate)
    res4 = extract_features(shifted_data)
    result = np.vstack((result, res4))

    # data with dynamic range compression
    compressed_data = dynamic_range_compression(data)
    res5 = extract_features(compressed_data)
    result = np.vstack((result, res5))

    # data with equalization
    equalized_data = equalize(data, sample_rate)
    res6 = extract_features(equalized_data)
    result = np.vstack((result, res6))

    # # data with reverb
    # reverb_data = reverb(data)
    # res7 = extract_features(reverb_data)
    # result = np.vstack((result, res7))

    return result


* Here I extracted features from audi files
* This feature extraction was taking time.
* To solve time problem I extracted them once and saved in .npy files
* After Features extraction, loaded the .npy X,Y and used them

In [None]:
# X, Y = [], []
# ind = 1
# for path, emotion in zip(data_path.Path, data_path.Emotions):
#     feature = get_features(path)
#     print(ind)
#     ind += 1
#     for ele in feature:
#         X.append(ele)
#         # appending emotion 3 times as we have made 3 augmentation techniques on each audio file.
#         Y.append(emotion)

In [None]:
############################# FOR SAVING IN numpy ##################################



# # Convert lists to numpy arrays
# X_array = np.array(X)
# Y_array = np.array(Y)
# np.save('/content/drive/MyDrive/My Folder/X_features.npy', X_array)
# np.save('/content/drive/MyDrive/My Folder/Y_labels.npy', Y_array)
# # Save arrays to .npy files
# np.save('/content/X_features.npy', X_array)
# np.save('/content/Y_labels.npy', Y_array)



In [None]:
# ############################# FOR LOADING Numpy X AND Y ##################################
# X_array = np.load('/content/X_features.npy')
# Y_array = np.load('/content/Y_labels.npy')

# # If you want them as lists again
# X = X_array.tolist()
# Y = Y_array.tolist()


* Here I  mounted google drive
* Uploaded files from google drive
* Features Extracted files will be in the project folder

In [None]:
import os
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

In [7]:
############################# FOR LOADING Numpy X AND Y ##################################


# X_array1 = np.load('/content/drive/MyDrive/My Folder/X_features.npy')
# Y_array1 = np.load('/content/drive/MyDrive/My Folder/Y_labels.npy')

# # If you want them as lists again
# X = X_array1.tolist()
# Y = Y_array1.tolist()


In [9]:
Features = pd.DataFrame(X)
Features['labels'] = Y
Features.to_csv('features.csv', index=False)
Features.sample(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,154,155,156,157,158,159,160,161,162,labels
67897,0.141594,0.546109,0.468488,0.488112,0.553716,0.645137,0.618634,0.491209,0.481875,0.582355,...,0.0001710261,0.0001648473,0.0001461626,0.0001455819,0.0001280692,0.0001382304,0.0001362704,0.0001207512,3090.549694,happy
69513,0.025866,0.332503,0.334339,0.232307,0.243735,0.249915,0.419978,0.50164,0.418838,0.51371,...,5.010732e-05,3.217035e-05,4.106615e-05,5.03261e-05,3.796028e-05,4.278753e-05,2.89958e-05,2.281984e-06,1169.795486,sad
31830,0.055913,0.633165,0.558486,0.488095,0.501623,0.568332,0.503402,0.483069,0.558058,0.71039,...,9.005706e-10,8.505473e-10,8.131426e-10,7.85189e-10,7.650596e-10,7.503808e-10,7.408779e-10,7.349818e-10,1428.732976,sad
29937,0.05657,0.56036,0.585672,0.535939,0.548877,0.666266,0.688347,0.669565,0.608932,0.600284,...,9.219977e-05,9.007699e-05,8.818944e-05,8.656411e-05,8.526849e-05,8.421528e-05,8.349225e-05,8.300414e-05,1280.026057,happy
26384,0.117983,0.551488,0.726139,0.717381,0.669638,0.476331,0.354386,0.373672,0.463651,0.513924,...,3.170504e-07,2.889315e-07,2.687297e-07,2.539808e-07,2.439559e-07,2.3838e-07,1.907924e-07,7.317811e-08,2025.644292,angry


In [10]:
Features['labels'].unique()

array(['disgust', 'happy', 'sad', 'angry', 'fear', 'calm', 'surprise',
       'neutral'], dtype=object)

* We have applied data augmentation and extracted the features for each audio files and saved them.

## Data Preparation

- As of now we have extracted the data, now we need to normalize and split our data for training and testing.

In [11]:
X = Features.iloc[: ,:-1].values
Y = Features['labels'].values


In [12]:
# As this is a multiclass classification problem onehotencoding our Y.
encoder = encoder = OneHotEncoder(sparse=False, categories='auto')
Y = encoder.fit_transform(np.array(Y).reshape(-1,1)).toarray()

In [14]:
# splitting data
x_train, x_test, y_train, y_test = train_test_split(X, Y, random_state=0, shuffle=True)
x_train.shape, y_train.shape, x_test.shape, y_test.shape

((54729, 163), (54729, 8), (18243, 163), (18243, 8))

In [15]:
# scaling our data with sklearn's Standard scaler
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
x_train.shape, y_train.shape, x_test.shape, y_test.shape

((54729, 163), (54729, 8), (18243, 163), (18243, 8))

In [16]:
# making our data compatible to model.
x_train = np.expand_dims(x_train, axis=2)
x_test = np.expand_dims(x_test, axis=2)
x_train.shape, y_train.shape, x_test.shape, y_test.shape

((54729, 163, 1), (54729, 8), (18243, 163, 1), (18243, 8))

## Modelling

In [18]:
from tensorflow.keras.layers import Conv1D, MaxPooling1D, Dropout, Dense, GRU, LSTM, GlobalAveragePooling1D
def build_cnn(input_shape):
  # input_shape=input_shape (x_train.shape[1], 1)
  model=Sequential()
  model.add(Conv1D(256, kernel_size=5, strides=1, padding='same', activation='relu', input_shape=input_shape))
  model.add(MaxPooling1D(pool_size=5, strides = 2, padding = 'same'))

  model.add(Conv1D(512, kernel_size=5, strides=1, padding='same', activation='relu'))  # Increased filters
  model.add(MaxPooling1D(pool_size=5, strides=2, padding='same'))

  model.add(Conv1D(256, kernel_size=3, strides=1, padding='same', activation='relu'))
  model.add(MaxPooling1D(pool_size=5, strides = 2, padding = 'same'))

  model.add(Conv1D(128, kernel_size=5, strides=1, padding='same', activation='relu'))
  model.add(MaxPooling1D(pool_size=5, strides = 2, padding = 'same'))
  model.add(Dropout(0.3))

  model.add(Conv1D(64, kernel_size=5, strides=1, padding='same', activation='relu'))
  model.add(MaxPooling1D(pool_size=5, strides = 2, padding = 'same'))

  return model


In [19]:
def add_lstm_layers(model):
    model.add(Dense(100, activation='relu'))  # Adding Dense layer before LSTM
    model.add(Dropout(0.3))

    # Adding GRU Layer
    model.add(LSTM(100, return_sequences=True))
    model.add(Dropout(0.3))

    # Adding LSTM Layer
    model.add(GRU(50, return_sequences=True))
    model.add(Dropout(0.3))

    model.add(LSTM(50, return_sequences=False))
    model.add(Dropout(0.3))

    # Adding Dense Layers
    model.add(Dense(units=32, activation='relu'))
    model.add(Dropout(0.3))

    # Output Layer
    model.add(Dense(units=8, activation='softmax'))

    return model

In [20]:
# Define the input shape
input_shape = (x_train.shape[1], 1)  # Adjust as per your data

# Build the CNN model
cnn_model = build_cnn(input_shape)

# Add LSTM layers to the CNN model
model = add_lstm_layers(cnn_model)

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Display the model summary
model.summary()

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [None]:
rlrp = ReduceLROnPlateau(monitor='loss', factor=0.4, verbose=0, patience=2, min_lr=0.0000001)
history=model.fit(x_train, y_train, batch_size=64, epochs=50, validation_data=(x_test, y_test), callbacks=[rlrp])

Epoch 1/50
[1m856/856[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m954s[0m 1s/step - accuracy: 0.2496 - loss: 1.8319 - val_accuracy: 0.4056 - val_loss: 1.4690 - learning_rate: 0.0010
Epoch 2/50
[1m856/856[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m965s[0m 1s/step - accuracy: 0.4048 - loss: 1.4888 - val_accuracy: 0.4403 - val_loss: 1.3849 - learning_rate: 0.0010
Epoch 3/50
[1m856/856[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m968s[0m 1s/step - accuracy: 0.4430 - loss: 1.3896 - val_accuracy: 0.4702 - val_loss: 1.2919 - learning_rate: 0.0010
Epoch 4/50
[1m856/856[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m928s[0m 1s/step - accuracy: 0.4732 - loss: 1.3282 - val_accuracy: 0.5052 - val_loss: 1.2171 - learning_rate: 0.0010
Epoch 5/50
[1m856/856[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m933s[0m 1s/step - accuracy: 0.4948 - loss: 1.2737 - val_accuracy: 0.5155 - val_loss: 1.2002 - learning_rate: 0.0010
Epoch 6/50
[1m271/856[0m [32m━━━━━━[0m[37m━━━━━━━━━━━━━━[0m

In [None]:
print("Accuracy of our model on test data : " , model.evaluate(x_test,y_test)[1]*100 , "%")

epochs = [i for i in range(50)]
fig , ax = plt.subplots(1,2)
train_acc = history.history['accuracy']
train_loss = history.history['loss']
test_acc = history.history['val_accuracy']
test_loss = history.history['val_loss']

fig.set_size_inches(20,6)
ax[0].plot(epochs , train_loss , label = 'Training Loss')
ax[0].plot(epochs , test_loss , label = 'Testing Loss')
ax[0].set_title('Training & Testing Loss')
ax[0].legend()
ax[0].set_xlabel("Epochs")

ax[1].plot(epochs , train_acc , label = 'Training Accuracy')
ax[1].plot(epochs , test_acc , label = 'Testing Accuracy')
ax[1].set_title('Training & Testing Accuracy')
ax[1].legend()
ax[1].set_xlabel("Epochs")
plt.show()

In [None]:
# predicting on test data.
pred_test = model.predict(x_test)
y_pred = encoder.inverse_transform(pred_test)

y_test = encoder.inverse_transform(y_test)



In [None]:
df = pd.DataFrame(columns=['Predicted Labels', 'Actual Labels'])
df['Predicted Labels'] = y_pred.flatten()
df['Actual Labels'] = y_test.flatten()

df.head(10)

In [None]:
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize = (12, 10))
cm = pd.DataFrame(cm , index = [i for i in encoder.categories_] , columns = [i for i in encoder.categories_])
sns.heatmap(cm, linecolor='white', cmap='Blues', linewidth=1, annot=True, fmt='')
plt.title('Confusion Matrix', size=20)
plt.xlabel('Predicted Labels', size=14)
plt.ylabel('Actual Labels', size=14)
plt.show()

In [None]:
print(classification_report(y_test, y_pred))

- We can see our model is more accurate in predicting surprise, angry emotions and it makes sense also because audio files of these emotions differ to other audio files in a lot of ways like pitch, speed etc..
- We overall achieved 68% accuracy on our test data and its decent but we can improve it more by applying more augmentation techniques and using other feature extraction methods. You can try more possible methods

## WORKING ON THE NEW AUDIO FILE
* Load Saved model
* Process audio file using feature extractions
* Take audio file path and do processing and prediction
* Convert prediction into emotion

In [None]:
 # You can Load from anywhere else you saved the model
model = load_model('/content/drive/MyDrive/My Folder/my_model1.keras')

In [None]:
def process_new_audio_file(file_path):
    features = get_features(file_path)
    return features

def prepare_input_for_model(features):
    feature_vector = features[0]
    # Reshape to match the input shape of the model
    feature_vector = np.expand_dims(feature_vector, axis=1)
    return feature_vector


In [None]:
filename = 'angry_anime'
file_path = f'/content/{filename}.wav'

def predict_emotion(file_path, model):
    features = process_new_audio_file(file_path)
    input_for_model = prepare_input_for_model(features)

    # Predict emotion
    prediction = model.predict(np.array([input_for_model]))
    return prediction

In [None]:


predicted_array = predict_emotion(file_path, model)
prediction = np.array(predicted_array)
predicted_index = np.argmax(prediction)
# Emotions labeled using the encoder. You can see in encoder.categories_[0]
emotion_labels = ['angry', 'calm', 'disgust', 'fear', 'happy', 'neutral', 'sad', 'surprise']

predicted_emotion = emotion_labels[predicted_index]
print("Predicted emotion:", predicted_emotion)
