#### <font color = "teal"> Datasets we're using in this project </font>

- RAVDESS https://www.kaggle.com/datasets/uwrfkaggler/ravdess-emotional-speech-audio
- CREMA-D https://www.kaggle.com/datasets/ejlok1/cremad?resource=download
- TESS https://www.kaggle.com/datasets/ejlok1/toronto-emotional-speech-set-tess
- NOR https://www.kaggle.com/datasets/omagarwal2411/nor-smart-speech


#### <font color = "teal"> Useful Research Papers for further development </font>

- AKÇAY, M.B. and OĞUZ, K., 2020. Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Communication, 116, 56–76.
https://www.sciencedirect.com/science/article/pii/S0167639319302262

- ISSA, D., FATIH DEMIRCI, M. and YAZICI, A., 2020. Speech emotion recognition with deep convolutional neural networks. Biomedical Signal Processing and Control, 59, 101894.
https://www.sciencedirect.com/science/article/abs/pii/S1746809420300501

- KHALIL, R.A., et al., 2019. Speech emotion recognition using deep learning techniques: A review. IEEE Access, 7, 117327–117345.
https://ieeexplore.ieee.org/abstract/document/8805181



## USING THE RAVDESS EMOTIONAL SPEECH DATASET

## File naming convention

- Each of the 1440 files has a unique filename. The filename consists of a 7-part numerical identifier (e.g., 03-01-06-01-02-01-12.wav). These identifiers define the stimulus characteristics:

- **Filename identifiers**

    - Modality (01 = full-AV, 02 = video-only, 03 = audio-only).

    - Vocal channel (01 = speech, 02 = song).

    - Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).

    - Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the 'neutral' emotion.

    - Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door").

    - Repetition (01 = 1st repetition, 02 = 2nd repetition).

    - Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).

    - Filename example: 03-01-06-01-02-01-12.wav<br></br>

        - Audio-only (03)
        - Speech (01)
        - Fearful (06)
        - Normal intensity (01)
        - Statement "dogs" (02)
        - 1st Repetition (01)
        - 12th Actor (12)
        - Female, as the actor ID number is even.

https://www.kaggle.com/datasets/uwrfkaggler/ravdess-emotional-speech-audio

In [None]:
!pip install kagglehub
!pip install librosa



In [None]:
# importing the necessary libraries

import os
import pandas as pd, numpy as np
import librosa
import librosa.display
import seaborn as sns
import matplotlib.pyplot as plt


from IPython.display import Audio

import warnings
warnings.filterwarnings("ignore")
import kagglehub

### <font color = "teal"> Training CNN model on RAVDESS audio dataset<font>

In [None]:
pathRAVDESS = kagglehub.dataset_download("uwrfkaggler/ravdess-emotional-speech-audio")

print("Path to dataset files:", pathRAVDESS)

Path to dataset files: C:\Users\prakh\.cache\kagglehub\datasets\uwrfkaggler\ravdess-emotional-speech-audio\versions\1


In [None]:
audioDr = pathRAVDESS

audioFiles = []

for rt, dr, files in os.walk(audioDr):
    for audioName in files:
        if audioName.endswith(".wav"):

            filePath = os.path.join(rt, audioName)
            components = audioName.split("-")

            decodeInfo = {
                "Modality": "Full-AV" if components[0] == "01" else "Video-only" if components[0] == "02" else "Audio-only",
                "Vocal_Channel": "Speech" if components[1] == "01" else "Song",
                "Emotion": ["Neutral", "Calm", "Happy", "Sad", "Angry", "Fearful", "Disgust", "Surprised"][int(components[2]) - 1],
                "Emotional_Intensity": "Normal" if components[3] == "01" else "Strong",
                "Statement": "Kids are talking by the door" if components[4] == "01" else "Dogs are sitting by the door",
                "Repetition": "1st" if components[5] == "01" else "2nd",
                "Actor": int(os.path.splitext(components[6])[0]),
                "Gender": "Male" if int(os.path.splitext(components[6])[0])%2 != 0 else "Female",
                "File_Path": filePath
            }

            decodeInfo["AudioName"] = audioName

            audioFiles.append(decodeInfo)

AudioDf = pd.DataFrame(audioFiles)

print(AudioDf.head())

AudioDf.to_csv("Ravdess_Decoded_with_paths.csv", index = False)



     Modality Vocal_Channel  Emotion Emotional_Intensity  \
0  Audio-only        Speech  Neutral              Normal   
1  Audio-only        Speech  Neutral              Normal   
2  Audio-only        Speech  Neutral              Normal   
3  Audio-only        Speech  Neutral              Normal   
4  Audio-only        Speech     Calm              Normal   

                      Statement Repetition  Actor Gender  \
0  Kids are talking by the door        1st      1   Male   
1  Kids are talking by the door        2nd      1   Male   
2  Dogs are sitting by the door        1st      1   Male   
3  Dogs are sitting by the door        2nd      1   Male   
4  Kids are talking by the door        1st      1   Male   

                                           File_Path                 AudioName  
0  C:\Users\prakh\.cache\kagglehub\datasets\uwrfk...  03-01-01-01-01-01-01.wav  
1  C:\Users\prakh\.cache\kagglehub\datasets\uwrfk...  03-01-01-01-01-02-01.wav  
2  C:\Users\prakh\.cache\kagglehub\

In [None]:
extractDf = pd.read_csv("Ravdess_Decoded_with_paths.csv")

In [None]:
extractDf.head()

Unnamed: 0,Modality,Vocal_Channel,Emotion,Emotional_Intensity,Statement,Repetition,Actor,Gender,File_Path,AudioName
0,Audio-only,Speech,Neutral,Normal,Kids are talking by the door,1st,1,Male,C:\Users\prakh\.cache\kagglehub\datasets\uwrfk...,03-01-01-01-01-01-01.wav
1,Audio-only,Speech,Neutral,Normal,Kids are talking by the door,2nd,1,Male,C:\Users\prakh\.cache\kagglehub\datasets\uwrfk...,03-01-01-01-01-02-01.wav
2,Audio-only,Speech,Neutral,Normal,Dogs are sitting by the door,1st,1,Male,C:\Users\prakh\.cache\kagglehub\datasets\uwrfk...,03-01-01-01-02-01-01.wav
3,Audio-only,Speech,Neutral,Normal,Dogs are sitting by the door,2nd,1,Male,C:\Users\prakh\.cache\kagglehub\datasets\uwrfk...,03-01-01-01-02-02-01.wav
4,Audio-only,Speech,Calm,Normal,Kids are talking by the door,1st,1,Male,C:\Users\prakh\.cache\kagglehub\datasets\uwrfk...,03-01-02-01-01-01-01.wav


In [None]:
# Checking the unique values in the Vocal channel

extractDf.Vocal_Channel.unique()

array(['Speech'], dtype=object)

In [None]:
# Checking for the unique values in the Modality

extractDf.Modality.unique()

array(['Audio-only'], dtype=object)

In [None]:
emotionDf = extractDf[["Emotion", "File_Path"]]

In [None]:
emotionDf.head()

Unnamed: 0,Emotion,File_Path
0,Neutral,C:\Users\prakh\.cache\kagglehub\datasets\uwrfk...
1,Neutral,C:\Users\prakh\.cache\kagglehub\datasets\uwrfk...
2,Neutral,C:\Users\prakh\.cache\kagglehub\datasets\uwrfk...
3,Neutral,C:\Users\prakh\.cache\kagglehub\datasets\uwrfk...
4,Calm,C:\Users\prakh\.cache\kagglehub\datasets\uwrfk...


In [None]:
# Listening to one of the audio from the emotionDf

AudioPath = emotionDf.loc[10, "File_Path"]
print(f"Listening to the audio with emotion : {emotionDf.loc[10, 'Emotion']}")
Audio(AudioPath)

Listening to the audio with emotion : Calm


In [None]:
# Checking for all the unique emotions in the dataset

print(f"The different emotions in the dataset are : {emotionDf['Emotion'].unique()}")
print("*"*100)
print(f"There are {emotionDf['Emotion'].nunique()} number of unique emotions in the dataset")


The different emotions in the dataset are : ['Neutral' 'Calm' 'Happy' 'Sad' 'Angry' 'Fearful' 'Disgust' 'Surprised']
****************************************************************************************************
There are 8 number of unique emotions in the dataset


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.utils import to_categorical
from tensorflow.keras import layers, models
from tqdm import tqdm

# Extracting the  Features from Audio Files

def extract_features(file_path, n_mfcc=40):
    audio, sample_rate = librosa.load(file_path, sr=None)
    mfccs = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=n_mfcc)
    mfccs = np.mean(mfccs.T, axis=0)
    return mfccs

In [None]:
# Preparing the Dataset
FeatureList = []
EmotionList = []

for index, row in tqdm(emotionDf.iterrows(), total=emotionDf.shape[0]):
    try:
        features = extract_features(row['File_Path'])
        FeatureList.append(features)
        EmotionList.append(row['Emotion'])
    except Exception as e:
        print(f"Error processing file {row['File_Path']}: {e}")

# Converting to NumPy arrays
Features = np.array(FeatureList)
Labels = np.array(EmotionList)

# Encoding Labels to Integers
LabelEncoder = LabelEncoder()
EncodedLabels = LabelEncoder.fit_transform(Labels)
CategoricalLabels = to_categorical(EncodedLabels)

# Splitting the Dataset
XTrain, XTest, YTrain, YTest = train_test_split(Features, CategoricalLabels, test_size=0.2, random_state=42)

# Reshaping the Features for CNN Input
XTrain = XTrain[..., np.newaxis]
XTest = XTest[..., np.newaxis]

100%|██████████| 2880/2880 [01:59<00:00, 24.03it/s]


In [None]:
# Defining the CNN Model
Model = models.Sequential([
    layers.Conv1D(64, kernel_size=3, activation='relu', input_shape=(Features.shape[1], 1)),
    layers.MaxPooling1D(pool_size=2),
    layers.Conv1D(128, kernel_size=3, activation='relu'),
    layers.MaxPooling1D(pool_size=2),
    layers.Flatten(),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(CategoricalLabels.shape[1], activation='softmax')
])

# Compiling the Model
Model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

In [None]:
# Training the Model

History = Model.fit(XTrain, YTrain, epochs=30, batch_size=32, validation_data=(XTest, YTest))

Epoch 1/30
[1m72/72[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 13ms/step - accuracy: 0.1334 - loss: 5.2944 - val_accuracy: 0.1389 - val_loss: 2.0205
Epoch 2/30
[1m72/72[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 8ms/step - accuracy: 0.1725 - loss: 2.0515 - val_accuracy: 0.2222 - val_loss: 2.0186
Epoch 3/30
[1m72/72[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - accuracy: 0.1930 - loss: 2.0274 - val_accuracy: 0.1944 - val_loss: 1.9549
Epoch 4/30
[1m72/72[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - accuracy: 0.1626 - loss: 2.0042 - val_accuracy: 0.2257 - val_loss: 1.9453
Epoch 5/30
[1m72/72[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - accuracy: 0.1736 - loss: 1.9914 - val_accuracy: 0.2188 - val_loss: 1.8963
Epoch 6/30
[1m72/72[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - accuracy: 0.1970 - loss: 1.9833 - val_accuracy: 0.2569 - val_loss: 1.9364
Epoch 7/30
[1m72/72[0m [32m━━━━━━━━━

In [None]:
# Evaluating the Model

TestLoss, TestAccuracy = Model.evaluate(XTest, YTest)
print(f"Test Loss: {TestLoss:.4f}")
print(f"Test Accuracy: {TestAccuracy:.4f}")

[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.6273 - loss: 1.0951
Test Loss: 1.1014
Test Accuracy: 0.6215


In [None]:
# Creating a new dataframe containing intensity and gender as well

intensityDf = extractDf[["Emotion", "File_Path", "Emotional_Intensity", "Gender"]]

In [None]:
intensityDf.head()

Unnamed: 0,Emotion,File_Path,Emotional_Intensity,Gender
0,Neutral,C:\Users\prakh\.cache\kagglehub\datasets\uwrfk...,Normal,Male
1,Neutral,C:\Users\prakh\.cache\kagglehub\datasets\uwrfk...,Normal,Male
2,Neutral,C:\Users\prakh\.cache\kagglehub\datasets\uwrfk...,Normal,Male
3,Neutral,C:\Users\prakh\.cache\kagglehub\datasets\uwrfk...,Normal,Male
4,Calm,C:\Users\prakh\.cache\kagglehub\datasets\uwrfk...,Normal,Male


In [None]:
# Checking for the types of Emotional Intensity

intensityDf.Emotional_Intensity.unique()

array(['Normal', 'Strong'], dtype=object)

In [None]:
# Preparing Additional Features
GenderMapping = {"Male": 0, "Female": 1}
IntensityMapping = {"Normal": 0, "Strong": 1}

FeatureList = []
AdditionalFeatures = []
EmotionList = []

for index, row in tqdm(intensityDf.iterrows(), total=intensityDf.shape[0]):
    try:
        # Extracting MFCC Features
        features = extract_features(row['File_Path'])
        FeatureList.append(features)

        # Additional Features
        gender = GenderMapping[row['Gender']]
        intensity = IntensityMapping[row['Emotional_Intensity']]
        AdditionalFeatures.append([gender, intensity])

        # Emotion Labels
        EmotionList.append(row['Emotion'])
    except Exception as e:
        print(f"Error processing file {row['File_Path']}: {e}")

# Converting to NumPy Arrays
AudioFeatures = np.array(FeatureList)
AdditionalFeatures = np.array(AdditionalFeatures)
Labels = np.array(EmotionList)

# Encoding Labels
EncodedLabels = LabelEncoder.fit_transform(Labels)
CategoricalLabels = to_categorical(EncodedLabels)

# Splitting Data
XTrainAudio, XTestAudio, XTrainAdditional, XTestAdditional, YTrain, YTest = train_test_split(
    AudioFeatures, AdditionalFeatures, CategoricalLabels, test_size=0.2, random_state=42)

# Reshaping Audio Features
XTrainAudio = XTrainAudio[..., np.newaxis]
XTestAudio = XTestAudio[..., np.newaxis]

100%|██████████| 2880/2880 [01:20<00:00, 35.85it/s]


In [None]:
# Audio Branch
AudioInput = layers.Input(shape=(AudioFeatures.shape[1], 1))
AudioBranch = layers.Conv1D(64, kernel_size=3, activation='relu')(AudioInput)
AudioBranch = layers.MaxPooling1D(pool_size=2)(AudioBranch)
AudioBranch = layers.Conv1D(128, kernel_size=3, activation='relu')(AudioBranch)
AudioBranch = layers.MaxPooling1D(pool_size=2)(AudioBranch)
AudioBranch = layers.Flatten()(AudioBranch)

# Additional Features Branch
AdditionalInput = layers.Input(shape=(AdditionalFeatures.shape[1],))
AdditionalBranch = layers.Dense(16, activation='relu')(AdditionalInput)
AdditionalBranch = layers.Dense(8, activation='relu')(AdditionalBranch)

# Merging Branches
Merged = layers.concatenate([AudioBranch, AdditionalBranch])
DenseLayer = layers.Dense(128, activation='relu')(Merged)
DenseLayer = layers.Dropout(0.5)(DenseLayer)
OutputLayer = layers.Dense(CategoricalLabels.shape[1], activation='softmax')(DenseLayer)

# Model
Model = models.Model(inputs=[AudioInput, AdditionalInput], outputs=OutputLayer)

# Compile
Model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train
History = Model.fit(
    [XTrainAudio, XTrainAdditional], YTrain,
    epochs=30,
    batch_size=32,
    validation_data=([XTestAudio, XTestAdditional], YTest)
)

Epoch 1/30
[1m72/72[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 12ms/step - accuracy: 0.1566 - loss: 4.1731 - val_accuracy: 0.2274 - val_loss: 2.0026
Epoch 2/30
[1m72/72[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - accuracy: 0.1874 - loss: 1.9980 - val_accuracy: 0.2535 - val_loss: 1.9103
Epoch 3/30
[1m72/72[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - accuracy: 0.2338 - loss: 1.9506 - val_accuracy: 0.2170 - val_loss: 1.9041
Epoch 4/30
[1m72/72[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - accuracy: 0.2359 - loss: 1.9153 - val_accuracy: 0.3403 - val_loss: 1.8263
Epoch 5/30
[1m72/72[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.2724 - loss: 1.8749 - val_accuracy: 0.3542 - val_loss: 1.7888
Epoch 6/30
[1m72/72[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 8ms/step - accuracy: 0.2882 - loss: 1.8352 - val_accuracy: 0.3993 - val_loss: 1.7730
Epoch 7/30
[1m72/72[0m [32m━━━━━━━━━

In [None]:
# Evaluating the model on the test data
TestLoss, TestAccuracy = Model.evaluate([XTestAudio, XTestAdditional], YTest, verbose=0)

print(f"Test Loss: {TestLoss:.4f}")
print(f"Test Accuracy: {TestAccuracy:.4f}")

Test Loss: 0.5540
Test Accuracy: 0.8177


### <font color = "teal"> Training CNN model on CREMA-D audio dataset<font>

### CREMA-D (Crowd-sourced Emotional Multimodal Actors Dataset)

##### <font color = "red"> *Summary*</font>

- CREMA-D is a data set of 7,442 original clips from 91 actors. These clips were from 48 male and 43 female actors between the ages of 20 and 74 coming from a variety of races and ethnicities (African America, Asian, Caucasian, Hispanic, and Unspecified).

- Actors spoke from a selection of 12 sentences. The sentences were presented using one of six different emotions (Anger, Disgust, Fear, Happy, Neutral and Sad) and four different emotion levels (Low, Medium, High and Unspecified).

- Participants rated the emotion and emotion levels based on the combined audiovisual presentation, the video alone, and the audio alone. Due to the large number of ratings needed, this effort was crowd-sourced and a total of 2443 participants each rated 90 unique clips, 30 audio, 30 visual, and 30 audio-visual. 95% of the clips have more than 7 ratings.

https://github.com/CheyneyComputerScience/CREMA-D

In [None]:
pathCREMA_D = kagglehub.dataset_download("ejlok1/cremad")

print("Path to dataset files:", pathCREMA_D)

Path to dataset files: C:\Users\prakh\.cache\kagglehub\datasets\ejlok1\cremad\versions\1


In [None]:
# Examining the filename encoding for the CREMA-D dataset

for root, dirs, files in os.walk(pathCREMA_D):
    for file in files:
        if file.endswith(".wav"):
            print(file)
            break

1001_DFA_ANG_XX.wav


#### <font color = "red"> Filename labeling conventions</font>

- The Actor id is a 4 digit number at the start of the file. Each subsequent identifier is separated by an underscore (_).

- Actors spoke from a selection of 12 sentences (in parentheses is the three letter acronym used in the second part of the filename):

    - It's eleven o'clock (IEO).
    - That is exactly what happened (TIE).
    - I'm on my way to the meeting (IOM).
    - I wonder what this is about (IWW).
    - The airplane is almost full (TAI).
    - Maybe tomorrow it will be cold (MTI).
    - I would like a new alarm clock (IWL)
    - I think I have a doctor's appointment (ITH).
    - Don't forget a jacket (DFA).
    - I think I've seen this before (ITS).
    - The surface is slick (TSI).
    - We'll stop in a couple of minutes (WSI).

- The sentences were presented using different emotion (in parentheses is the three letter code used in the third part of the filename):

    - Anger (ANG)
    - Disgust (DIS)
    - Fear (FEA)
    - Happy/Joy (HAP)
    - Neutral (NEU)
    - Sad (SAD)

- And emotion level (in parentheses is the two letter code used in the fourth part of the filename):

    - Low (LO)
    - Medium (MD)
    - High (HI)
    - Unspecified (XX)


- The suffix of the filename is based on the type of file, flv for flash video used for presentation of both the video only, and the audio-visual clips. mp3 is used for the audio files used for the audio-only presentation of the clips. wav is used for files used for computational audio processing.

https://github.com/CheyneyComputerScience/CREMA-D

In [None]:
# Initializing list
FilePaths = []
SpeakerIDs = []
Sentences = []
Emotions = []
Intensities = []

# Sentence mapping
SentenceMapping = {
    "IEO": "It's eleven o'clock",
    "TIE": "That is exactly what happened",
    "IOM": "I'm on my way to the meeting",
    "IWW": "I wonder what this is about",
    "TAI": "The airplane is almost full",
    "MTI": "Maybe tomorrow it will be cold",
    "IWL": "I would like a new alarm clock",
    "ITH": "I think I have a doctor's appointment",
    "DFA": "Don't forget a jacket",
    "ITS": "I think I've seen this before",
    "TSI": "The surface is slick",
    "WSI": "We'll stop in a couple of minutes"
}

# Emotion mapping
EmotionMapping = {
    "ANG": "Angry",
    "DIS": "Disgust",
    "FEA": "Fearful",
    "HAP": "Happy",
    "NEU": "Neutral",
    "SAD": "Sad"
}

# Intensity mapping
IntensityMapping = {
    "LO": "Low",
    "MD": "Medium",
    "HI": "High",
    "XX": "Unspecified"
}



# Extracting the meta data

for root, dirs, files in os.walk(pathCREMA_D):
    for file in files:
        if file.endswith(".wav"):
            FilePaths.append(os.path.join(root, file))
            parts = file.split("_")

            SpeakerIDs.append(parts[0])
            Sentences.append(SentenceMapping.get(parts[1], "Unknown"))
            Emotions.append(EmotionMapping.get(parts[2], "Unknown"))
            Intensities.append(IntensityMapping.get(parts[3].split(".")[0], "Unknown"))

# Creating a DataFrame
CremaDDf = pd.DataFrame({
    "FilePath": FilePaths,
    "SpeakerID": SpeakerIDs,
    "Sentence": Sentences,
    "Emotion": Emotions,
    "Intensity": Intensities
})

# Checking the first five rows
print(CremaDDf.head())


                                            FilePath SpeakerID  \
0  C:\Users\prakh\.cache\kagglehub\datasets\ejlok...      1001   
1  C:\Users\prakh\.cache\kagglehub\datasets\ejlok...      1001   
2  C:\Users\prakh\.cache\kagglehub\datasets\ejlok...      1001   
3  C:\Users\prakh\.cache\kagglehub\datasets\ejlok...      1001   
4  C:\Users\prakh\.cache\kagglehub\datasets\ejlok...      1001   

                Sentence  Emotion    Intensity  
0  Don't forget a jacket    Angry  Unspecified  
1  Don't forget a jacket  Disgust  Unspecified  
2  Don't forget a jacket  Fearful  Unspecified  
3  Don't forget a jacket    Happy  Unspecified  
4  Don't forget a jacket  Neutral  Unspecified  


In [None]:
FixedTimeSteps = 200

def ExtractMelSpectrogram(filePath, n_mels=128, duration=3, sr=22050, fixed_time_steps=FixedTimeSteps):
    try:
        y, sr = librosa.load(filePath, sr=sr, duration=duration)
        mel_spec = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=n_mels)
        mel_spec_db = librosa.power_to_db(mel_spec, ref=np.max)

        if mel_spec_db.shape[1] < fixed_time_steps:
            pad_width = fixed_time_steps - mel_spec_db.shape[1]
            mel_spec_db = np.pad(mel_spec_db, ((0, 0), (0, pad_width)), mode='constant')
        else:
            mel_spec_db = mel_spec_db[:, :fixed_time_steps]

        return mel_spec_db
    except Exception as e:
        print(f"Error loading {filePath}: {e}")
        return None

MelSpecs = []
Labels = []

for index, row in tqdm(CremaDDf.iterrows(), total=CremaDDf.shape[0]):
    mel_spec = ExtractMelSpectrogram(row["FilePath"])
    if mel_spec is not None:
        MelSpecs.append(mel_spec)
        Labels.append(row["Emotion"])

MelSpecs = np.array(MelSpecs)
Labels = np.array(Labels)

MelSpecs = MelSpecs.reshape(MelSpecs.shape[0], MelSpecs.shape[1], MelSpecs.shape[2], 1)

print("Final Shape of MelSpecs:", MelSpecs.shape)

100%|██████████| 7442/7442 [01:57<00:00, 63.34it/s] 


Final Shape of MelSpecs: (7442, 128, 200, 1)


In [None]:
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.utils import to_categorical



LabelEncoder = LabelEncoder()
EncodedLabels = LabelEncoder.fit_transform(Labels)
CategoricalLabels = to_categorical(EncodedLabels)

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout

def BuildCNNModel(inputShape, numClasses):
    Model = Sequential([
        Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=inputShape),
        MaxPooling2D(pool_size=(2, 2)),

        Conv2D(64, kernel_size=(3, 3), activation='relu'),
        MaxPooling2D(pool_size=(2, 2)),

        Flatten(),
        Dense(128, activation='relu'),
        Dropout(0.5),
        Dense(numClasses, activation='softmax')
    ])

    Model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    return Model

InputShape = (MelSpecs.shape[1], MelSpecs.shape[2], 1)
NumClasses = len(LabelEncoder.classes_)
Model = BuildCNNModel(InputShape, NumClasses)

Model.fit(MelSpecs, CategoricalLabels, epochs=20, batch_size=32, validation_split=0.2)

Epoch 1/20
[1m187/187[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m109s[0m 559ms/step - accuracy: 0.1997 - loss: 10.7209 - val_accuracy: 0.2216 - val_loss: 1.7495
Epoch 2/20
[1m187/187[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m99s[0m 529ms/step - accuracy: 0.2063 - loss: 1.7879 - val_accuracy: 0.2283 - val_loss: 1.7634
Epoch 3/20
[1m187/187[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m100s[0m 535ms/step - accuracy: 0.2238 - loss: 1.7656 - val_accuracy: 0.2290 - val_loss: 1.7599
Epoch 4/20
[1m187/187[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m102s[0m 543ms/step - accuracy: 0.2149 - loss: 1.7688 - val_accuracy: 0.2357 - val_loss: 1.7524
Epoch 5/20
[1m187/187[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m102s[0m 547ms/step - accuracy: 0.2134 - loss: 1.7716 - val_accuracy: 0.2250 - val_loss: 1.7500
Epoch 6/20
[1m187/187[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m99s[0m 528ms/step - accuracy: 0.2307 - loss: 1.7641 - val_accuracy: 0.2270 - val_loss: 1.7510
Epoch

<keras.src.callbacks.history.History at 0x255ab548410>

In [None]:
TestLoss, TestAccuracy = Model.evaluate(MelSpecs, CategoricalLabels)
print(f"Test Loss: {TestLoss:.4f}")
print(f"Test Accuracy: {TestAccuracy:.4f}")

[1m233/233[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 95ms/step - accuracy: 0.7260 - loss: 0.7860
Test Loss: 0.9476
Test Accuracy: 0.6755
