<a href="https://colab.research.google.com/github/mohripan/Job-Interview-Analysis/blob/main/RAVDESS_emotions_speech_audio.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Download the Dataset

To download the dataset, I save my Kaggle API Key in my google drive so I don't need to upload it everytime.

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [2]:
!cp '/content/gdrive/MyDrive/KaggleAPI/kaggle.json' kaggle.json

In [3]:
!pip install -q kaggle

In [4]:
!mkdir -p ~/.kaggle

In [5]:
!cp kaggle.json ~/.kaggle/

In [6]:
!chmod 600 ~/.kaggle/kaggle.json

In [7]:
!kaggle datasets download -d uwrfkaggler/ravdess-emotional-speech-audio

Downloading ravdess-emotional-speech-audio.zip to /content
 96% 411M/429M [00:02<00:00, 184MB/s]
100% 429M/429M [00:02<00:00, 165MB/s]


In [8]:
!mkdir ravdess

In [9]:
!unzip -q /content/ravdess-emotional-speech-audio.zip -d ravdess

# EDA

The first thing I want to do is to listen one of the audio in my dataset and print out the labels. You can look at the detail of the dataset [here](https://www.kaggle.com/datasets/uwrfkaggler/ravdess-emotional-speech-audio), but here is a big picture of the dataset.

The RAVDESS dataset is designed for emotion classification. The dataset includes 24 actors who vocalize different emotions across various statements. The file naming convention provided in the dataset description contains the information you need to identify the emotion associated with each audio file.

Here's a quick summary of the filename identifiers, focusing on the emotion part:
- Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised)

For example, in the filename "03-01-06-01-02-01-12.wav":
- The "06" in the filename represents the emotion, which is "fearful" in this case.

In [64]:
import os
import random
import pandas as pd
from IPython.display import Audio

In [65]:
# Define the path to the dataset folder
dataset_path = '/content/ravdess'

# Define a mapping for emotion identifiers to emotion labels
emotion_mapping = {
    '01': 'neutral',
    '02': 'calm',
    '03': 'happy',
    '04': 'sad',
    '05': 'angry',
    '06': 'fearful',
    '07': 'disgust',
    '08': 'surprised'
}

In [66]:
# Initialize an empty list to store the audio file paths
audio_files = []

# Iterate through the folders (actor_01 to actor_24)
for actor_folder in os.listdir(dataset_path):
    actor_path = os.path.join(dataset_path, actor_folder)
    
    # Check if the path is a directory (to avoid processing non-directory files)
    if os.path.isdir(actor_path):
        
        # Iterate through the audio files in the actor folder
        for audio_file in os.listdir(actor_path):
            audio_path = os.path.join(actor_path, audio_file)
            audio_files.append(audio_path)

In [67]:
# Select a random audio file
random_audio_file = random.choice(audio_files)

# Extract the emotion identifier from the filename (4th part of the filename)
emotion_id = os.path.basename(random_audio_file).split('-')[2]

# Map the emotion identifier to the corresponding emotion label
emotion_label = emotion_mapping[emotion_id]

# Print the emotion label and play the audio
print(f"Emotion: {emotion_label}")
Audio(filename=random_audio_file)

Emotion: disgust


You can run the code above several times and match the emotion label and the sounds that you hear.

# Pre-processing The Dataset

After we saved the dataset, our next step is to preprocess the dataset. There are some process to preprocess the data before we feed it into the model. Those process are:
- Resampling
- Noise Reduction
- Trimming Silence
- Normalization

Our goal now is to preprocess the dataset and then save the preprocess into a different folder.

In [29]:
import os
import librosa
import soundfile as sf

Let's make a folder to save our preprocess dataset

In [30]:
!mkdir ravdess_preprocess

In [31]:
# Define the path to the dataset folder and the output folder for preprocessed audio files
dataset_path = '/content/ravdess'
output_path = '/content/ravdess_preprocess'

# Define the target sampling rate
target_sampling_rate = 22050

In [33]:
# Iterate through the folders (actor_01 to actor_24)
for actor_folder in os.listdir(dataset_path):
    actor_path = os.path.join(dataset_path, actor_folder)
    output_actor_path = os.path.join(output_path, actor_folder)
    
    # Create the output folder for the current actor if it doesn't exist
    os.makedirs(output_actor_path, exist_ok=True)

    # Check if the path is a directory (to avoid processing non-directory files)
    if os.path.isdir(actor_path):
        
        # Iterate through the audio files in the actor folder
        for audio_file in os.listdir(actor_path):
            
            if audio_file.endswith('.wav'):
                audio_path = os.path.join(actor_path, audio_file)
                output_audio_path = os.path.join(output_actor_path, audio_file)

                # Load the audio file using librosa
                audio, sampling_rate = librosa.load(audio_path, sr=None)

                # Resample the audio to the target sampling rate
                if sampling_rate != target_sampling_rate:
                    audio = librosa.resample(audio, orig_sr=sampling_rate, target_sr=target_sampling_rate)

                # Trim leading and trailing silence
                audio, _ = librosa.effects.trim(audio, top_db=30)

                # Normalize the amplitude
                audio = librosa.util.normalize(audio)

                # Save the preprocessed audio file
                sf.write(output_audio_path, audio, target_sampling_rate, format='wav')

Now, let us hear the preprocessing data

In [68]:
# Define the path to the dataset folder
dataset_path = '/content/ravdess_preprocess'

# Define a mapping for emotion identifiers to emotion labels
emotion_mapping = {
    '01': 'neutral',
    '02': 'calm',
    '03': 'happy',
    '04': 'sad',
    '05': 'angry',
    '06': 'fearful',
    '07': 'disgust',
    '08': 'surprised'
}

In [69]:
# Initialize an empty list to store the audio file paths
audio_files = []

# Iterate through the folders (actor_01 to actor_24)
for actor_folder in os.listdir(dataset_path):
    actor_path = os.path.join(dataset_path, actor_folder)
    
    # Check if the path is a directory (to avoid processing non-directory files)
    if os.path.isdir(actor_path):
        
        # Iterate through the audio files in the actor folder
        for audio_file in os.listdir(actor_path):
            audio_path = os.path.join(actor_path, audio_file)
            audio_files.append(audio_path)

In [70]:
# Select a random audio file
random_audio_file = random.choice(audio_files)

# Extract the emotion identifier from the filename (4th part of the filename)
emotion_id = os.path.basename(random_audio_file).split('-')[2]

# Map the emotion identifier to the corresponding emotion label
emotion_label = emotion_mapping[emotion_id]

# Print the emotion label and play the audio
print(f"Emotion: {emotion_label}")
Audio(filename=random_audio_file)

Emotion: sad


# Save the Dataset Into .csv File

After we did preprocessing dataset, we want to save the data to .csv file. You don't have to do this, but it will make the formatting easier.

In [71]:
# Define the path to the dataset folder
dataset_path = '/content/ravdess_preprocess'

# Define a mapping for emotion identifiers to emotion labels
emotion_mapping = {
    '01': 'neutral',
    '02': 'calm',
    '03': 'happy',
    '04': 'sad',
    '05': 'angry',
    '06': 'fearful',
    '07': 'disgust',
    '08': 'surprised'
}

In [72]:
# Initialize an empty list to store the data
data = []

# Iterate through the folders (actor_01 to actor_24)
for actor_folder in os.listdir(dataset_path):
    actor_path = os.path.join(dataset_path, actor_folder)
    
    # Check if the path is a directory (to avoid processing non-directory files)
    if os.path.isdir(actor_path):
        
        # Iterate through the audio files in the actor folder
        for audio_file in os.listdir(actor_path):
            
            # Check if the file is an audio file (has a .wav extension)
            if audio_file.endswith('.wav'):
                audio_path = os.path.join(actor_path, audio_file)

                # Extract the emotion identifier from the filename (4th part of the filename)
                filename_parts = audio_file.split('-')
                if len(filename_parts) > 2:
                    emotion_id = filename_parts[2]

                    # Map the emotion identifier to the corresponding emotion label
                    emotion_label = emotion_mapping.get(emotion_id)

                    if emotion_label is not None:
                        # Append the audio path and emotion label to the data list
                        data.append([audio_path, emotion_label])

# Create a pandas DataFrame from the data list
df = pd.DataFrame(data, columns=['audio_path', 'emotion'])

# Display the first few rows of the DataFrame
print(df.head())

# Save the DataFrame to a CSV file
df.to_csv('RAVDESS_emotion_labels.csv', index=False)

                                          audio_path    emotion
0  /content/ravdess_preprocess/Actor_11/03-01-04-...        sad
1  /content/ravdess_preprocess/Actor_11/03-01-08-...  surprised
2  /content/ravdess_preprocess/Actor_11/03-01-04-...        sad
3  /content/ravdess_preprocess/Actor_11/03-01-05-...      angry
4  /content/ravdess_preprocess/Actor_11/03-01-02-...       calm


# Feature Extraction
Feature extraction is an essential step in machine learning, especially for tasks involving complex data like audio. Raw data often contains a lot of irrelevant or redundant information that can negatively impact the performance of machine learning models. By extracting meaningful features from the raw data, we can represent the data more compactly and efficiently, which helps the model learn better and make more accurate predictions.

For speech and audio analysis, Mel-frequency cepstral coefficients (MFCCs) are a widely used set of features. MFCCs represent the spectral envelope of the audio signal, which captures the characteristics of the speaker's voice and the spoken content. By using MFCCs as input features, we can simplify the audio data while retaining the essential information required for the machine learning model to recognize patterns related to emotions in the audio signal.

In [73]:
import pandas as pd
import numpy as np
import librosa

In [74]:
# Load the DataFrame from the preprocessed .csv file
df = pd.read_csv('RAVDESS_emotion_labels.csv')

# Define the parameters for MFCC feature extraction
n_mfcc = 13
n_fft = 2048
hop_length = 512

In [75]:
# Initialize an empty list to store the features
features = []

# Iterate through the DataFrame rows
for index, row in df.iterrows():
    audio_path = row['audio_path']
    emotion = row['emotion']

    # Load the preprocessed audio file
    audio, _ = librosa.load(audio_path, sr=None)

    # Extract the MFCC features
    mfcc = librosa.feature.mfcc(y=audio, n_mfcc=n_mfcc, n_fft=n_fft, hop_length=hop_length)

    # Calculate the mean MFCC features
    mean_mfcc = np.mean(mfcc, axis=1)
    mean_mfcc = np.pad(mean_mfcc, (0, 16 - n_mfcc), 'constant')

    # Append the mean MFCC features and the emotion label to the features list
    features.append(np.append(mean_mfcc, emotion))

In [76]:
# Create a pandas DataFrame from the features list
feature_columns = [f'mfcc_{i + 1}' for i in range(n_mfcc)] + ['emotion']
features_df = pd.DataFrame(features, columns=feature_columns)

# Display the first few rows of the features DataFrame
print(features_df.head())

# Save the features DataFrame to a CSV file
features_df.to_csv('RAVDESS_MFCC_features.csv', index=False)

ValueError: ignored

# Data Preparation

Finally, we want to split the dataset into training and test sets. We also want to encode our data. In this case, I use PyTorch to prepare my data.

In [59]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
import torch
from torch.utils.data import DataLoader, TensorDataset

In [60]:
# Load the features DataFrame from the .csv file
features_df = pd.read_csv('RAVDESS_MFCC_features.csv')

# Separate the features and labels
X = features_df.iloc[:, :-1].values
y = features_df['emotion'].values

In [61]:
# Encode the emotion labels
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [62]:
# Normalize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Convert the numpy arrays to PyTorch tensors
X_train = torch.tensor(X_train, dtype=torch.float32)
y_train = torch.tensor(y_train, dtype=torch.long)
X_test = torch.tensor(X_test, dtype=torch.float32)
y_test = torch.tensor(y_test, dtype=torch.long)

In [63]:
# Create PyTorch DataLoaders for training and testing sets
train_dataset = TensorDataset(X_train, y_train)
test_dataset = TensorDataset(X_test, y_test)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

# Let's Train a Machine Learning Model
Finally, after we have preprared the dataset, we will train the model. For this particular case, I will use Transformer architecture. I also try to make the model smaller to avoid overkill, and might update it later.

In [52]:
import torch
import torch.nn as nn
from torch.optim import Adam
import os

In [53]:
# Define the Transformer model
class EmotionClassifier(nn.Module):
    def __init__(self, input_dim, nhead, num_layers, dim_feedforward, num_classes):
        super().__init__()
        self.transformer = nn.Transformer(input_dim, nhead, num_layers, dim_feedforward)
        self.fc = nn.Linear(input_dim, num_classes)

    def forward(self, x):
        x = x.transpose(0, 1)
        x = self.transformer.encoder(x)
        x = x.transpose(0, 1)
        x = self.fc(x[:, 0, :])
        return x

In [54]:
# Instantiate the model
input_dim = X_train.shape[1]
nhead = 4
num_layers = 2
dim_feedforward = 64
num_classes = len(label_encoder.classes_)
model = EmotionClassifier(input_dim, nhead, num_layers, dim_feedforward, num_classes)

AssertionError: ignored