<a href="https://www.kaggle.com/code/ananyamkhrj/ser-project-feature-extraction-data-augmentation?scriptVersionId=292057016" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Speech Emotion Recognition (SER) Project

In [1]:
#!pip3 install librosa
#!pip3 install numpy
#!pip3 install kagglehub
#!pip3 install IPython
#!pip3 install matplotlib
#!pip3 install pandas

In [2]:
import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt
import IPython.display as ipd
import os
import kagglehub
import pandas as pd

## Feature Extraction

### 1. MFCCs

MFCCs can be represented as an image since they are of the form of a 2D array, with one axis along time.
Otherwise, we can take the mean of each MFCC across time, which might be faster to process for a real-time engine.<br>

Here, we extract 16 MFCCs and return their means across time as a feature of the data.

In [3]:
def extract_mfccs(data,sample_rate):
    mfccs = librosa.feature.mfcc(y=data, sr=sample_rate, n_mfcc=15).T     # Extract 15 MFCCs of the form of a 2D array
    # mfccs = np.mean(mfccs, axis=0)  # Take the mean of the MFCCs across time axis
    # print("MFCCs shape:", mfccs.shape)
    return mfccs

def extract_chroma(data,sample_rate):
    chroma = librosa.feature.chroma_stft(y=data, sr=sample_rate, n_chroma=12).T
    # print("Chroma shape:", chroma.shape)
    return chroma

def extract_mel(data,sample_rate):
    mel = librosa.feature.melspectrogram(y=data, sr=sample_rate, n_mels=128).T
    # print("Mel Spectrogram shape:", mel.shape)
    return mel

# example usage nutcracker through librosa example
#filename = librosa.example('nutcracker')
#y, sr = librosa.load(filename)
#extract_mfccs(y, sr)
#extract_chroma(y, sr)
#extract_mel(y, sr)

## Data Augmentation

1. Noise Injection
2. Shifting
3. Pitch Stretching

### 1. Noise Injection

We plan to implement **Gaussian Noise Injection** -- random noise with a probability density of a normal distribution.

It simulates microphone etc. sensor noise and is easy to implement.

Environmental noise would be ideal, but we're ignoring it as its slightly difficult to randomise.

_Pink/brown noise_ are more aligned to the human speech frequency distribution, and may provide better robustness when combined with white noise, however this will have to be tested.

Noise will be adjusted on the basis of **SNR (Signal-to-Noise ratio)** of 25dB, so that noise is not overpowering.

<center><pre>noise_power = signal_power / (10 ** (25/10)) = signal_power * 0.0032</pre></center>

The calculation <pre>(10**(25/10))</pre> converts SNR from dB to linear scale.

This can be changed later on to increase background noise.

In [4]:
def noiseInjection(data):
    signal_power = np.mean(data**2)
    noise_power = signal_power * 0.0032     # calculated for an approximate SNR of 25 dB
    noise = np.random.normal(0,np.sqrt(noise_power),len(data))
    augmented_data = data + noise
    return augmented_data

### 2. Shifting

So, temporal translation (shifting) can be done by rolling the numpy arrays.

Should the arrays only be rolled forward?<br>
Also, once rolled, should that portion of array must be set to silence (0)? This is a question regarding whether to use circular shift or zero-padding.<br>

In this case, we are rolling arrays in both directions and implementing circular shift of <10%


In [5]:
def shifting(data):
    shift = int(np.random.uniform(-0.1,0.1)*len(data))
    augmented_data = np.roll(data,shift)
    return augmented_data

### 3. Pitch Stretching
randomly changing the pitch using the librosa library Â±10 half-notes<br>
PROBLEM -- if implementing pitch stretching, should we separate male and female voices?

In [6]:
def pitchStretching(data,sr):
    pitch_stretch = np.random.uniform(-5,5)
    augmented_data = librosa.effects.pitch_shift(data,sr=sr,n_steps=pitch_stretch)
    return augmented_data


### Example of Data Augmentation
Augmenting "The Nutcracker" with randomised noise injection, temporal shifting and pitch stretching.

In [7]:
#y2 = pitchStretching(y,sr)
#y2 = noiseInjection(y2)
#y2 = shifting(y2)
#ipd.display(ipd.Audio(y,rate=sr))
#ipd.display(ipd.Audio(y2,rate=sr))

In [8]:
# downloads RAVDESS dataset from kaggle
RAVpath = kagglehub.dataset_download("uwrfkaggler/ravdess-emotional-speech-audio")
CREMApath = kagglehub.dataset_download("ejlok1/cremad")

print("Path to dataset files:", RAVpath)
print("Path to dataset files:", CREMApath)

Path to dataset files: /kaggle/input/ravdess-emotional-speech-audio
Path to dataset files: /kaggle/input/cremad


In [9]:
dirlist = os.listdir(RAVpath)
dirlist.sort()
dirlist2 = os.listdir(CREMApath)
dirlist2.sort()
pd.set_option('display.max_columns', None)

# first extracting filepaths of each subfolder in dirlist, and then extracting filepaths of each audio file in each subfolder
# then, we extract emotion and gender from filenames and append to respective lists

filepaths = []
emotions = []
genders = []
for subfolder in dirlist:
    subfolder_path = os.path.join(RAVpath, subfolder)
    if os.path.isdir(subfolder_path):
        for file in os.listdir(subfolder_path):
            codes = file.split('.')[0].split('-')
            if file.endswith('.wav'):
                filepaths.append(os.path.join(subfolder_path, file))
                emotions.append(codes[2])
                genders.append('female' if int(codes[6]) % 2 == 0 else 'male')
            # print(filepaths[-3:])
            # print(emotions[-3:])
            # print(genders[-3:])


female = [1002,1003,1004,1006,1007,1008,1009,1010,1012,1013,1018,1020,1021,1024,1025,1028,1029,
          1030,1037,1043,1046,1047,1049,1052,1053,1054,1055,1056,1058,1060,1061,1063,1072,1073,
          1074,1075,1076,1078,1079,1082,1084,1089,1091]

for subfolder in dirlist2:
    subfolder_path = os.path.join(CREMApath, subfolder)
    if os.path.isdir(subfolder_path):
        for file in os.listdir(subfolder_path):
            codes = file.split('.')[0].split('_')
            if file.endswith('.wav'):
                filepaths.append(os.path.join(subfolder_path, file))
                emotions.append(codes[2])
                genders.append('female' if int(codes[0]) in female else 'male')

# checking number of filepaths extracted
print("Total number of audio files:", len(filepaths),len(emotions),len(genders))



Total number of audio files: 8882 8882 8882


In [10]:
# creating pandas dataframe to store filepaths, emotions and genders

df = pd.DataFrame([filepaths, emotions, genders]).T
df.columns = ['filepath', 'emotion', 'gender']
emotion_dict = {'01':'neutral','02':'calm','03':'happy','04':'sad','05':'angry','06':'fearful','07':'disgust','08':'surprised'}
emotion_dict_crema = {'NEU':'neutral','HAP':'happy','SAD':'sad','ANG':'angry','FEA':'fearful','DIS':'disgust','SUR':'surprised'}
emotion_dict.update(emotion_dict_crema)
df['emotion'] = df['emotion'].replace(emotion_dict)
df['label'] = df['gender'] + '-' + df['emotion']
df.drop(columns=['emotion','gender'], inplace=True)
print(df.head())
print(df.tail())



                                            filepath         label
0  /kaggle/input/ravdess-emotional-speech-audio/A...     male-calm
1  /kaggle/input/ravdess-emotional-speech-audio/A...  male-neutral
2  /kaggle/input/ravdess-emotional-speech-audio/A...      male-sad
3  /kaggle/input/ravdess-emotional-speech-audio/A...     male-calm
4  /kaggle/input/ravdess-emotional-speech-audio/A...     male-calm
                                               filepath         label
8877  /kaggle/input/cremad/AudioWAV/1060_IEO_ANG_MD.wav  female-angry
8878  /kaggle/input/cremad/AudioWAV/1088_IWL_ANG_XX.wav    male-angry
8879  /kaggle/input/cremad/AudioWAV/1050_IOM_ANG_XX.wav    male-angry
8880  /kaggle/input/cremad/AudioWAV/1044_IWL_SAD_XX.wav      male-sad
8881  /kaggle/input/cremad/AudioWAV/1009_ITH_SAD_XX.wav    female-sad


In [11]:
# adding feature columns to the dataframe
mfccs = []
chroma = []
mel = []
for filepath in df['filepath']:
    data, sample_rate = librosa.load(filepath)
    mfccs.append(extract_mfccs(data, sample_rate))
    chroma.append(extract_chroma(data, sample_rate))
    mel.append(extract_mel(data, sample_rate))

print("MFCCs feature extraction completed:", len(mfccs))
print("Chroma feature extraction completed:", len(chroma))
print("Mel Spectrogram feature extraction completed:", len(mel))

  return pitch_tuning(


MFCCs feature extraction completed: 8882
Chroma feature extraction completed: 8882
Mel Spectrogram feature extraction completed: 8882


In [12]:
df['mfccs'] = mfccs
df['chroma'] = chroma
df['mel'] = mel

print(df.head())
df.to_csv('labels.csv', index=False)        # created a csv file with labels for each audio file
numpy_array = df.to_numpy()
np.save('labels.npy', numpy_array)

                                            filepath         label  \
0  /kaggle/input/ravdess-emotional-speech-audio/A...     male-calm   
1  /kaggle/input/ravdess-emotional-speech-audio/A...  male-neutral   
2  /kaggle/input/ravdess-emotional-speech-audio/A...      male-sad   
3  /kaggle/input/ravdess-emotional-speech-audio/A...     male-calm   
4  /kaggle/input/ravdess-emotional-speech-audio/A...     male-calm   

                                               mfccs  \
0  [[-887.14105, 2.89413, 2.8835135, 2.8661366, 2...   
1  [[-864.93823, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0....   
2  [[-798.6087, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0...   
3  [[-902.4752, 44.3454, 20.483978, 14.49259, 19....   
4  [[-892.03625, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0....   

                                              chroma  \
0  [[0.85496235, 0.6903129, 0.70005983, 0.601727,...   
1  [[0.727733, 0.9636807, 1.0, 0.80765074, 0.5434...   
2  [[1.0, 0.66551745, 0.6343575, 0.6218832, 0.673...   
3  [[0.8004069, 0.