# **Merging Datasets**

This notebook presents how we can combine the following four audio datasets, which are classified by emotions:

- **[RAVDESS](https://www.kaggle.com/datasets/uwrfkaggler/ravdess-emotional-speech-audio)**: Contains 1440 `.wav` files of emotional speech and song, classified into **8 emotions** (neutral, calm, happy, sad, angry, fearful, disgust, and surprised).
- **[CREMA-D](https://www.kaggle.com/datasets/ejlok1/cremad)**: Contains 7442 `.wav` files classified into **6 emotions** (anger, disgust, fear, happiness, neutral, and sadness).
- **[TESS](https://www.kaggle.com/datasets/ejlok1/toronto-emotional-speech-set-tess)**: Contains 2800 `.wav` files classified into **7 emotions** (anger, disgust, fear, happiness, pleasant surprise, sadness, and neutral).
- **[SAVEE](https://www.kaggle.com/datasets/ejlok1/surrey-audiovisual-expressed-emotion-savee)**: Contains 480 `.wav` files spoken by four actors, classified into **7 emotions** (neutral, anger, disgust, fear, happiness, sadness, and surprise).



In this section, we will merge the four datasets into one

### **Load packages**

In [1]:
import pandas as pd
import numpy as np
import os

### **Useful functions**

In [2]:
def generate_paths_csv(directories:list[str]=[], type='ravdess', directory_path=''):

    emotion_id = []
    file_path = []

    if (type == 'ravdess'):
        
        for directory in directories:
            actor = os.listdir(directory)
            for audio in actor:
                part = audio.split('.')[0].split('-')
                emotion_id.append(int(part[2]))
                file_path.append(directory + '/' + audio)

    elif (type == 'cremad'):

        files =  os.listdir(directory_path)
        
        for file in files:

            file_path.append(directory_path + file)
            emotion = file.split('_')[2]

            if emotion == 'SAD':
                emotion_id.append('sad')
            elif emotion== 'ANG':
                emotion_id.append('angry')
            elif emotion== 'DIS':
                emotion_id.append('disgust')
            elif emotion== 'FEA':
                emotion_id.append('fear')
            elif emotion== 'HAP':
                emotion_id.append('happy')
            elif emotion== 'NEU':
                emotion_id.append('neutral')
            else:
                emotion_id.append('unknown')
            
    elif (type == 'tess'):

        directories = os.listdir(directory_path)

        for directory in directories:

            files = os.listdir(directory_path + directory)

            for file in files:
                emotion = file.split('.')[0].split('_')[2]
                if emotion == 'ps':
                    emotion_id.append('surprise')
                else:
                    emotion_id.append(emotion)

                file_path.append(directory_path + directory + '/' + file)

    elif (type == 'savee'):

        files = os.listdir(directory_path)

        for file in files:

            file_path.append(directory_path+ file)
            id = file.split('_')[1][:-6] # ignore ##.wav
            if id == 'a':
                emotion_id.append('angry')
            elif id == 'd':
                emotion_id.append('disgust')
            elif id == 'f':
                emotion_id.append('fear')
            elif id == 'h':
                emotion_id.append('happy')
            elif id == 'n':
                emotion_id.append('neutral')
            elif id == 'sa':
                emotion_id.append('sad')
            else:
                emotion_id.append('surprise')

    return emotion_id, file_path
    

### **RAVDESS**

In [3]:
ravdess = [f'./archive/Actor_{i:02d}' for i in range(1, 25)]

print(ravdess)

['./archive/Actor_01', './archive/Actor_02', './archive/Actor_03', './archive/Actor_04', './archive/Actor_05', './archive/Actor_06', './archive/Actor_07', './archive/Actor_08', './archive/Actor_09', './archive/Actor_10', './archive/Actor_11', './archive/Actor_12', './archive/Actor_13', './archive/Actor_14', './archive/Actor_15', './archive/Actor_16', './archive/Actor_17', './archive/Actor_18', './archive/Actor_19', './archive/Actor_20', './archive/Actor_21', './archive/Actor_22', './archive/Actor_23', './archive/Actor_24']


In [4]:
emotion_id_ravdess, file_path_ravdess = generate_paths_csv(directories=ravdess, type='ravdess')

In [5]:
emotion_df = pd.DataFrame(emotion_id_ravdess, columns=['emotion_id'])
path_df = pd.DataFrame(file_path_ravdess, columns=['file_path'])
ravdess_df = pd.concat([emotion_df, path_df], axis=1)

ravdess_df.replace({"emotion_id": {
                               1:'neutral', 
                               2:'neutral', 
                               3:'happy', 
                               4:'sad', 
                               5:'angry', 
                               6:'fear', 
                               7:'disgust',
                               8:'surprise'}},
                            inplace=True)


In [6]:
ravdess_df.head()

Unnamed: 0,emotion_id,file_path
0,neutral,./archive/Actor_01/03-01-01-01-01-01-01.wav
1,neutral,./archive/Actor_01/03-01-01-01-01-02-01.wav
2,neutral,./archive/Actor_01/03-01-01-01-02-01-01.wav
3,neutral,./archive/Actor_01/03-01-01-01-02-02-01.wav
4,neutral,./archive/Actor_01/03-01-02-01-01-01-01.wav


In [7]:
ravdess_df.emotion_id.value_counts()

emotion_id
neutral     288
happy       192
sad         192
angry       192
fear        192
disgust     192
surprise    192
Name: count, dtype: int64

### **CREMA-D**

In [8]:
cremad = "./archive/AudioWAV/"

In [9]:
emotion_id_cremad, file_path_cremad = generate_paths_csv(type='cremad', directory_path=cremad)

In [10]:
emotion_df = pd.DataFrame(emotion_id_cremad, columns=['emotion_id'])
path_df = pd.DataFrame(file_path_cremad, columns=['file_path'])
cremad_df = pd.concat([emotion_df, path_df], axis=1)

In [11]:
cremad_df.head()

Unnamed: 0,emotion_id,file_path
0,angry,./archive/AudioWAV/1001_DFA_ANG_XX.wav
1,disgust,./archive/AudioWAV/1001_DFA_DIS_XX.wav
2,fear,./archive/AudioWAV/1001_DFA_FEA_XX.wav
3,happy,./archive/AudioWAV/1001_DFA_HAP_XX.wav
4,neutral,./archive/AudioWAV/1001_DFA_NEU_XX.wav


In [12]:
cremad_df.emotion_id.value_counts()

emotion_id
angry      1271
disgust    1271
fear       1271
happy      1271
sad        1271
neutral    1087
Name: count, dtype: int64

### **Tess**

In [13]:
tess = "./archive/TESS Toronto emotional speech set data/"

In [14]:
emotion_id_tess, file_path_tess = generate_paths_csv(type='tess', directory_path=tess)

In [15]:
emotion_df = pd.DataFrame(emotion_id_tess, columns=['emotion_id'])
path_df = pd.DataFrame(file_path_tess, columns=['file_path'])
tess_df = pd.concat([emotion_df, path_df], axis=1)

In [16]:
tess_df.head()

Unnamed: 0,emotion_id,file_path
0,angry,./archive/TESS Toronto emotional speech set da...
1,angry,./archive/TESS Toronto emotional speech set da...
2,angry,./archive/TESS Toronto emotional speech set da...
3,angry,./archive/TESS Toronto emotional speech set da...
4,angry,./archive/TESS Toronto emotional speech set da...


In [17]:
tess_df.emotion_id.value_counts()

emotion_id
angry       400
disgust     400
fear        400
happy       400
neutral     400
surprise    400
sad         400
Name: count, dtype: int64

### **SAVEE**

In [18]:
savee = "./archive/ALL/"

In [19]:
emotion_id_savee, file_path_savee = generate_paths_csv(type='savee', directory_path=savee)

In [20]:
emotion_df = pd.DataFrame(emotion_id_savee, columns=['emotion_id'])
path_df = pd.DataFrame(file_path_savee, columns=['file_path'])
savee_df = pd.concat([emotion_df, path_df], axis=1)

In [21]:
savee_df.head()

Unnamed: 0,emotion_id,file_path
0,angry,./archive/ALL/DC_a01.wav
1,angry,./archive/ALL/DC_a02.wav
2,angry,./archive/ALL/DC_a03.wav
3,angry,./archive/ALL/DC_a04.wav
4,angry,./archive/ALL/DC_a05.wav


In [22]:
savee_df.emotion_id.value_counts()

emotion_id
neutral     120
angry        60
disgust      60
fear         60
happy        60
sad          60
surprise     60
Name: count, dtype: int64

### **Merge datasets**

Now that we have the four datasets loaded, it's time to merge them into one

In [23]:
df = pd.concat([ravdess_df, cremad_df, tess_df, savee_df], axis = 0)
df.to_csv("./data/dataset.csv",index=False)

In [24]:
df.head()

Unnamed: 0,emotion_id,file_path
0,neutral,./archive/Actor_01/03-01-01-01-01-01-01.wav
1,neutral,./archive/Actor_01/03-01-01-01-01-02-01.wav
2,neutral,./archive/Actor_01/03-01-01-01-02-01-01.wav
3,neutral,./archive/Actor_01/03-01-01-01-02-02-01.wav
4,neutral,./archive/Actor_01/03-01-02-01-01-01-01.wav


In [25]:
df.emotion_id.value_counts()

emotion_id
happy       1923
sad         1923
angry       1923
fear        1923
disgust     1923
neutral     1895
surprise     652
Name: count, dtype: int64