# **Feature extraction**

To extract features from our dataset, which consists of the four previously mentioned datasets (`merging-datasets.ipynb`), we'll first apply **data augmentation**. After enhancing the dataset, we'll perform feature extraction using **ZCR, RMSE, and MFCC**.

## **Load packages**

In [6]:
import pandas as pd
import numpy as np
import librosa

## **Load dataset**

In [7]:
df = pd.read_csv("./data/dataset.csv")

In [8]:
df.head()

Unnamed: 0,emotion_id,file_path
0,neutral,./archive/Actor_01/03-01-01-01-01-01-01.wav
1,neutral,./archive/Actor_01/03-01-01-01-01-02-01.wav
2,neutral,./archive/Actor_01/03-01-01-01-02-01-01.wav
3,neutral,./archive/Actor_01/03-01-01-01-02-02-01.wav
4,calm,./archive/Actor_01/03-01-02-01-01-01-01.wav


## **Data Augmentation**

This technique is used to increase the quantity and diversity of data in a training set, helping to improve the generalization capability of a ML model. In the case of audio, various transformations are applied to simulate natural variations in sound signals

### **1. Noise**

- Random noise is added to the audio signal to simulate recordings in noisy environments.
- This helps the model become more robust to variations in recording conditions.  

In [9]:
def noise(data):
    noise_amp = 0.035*np.random.uniform()*np.amax(data)
    data = data + noise_amp*np.random.normal(size=data.shape[0])
    return data

### **2. Stretch**

- Changes the speed of the audio without altering its pitch.
- Useful for simulating variations in the duration of words or phrases without affecting the frequency of the sound.

In [10]:
def stretch(data, rate=0.8):
    return librosa.effects.time_stretch(y=data, rate=rate)

### **3. Shift (Time Shifting)**
- Shifts the audio signal in time to generate modified versions of the same recording.
- Helps make the model more robust to slight changes in the temporal alignment of the audio.

In [11]:
def shift(data):
    shift_range = int(np.random.uniform(low=-5, high = 5)*1000)
    return np.roll(data, shift_range)

### **4. Pitch (Pitch Shifting)**
- Alters the frequency of the audio signal without changing its duration.
- Simulates variations in speakers' voices

In [12]:
def pitch(data, sampling_rate, pitch_factor=0.7):
    return librosa.effects.pitch_shift(y=data, sr=sampling_rate, n_steps=pitch_factor)

## **Feature Extraction**

### **Load packages**  

In [17]:
import timeit
from tqdm import tqdm

Zero Crossing Rate (**ZCR**), Root Mean Square Energy (**RMSE**), and Mel-Frequency Cepstral Coefficients (**MFCC**) are some of the feature extraction techniques used to analyze audio signals. These features help capture important characteristics of the sound, such as signal energy, spectral properties, and frequency variations, making them useful for tasks like speech recognition, emotion detection, and audio classification.

In [18]:
def zcr(data, frame_length, hop_length):
    zcr = librosa.feature.zero_crossing_rate(y=data, frame_length=frame_length, hop_length=hop_length)
    return np.squeeze(zcr)

def rmse(data, frame_length=2048, hop_length=512):
    rmse = librosa.feature.rms(y=data, frame_length=frame_length, hop_length=hop_length)
    return np.squeeze(rmse)

def mfcc(data, sr, flatten:bool=True):
    mfcc = librosa.feature.mfcc(y=data, sr=sr)
    return np.squeeze(mfcc.T)if not flatten else np.ravel(mfcc.T)

In [19]:
def extract_features(data,sr=22050, frame_length=2048, hop_length=512):
    result=np.array([])
    
    result=np.hstack((result,
                      zcr(data, frame_length, hop_length),
                      rmse(data, frame_length, hop_length),
                      mfcc(data, sr)
                     ))
    return result

def get_features(path, duration=2.5, offset=0.6):
    
    normal_audio, sr = librosa.load(path, duration=duration, offset=offset)
    
    features_1 = extract_features(normal_audio)
    features = np.array(features_1)
    
    noised_audio = noise(normal_audio)
    noised_features = extract_features(noised_audio)
    features = np.vstack((features, noised_features))
    
    pitched_audio = pitch(normal_audio, sr)
    pitched_features = extract_features(pitched_audio)
    features = np.vstack((features, pitched_features))
    
    pitched_audio_ = pitch(normal_audio, sr)
    pitched_noised_audio = noise(pitched_audio_)
    pitched_noised_features = extract_features(pitched_noised_audio)
    features = np.vstack((features, pitched_noised_features))
    
    return features

In [None]:
X, Y = [], []

print('Starting to extract features...\n')

for path, emotion, index in tqdm (zip(df.file_path, df.emotion_id, range(df.file_path.shape[0]))):
    
    features = get_features(path)
    
    if (index % 1000 == 0):
        print(f'{index} audios has been processed')
    
    for i in features:
        X.append(i)
        Y.append(emotion)

print(f'Done! {len(X)} audios has been processed')


### **Saving features**