# **Feature extraction**

To extract features from our dataset, which consists of the four previously mentioned datasets (`merging-datasets.ipynb`), we'll first apply **data augmentation**. After enhancing the dataset, we'll perform feature extraction using **ZCR, RMSE, and MFCC**.

## **Load packages**

In [1]:
import pandas as pd
import numpy as np
import librosa

## **Load dataset**

In [2]:
df = pd.read_csv("./data/dataset.csv")

In [3]:
df.head()

Unnamed: 0,emotion_id,file_path
0,neutral,./archive/Actor_01/03-01-01-01-01-01-01.wav
1,neutral,./archive/Actor_01/03-01-01-01-01-02-01.wav
2,neutral,./archive/Actor_01/03-01-01-01-02-01-01.wav
3,neutral,./archive/Actor_01/03-01-01-01-02-02-01.wav
4,neutral,./archive/Actor_01/03-01-02-01-01-01-01.wav


## **Data Augmentation**

This technique is used to increase the quantity and diversity of data in a training set, helping to improve the generalization capability of a ML model. In the case of audio, various transformations are applied to simulate natural variations in sound signals

### **1. Noise**

- Random noise is added to the audio signal to simulate recordings in noisy environments.
- This helps the model become more robust to variations in recording conditions.  

In [26]:
def noise(data):
    noise_amp = 0.035*np.random.uniform()*np.amax(data)
    data = data + noise_amp*np.random.normal(size=data.shape[0])
    return data

### **2. Stretch**

- Changes the speed of the audio without altering its pitch.
- Useful for simulating variations in the duration of words or phrases without affecting the frequency of the sound.

In [27]:
def stretch(data, rate=0.8):
    return librosa.effects.time_stretch(y=data, rate=rate)

### **3. Shift (Time Shifting)**
- Shifts the audio signal in time to generate modified versions of the same recording.
- Helps make the model more robust to slight changes in the temporal alignment of the audio.

In [28]:
def shift(data):
    shift_range = int(np.random.uniform(low=-5, high = 5)*1000)
    return np.roll(data, shift_range)

### **4. Pitch (Pitch Shifting)**
- Alters the frequency of the audio signal without changing its duration.
- Simulates variations in speakers' voices

In [29]:
def pitch(data, sampling_rate, pitch_factor=0.7):
    return librosa.effects.pitch_shift(y=data, sr=sampling_rate, n_steps=pitch_factor)

## **Feature Extraction**

### **Load packages**  

In [4]:
from tqdm import tqdm

Zero Crossing Rate (**ZCR**), Root Mean Square Energy (**RMSE**), and Mel-Frequency Cepstral Coefficients (**MFCC**) are some of the feature extraction techniques used to analyze audio signals. These features help capture important characteristics of the sound, such as signal energy, spectral properties, and frequency variations, making them useful for tasks like speech recognition, emotion detection, and audio classification.

In [5]:
def zcr(data, frame_length, hop_length):
    zcr = librosa.feature.zero_crossing_rate(y=data, frame_length=frame_length, hop_length=hop_length)
    return np.squeeze(zcr)

def rmse(data, frame_length=2048, hop_length=512):
    rmse = librosa.feature.rms(y=data, frame_length=frame_length, hop_length=hop_length)
    return np.squeeze(rmse)

def mfcc(data, sr, flatten:bool=True):
    mfcc = librosa.feature.mfcc(y=data, sr=sr)
    return np.squeeze(mfcc.T)if not flatten else np.ravel(mfcc.T)

In [6]:
def extract_features(data,sr=22050, frame_length=2048, hop_length=512):
    result=np.array([])
    
    result=np.hstack((result,
                      zcr(data, frame_length, hop_length),
                      rmse(data, frame_length, hop_length),
                      mfcc(data, sr)
                     ))
    return result

def get_features(path, duration=2.5, offset=0.6):
    
    normal_audio, sr = librosa.load(path, duration=duration, offset=offset)
    
    features_1 = extract_features(normal_audio)
    features = np.array(features_1)
    
    noised_audio = noise(normal_audio)
    noised_features = extract_features(noised_audio)
    features = np.vstack((features, noised_features))
    
    pitched_audio = pitch(normal_audio, sr)
    pitched_features = extract_features(pitched_audio)
    features = np.vstack((features, pitched_features))
    
    pitched_audio_ = pitch(normal_audio, sr)
    pitched_noised_audio = noise(pitched_audio_)
    pitched_noised_features = extract_features(pitched_noised_audio)
    features = np.vstack((features, pitched_noised_features))
    
    return features

In [38]:
X, Y = [], []

print('Starting to extract features...\n')

for path, emotion, index in tqdm (zip(df.file_path, df.emotion_id, range(df.file_path.shape[0]))):
    
    features = get_features(path)
    
    if (index % 1000 == 0):
        print(f'{index} audios has been processed')
    
    for i in features:
        X.append(i)
        Y.append(emotion)

print(f'Done! {len(X)} audios has been processed')


Starting to extract features...



1it [00:01,  1.89s/it]

0 audios has been processed


1002it [02:35,  5.14it/s]

1000 audios has been processed


2002it [05:26,  7.66it/s]

2000 audios has been processed


3002it [07:20,  8.28it/s]

3000 audios has been processed


4002it [09:13,  7.83it/s]

4000 audios has been processed


5002it [11:09,  8.32it/s]

5000 audios has been processed


6002it [13:14,  7.89it/s]

6000 audios has been processed


7002it [15:59,  8.52it/s]

7000 audios has been processed


8002it [18:11,  7.38it/s]

8000 audios has been processed


9002it [20:25, 11.78it/s]

9000 audios has been processed


10002it [22:08,  8.49it/s]

10000 audios has been processed


11002it [24:02,  9.03it/s]

11000 audios has been processed


12002it [26:07,  6.44it/s]

12000 audios has been processed


12162it [26:30,  7.65it/s]


Done! 48648 audios has been processed


### **Saving features**

In [40]:
max_length = max(len(x) for x in X)
X_padded = np.array([np.pad(x, (0, max_length - len(x))) for x in X])

In [41]:
X_array = np.array(X_padded)
Y_array = np.array(Y)

In [42]:
np.savez('./features/features.npz', X=X_array, Y=Y_array)

## **Feature extraction without Data Augmentation**

In [36]:
def get_features(path, duration=2.5, offset=0.6):
    
    normal_audio, sr = librosa.load(path, duration=duration, offset=offset)
    
    features_1 = extract_features(normal_audio)
    features = np.array(features_1)
    
    return features

In [39]:
X_, Y_ = [], []

print('Starting to extract features without data augmentation ...\n')

for path, emotion, index in tqdm (zip(df.file_path, df.emotion_id, range(df.file_path.shape[0]))):
    
    features = get_features(path)
    
    if (index % 1000 == 0):
        print(f'{index} audios has been processed')
    
    X_.append(features)
    Y_.append(emotion)

print(f'Done! {len(X_)} audios has been processed')


Starting to extract features without data augmentation ...



6it [00:00, 54.13it/s]

0 audios has been processed


1006it [00:16, 57.71it/s]

1000 audios has been processed


2005it [00:31, 69.67it/s]

2000 audios has been processed


3013it [00:48, 64.39it/s]

3000 audios has been processed


4014it [01:01, 72.54it/s]

4000 audios has been processed


5007it [01:14, 72.34it/s]

5000 audios has been processed


6014it [01:27, 80.65it/s]

6000 audios has been processed


7011it [01:41, 81.48it/s]

7000 audios has been processed


8013it [01:54, 80.40it/s]

8000 audios has been processed


9019it [02:07, 94.80it/s]

9000 audios has been processed


10019it [02:18, 87.55it/s]

10000 audios has been processed


11009it [02:31, 83.43it/s]

11000 audios has been processed


12008it [02:45, 65.93it/s]

12000 audios has been processed


12162it [02:47, 72.73it/s]

Done! 12162 audios has been processed





### **Saving features**

In [41]:
max_length_ = max(len(x) for x in X_)
X_padded_ = np.array([np.pad(x, (0, max_length_ - len(x))) for x in X_])

In [42]:
X_array_ = np.array(X_padded_)
Y_array_ = np.array(Y_)

In [43]:
np.savez('./features_npz/features_no.npz', X=X_array_, Y=Y_array_)