# Recurrent Neural Network

In this file, we will be adressing our take on the problem using a **recurrent neural network**.

We will begin  by importing the necessary modules:

In [None]:
import pandas as pd
import numpy as np
from tqdm import tqdm # progress bar on long runs
from scipy.io import wavfile as wav
import librosa
import os
import matplotlib.pyplot as plt
%matplotlib inline 
# allows to plot graphs inline

import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('../UrbanSound8K/metadata/UrbanSound8K.csv')

df.head()

Unnamed: 0,slice_file_name,fsID,start,end,salience,fold,classID,class
0,100032-3-0-0.wav,100032,0.0,0.317551,1,5,3,dog_bark
1,100263-2-0-117.wav,100263,58.5,62.5,1,5,2,children_playing
2,100263-2-0-121.wav,100263,60.5,64.5,1,5,2,children_playing
3,100263-2-0-126.wav,100263,63.0,67.0,1,5,2,children_playing
4,100263-2-0-137.wav,100263,68.5,72.5,1,5,2,children_playing


As per mentioned in the **project statement**, the target variable corresponds to the correct labeling of the sound. There are 10 different possible sounds in the dataset:

 - air conditioner
 - car horn
 - children playing
 - dog bark
 - drilling
 - engine idling
 - gun shot
 - jackhammer
 - siren
 - street music


We can already find the `classID` column, which essentially represents each label as an integer, from 0 to 9:


In [3]:
class_id_pairs = df[['classID', 'class']].drop_duplicates().sort_values(by="classID")

for index, row in class_id_pairs.iterrows():
    print(f'classID: {row["classID"]}, class: {row["class"]}')

classID: 0, class: air_conditioner
classID: 1, class: car_horn
classID: 2, class: children_playing
classID: 3, class: dog_bark
classID: 4, class: drilling
classID: 5, class: engine_idling
classID: 6, class: gun_shot
classID: 7, class: jackhammer
classID: 8, class: siren
classID: 9, class: street_music


This means that we can remove the last column and begin working with our dataset, which we already determined is slightly unbalanced for the `car_horn` and `gunshot` values: 

In [4]:
df.drop(columns=['class'],inplace=True)
df.head()

Unnamed: 0,slice_file_name,fsID,start,end,salience,fold,classID
0,100032-3-0-0.wav,100032,0.0,0.317551,1,5,3
1,100263-2-0-117.wav,100263,58.5,62.5,1,5,2
2,100263-2-0-121.wav,100263,60.5,64.5,1,5,2
3,100263-2-0-126.wav,100263,63.0,67.0,1,5,2
4,100263-2-0-137.wav,100263,68.5,72.5,1,5,2


## Feature extraction
The **librosa** library has a built-in method for feature extraction, called [Mel-Frequency Cepstral Coefficients](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum), that summarises the frequency distribution across the time window.

In order to build the new dataset, we developed the following functions, which are capable of extracting **1D or 2D** features.

These feature extractor functions will represent the frequencies found in the wav files as **np arrays**, while using MFCCs in order to obtain features similar to the way humans perceive sounds.

In [5]:
# Uses the mean from the Time axis
def features_extractor_1D(file):
    audio, sample_rate = librosa.load(file) 
    mfccs_features = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=40)
    mfccs_scaled_features = np.mean(mfccs_features.T,axis=0) 
    return mfccs_scaled_features

# Uses both Time and Frequency axis
def features_extractor_2D(file):
    audio, sample_rate = librosa.load(file) 
    mfccs_features = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=40)
    return mfccs_features

Now that we know how to transform audio files into usable data types, we must associate each numpy array to their respective entry inside the df dataframe.

This will allow for important pre-processing steps to be applied accordingly, as well as proper Neural Network training and testing.

In [None]:
# Identify path containing all folds
audio_dataset_path='../UrbanSound8K/audio/'
extracted_features1=[]
extracted_features2=[]

'''

# Iterates over all original dataframe rows (predicts approximate runtime)
for index_num,row in tqdm(df.iterrows(), total=len(df), desc="Processing", unit="row"):
    # Identifies wav file name, concatenates to respective fold: accesses original .wav file
    #file_name = os.path.join(os.path.abspath(audio_dataset_path),'fold'+str(row["fold"])+'/',str(row["slice_file_name"]))
    file_name = os.path.join(os.path.abspath(audio_dataset_path),'fold'+str(row["fold"])+'\\',str(row["slice_file_name"]))
    
    # Adds associated sound label
    final_class_labels=row["classID"]
    

    data1=features_extractor_1D(file_name) 
    extracted_features1.append([data1,final_class_labels])

    data2=features_extractor_2D(file_name) 
    extracted_features2.append([data2,final_class_labels])
    


    
# Convert extracted_features to Pandas dataframe
df_1d =pd.DataFrame(extracted_features1,columns=['feature','class'])
df_2d =pd.DataFrame(extracted_features2,columns=['feature','class'])

df_1d.to_csv("rnn_1d.csv", index=False)
df_2d.to_csv("rnn_2d.csv", index=False)

'''

Processing:   0%|          | 0/8732 [00:00<?, ?row/s]

Processing: 100%|██████████| 8732/8732 [02:03<00:00, 70.86row/s] 


## Data Preprocessing

Librosa extracts MFCCs on different scales for different .wav files. This is due to the fact that lower frequencies are emphasized during this process, potentially creating bias issues as a consequence of heterogeneous distributions of frequencies throughout each file.

To address this issue, we can apply feature scaling to the new dataframes, in order to improve data quality for our modeling purposes:

In [None]:
# Uses sklearn's MinMax scaler, rescales values to be in a range of [0,1]
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

# /////////////////// NEEDS REVISION ///////////////////

# example = df_2d.iloc[0]["feature"][0]
# print("First arrray of the first entry in the 2D dataset: \n", example)

'''
# Iterates over all original dataframe rows (predicts approximate runtime)
for index_num,row in tqdm(df.iterrows(), total=len(df), desc="Processing", unit="row"):
    # Get the "features" array for the current row
    features_array = row['feature']

    # Ensure the features_array is a 2D array (in case it is 1D)
    # If it's a 1D array of shape (40,) for example, reshape it into (40, 1) for scaling
    if isinstance(features_array, np.ndarray):  # Check if the element is a numpy array
        if features_array.ndim == 1:
            # Reshape the 1D array to 2D for scaling
            features_array = features_array.reshape(-1, 1)
        
        # Apply Min-Max scaling to the array
        scaled_features = scaler.fit_transform(features_array).flatten()  # Flatten to maintain 1D structure after scaling
        
        # Update the "features" column with the scaled array (in-place)
        df.at[index_num, 'feature'] = scaled_features
'''
# /////////////////// NEEDS REVISION ///////////////////

First arrray of the first entry in the 2D dataset: 
 [-332.03876  -169.58775   -90.24683   -56.92349   -40.27587   -50.544167
  -99.22394  -159.67033  -215.6434   -267.75623  -316.28625  -355.39624
 -390.08435  -423.43994 ]


'\n# Iterates over all original dataframe rows (predicts approximate runtime)\nfor index_num,row in tqdm(df.iterrows(), total=len(df), desc="Processing", unit="row"):\n    # Get the "features" array for the current row\n    features_array = row[\'feature\']\n\n    # Ensure the features_array is a 2D array (in case it is 1D)\n    # If it\'s a 1D array of shape (40,) for example, reshape it into (40, 1) for scaling\n    if isinstance(features_array, np.ndarray):  # Check if the element is a numpy array\n        if features_array.ndim == 1:\n            # Reshape the 1D array to 2D for scaling\n            features_array = features_array.reshape(-1, 1)\n        \n        # Apply Min-Max scaling to the array\n        scaled_features = scaler.fit_transform(features_array).flatten()  # Flatten to maintain 1D structure after scaling\n        \n        # Update the "features" column with the scaled array (in-place)\n        df.at[index_num, \'feature\'] = scaled_features\n'

Since not every `.wav` file is 4 seconds long, we will apply **zero-padding** to ensure that all files meet this requirement.

In [20]:
# /////////////////// NEEDS REVISION ///////////////////

## Model Development

In order to develop an effective **Recurrent Neural Network**, the group decided to explore the concept of **Long Short Term Memory** (LSTM) networks. LSTMs are a type of RNN that are designed to handle sequential data pattern recognition. 

We consider this approach could be the most effective in order to classify the sounds, since continuous sounds or repetitive rythms are sequential. These time-dependant aspects are characteristics which LSTMs are capable of recognizing and "remembering" throughout training.