### Household Sounds Capstone
## 4. Pre-Processing: Convolutional Neural Network
- In this notebook we will create a Validation and Test datasets for hyperparameter tuning of the Neural Networks. 
- We will also perform some minor data cleaning to prepare the data for modeling in the next notebook.

### Convolutional Neural Network Validation-Test Split
- For the Convolutional Neural Network models, we will split the Testing data into two equal sized datasets. 
- Validation set will be used to evaluate model perforamnce while hyperparameter tuning. 
- Test set will only be used to evaluate the final models. 
- Training dataset has already been separated by the creator's of this dataset.

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

In [None]:
# loading metadata
dev_info = pd.read_json('data/labelled_dev_info.json')
eval_info = pd.read_json('data/labelled_eval_info.json')
dev_info_resampled = pd.read_json('data/dev_info_resamp.json')

In [19]:
# shuffling eval_info indices before splitting into validation/test datasets
shuffled_eval_info = eval_info.sample(frac=1).reset_index(drop=True)

#splitting into validation and test datasets (50-50 split)
validation_info = shuffled_eval_info.iloc[:5115]
test_info = shuffled_eval_info.iloc[5115:]

print(validation_info.shape)
print(test_info.shape)

(5115, 12)
(5116, 12)


In [20]:
shuffled_eval_info.to_json('data/shuffled_eval_info.json')
validation_info.to_json('data/validation_info.json')
test_info.to_json('data/test_info.json')

###  Pre-Processing Mean MFCC Value Features
- Scaling Mean MFCC Values to zero mean and unit variance

In [21]:
# loading Training and Testing Data features: Numpy Arrays of Mean MFCC values
X_train = np.load('data/train_resamp_mean_mfcc_values.npz')['arr_0']
X_test = np.load('data/test_augmented_mean_mfcc_values.npz')['arr_0']

# declaring target variable:
y_train = dev_info_resampled['labels'].to_numpy()
y_test = eval_info['labels'].to_numpy()

# Standardizing/normalizing features 
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)


np.savez('data/scaled_train_mean_mfcc_values.npz', X_train)    
np.savez('data/scaled_test__mean_mfcc_values.npz', X_test)  

print(X_train.shape)
print(X_test.shape)

print(y_train.shape)
print(y_test.shape)

(40000, 32)
(10231, 32)
(40000,)
(10231,)


### Validation-Test Split
- The Nueral Network model requires the same validation-test data split with a different set of features

In [26]:
# making validation-test split for Neural Network hyperparameter tuning
val_test_info= pd.DataFrame(X_test)
val_test_info['labels'] = y_test

# shuffling before splitting into validation/test datasets
val_test_info = val_test_info.sample(frac=1).reset_index(drop=True)
validation_info = val_test_info.iloc[:5115]
test_info = val_test_info.iloc[5115:]

validation_info.to_json('data/nn_validation_info.json')
test_info.to_json('data/nn_validation_info.json')

print(validation_info.shape)
print(test_info.shape)

(5115, 33)
(5116, 33)


### Cleaning Dataframes Before Modeling

In [None]:
# Loading data
eval_info = pd.read_json('data/labelled_eval_info.json')
validation_info= pd.read_json('data/validation_info.json')
test_info = pd.read_json('data/test_info.json')

# Removing unnecessary columns used during EDA
eval_info= eval_info.drop(columns=['title','license','uploader','wav_name','labels_15','labels_2','labels_4'])
eval_info.to_json('data/labelled_eval_info.json')
test_info = test_info.drop(columns=['title','license','uploader','wav_name','labels_15','labels_2','labels_4'])
test_info.to_json('data/test_info.json')
validation_info = validation_info.drop(columns=['title','license','uploader','wav_name','labels_15','labels_2','labels_4'])
validation_info.to_json('data/validation_info.json')

### Next Step: Modeling 
- In the next notebook, we will begin training one CNN with the Mel-Frequency Sprectrograms, one CNN with Mel-Frequency Cepstrum Spectrograms, and series of classifiers trained with the Mean Mel-Frequency Cepstrum Coefficients.