In [4]:
%load_ext autoreload
%autoreload 2

# Classifying Music Note sounds using Deep Learning

source: https://medium.com/@mikesmales/sound-classification-using-deep-learning-8bc2aa1990b7

### Load Preprocessed data 

#### Feature Extraction refinement 

In the prevous feature extraction stage, the MFCC vectors would vary in size for the different audio files (depending on the samples duration). 

However, CNNs require a fixed size for all inputs. To overcome this we will zero pad the output vectors to make them all the same size. 

In [15]:
import numpy as np
max_pad_len = 365

def extract_features(file_name):
   
    try:
        audio, sample_rate = librosa.load(file_name, res_type='kaiser_fast') 
        mfccs = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=40)
        pad_width = max_pad_len - mfccs.shape[1]
        mfccs = np.pad(mfccs, pad_width=((0, 0), (0, pad_width)), mode='constant')
        
    except Exception as e:
        print("Error encountered while parsing file: ", file_name, e)
        return None 
     
    return mfccs

In [16]:
# Load various imports 
import pandas as pd
import os
import librosa

# Set the path to the full UrbanSound dataset 
DATA_DIR = os.path.join("data", "guitar_sample")

# feature list
features = []

# Iterate through each sound file and extract the features 
for folder in os.listdir(DATA_DIR):
    for file in os.listdir(os.path.join(DATA_DIR, folder)):
        class_label = folder
        file_name = os.path.join(os.path.join(DATA_DIR, folder, file))
        
        data = extract_features(file_name)
        features.append([data, class_label])

# Convert into a Panda dataframe 
featuresdf = pd.DataFrame(features, columns=['feature','class_label'])

print('Finished feature extraction from ', len(featuresdf), ' files') 

274 365 (40, 91)
113 365 (40, 252)
284 365 (40, 81)
254 365 (40, 111)
249 365 (40, 116)
236 365 (40, 129)
214 365 (40, 151)
226 365 (40, 139)
265 365 (40, 100)
209 365 (40, 156)
245 365 (40, 120)
129 365 (40, 236)
244 365 (40, 121)
245 365 (40, 120)
220 365 (40, 145)
235 365 (40, 130)
248 365 (40, 117)
230 365 (40, 135)
250 365 (40, 115)
246 365 (40, 119)
284 365 (40, 81)
112 365 (40, 253)
276 365 (40, 89)
276 365 (40, 89)
231 365 (40, 134)
234 365 (40, 131)
222 365 (40, 143)
189 365 (40, 176)
186 365 (40, 179)
198 365 (40, 167)
240 365 (40, 125)
0 365 (40, 365)
270 365 (40, 95)
270 365 (40, 95)
242 365 (40, 123)
245 365 (40, 120)
242 365 (40, 123)
238 365 (40, 127)
248 365 (40, 117)
242 365 (40, 123)
278 365 (40, 87)
272 365 (40, 93)
303 365 (40, 62)
278 365 (40, 87)
287 365 (40, 78)
271 365 (40, 94)
57 365 (40, 308)
267 365 (40, 98)
207 365 (40, 158)
249 365 (40, 116)
243 365 (40, 122)
238 365 (40, 127)
243 365 (40, 122)
242 365 (40, 123)
244 365 (40, 121)
236 365 (40, 129)
199 365 (

In [17]:
from sklearn.preprocessing import LabelEncoder
from keras.utils import to_categorical

# Convert features and corresponding classification labels into numpy arrays
X = np.array(featuresdf.feature.tolist())
y = np.array(featuresdf.class_label.tolist())

# Encode the classification labels
le = LabelEncoder()
yy = to_categorical(le.fit_transform(y)) 

# split the dataset 
from sklearn.model_selection import train_test_split 

x_train, x_test, y_train, y_test = train_test_split(X, yy, test_size=0.2, random_state = 42)

### Convolutional Neural Network (CNN) model architecture 


We will modify our model to be a Convolutional Neural Network (CNN) again using Keras and a Tensorflow backend. 

Again we will use a `sequential` model, starting with a simple model architecture, consisting of four `Conv2D` convolution layers, with our final output layer being a `dense` layer. 

The convolution layers are designed for feature detection. It works by sliding a filter window over the input and performing a matrix multiplication and storing the result in a feature map. This operation is known as a convolution. 


The `filter` parameter specifies the number of nodes in each layer. Each layer will increase in size from 16, 32, 64 to 128, while the `kernel_size` parameter specifies the size of the kernel window which in this case is 2 resulting in a 2x2 filter matrix. 

The first layer will receive the input shape of (40, 174, 1) where 40 is the number of MFCC's 174 is the number of frames taking padding into account and the 1 signifying that the audio is mono. 

The activation function we will be using for our convolutional layers is `ReLU` which is the same as our previous model. We will use a smaller `Dropout` value of 20% on our convolutional layers. 

Each convolutional layer has an associated pooling layer of `MaxPooling2D` type with the final convolutional layer having a `GlobalAveragePooling2D` type. The pooling layer is do reduce the dimensionality of the model (by reducing the parameters and subsquent computation requirements) which serves to shorten the training time and reduce overfitting. The Max Pooling type takes the maximum size for each window and the Global Average Pooling type takes the average which is suitable for feeding into our `dense` output layer.  

Our output layer will have 10 nodes (num_labels) which matches the number of possible classifications. The activation is for our output layer is `softmax`. Softmax makes the output sum up to 1 so the output can be interpreted as probabilities. The model will then make its prediction based on which option has the highest probability.

In [19]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Convolution2D, Conv2D, MaxPooling2D, GlobalAveragePooling2D
from keras.optimizers import Adam
from keras.utils import np_utils
from sklearn import metrics 

num_rows = 40
num_columns = 365
num_channels = 1

x_train = x_train.reshape(x_train.shape[0], num_rows, num_columns, num_channels)
x_test = x_test.reshape(x_test.shape[0], num_rows, num_columns, num_channels)

num_labels = yy.shape[1]
filter_size = 2

# Construct model 
model = Sequential()
model.add(Conv2D(filters=16, kernel_size=2, input_shape=(num_rows, num_columns, num_channels), activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.2))

model.add(Conv2D(filters=32, kernel_size=2, activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.2))

model.add(Conv2D(filters=64, kernel_size=2, activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.2))

model.add(Conv2D(filters=128, kernel_size=2, activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.2))
model.add(GlobalAveragePooling2D())

model.add(Dense(num_labels, activation='softmax')) 

### Compiling the model 

For compiling our model, we will use the same three parameters as the previous model: 

In [20]:
# Compile the model
model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam') 

In [21]:
# Display model architecture summary 
model.summary()

# Calculate pre-training accuracy 
score = model.evaluate(x_test, y_test, verbose=1)
accuracy = 100*score[1]

print("Pre-training accuracy: %.4f%%" % accuracy)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 39, 364, 16)       80        
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 19, 182, 16)       0         
_________________________________________________________________
dropout (Dropout)            (None, 19, 182, 16)       0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 18, 181, 32)       2080      
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 9, 90, 32)         0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 9, 90, 32)         0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 8, 89, 64)         8

### Training 

Here we will train the model. As training a CNN can take a sigificant amount of time, we will start with a low number of epochs and a low batch size. If we can see from the output that the model is converging, we will increase both numbers.  

In [22]:
from keras.callbacks import ModelCheckpoint 
from datetime import datetime 

#num_epochs = 12
#num_batch_size = 128

num_epochs = 72
num_batch_size = 256

checkpointer = ModelCheckpoint(
    filepath='saved_models/weights.best.basic_cnn.hdf5', 
    verbose=1, 
    save_best_only=True
)

start = datetime.now()
model.fit(
    x_train, 
    y_train, 
    batch_size=num_batch_size, 
    epochs=num_epochs, 
    validation_data=(x_test, y_test), 
    callbacks=[checkpointer], 
    verbose=1
)


duration = datetime.now() - start
print("Training completed in time: ", duration)

Epoch 1/72
Epoch 00001: val_loss improved from inf to 3.96389, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 2/72
Epoch 00002: val_loss did not improve from 3.96389
Epoch 3/72
Epoch 00003: val_loss did not improve from 3.96389
Epoch 4/72
Epoch 00004: val_loss did not improve from 3.96389
Epoch 5/72
Epoch 00005: val_loss did not improve from 3.96389
Epoch 6/72
Epoch 00006: val_loss did not improve from 3.96389
Epoch 7/72
Epoch 00007: val_loss did not improve from 3.96389
Epoch 8/72
Epoch 00008: val_loss improved from 3.96389 to 3.87149, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 9/72
Epoch 00009: val_loss improved from 3.87149 to 3.68278, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 10/72
Epoch 00010: val_loss did not improve from 3.68278
Epoch 11/72
Epoch 00011: val_loss did not improve from 3.68278
Epoch 12/72
Epoch 00012: val_loss did not improve from 3.68278
Epoch 13/72
Epoch 00013: val_loss improved from 3.68278 to 3.55641, sav

Epoch 00027: val_loss improved from 1.34377 to 1.22836, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 28/72
Epoch 00028: val_loss improved from 1.22836 to 1.13828, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 29/72
Epoch 00029: val_loss improved from 1.13828 to 1.07306, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 30/72
Epoch 00030: val_loss improved from 1.07306 to 1.02947, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 31/72
Epoch 00031: val_loss improved from 1.02947 to 1.00270, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 32/72
Epoch 00032: val_loss improved from 1.00270 to 0.98459, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 33/72
Epoch 00033: val_loss improved from 0.98459 to 0.97434, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 34/72
Epoch 00034: val_loss improved from 0.97434 to 0.97145, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 35/72


Epoch 54/72
Epoch 00054: val_loss improved from 0.67282 to 0.64866, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 55/72
Epoch 00055: val_loss improved from 0.64866 to 0.62656, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 56/72
Epoch 00056: val_loss improved from 0.62656 to 0.60666, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 57/72
Epoch 00057: val_loss improved from 0.60666 to 0.58901, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 58/72
Epoch 00058: val_loss improved from 0.58901 to 0.57191, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 59/72
Epoch 00059: val_loss improved from 0.57191 to 0.55643, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 60/72
Epoch 00060: val_loss improved from 0.55643 to 0.54071, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 61/72
Epoch 00061: val_loss improved from 0.54071 to 0.52424, saving model to saved_models\weights.best.basic_cnn.hdf5


### Test the model 

Here we will review the accuracy of the model on both the training and test data sets. 

In [23]:
# Evaluating the model on the training and testing set
score = model.evaluate(x_train, y_train, verbose=0)
print("Training Accuracy: ", score[1])

score = model.evaluate(x_test, y_test, verbose=0)
print("Testing Accuracy: ", score[1])

Training Accuracy:  1.0
Testing Accuracy:  1.0


The Training and Testing accuracy scores are both high and an increase on our initial model. Training accuracy has increased by ~6% and Testing accuracy has increased by ~4%. 

There is a marginal increase in the difference between the Training and Test scores (~6% compared to ~5% previously) though the difference remains low so the model has not suffered from overfitting. 

### Predictions  

Here we will modify our previous method for testing the models predictions on a specified audio .wav file. 

In [30]:
def print_prediction(file_name):
    prediction_feature = extract_features(file_name) 
    prediction_feature = prediction_feature.reshape(1, num_rows, num_columns, num_channels)

    predicted_vector = np.argmax(model.predict(prediction_feature), axis=-1)
    predicted_class = le.inverse_transform(predicted_vector) 
    print("The predicted class is:", predicted_class[0], '\n') 

    predicted_proba_vector = model.predict_proba(prediction_feature) 
    predicted_proba = predicted_proba_vector[0]
    for i in range(len(predicted_proba)): 
        category = le.inverse_transform(np.array([i]))
        print(category[0], "\t\t : ", format(predicted_proba[i], '.32f') )

### Validation 

#### Test with sample data 

As before we will verify the predictions using a subsection of the sample audio files we explored in the first notebook. We expect the bulk of these to be classified correctly. 

In [35]:
# Class: A

filename = os.path.join(DATA_DIR, "A", "A1.wav")
print_prediction(filename) 

274 365 (40, 91)
The predicted class is: A 

A 		 :  0.44279378652572631835937500000000
B 		 :  0.03873376920819282531738281250000
D 		 :  0.09698215126991271972656250000000
E 		 :  0.13421411812305450439453125000000
EH 		 :  0.24673040211200714111328125000000
G 		 :  0.04054575040936470031738281250000


In [52]:
# Class: B

filename = os.path.join(DATA_DIR, "B", "B1.wav")
print_prediction(filename) 

The predicted class is: drilling 

air_conditioner 		 :  0.00070991273969411849975585937500
car_horn 		 :  0.00000001777174851724794280016795
children_playing 		 :  0.00001405069633619859814643859863
dog_bark 		 :  0.00000047111242906794359441846609
drilling 		 :  0.99598699808120727539062500000000
engine_idling 		 :  0.00000354658413925790227949619293
gun_shot 		 :  0.00000003223207656333215709310025
jackhammer 		 :  0.00052903906907886266708374023438
siren 		 :  0.00000098340262866258854046463966
street_music 		 :  0.00275487988255918025970458984375


In [53]:
# Class: D

filename = os.path.join(DATA_DIR, "D", "D1.wav")
print_prediction(filename) 

The predicted class is: street_music 

air_conditioner 		 :  0.00011496015213197097182273864746
car_horn 		 :  0.00079288281267508864402770996094
children_playing 		 :  0.01791538484394550323486328125000
dog_bark 		 :  0.00257923710159957408905029296875
drilling 		 :  0.00007904539961600676178932189941
engine_idling 		 :  0.00006061193562345579266548156738
gun_shot 		 :  0.00000000007482268277181347571059
jackhammer 		 :  0.00000457825990451965481042861938
siren 		 :  0.00922307930886745452880859375000
street_music 		 :  0.96923023462295532226562500000000


In [34]:
# Class: E

filename = os.path.join(DATA_DIR, "E", "E1.wav")
print_prediction(filename) 

240 365 (40, 125)
The predicted class is: E 

A 		 :  0.00500741647556424140930175781250
B 		 :  0.00262584001757204532623291015625
D 		 :  0.00001245124531124019995331764221
E 		 :  0.99197739362716674804687500000000
EH 		 :  0.00026903860270977020263671875000
G 		 :  0.00010790056694531813263893127441


#### Observations 

We can see that the model performs well. 

Interestingly, car horn was again incorrectly classifed but this time as drilling - though the per class confidence shows it was a close decision between car horn with 26% confidence and drilling at 34% confidence.  

### Other audio

Again we will further validate our model using a sample of various copyright free sounds that we not part of either our test or training data. 

In [33]:
# Class: EH

filename = os.path.join(DATA_DIR, "EH", "E1.wav")
print_prediction(filename) 

278 365 (40, 87)
The predicted class is: EH 

A 		 :  0.21721151471138000488281250000000
B 		 :  0.03729132190346717834472656250000
D 		 :  0.06721183657646179199218750000000
E 		 :  0.05496610701084136962890625000000
EH 		 :  0.56571435928344726562500000000000
G 		 :  0.05760479718446731567382812500000


In [32]:
# Class: G

filename = os.path.join(DATA_DIR, "G", "G1.wav")
print_prediction(filename) 

249 365 (40, 116)
The predicted class is: G 

A 		 :  0.04788042232394218444824218750000
B 		 :  0.06000442057847976684570312500000
D 		 :  0.13914313912391662597656250000000
E 		 :  0.04142617434263229370117187500000
EH 		 :  0.10056187212467193603515625000000
G 		 :  0.61098396778106689453125000000000


#### Observations 

The performance of our final model is very good and has generalised well, seeming to predict well when tested against new audio data. 