In [1]:
%load_ext autoreload
%autoreload 2

# Classifying Music Note sounds using Few Shot Deep Learning

credit: 

### Load Preprocessed data 

#### Feature Extraction refinement 

In the prevous feature extraction stage, the MFCC vectors would vary in size for the different audio files (depending on the samples duration). 

However, CNNs require a fixed size for all inputs. To overcome this we will zero pad the output vectors to make them all the same size. 

In [2]:
import numpy as np
max_pad_len = 365

def extract_features(file_name):
   
    try:
        audio, sample_rate = librosa.load(file_name, res_type='kaiser_fast') 
        mfccs = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=40)
        pad_width = max_pad_len - mfccs.shape[1]
        mfccs = np.pad(mfccs, pad_width=((0, 0), (0, pad_width)), mode='constant')
        
    except Exception as e:
        print("Error encountered while parsing file: ", file_name, e)
        return None 
     
    return mfccs

In [3]:
# Load various imports 
import pandas as pd
import os
import librosa

# Set the path to the full UrbanSound dataset 
DATA_DIR = os.path.join("data", "guitar_sample")

# feature list
features = []

# Iterate through each sound file and extract the features 
for folder in os.listdir(DATA_DIR):
    for file in os.listdir(os.path.join(DATA_DIR, folder)):
        class_label = folder
        file_name = os.path.join(os.path.join(DATA_DIR, folder, file))
        
        data = extract_features(file_name)
        features.append([data, class_label])

# Convert into a Panda dataframe 
featuresdf = pd.DataFrame(features, columns=['feature','class_label'])

print('Finished feature extraction from ', len(featuresdf), ' files') 

Finished feature extraction from  58  files


In [4]:
from itertools import combinations

def prepare_train_pair(X, y, label):
    indices = np.array(list(range(len(y))))
    
    similar_indices = indices[y == label]
    dissimilar_indices = indices[y != label]
    
    np.random.shuffle(dissimilar_indices)
    
    similar_indices_pair = []
    dissimilar_indices_pair = []
    
    it = iter(dissimilar_indices)
    size = 0
    
    for i, j in combinations(similar_indices, 2):
        size += 1
        similar_indices_pair.append([i, j])
        dissimilar_indices_pair.append([i, next(it)])
    
    # get the dimension of data based on combination
    dim = tuple([2, 2*size] + list(X.shape[1:]))
    
    # build the sim and dis-sim matrix
    new_X = np.empty(dim, dtype=float)
    new_y = np.concatenate([np.ones(size, dtype=float), np.zeros(size, dtype=float)])
    
    similar_indices_pair = np.array(similar_indices_pair)
    dissimilar_indices_pair = np.array(dissimilar_indices_pair)
    
    new_X[0, :size], new_X[1, :size] = X[similar_indices_pair[:, 0]], X[similar_indices_pair[:, 1]]
    new_X[0:, size:], new_X[1, size:] = X[dissimilar_indices_pair[:, 0]], X[dissimilar_indices_pair[:, 1]]
    
    all_indices = np.array(list(range(2*size)))
    np.random.shuffle(all_indices)
    
    return new_X[:, all_indices], new_y[all_indices]

In [5]:
# Convert features and corresponding classification labels into numpy arrays
X = np.array(featuresdf.feature.tolist())
y = np.array(featuresdf.class_label.tolist())

# prepare data set pairs (similar and dissimilar)
X, y = prepare_train_pair(X, y, "G")

n, datasize, num_rows, num_columns = X.shape

# reshape for training
X = X.reshape(n, datasize, num_rows, num_columns, 1)

# split the dataset 
from sklearn.model_selection import train_test_split 

x_train_indices, x_test_indices, y_train, y_test = train_test_split(np.array(list(range(datasize))), y, test_size=0.2, random_state = 42)
x_train, x_test = X[:, x_train_indices], X[:, x_test_indices]

In [6]:
X.shape, y.shape

((2, 72, 40, 365, 1), (72,))

In [7]:
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((2, 57, 40, 365, 1), (2, 15, 40, 365, 1), (57,), (15,))

### Convolutional Neural Network (CNN) model architecture 


We will modify our model to be a Convolutional Neural Network (CNN) again using Keras and a Tensorflow backend. 

Again we will use a `sequential` model, starting with a simple model architecture, consisting of four `Conv2D` convolution layers, with our final output layer being a `dense` layer. 

The convolution layers are designed for feature detection. It works by sliding a filter window over the input and performing a matrix multiplication and storing the result in a feature map. This operation is known as a convolution. 


The `filter` parameter specifies the number of nodes in each layer. Each layer will increase in size from 16, 32, 64 to 128, while the `kernel_size` parameter specifies the size of the kernel window which in this case is 2 resulting in a 2x2 filter matrix. 

The first layer will receive the input shape of (40, 174, 1) where 40 is the number of MFCC's 174 is the number of frames taking padding into account and the 1 signifying that the audio is mono. 

The activation function we will be using for our convolutional layers is `ReLU` which is the same as our previous model. We will use a smaller `Dropout` value of 20% on our convolutional layers. 

Each convolutional layer has an associated pooling layer of `MaxPooling2D` type with the final convolutional layer having a `GlobalAveragePooling2D` type. The pooling layer is do reduce the dimensionality of the model (by reducing the parameters and subsquent computation requirements) which serves to shorten the training time and reduce overfitting. The Max Pooling type takes the maximum size for each window and the Global Average Pooling type takes the average which is suitable for feeding into our `dense` output layer.  

Our output layer will have 10 nodes (num_labels) which matches the number of possible classifications. The activation is for our output layer is `softmax`. Softmax makes the output sum up to 1 so the output can be interpreted as probabilities. The model will then make its prediction based on which option has the highest probability.

In [8]:
from keras import backend as K
from keras.optimizers import Adam
from keras.models import Sequential, Model
from keras.layers import Input, Conv2D, MaxPooling2D, Dense, Dropout, Flatten, Lambda, GlobalAveragePooling2D

def build_base_network2(input_shape):
    filter_size = 2

    # Construct model 
    model = Sequential()
    model.add(Conv2D(filters=16, kernel_size=2, input_shape=input_shape, activation='relu'))
    model.add(MaxPooling2D(pool_size=2))
    model.add(Dropout(0.2))

    model.add(Conv2D(filters=32, kernel_size=2, activation='relu'))
    model.add(MaxPooling2D(pool_size=2))
    model.add(Dropout(0.2))

    model.add(Conv2D(filters=64, kernel_size=2, activation='relu'))
    model.add(MaxPooling2D(pool_size=2))
    model.add(Dropout(0.2))

    model.add(Conv2D(filters=128, kernel_size=2, activation='relu'))
    model.add(MaxPooling2D(pool_size=2))
    model.add(Dropout(0.2))
    
    model.add(GlobalAveragePooling2D())
    model.add(Dense(256))
    
    model.add(Dropout(0.1))
    model.add(Dense(128))
    
    return model
    
def build_base_network(input_shape):
    model = Sequential()
    
    model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape))
    model.add(Conv2D(32, kernel_size=(3, 3), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    
    model.add(Conv2D(32, kernel_size=(3, 3), activation='relu'))
    model.add(Conv2D(32, kernel_size=(3, 3), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    
    model.add(Flatten())
    model.add(Dense(1024))
    model.add(Dropout(0.1))
    
    model.add(Dense(256))
    model.add(Dropout(0.1))
    
    model.add(Dense(128))
    return model


def euclidean_distance(vects):
    x, y = vects
    return K.sqrt(K.sum(K.square(x - y), axis=1, keepdims=True))

def eucl_dist_output_shape(shapes):
    shape1, shape2 = shapes
    return (shape1[0], 1)

def distance(emb1, emb2):
    return np.sum(np.square(emb1 - emb2))


def predict(afs, y, threshold=0.5):
    print(afs.shape)
    acc = 0
    preds = model.predict([afs[0], afs[1]])
    for i in range(len(preds)):
        p = preds[i][0]
        z = int(p < threshold)
        if z == y[i]:
            acc += 1
        print(z, y[i], p)
    print('acc = {}%'.format(acc*100/len(preds)))

def contrastive_loss(y_true, y_pred):
    margin = 1
    return K.mean(y_true * K.square(y_pred) + (1 - y_true) * K.square(K.maximum(margin - y_pred, 0)))


### Compiling the model 

For compiling our model, we will use the same three parameters as the previous model: 

In [9]:
input_dim = x_train.shape[2:]

audio_a = Input(shape=input_dim)
audio_b = Input(shape=input_dim)

base_network = build_base_network(input_dim)

feat_vecs_a = base_network(audio_a)
feat_vecs_b = base_network(audio_b)

difference = Lambda(euclidean_distance, output_shape=eucl_dist_output_shape)([feat_vecs_a, feat_vecs_b])

# initialize training params
epochs = 64
batch_size = 24
optimizer = Adam() #RMSprop()

# initialize the network
model = Model(inputs=[audio_a, audio_b], outputs=difference)
model.compile(loss=contrastive_loss, optimizer=optimizer)

In [10]:
# Display model architecture summary 
model.summary()

Model: "functional_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 40, 365, 1)] 0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, 40, 365, 1)] 0                                            
__________________________________________________________________________________________________
sequential (Sequential)         (None, 128)          20509472    input_1[0][0]                    
                                                                 input_2[0][0]                    
__________________________________________________________________________________________________
lambda (Lambda)                 (None, 1)            0           sequential[0][0]      

In [11]:
predict(x_train, y_train)

(2, 57, 40, 365, 1)
0 0.0 16.871254
0 1.0 5.0769644
0 0.0 15.167051
0 0.0 17.08545
0 0.0 18.611961
0 0.0 16.556719
0 1.0 6.293456
0 0.0 17.742388
0 0.0 16.729364
0 1.0 14.330191
0 1.0 14.707417
0 1.0 31.302975
0 1.0 28.512344
0 1.0 13.650093
0 0.0 16.04457
0 1.0 7.4985375
0 0.0 16.458845
0 1.0 5.940324
0 0.0 17.554096
0 1.0 31.205189
0 1.0 6.166374
0 1.0 5.940142
0 1.0 14.611991
0 0.0 17.171213
0 1.0 7.0209956
0 0.0 19.273155
0 0.0 15.485283
0 1.0 32.113766
0 1.0 6.4819913
0 0.0 41.84114
0 0.0 17.051504
0 1.0 31.418772
0 1.0 31.099882
0 1.0 30.089296
0 0.0 17.512714
0 0.0 17.983402
0 1.0 4.478962
0 0.0 15.453634
0 1.0 3.98404
0 0.0 18.87347
0 0.0 15.574558
0 1.0 15.25096
0 0.0 17.033222
0 1.0 9.094498
0 1.0 6.2206836
0 0.0 16.283693
0 1.0 14.579174
0 0.0 15.614336
0 0.0 16.355484
0 0.0 19.669628
0 1.0 6.331181
0 1.0 4.206341
0 0.0 18.323875
0 1.0 5.301945
0 0.0 17.164194
0 0.0 18.651783
0 1.0 4.5669675
acc = 49.12280701754386%


In [12]:
predict(x_test, y_test)

(2, 15, 40, 365, 1)
0 1.0 2.9108527
0 1.0 30.424143
0 0.0 25.027956
0 1.0 4.3869987
0 0.0 16.149446
0 0.0 16.139362
0 0.0 15.685213
0 1.0 5.8893046
0 0.0 14.78619
0 0.0 26.74393
0 1.0 8.817082
0 1.0 6.2021866
0 0.0 22.398136
0 0.0 16.2329
0 1.0 16.078587
acc = 53.333333333333336%


### Training 

Here we will train the model. As training a CNN can take a sigificant amount of time, we will start with a low number of epochs and a low batch size. If we can see from the output that the model is converging, we will increase both numbers.  

In [13]:
from keras.callbacks import ModelCheckpoint 
from time import time

checkpointer = ModelCheckpoint(
    filepath='saved_models/weights.best.basic_cnn.hdf5', 
    verbose=1, 
    save_best_only=True
)

start = time()
model.fit(
    [x_train[0], x_train[1]], 
    y_train, 
    batch_size=batch_size, 
    epochs=epochs, 
    validation_split=0.2,
    callbacks=[checkpointer], 
    verbose=1
)


duration = (time() - start)/60
print("Training completed in time: ", duration, "min")

Epoch 1/64
Epoch 00001: val_loss improved from inf to 101.66930, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 2/64
Epoch 00002: val_loss improved from 101.66930 to 35.64974, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 3/64
Epoch 00003: val_loss improved from 35.64974 to 10.92859, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 4/64
Epoch 00004: val_loss improved from 10.92859 to 4.82101, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 5/64
Epoch 00005: val_loss improved from 4.82101 to 2.57766, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 6/64
Epoch 00006: val_loss improved from 2.57766 to 1.55828, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 7/64
Epoch 00007: val_loss improved from 1.55828 to 1.18983, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 8/64
Epoch 00008: val_loss improved from 1.18983 to 0.96244, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoc

Epoch 31/64
Epoch 00031: val_loss improved from 0.03131 to 0.02643, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 32/64
Epoch 00032: val_loss improved from 0.02643 to 0.02384, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 33/64
Epoch 00033: val_loss improved from 0.02384 to 0.02274, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 34/64
Epoch 00034: val_loss improved from 0.02274 to 0.02265, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 35/64
Epoch 00035: val_loss improved from 0.02265 to 0.02264, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 36/64
Epoch 00036: val_loss improved from 0.02264 to 0.02023, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 37/64
Epoch 00037: val_loss improved from 0.02023 to 0.01845, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 38/64
Epoch 00038: val_loss improved from 0.01845 to 0.01766, saving model to saved_models\weights.best.basic_cnn.hdf5


Epoch 63/64
Epoch 00063: val_loss improved from 0.00513 to 0.00498, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 64/64
Epoch 00064: val_loss improved from 0.00498 to 0.00490, saving model to saved_models\weights.best.basic_cnn.hdf5
Training completed in time:  3.821002121766408 min


### Test the model 

Here we will review the accuracy of the model on both the training and test data sets. 

In [14]:
predict(x_train, y_train)

(2, 57, 40, 365, 1)
0 0.0 1.0872798
1 1.0 0.08608175
0 0.0 0.689456
0 0.0 5.764269
0 0.0 1.3669777
0 0.0 1.44807
1 1.0 0.06697221
0 0.0 1.1745834
0 0.0 1.1764838
1 1.0 0.072153926
1 1.0 0.090201944
1 1.0 0.0875443
1 1.0 0.06289607
1 1.0 0.044211827
0 0.0 1.4250283
1 1.0 0.086301595
0 0.0 1.5424693
1 1.0 0.090231456
0 0.0 8.15467
1 1.0 0.07098756
1 1.0 0.09667267
1 1.0 0.052663837
1 1.0 0.054785863
0 0.0 1.8447478
1 1.0 0.04301049
0 0.0 1.3944316
0 0.0 1.0955516
1 1.0 0.06272768
1 1.0 0.061319906
0 0.0 2.3040001
0 0.0 1.0832397
1 1.0 0.063796826
1 1.0 0.08047636
1 1.0 0.063587815
0 0.0 6.123921
0 0.0 6.2126627
1 1.0 0.05811749
0 0.0 0.9400404
1 1.0 0.05757321
0 0.0 1.3812172
0 0.0 7.6590877
1 1.0 0.06554505
0 0.0 1.6887981
1 1.0 0.067737326
1 1.0 0.059289437
0 0.0 7.082174
1 1.0 0.10153466
0 0.0 1.0210713
0 0.0 1.325704
0 0.0 1.594081
1 1.0 0.119093716
1 1.0 0.1112123
0 0.0 1.1813018
1 1.0 0.109268576
0 0.0 1.3959684
0 0.0 1.2058371
1 1.0 0.10000919
acc = 100.0%


In [15]:
predict(x_test, y_test)

(2, 15, 40, 365, 1)
1 1.0 0.107383825
1 1.0 0.069897026
0 0.0 0.99296767
1 1.0 0.10461344
0 0.0 7.5714693
0 0.0 0.56158644
0 0.0 0.8223759
1 1.0 0.06732982
0 0.0 0.8940802
0 0.0 14.450246
1 1.0 0.06707755
1 1.0 0.10241078
0 0.0 5.596364
0 0.0 1.5377754
1 1.0 0.07242619
acc = 100.0%
