# Audio Recognition Using Siamese Network

In the last tutorial, we saw how to use the siamese networks to recognize a face. Now we will see how to use the siamese networks to recognize the audio. We will train our network to differentiate between a dog's and a cat's sound. The dataset of cats and dogs audio can be downloaded from here https://www.kaggle.com/mmoreaux/audio-cats-and-dogs#cats_dogs.zip

Once we have downloaded the data, we fragment our data into three folders Dogs, Sub_dogs, and Cats. In Dogs and Sub_dogs, we place the dog's barking audio and in Cats folder, we place the cat's audio. The objective of our network is to recognize whether the audio is the dog's barking sound or some different sound. As we know for a Siamese network, we need to feed input as a pair, we select an audio from Dogs and Sub_dogs folder and mark it as a genuine pair and we select an audio from Dogs and Cats folder and mark it as an imposite pair. That is, (dogs, subdogs) is genuine pair and (dogs, cats) is imposite pair.

Now we will step by step how to train our siamese network to recognize whether the audio is the dog's barking sound or some different sound. First, We will load all the necessary libraries:

In [1]:
#basic imports
import glob
import IPython
from random import randint

#data processing
import librosa
import numpy as np

#modelling
from sklearn.model_selection import train_test_split

from keras import backend as K
from keras.layers import Activation
from keras.layers import Input, Lambda, Dense, Dropout, Flatten
from keras.models import Model
from keras.optimizers import RMSprop

Using TensorFlow backend.


Before going ahead, We load and listen to the audio clips,

In [2]:
IPython.display.Audio("data/audio/Dogs/dog_barking_0.wav")

In [3]:
IPython.display.Audio("data/audio/Cats/cat_13.wav")

So, how can we feed this raw audio to our network? How can we extract meaningful features from the raw audio? As we know neural networks accept only vectorized input, we need to convert our audio to a feature vector. How can we do that? Well, there are several mechanisms through which we can generate embeddings for the audio. One such popular mechanism is Mel-Frequency Cepstral Coefficients (MFCC). 

We use MFCC for vectorizing our audio. MFCC converts the short-term power spectrum of an audio using a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency. To learn more about MFCC check this nice tutorial (http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/).

We will use MFCC function from librosa library for generating the audio embeddings. So, we define a function called audio2vector which return the audio embeddings given the audio:

In [4]:
def audio2vector(file_path, max_pad_len=400):
    
    #read the audio file
    audio, sr = librosa.load(file_path, mono=True)
    #reduce the shape
    audio = audio[::3]
    
    #extract the audio embeddings using MFCC
    mfcc = librosa.feature.mfcc(audio, sr=sr) 
    
    #as the audio embeddings length varies for different audio, we keep the maximum length as 400
    #pad them with zeros
    pad_width = max_pad_len - mfcc.shape[1]
    mfcc = np.pad(mfcc, pad_width=((0, 0), (0, pad_width)), mode='constant')
    return mfcc

We will load one audio file and see the embeddings

In [5]:
audio_file = 'data/audio/Dogs/dog_barking_0.wav'

In [6]:
audio2vector(audio_file)

array([[-297.54905127, -288.37618855, -314.92037769, ...,    0.        ,
           0.        ,    0.        ],
       [  23.05969394,    9.55913148,   37.2173831 , ...,    0.        ,
           0.        ,    0.        ],
       [-122.06299523, -115.02627567, -108.18703056, ...,    0.        ,
           0.        ,    0.        ],
       ...,
       [  -6.40930836,   -2.8602708 ,   -2.12551478, ...,    0.        ,
           0.        ,    0.        ],
       [   0.70572914,    4.21777791,    4.62429301, ...,    0.        ,
           0.        ,    0.        ],
       [  -6.08997702,  -11.40687886,  -18.2415214 , ...,    0.        ,
           0.        ,    0.        ]])

Now that we have understood how to generate audio embeddings, we need to create the data for our Siamese network. As we know, Siamese network accepts the data in a pair, we define the function for getting our data. We will create the genuine pair as (Dogs, Sub_dogs) and assign label as 1 and imposite pair as (Dogs, Cats) and assign label as 0.

In [7]:
def get_training_data():
    
    pairs = []
    labels = []
    
    Dogs = glob.glob('data/audio/Dogs/*.wav')
    Sub_dogs = glob.glob('data/audio/Sub_dogs/*.wav')
    Cats = glob.glob('data/audio/Cats/*.wav')
    
    
    np.random.shuffle(Sub_dogs)
    np.random.shuffle(Cats)
    
    for i in range(min(len(Cats),len(Sub_dogs))):
        #imposite pair
        if (i % 2) == 0:
            pairs.append([audio2vector(Dogs[randint(0,3)]),audio2vector(Cats[i])])
            labels.append(0)
            
        #genuine pair
        else:
            pairs.append([audio2vector(Dogs[randint(0,3)]),audio2vector(Sub_dogs[i])])
            labels.append(1)
            
            
    return np.array(pairs), np.array(labels)

In [8]:
X, Y = get_training_data()

Next, we split our data for training and testing with 75% training and 25% testing proportions:

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

Now that we have successfully generated our data, we build our Siamese network. We define our base network which is used for feature extraction, we use three dense layers with dropout layer in between.

In [10]:
def build_base_network(input_shape):
    input = Input(shape=input_shape)
    x = Flatten()(input)
    x = Dense(128, activation='relu')(x)
    x = Dropout(0.1)(x)
    x = Dense(128, activation='relu')(x)
    x = Dropout(0.1)(x)
    x = Dense(128, activation='relu')(x)
    return Model(input, x)

Next, we feed the audio pair to the base network, which will return the features:

In [11]:
input_dim = X_train.shape[2:]

audio_a = Input(shape=input_dim)
audio_b = Input(shape=input_dim)

In [12]:
base_network = build_base_network(input_dim)

feat_vecs_a = base_network(audio_a)
feat_vecs_b = base_network(audio_b)

These feat_vecs_a and feat_vecs_b are the feature vectors of our audio pair. Next, we feed this feature vectors to the energy function to compute a distance between them, we use Euclidean distance as our energy function:

In [13]:
def euclidean_distance(vects):
    x, y = vects
    return K.sqrt(K.sum(K.square(x - y), axis=1, keepdims=True))


def eucl_dist_output_shape(shapes):
    shape1, shape2 = shapes
    return (shape1[0], 1)

In [14]:
distance = Lambda(euclidean_distance, output_shape=eucl_dist_output_shape)([feat_vecs_a, feat_vecs_b])


Next, we set the epoch length to 13 and we use RMS prop for optimization.

In [15]:
epochs = 13
rms = RMSprop()

In [16]:
model = Model(input=[audio_a, audio_b], output=distance)

Lastly, we define our loss function as contrastive_loss  and compile the model.

In [17]:
def contrastive_loss(y_true, y_pred):
    margin = 1
    return K.mean(y_true * K.square(y_pred) + (1 - y_true) * K.square(K.maximum(margin - y_pred, 0)))

In [18]:
model.compile(loss=contrastive_loss, optimizer=rms)

Now, we train our model,

In [19]:
audio_1 = X_train[:, 0]
audio_2 = X_train[:, 1]

In [20]:
model.fit([audio_1, audio_2], y_train, validation_split=.25,
          batch_size=128, verbose=2, nb_epoch=epochs)

Train on 8 samples, validate on 3 samples
Epoch 1/13
 - 1s - loss: 6794.6025 - val_loss: 13322.8232
Epoch 2/13
 - 0s - loss: 11778.2275 - val_loss: 13736.1221
Epoch 3/13
 - 0s - loss: 10468.7461 - val_loss: 8607.5186
Epoch 4/13
 - 0s - loss: 9080.3896 - val_loss: 3346.3660
Epoch 5/13
 - 0s - loss: 5028.8110 - val_loss: 2159.2551
Epoch 6/13
 - 0s - loss: 1999.7991 - val_loss: 1449.4528
Epoch 7/13
 - 0s - loss: 1754.0571 - val_loss: 1395.3939
Epoch 8/13
 - 0s - loss: 905.6921 - val_loss: 1398.4510
Epoch 9/13
 - 0s - loss: 434.2379 - val_loss: 1244.3544
Epoch 10/13
 - 0s - loss: 328.7910 - val_loss: 1355.8146
Epoch 11/13
 - 0s - loss: 413.8195 - val_loss: 1199.9091
Epoch 12/13
 - 0s - loss: 600.8795 - val_loss: 941.8326
Epoch 13/13
 - 0s - loss: 213.6848 - val_loss: 855.2000


<keras.callbacks.History at 0x7f4d0e8d3b50>