# Audio Classification: Convolutional Neural Networks


## 1. The Problem

The goal is to classify audio spectrograms into one of two classes.

So although it's audio, we have the images of the audio.


## 2. The Dataset

The dataset contains spectrograms of Stephen Colbert and Conan O'Brien speaking. This dataset has been put together by Sean M. Tracey.

The source videos from which the audio samples have been extracted, and the spectrograms generated from are:

Stephen Colbert:
- https://www.youtube.com/watch?v=U2_52Dj6DsI
- https://www.youtube.com/watch?v=m6tiaooiIo0

Conan O'Brien:
- https://www.youtube.com/watch?v=KmDYXaaT9sA
- https://www.youtube.com/watch?v=_q471WB5Tgw
- https://www.youtube.com/watch?v=DtJ28qOEG1g

The audio content is from different time periods across 2 decades. Once extracted from the videos, each audio file is divided into 250ms clips. These clips are then analysed to generate spectrograms that can be classified by a CNN.

## Dependencies

We'll be using tensorflow 2.0 to construct our neural network. Make sure you have the following packages. The version may differ but be careful about it.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
%load_ext watermark
%watermark -v -m -p numpy,scipy,tensorflow,matplotlib,tqdm  -g

CPython 3.7.5
IPython 7.10.0

numpy 1.17.4
scipy 1.3.1
tensorflow 2.0.0
matplotlib 3.1.2
tqdm 4.39.0

compiler   : Clang 4.0.1 (tags/RELEASE_401/final)
system     : Darwin
release    : 19.0.0
machine    : x86_64
processor  : i386
CPU cores  : 16
interpreter: 64bit
Git hash   : e002afdb781c2538f12af576a44b04aefe95ff13


## Getting the training data

The training data consists of 2394 JPGs (spectrograms) of the raw audio data from the processed video files.

Once we have the data, we divide it into 3 different categories:

- training (75%)
- test (20%)
- validation (5%)

The `training` data will be used to train our model on the different patterns in Stephen Colbert and Conan O'Brien's speech. 

The `test` data is used by the model to track how well it's doing in the current epoch. 

The `validation` data is not used by the model, but is reserved by us to run some code later on in this notebook.

You can retrieve these files from a DropBox folder and extract them to a working directory for our script.

In [2]:
# !mkdir -p ./data
# !curl -L --output ./data/audio_data.zip https://www.dropbox.com/s/rbywvpnd7h3d5ra/audio_data.zip?dl=1
# !unzip -o ./data/audio_data.zip -d ./data
# !ls -la && ls -la ./data

## Load Libraries

Now we import the required dependencies we'll need:

In [3]:
import sys, os
from os import listdir
from os.path import isfile, join
from tqdm.notebook import tqdm
from scipy import misc
import numpy as np

# And all the required tensorflow libraries (using tensorflow 2.0)
import tensorflow as tf2
from tensorflow.keras.preprocessing.image import ImageDataGenerator, load_img
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Activation, Dropout, Flatten, Dense
from tensorflow.keras import backend as K

## Tracking Available Files

This is not strictly neccesary in our notebook but it is handy little bit of code that we can use to define our training parameters.

In [4]:
labelled_class_names = [dI for dI in os.listdir('data/train') if os.path.isdir(os.path.join('data/train', dI))]
labelled_classes = {}

for class_name in labelled_class_names:
    trainingPath = 'data/train/' + class_name
    testPath = 'data/test/' + class_name

    labelled_classes[class_name] = {}
    labelled_classes[class_name]['training'] = len([f for f in listdir(trainingPath) 
                                                    if isfile(join (trainingPath, f))])
    labelled_classes[class_name]['test'] = len([f for f in listdir(testPath) 
                                                if isfile(join(testPath, f))])

In [5]:
labelled_classes

{'colbert': {'training': 957, 'test': 180},
 'conan': {'training': 957, 'test': 180}}

## Preparing the data

We set the parameters that our generators will use to decide how to train our model.

1. Create variables that we can use to point our code to the locations of our training and test data on our filesystem (where we unzipped our data files in the early steps of this notebook)

2. Create a variable which lists the number of training + test files in our dataset. This is used later on in our dataset generators (some natty code which handles all the messy business of passing our training/test data to our model) to determine the number of training steps are needed for each epoch.

3. Set the number of epochs we want our training cycle to have. We don't want to have too many epochs as we have quite a small dataset, if we have too many epochs, our model will be trainined to recognise our dataset, not the speech patterns of the people we're trying to identify. This would negate our models efficacy in any environment outside of this notebook / dataset. 3 epochs on a dataset this size should give us a confidence of ~90% - 95%


In [6]:
# dimensions of our images.
img_width, img_height = 110, 110

train_data_dir = 'data/train'           # Folder used to train our model
test_data_dir = 'data/test' # Folder used to validate the model

nb_train_samples = labelled_classes[labelled_class_names[0]]['training']    # Number of files to train our model
nb_test_samples = labelled_classes[labelled_class_names[0]]['test']         # Number of files to validate our model

epochs = 3      # Iterations (epochs) that our data will pass through the model to train it.
batch_size = 16 # How many files from our training / test dataset will include in our batch at training time

# Depending on the backend Keras is running with, we either include the dimensions of our imnage before we include the number of channels
# This code sets the shape according to the backend included in the script.
if K.image_data_format() == 'channels_first': 
    input_shape = (3, img_width, img_height)
else:
    input_shape = (img_width, img_height, 3)

## Constructing Our Model

It's time to start building our neural network!

In this next cell, we set the `model` variable. Here, we're telling our script that we want to create a sequential neural network, that is, a network which layers are triggered in tandem with each other, one layer after the other.

## Breaking down our neural network

Let's break down our layers line by line.

First, we have `model.add(Conv2D(32, (3, 3), input_shape=input_shape))` which, as the code suggests, creates a 2D convolutional layer. A convolutional layers job is to take an input (in this case, our image) and to essentially move around areas of it (convolve around) and tally up all of the values in that section of the image. Imagine you have a photo and a small square, if you were to place the square over a part of the image and tally up all of the values of the image inside the grid you would get a single number. If you repeat this process for all of the remaining parts of the image, you would end up with a matrix of numbers which you would then pass on to the next layer of our network. You can think of it as a filter. Our image goes in, the **_key features_** of our image are what is output, and that's what's important - we're hoping (and hope is the right word here) that our convolutional layer will be able to pick out the key features of our spectrograms as we show it more and more of them. This layer has `32` outputs, so we're greatly reducing the amount of information (A 110x110 image has 12,100 data points) passing from our source to the next layer of our network.

Next up, we have `model.add(Activation('relu'))`. This creates a layer of densely connected tensors (that is, a series of tensors which connects to every input and output before it) that will be activated with the **ReLu** function. The **ReLu** activation function is generally the go-to choice of activation functions when building convolutional neural networks. It's uncomplicated (it ignores negative input values), so it's computationally cheap to process, meaning we can usually train our networks faster with results comparative to networks built with other activation functions.

Finally, we have `model.add(MaxPooling2D(pool_size=(2, 2)))`. A pooling layer take only the highest values of a section of the inputs. Essentialluy this is a downsampling function which only allows the most prominent features of the inputs to remain. In this pooling layer, we're passing a pool size of `2x2`, that is, we're dividing the images into quadrants and passing the maximum values found in those quandrants to the next layer of our neural network.

In [7]:
model = Sequential()

# First Group
model.add(Conv2D(32, (3, 3), input_shape=input_shape))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

## Finding Higher Level Features

Here, we repeat the stucture of our network. In our second group of layers, we have the same number of inputs as the output of our original layer of convolutional layer. Afterwards we have a second layer of activation nodes, followed by another pooling layer.

By adding more neurons to our convolution layer in `model.add(Conv2D(64, (3, 3)))` we create in our network the possibility of exploring more higher level features. If we're looking for images of cats, we can imagine the first layers in the first group of layers of our neural network would have identified the edges of objects. The second and third groups of convolutional layer gives our network the opportunity to explore connections between different features identified in the first group of layers. So, as opposed to finding the edge of features in an image, the network can start to look at the relationships between those featurs identified in the first group of layers in our network. Going back to thinking about cats, it's a bit like being able to say "This thing has ears (identification made in our first layer) but it also has four paws (exposed by the greater feature exploration enabled in our second layer)" 

In [8]:
# Second Group
model.add(Conv2D(32, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

# Third Group
model.add(Conv2D(64, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

## Flattening Our Output

At this point, after a number of epochs, our network will hopefully have built up enough information on the features in our spectrograns to adjust the weights which connect our neurons in a way that corresponds to an output that we expect - that is, it should be able to tell the difference between Stephen Colbert speaking and Conan O'Brien.

In our fourth group of layers, we flatten our input (essentially create a one-dimensional array of numbers which we can feed into our next layer) and create a densely connected layer of neurons which have **ReLu** activation. This ensures that none of the neurons in the previous layers aren't ultimately connected to the subsequent series of layers in our network.

In [9]:
# Fourth Group
model.add(Flatten())
model.add(Dense(64))
model.add(Activation('relu'))

## Classifying our data

Here we arrive at our final group of layers. First, we have a dropout layer `model.add(Dropout(0.5))`. When we speak of "dropout" in a neural network, we're talking about the process whereby neurons are randomly selected be ignored during the forward and backpropogation of data in our training stages. This has the effect of making our networks more robust to overfitting to our dataset, thus (hopefully) being able to better to classify inputs from more varied sources than that of our dataset. In this instance, we're dropping about half of the neurons from our network at random during training.

Next we have a final densely connected layer `model.add(Dense(len(labelled_classes)))`. The number of neurons in this layers corresponds to the number of classes our dataset has. This is the place where a decision is effectively made on what label the input data should have in our neural network.

Finally, we add a layer of activation neurons using a sigmoid function. The sigmoid function gives us a nice, curvy decider function that skirts around linearity. It inherently allows for nuance in deciding whether a thing is thing A or thing B based on the input.

In [10]:
# Fifth Group
model.add(Dropout(0.5))
model.add(Dense( len(labelled_classes)))
model.add(Activation('sigmoid'))

## Compiling Our Model

And that's it! We've constructed a very simple neural network to classify our spectrograms. Now it's time to compile our model to ready it for training.

In our `model.compile` code, we pass an optimizer, a loss function, and the metric that we use to measure the accuracy of our network.

The purpose of our optimiser is to best figure out how to update the weights of the connection between the neurons to minimise the loss function. Here, we're using the **ADAM** (**Ada**ptive **M**oment Estimation) optimiser.

Next, we pass the loss function in this case we're using the `categorical_crossentropy` loss function to calculate the difference between the expected output and the actual output that our model produces. This loss function is generally used for networks where there are more than two classes to identify (in that case, we could use a `binary_crossentropy` loss function), but by using the `categorical_crossentropy` function, we can use the same code that can manage classification of more than two types of classes. For the loss function, smaller number is better.

Finally, we have the metric that we use to measure the performance of our model. There can be a combination of metrics to measure the performance of our model, but here, we're only using `accuracy`.

In [11]:
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

## Feeding our data into our model

Now that we have our model compiled, we can start funelling our data into it to train it 🎉

Keras has "generators" which can look at a directory structure, identify the classes therein, and then feed the data into our model.

We need two generators, one for the feeding the training data into our model, the other for feeding the test data into our model.

For this, we create the `train_datagen` and `test_datagen` variables. Our model needs to have values between 0 and 1 to function correctly, so when we call the `ImageDataGenerator` function to create the generator, we tell it to divide all of the values it finds in the file by `255` the maxiumum number a pixel in our images.

In our `train_generator` and `test_generator`, we pass through the directories that Keras can find our divvied-up dataset in (remember when we unzipped that file _ages_ ago? Well, now we get to use it). 

In both of our generators, we pass a target size tuple. If we had large images in our dataset, this would rescale them to a more manageable size for our network to handle. We don't necessarily need a high resolution image for our model to work. In this case, our images are `110 x 110` pixels. This is small enough that we should have enough data for our model to figure out any patterns in our spectrograms, but small enough that the model should train within a few minutes.

In [12]:
train_datagen = ImageDataGenerator( rescale = 1.0 / 255 )
test_datagen = ImageDataGenerator(rescale = 1.0 / 255 )

train_generator = train_datagen.flow_from_directory(
    train_data_dir,
    target_size=(img_width, img_height),
    batch_size=batch_size
)

test_generator = test_datagen.flow_from_directory(
    test_data_dir,
    target_size=(img_width, img_height),
    batch_size=batch_size
)

Found 1914 images belonging to 2 classes.
Found 360 images belonging to 2 classes.


## Training our model

And here we are, at the point where we get to train our model.

If we hadn't used a generator to load our dataset, we could have called `model.fit` with a set of numpy arrays describing our data, but the generator is a nice way of quickly getting our model up and running. It's job is to break our training job into a smaller series of batches (a step) for training so that we don't have to load all of our data into memory at once. Though not a problem in this instance, large datasets may cause _out of memory_ errors when we try to train our model by overwhelming the systems capabilities. By breaking the training process up into smaller jobs, we can train our models on machines that don't neccesarily have an abundance of resources.

In [13]:
model.fit_generator(
    train_generator,
    steps_per_epoch=nb_train_samples // batch_size,
    epochs=epochs,
    validation_data=test_generator,
    validation_steps=nb_test_samples // batch_size)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x136a46d50>

## Saving Our Model

After our model has gone through however many epochs of training we've asked it to do, we can save it!

By calling `model.save` we can pass a filename through to the function which will write a file. This will write the structure of the network as well as the weights between each neuron to a file, which makes it perfect for loading somewhere else. If you run the cell after `model.save` (`!ls -la`) you'll see that we now have the file `model.h5`.

In [14]:
model_reference = 'models/model_03.h5'

In [15]:
model.save(model_reference)

In [16]:
!ls -a models

[34m.[m[m           [34m..[m[m          model_01.h5 model_02.h5


## Loading + Predicting w. Our Model

So, we've trained a model, but what good is that to us if we can't predict things with it? In this next cell, we'll load the model we just wrote to a file and use it to classify the data that we seperated from our dataset earlier on (the validation data).

This is data that the model has never seen - not during training, and not during test - so if our model is any good, it should be able to pick out which of our two speakers are talking in each file.

This code will get all of the files in the `validation` folder in our `data` folder and classify each one of them adding to a tally for each speaker.

First, we identify all of the classes in our data structure with `labelled_class_names`, then we create a `tally` variable which will maintain a count of each correct and incorrectly identified speakers. 

In [17]:
stored_model = load_model(model_reference)

labelled_class_names = [dI for dI in os.listdir('data/validation') 
                        if os.path.isdir(os.path.join('data/validation', dI ))]
labelled_classes = {}

tally = {}

for class_name in tqdm(labelled_class_names):
    
    if class_name not in tally:
        tally[class_name] = {"correct": 0, "incorrect": 0}

    validationPath = 'data/validation/' + class_name

    for f in tqdm(os.listdir(validationPath)):

        if os.path.isfile(os.path.join(validationPath, f)) and f != ".DS_Store":

            filePath = validationPath + "/" + f

            spectrogramFile = load_img(filePath, target_size=(110,110))
            spectrogramFile = np.reshape(spectrogramFile, [1,110,110,3])
            prediction = stored_model.predict_classes(spectrogramFile.astype(np.float16))

            if class_name == "colbert" and prediction == 0:
                tally[class_name]["correct"] += 1
            elif class_name == "colbert" and prediction == 1:
                tally[class_name]["incorrect"] += 1

            if class_name == "conan" and prediction == 1:
                tally[class_name]["correct"] += 1
            elif class_name == "conan" and prediction == 0:
                tally[class_name]["incorrect"] += 1
print(tally)

HBox(children=(IntProgress(value=0, max=2), HTML(value='')))

HBox(children=(IntProgress(value=0, max=60), HTML(value='')))




HBox(children=(IntProgress(value=0, max=60), HTML(value='')))



{'colbert': {'correct': 60, 'incorrect': 0}, 'conan': {'correct': 58, 'incorrect': 2}}
