# Neural networks for audio classification

## Part 3: Model Training

We will train a classifier (neural network) that predicts which keyword or class is present from the MFCC features of a one-second long audio clip.

### Load some libs and setup GPU usage

In [1]:
import tensorflow as tf

In [2]:
## Activate gpu usage if available
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:    
    try:  
        tf.config.experimental.set_virtual_device_configuration(gpus[0], [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=4096)])
    except RuntimeError as e:
        print(e)
else:
    print('no gpus found!')

no gpus found!


In [3]:
%matplotlib inline

In [4]:
from config import *
import pandas as pd
from utility import keep_only_n_unknowns, pad_signal, augment_audio, get_callbacks
import matplotlib.pyplot as plt
from tensorflow.keras import layers

### Setup our first neural network!

In [5]:
## Infer model size
n_max_frames     =  49  # leave this at 49 
n_output_neurons = 12


print('features have the dimension:', n_max_frames, 'x', n_mfccs, 'and output neurons:', n_output_neurons)

features have the dimension: 49 x 40 and output neurons: 12


## Exercise
We will use `tf.keras.models.Sequential(list_of_layers)` and feed a list of layers to it to create our model. 

1. Create a feed forward network with $2$ hidden layers and **ReLU** activation functions, that has a softmax output layer. you can use `tf.keras.Input()` as the input layer before the Dense hidden layers. Use $64$, $128$ neurons for your **"Dense"** layers. The input dimension is the dimension of the spectrogram image, aka $(49 x 40)$. A hidden layer only accepts $1$ dimensional input. You could use the **reshape** layer to reshape the input from (49,40) to (49*40), which is now only $1$ dimensional.

2. Check the dimensions and parameters of your model using `model.summary()` and try out the `model.predict` function on a training batch. 

## Hints:
1. 
Here is a blueprint: 

```
model = tf.keras.models.Sequential(
    [
        tf.keras.Input(name='input_layer', shape=("xxx", "yyy")),
        layers.Reshape(("xxx" * "yyy", ), input_shape=("xxx", "yyy")),
        layers.Dense("", activation='relu'),
        layers.Dense("", activation='relu'),
        layers.Dense("", activation='softmax'),
    ]
)
```

2. You can use `np.random.random()` and pass it a tuple of $(batchsize, 49, 40)$ to create a random batch for testing the model with the `model.predict()` function.

3. The predictions should sum up to $1$ because we have used a **Softmax** layer. You can check it with np.sum(prediction_vector).

## Solution:

### E1
Model architecture

In [6]:
model = tf.keras.models.Sequential(
    [
        tf.keras.Input(name='input_layer', shape=(n_max_frames, n_mfccs)),
        layers.Reshape((n_max_frames * n_mfccs, ), input_shape=(n_max_frames, n_mfccs)),
        layers.Dense(64, activation='relu'),
        layers.Dense(128, activation='relu'),
        layers.Dense(n_output_neurons, activation='softmax'),
    ]
)

### E2
Model summary

In [7]:
model.summary()
model.input_shape

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 reshape (Reshape)           (None, 1960)              0         
                                                                 
 dense (Dense)               (None, 64)                125504    
                                                                 
 dense_1 (Dense)             (None, 128)               8320      
                                                                 
 dense_2 (Dense)             (None, 12)                1548      
                                                                 
Total params: 135,372
Trainable params: 135,372
Non-trainable params: 0
_________________________________________________________________


(None, 49, 40)

Model predictions

In [8]:
import numpy as np
prediction = model.predict(np.random.random((10,49,40)))[0]
print(prediction, '\n sum:', np.sum(prediction))

[0.08838195 0.08457235 0.0447917  0.05025874 0.02020248 0.10175817
 0.1082058  0.1439389  0.11919267 0.05784237 0.1036185  0.07723636] 
 sum: 0.99999994


# Part 4: Train the model

In this part, we will compile the model by providing loss, metrics and an optimizer. We will use one set of parameters for the following trainings.

In [9]:
## Number of epochs to run the training for
n_epochs= 30

## Early stopping setting
patience= 25    

## Logging/debugging 
debugging_mode = False

## size of the batches used in training
batch_size = 32

In [10]:
## Compile the model
from tensorflow import keras

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=[keras.metrics.CategoricalAccuracy()],
              run_eagerly=debugging_mode)

## Fit with training set unaugmented 1fold
Now we finally get to training a model. Keras has everything implemented inside the model class. While it is possible to write a custom training, we can just pass all the options to the `model.fit()` function and it will do everything for us. We need to pass:
- Training set as $x$ and $y$.
- **steps_per_epoch**, which are the number of batches inside the training set.
- **n_epochs**, which is the total number of epochs to train for.
- **Shuffle**, which automatically shuffles the dataset after each epoch (we set it to **False** for now for all our trainings).
- **validation_data**, which is the validation set. This will only be used to calculate loss and accuracy on itself.
- Callbacks, which is a collection of methods that are called throughout the training. We have provided a callback function for you that will write out certain metrics like **confusion matrix**, **roc curve** and so on. You should check if you can find those in your output_dir, sorted by the datetime when the training started.

In [11]:
import sys, importlib

importlib.reload(sys.modules['utility'])
from utility import get_callbacks

In [12]:
train_data = np.load("./data/X_train_data.npy"), np.load("./data/Y_train_data.npy")
val_data = np.load("./data/X_val_data.npy"), np.load("./data/Y_val_data.npy")

In [13]:
history = model.fit(x=train_data[0], y=train_data[1], 
                    steps_per_epoch=int(np.floor(len(train_data[0]) / batch_size)),
                    epochs=n_epochs, 
                    callbacks=get_callbacks(output_dir, val_data, model, patience=patience), 
                    validation_data=val_data, 
                    shuffle=False)

print('max val val_categorical_accuracy', np.max(history.history['val_categorical_accuracy']))

plt.plot(history.history['categorical_accuracy'])
plt.plot(history.history['val_categorical_accuracy'])

logging to:  ./output/2023_07_26_22_08_11/
Epoch 1/30

2023-07-26 22:08:21.715866: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.


INFO:tensorflow:Assets written to: ./output/2023_07_26_22_08_11/saved_models/assets
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30

KeyboardInterrupt: 

## Exercise
1. We should see a significant difference between the training and validation accuracies in the plots above. Why is this the case?

2. How do you judge the overall accuracy? Play with the number of hidden layers and the number of neurons in the hidden layers and retrain. Do you get a better result?

## Hints
1. Whats the difference between validation and train data?
2. Look at the confusion matrices we have dumped to your data folder. What can you see? Also, hust copy the code you already have from above and change some parameters!

## Solution

### E1
The validation data are not used for parameter optimization. The phenomenon we encountered is called overfitting and it can have different causes like **data sparsity**, **outliers**, too many **degrees of freedom** etc. We will learn more about it in the next lecture. 

### E2
The confusion matrices show that some keywords get mixed up a lot. The accuracy tells us how many instances are classified correctly or how many keywords are recognized correctly. <br>
<br>
**Lets try some deeper architectures**,

In [14]:
model = tf.keras.models.Sequential(
    [
        tf.keras.Input(name='input_layer', shape=(n_max_frames, n_mfccs)),
        layers.Reshape((n_max_frames * n_mfccs, ), input_shape=(n_max_frames, n_mfccs)),
        layers.Dense(64, activation='relu'),
        layers.Dense(128, activation='relu'),
        layers.Dense(128, activation='relu'),
        layers.Dense(128, activation='relu'),    
        layers.Dense(n_output_neurons, activation='softmax'),
    ]
)

In [15]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 reshape_1 (Reshape)         (None, 1960)              0         
                                                                 
 dense_3 (Dense)             (None, 64)                125504    
                                                                 
 dense_4 (Dense)             (None, 128)               8320      
                                                                 
 dense_5 (Dense)             (None, 128)               16512     
                                                                 
 dense_6 (Dense)             (None, 128)               16512     
                                                                 
 dense_7 (Dense)             (None, 12)                1548      
                                                                 
Total params: 168,396
Trainable params: 168,396
Non-tr

In [16]:
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=[keras.metrics.CategoricalAccuracy()],
              run_eagerly=debugging_mode)

In [17]:
history = model.fit(x=train_data[0], y=train_data[1], 
                    steps_per_epoch=int(np.floor(len(train_data[0]) / batch_size)),
                    epochs=n_epochs, 
                    callbacks=get_callbacks(output_dir, val_data, model, patience=patience), 
                    validation_data=val_data, 
                    shuffle=False)

print('max val val_categorical_accuracy', np.max(history.history['val_categorical_accuracy']))

plt.plot(history.history['categorical_accuracy'])
plt.plot(history.history['val_categorical_accuracy'])

logging to:  ./output/2023_07_26_22_11_48/
Epoch 1/30


KeyboardInterrupt



We can see that improving the network's depth helped to increase the accuracy to $74\%$, so our model was not powerful enough so far. <br>
<br>
**Lets try more deeper model**, 

In [18]:
model = tf.keras.models.Sequential(
    [
        tf.keras.Input(name='input_layer', shape=(n_max_frames, n_mfccs)),
        layers.Reshape((n_max_frames * n_mfccs, ), input_shape=(n_max_frames, n_mfccs)),
        layers.Dense(64, activation='relu'),
        layers.Dense(128, activation='relu'),
        layers.Dense(128, activation='relu'),
        layers.Dense(128, activation='relu'),    
        layers.Dense(128, activation='relu'),
        layers.Dense(128, activation='relu'),
        layers.Dense(256, activation='relu'),    
        layers.Dense(256, activation='relu'),
        layers.Dense(512, activation='relu'),
        layers.Dense(256, activation='relu'),
        layers.Dense(n_output_neurons, activation='softmax'),
    ]
)

In [19]:
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=[keras.metrics.CategoricalAccuracy()],
              run_eagerly=debugging_mode)

In [20]:
model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 reshape_2 (Reshape)         (None, 1960)              0         
                                                                 
 dense_8 (Dense)             (None, 64)                125504    
                                                                 
 dense_9 (Dense)             (None, 128)               8320      
                                                                 
 dense_10 (Dense)            (None, 128)               16512     
                                                                 
 dense_11 (Dense)            (None, 128)               16512     
                                                                 
 dense_12 (Dense)            (None, 128)               16512     
                                                                 
 dense_13 (Dense)            (None, 128)              

In [21]:
history = model.fit(x=train_data[0], y=train_data[1], 
                    steps_per_epoch=int(np.floor(len(train_data[0]) / batch_size)),
                    epochs=n_epochs, 
                    callbacks=get_callbacks(output_dir, val_data, model, patience=patience), 
                    validation_data=val_data, 
                    shuffle=False)

print('max val val_categorical_accuracy', np.max(history.history['val_categorical_accuracy']))

plt.plot(history.history['categorical_accuracy'])
plt.plot(history.history['val_categorical_accuracy'])

logging to:  ./output/2023_07_26_22_12_01/
Epoch 1/30
 246/1162 [=====>........................] - ETA: 10s - loss: 1.9919 - categorical_accuracy: 0.2942

KeyboardInterrupt: 

We see that a deeper model can help to increase the accuracy, however only up to a point. After that, adding parameters might not help anymore.

Ideally we would like over $90\%$, so we have some room for improvement :-)