# **Before you start**

*   Go to "*File*" --> "*Save a copy in Drive*"
*   Open that copy (might open automatically)
*   Then continue below

---

# AST MIR-Seminar 3: Sound classification with simple Neural Networks

What we are going to do:
* Revisit previous seminar:
 * Curate a small "ESC-5" dataset from the ESC-50 dataset
 * Create a train and test set
 * Extract features (mel spectrogram) and cater for labels
 * Use scikit-learn to classify the test set with a nearest neighbor algorithm
 * Plot a confusion matrix
* First look at creating simple neural networks
 * Keras / Tensorflow
 * Dense layers
 * Activations and Dropout
 * Fitting and Evaluation

---

# 1. Fetch the Dataset

*   We use ESC-50, a dataset for Environmental Sound Classification (https://github.com/karolpiczak/ESC-50)
*   Properties:
 * 50 classes
 * 40 files per class
 * each audio file has a length of 10s
*   ***Tasks:***
 * Download, then
 * unzip the dataset

In [None]:
!wget https://github.com/karoldvl/ESC-50/archive/master.zip

In [None]:
!unzip master.zip

---

# 2. Import libraries

* We will need a number of libraries. So we import them once and use them throughout the document.

In [None]:
import librosa
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf

from sklearn.metrics import plot_confusion_matrix, confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

from tqdm import tqdm


### Tensorflow check
all_devices = tf.config.list_physical_devices()
print('Found {} devices: {}'.format( len(all_devices), all_devices ))

Expected output:
```
Found 1 devices: [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
```



---

# 3. Curate an ESC-5

***Tasks:***
* a) We choose 5 classes for our ESC-5 (*our_classes*). Find all files that belong to them. Put the files and their classes in separate lists, but make sure their indices are equal (meaning: the value at index 3 of list *a* is related to the value at index 3 of list *b*).
 * Idea: Use *df.values* to iterate over the rows of the csv.

* b) Print the first 5 elements of each list as (file, class)-tuples. Also, print the overall lengths of the lists.

In [None]:
our_classes = ['crying_baby', 'dog', 'rain', 'rooster', 'sneezing']  # Note: This is also our class map for later.
esc5_X = []  # File list
esc5_y = []  # Class list
fn_csv = 'ESC-50-master/meta/esc50.csv'  # Have a look at the metadata


### START CODING ###
df = ...  # pandas dataframe df

for row in ...:
  ...:
    esc5_X.append( ... )  # filename column
    esc5_y ...  # class column

 ... zip(esc5_X[:5], esc5_y[:5])
print(...)
### END CODING ###

Expected output:
```
[('1-100032-A-0.wav', 'dog'), ('1-110389-A-0.wav', 'dog'), ('1-17367-A-10.wav', 'rain'), ('1-187207-A-20.wav', 'crying_baby'), ('1-211527-A-20.wav', 'crying_baby')]
Lengths: esc5_X: 200, esc5_y: 200
```



---

# 4. Splitting the dataset into *train* and *test* subsets

***Tasks:***
* ESC-5 is almost ready. Using sklearn, split the dataset into *train* and *test* subsets with a split ratio of 80%/20% and a random state of 1337.
* Print the first 3 elements of the resulting *X_train*.
* Print the overall lengths of the resulting lists. Are they aligned with the ratio?

In [None]:
### START CODING HERE ###
X_train, X_test, y_train, y_test = train_...

...
...
### END CODING ###

Expected output:
```
['5-203128-A-0.wav', '4-181286-A-10.wav', '3-157615-A-10.wav']
X: 160, 40; y: 160, 40
```

---
# 5. Create mel spectrograms

We now have a train set and a test set. They consist of file lists and their respective classes.

We need to compute features and corresponding labels for each file in our ESC-5.

***Tasks:***
* Define a function that does the following (in this order!):
  * input parameters: an *X*-list, a *y*-list, and *our_classes*
  * loops over the *X*-list (hint: *enumerate* it), and loads each file (.wav) using librosa
  * creates the mel spectrogram from the wave data
  * normalizes each mel spec by dividing it through the number of mel bands.
  * transposes the mel spec
  * appends the mel spec features to a large list
  * creates a target vector consisting of as many values as there are frames 
    * hint: use .shape to see which value you need
  * each value inside the vector must correspond to the index of the class in *our_classes*
    * hint: remember *numpy.ones(...)* ?
    * hint: use *.index(...)* here. Not the best idea, but works here.
  * appends the targets to a large list
  * stacks the large feature and target lists appropriately
  * returns the lists
* Finally, print the shapes of all 4 arrays.

In [None]:
### START CODING HERE ###
def ...(data_X, data_y):
  X = []  # feature tensor
  y = []  # target tensor

  mel_bands = 128
  for i, filename in tqdm(enumerate(data_X)):
    wav_data, sr = librosa...  # uses and returns the file's sr, will be used for the mel_spec

    # Features
    mel_spec = ...  # Create mel spectrogram. Output shape: (128, 216) (n_mels, frames)
    mel_spec = ...  # Normalization
    mel_spec = ...  # Transposition. Output shape: (216, 128)
    mel_spec = mel_spec.astype(np.float16)  # Reduce complexity, saves memory (float64/64bit -> 16bit)
    ...  # Append to feature tensor

    # Targets == class_name
    targets = np.ones( ... )  # Create placeholder target vector. Output shape: (216) (Note: silent frames are not going to be labeled as "silent")
    targets = targets * ... ( data_y[...] )  # Convert placeholders with actual class-index (our_classes)
    ...  # Append to target tensor

  # Stack tensors
  X = np.vstack(X)
  y = np.hstack(y)

  return ..., ...


# Call the function on our data lists
X_train_ready, y_train_ready = ...(X_train, y_train)
X_test_ready, y_test_ready = ...
### END CODING HERE ###


print('\nShapes: X_train_ready: {}, y_train_ready: {}'.format(X_train_ready.shape, y_train_ready.shape))
print('Shapes: X_test_ready: {}, y_test_ready: {}'.format(X_test_ready.shape, y_test_ready.shape))

Expected output:
```
Shapes: X_train_ready: (34560, 128), y_train_ready: (34560,)
Shapes: X_test_ready: (8640, 128), y_test_ready: (8640,)
```



---

# 6. Train a nearest neighbor classifier

***Tasks:***
* Use the features and targets from above to train (*fit*) a kNN-classifier from scikit-learn, with 5 neighbors and uniform weighting.
* Print the scores on the train set and test set, rounded to 4 decimals. (This will take some time!)


In [None]:
# Feature scaling
print('Scaling...')
scaler = StandardScaler()  # zero mean, unit variance normalization (ZMUV)
scaler.fit(X_train_ready)
X_train_ready = scaler.transform(X_train_ready)
X_test_ready = scaler.transform(X_test_ready)


### START CODING HERE ###
print('Fitting...')
model = ...  # Call the kNN classifier. Look at your imports again for a hint.
model...  # Fit/Train the classifier using our generated tensors.

print('Evaluating...')
print('Train score: {}'.format( np.round(model...
...
### END CODING HERE ###

Expected output (might differ slightly):
```
Train score: 0.694
Test score: 0.5138
```



---

# 7. Plot the confusion matrix
***Tasks:***
* Using scikit-learn, create a confusion matrix of our classifier over the test set
* Normalize the rows, use our_classes as tick values
* Display the plot

In [None]:
### START CODING HERE ###
plot_...
plt.xticks(ticks=np.arange(5)...)
plt.y...
plt....
### END CODING HERE ###

Expected output:

```
(a coloured confusion matrix with an emphasis right in the center)
(each row should add up to 1)
(labels from our_classes on x-axis and y-axis)
```

---

# **Getting to know Keras for Neural Networks**

In [None]:
# https://machinelearningmastery.com/keras-functional-api-deep-learning/

from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Input, Dense, Flatten, Conv2D, MaxPooling2D, Dropout
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.losses import CategoricalCrossentropy
from tensorflow.keras.utils import to_categorical

# One-hot encoding (OHE) of targets (binary class matrix)

With 5 classes, we have a multi-class problem. So, we will 'categorize' our targets. A target with index '3' becomes OHE'd: [0 0 0 1 0]
(https://www.tensorflow.org/api_docs/python/tf/keras/utils/to_categorical)

In [None]:
# Shapes so far
print('y_train_ready.shape: {}, y_test_ready.shape: {}'.format( y_train_ready.shape, y_test_ready.shape ))

# Now we OHE
### START CODING HERE ###
y_train_ready_OHE = to_categorical(y=y_train_ready, num_classes=...)  # How many classes do we have again?
... # Do the same for the test set

# New shapes
print(...)
### END CODING HERE ###

Expected output:
```
y_train_ready.shape: (34560,), y_test_ready.shape: (8640,)
y_train_ready_OHE.shape: (34560, 5), y_test_ready_OHE.shape: (8640, 5)
```

# Building a simple DNN (Sequential API & Functional API)

* Building an NN can be straightforward. We use a very simple and extendable 
architecture.
* Training and evaluating a NN is very similar to what you saw in the kNN-classifier.

In [None]:
# Using the Sequential API
model_s = Sequential()
model_s.add(Dense(128, input_dim=128))       # input layer, 128 input features due to our feature extraction process (Dense == Fully Connected)
model_s.add(Dense(256, activation='relu'))   # hidden layer 1
model_s.add(Dense(64, activation='relu'))    # hidden layer 2
model_s.add(Dense(5, activation='sigmoid'))  # output layer. Most of the time, this is sigmoid (each index can be between 0 and 1), or softmax (all indices sum up to 1)

# Using the Functional API
model_f_in = Input(shape=(128,))                         # input layer
model_f_x = Dense(10, activation='relu')(model_f_in)     # hidden layer 1
model_f_x = Dense(10, activation='relu')(model_f_x)      # hidden layer 2
model_f_x = Dense(10, activation='relu')(model_f_x)      # hidden layer 3
model_f_out = Dense(5, activation='sigmoid')(model_f_x)  # output layer
model_f = Model(inputs=model_f_in, outputs=model_f_out)  # model instance, specified with input and output layers

# We use model_f in the following
model = model_f

# Compile the model and show a summary of it
model.compile(optimizer=SGD(lr=0.001), loss=CategoricalCrossentropy(), metrics=['accuracy'])  # Given this loss function, 'accuracy' means 'Categorical Accuracy'. Notice how 'Categorical' relates to OHE
model.summary()

*   Now we want to train this DNN on the train set with the OHE'd targets.

In [None]:
model.fit(x=X_train_ready, y=y_train_ready_OHE)

Expected output (may differ):
```
1080/1080 [==============================] - 3s 2ms/step - loss: 1.7089 - accuracy: 0.2835
<tensorflow.python.keras.callbacks.History at 0x7f9859532d50>
```

* Now it's time to evaluate the model.

In [None]:
# Predicting over test set
predictions = model.predict(X_test_ready)
print(predictions.shape)

# Evaluating quality of model
score = model.evaluate(X_test_ready, y_test_ready_OHE, verbose=1)

In [None]:
# Plot confusion matrix
confmat = confusion_matrix(y_test_ready_OHE.argmax(axis=1), predictions.argmax(axis=1), normalize='true')
disp = ConfusionMatrixDisplay(confusion_matrix=confmat, display_labels=our_classes)
disp.plot(xticks_rotation=45)
plt.show()

# Building a simple CNN (Functional API)

* Using the functional API, we are now building a CNN with a few layers.
* Feel free to adjust number of filters and kernel sizes. Watch out you don't shrink the output too much by doing so!
* In the summary, note what happens with the model after each layer, especially after pooling.

**NOTE:** First, have a look at how the model is built.
Then, go on to the next cell. Input shapes are explained there.

In [None]:
model_cnn_in = Input(shape=(1,128,1))  # We give 3 dims here...? 'num_examples' can be left 'unknown'. Note the (None) in the summary. (Also, see next cell for more info.)
model_cnn_conv1 = Conv2D(filters=32, kernel_size=(1, 3), activation='relu')(model_cnn_in)  # Since input is a vector (1, 128), we span the kernel over the feature dimension only
model_cnn_pool1 = MaxPooling2D(pool_size=(1, 2))(model_cnn_conv1)  # Same as above: We pool only in frequency dimension
model_cnn_conv2 = Conv2D(filters=64, kernel_size=(1, 3), activation='relu')(model_cnn_pool1)
model_cnn_pool2 = MaxPooling2D(pool_size=(1, 2))(model_cnn_conv2)
model_cnn_conv2 = Conv2D(filters=128, kernel_size=(1, 3), activation='relu')(model_cnn_pool2)
model_cnn_flat = Flatten()(model_cnn_conv2)
model_cnn_drop1 = Dropout(0.2)(model_cnn_flat)
model_cnn_dense1 = Dense(64, activation='relu')(model_cnn_drop1)
model_cnn_out = Dense(5, activation='sigmoid')(model_cnn_dense1)
model_cnn = Model(inputs=model_cnn_in, outputs=model_cnn_out)

model_cnn.compile(optimizer=SGD(lr=0.1), loss=CategoricalCrossentropy(), metrics=['accuracy'])
model_cnn.summary()

**CNN: Preparation of tensors**
* For 2D convolutional layers, we need to reshape our tensors.
* 2D conv layers take a 4D-input of the form NHWC (*num_examples, height, width, channels*).
  * *num_examples*: (in this case) total number of frames (here: 34560 for the train set)
  * *height*: number of time frames per example (here: 1)
  * *width*: feature dimension (here: 128)
  * *channels*: audio is mono, so 1

**NOTE:** Above numbers are valid for our frame-based approach.
We could also translate *num_examples* into number of files, then each example would have a *height* of 216 (each file has 216 frames).

In [None]:
# Convert tensors to correct input shape (for both training and test sets)
print('Shapes: X_train_ready {}, y_train_ready_OHE {}'.format(X_train_ready.shape, y_train_ready_OHE.shape))
# X_train_ready (34560, 128)    <-- needs format adjustment
# y_train_ready_OHE (34560, 5)  <-- already in correct format


### START CODING HERE ###
# Add channel dimension
X_train_cnn = np.expand_dims(X_train_ready, axis=-1)  # (34560, 128, 1)
... # Do the same for the test set

# Add height dimension
X_train_cnn = np.expand_dims(X_train_cnn, axis=1)  # (34560, 1, 128, 1)
... # Do the same for the test set

# New shapes
print(...)  # Print shapes of the CNN-ready train set and test set 
### END CODING HERE ###

Expected output:
```
Shapes: X_train_ready (34560, 128), y_train_ready_OHE (34560, 5)
Final shapes: X_train_cnn (34560, 1, 128, 1), X_test_cnn (8640, 1, 128, 1)
```

In [None]:
# Start fitting
model_cnn.fit(x=X_train_cnn, y=y_train_ready_OHE, batch_size=16)

Expected output (or similar):
```
2160/2160 [==============================] - 20s 9ms/step - loss: 0.8811 - accuracy: 0.6036
<tensorflow.python.keras.callbacks.History at 0x7ffa6e4c93d0>
```

In [None]:
# Predicting over test set
predictions = model_cnn.predict(X_test_cnn)
print(predictions.shape)


### START CODING HERE ###
# Evaluating quality of model
model_cnn....
### END CODING HERE ###

Expected output (or similar):
```
(8640, 5)
270/270 [==============================] - 1s 5ms/step - loss: 1.0113 - accuracy: 0.5602
```

* For the confusion matrix, we consider the *argmax* again.
* This means: We retrieve the actual index of the predicted class and compare it with the actual index of the true class.

In [None]:
### START CODING HERE ####
# Plot confusion matrix
confmat = confusion_matrix(y_test_ready_OHE.argmax(axis=1), predictions.argmax(axis=1), normalize='true')
disp = ConfusionMatrixDisplay(confusion_matrix=..., display_labels=...)
disp.plot(...)  # rotate labels on x-axis for readability
...  # display plot
### END CODING HERE ###

Expected output:

```
(a coloured confusion matrix)
(each row should add up to 1)
(labels from our_classes on x-axis and y-axis)
```

# Hyper-parameter tuning

Try playing around with a different

* learning rate (go in steps of an order of magnitude, e.g. 0.1, 0.001...)
* batch size (e.g. 1, 16, 32, 216. 216 will result in 160 batches. Where have you seen this number before?)

and see how it affects the results.