# Training

First, import the required libraries for this notebook.

In [1]:
import os
from pathlib import Path

import boto3
import librosa
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

from sklearn.metrics import classification_report

Load the *ESC-10* dataset.

In [2]:
esc50 = pd.read_csv("data/ESC-50-master/meta/esc50.csv")
esc10 = esc50[esc50.esc10].reset_index(drop=True)
esc10.head()

Unnamed: 0,filename,fold,target,category,esc10,src_file,take
0,1-100032-A-0.wav,1,0,dog,True,100032,A
1,1-110389-A-0.wav,1,0,dog,True,110389,A
2,1-116765-A-41.wav,1,41,chainsaw,True,116765,A
3,1-17150-A-12.wav,1,12,crackling_fire,True,17150,A
4,1-172649-A-40.wav,1,40,helicopter,True,172649,A


## Prepare the Input

Modify the dataframe to contain the full relative path to each audio file.

In [3]:
def filename_to_relative_path(filename: str):
    return str(Path("data/ESC-50-master/audio", filename))

paths = esc10.filename.apply(filename_to_relative_path)
data_without_signals = esc10.assign(filename=paths)

data_without_signals.head()

Unnamed: 0,filename,fold,target,category,esc10,src_file,take
0,data/ESC-50-master/audio/1-100032-A-0.wav,1,0,dog,True,100032,A
1,data/ESC-50-master/audio/1-110389-A-0.wav,1,0,dog,True,110389,A
2,data/ESC-50-master/audio/1-116765-A-41.wav,1,41,chainsaw,True,116765,A
3,data/ESC-50-master/audio/1-17150-A-12.wav,1,12,crackling_fire,True,17150,A
4,data/ESC-50-master/audio/1-172649-A-40.wav,1,40,helicopter,True,172649,A


Preprocess each audio file.
Convert the original signal into a MEL spectrogram.

The MEL spectrogram is 2-dimensional: `(mels, time)`.
The time dimension depends on the lenght of each sample.
To easily deal with audio samples of different durations, calculate the mean value of each MEL accross the time axis.
The input for the model will be an array of 128 MELs.


This process will take several seconds.

```{note}
Typically, you should perform preprocessing only once and store the results.
```

In [4]:
def wav_to_mels(signal, sample_rate):
    """
    Takes a wav signal and produces the MEL spectrogram data, required by the model as input
    """
    mels = np.mean(
        librosa.feature.melspectrogram(y = signal, sr = sample_rate, n_mels=128), 
        axis=1
    )
    return mels


def file_to_mels(filename: str):
    signal, sr = librosa.load(filename)
    return wav_to_mels(signal, sr)


signals = data_without_signals.filename.apply(file_to_mels)
data = data_without_signals.assign(signal=signals)
data.head()

Unnamed: 0,filename,fold,target,category,esc10,src_file,take,signal
0,data/ESC-50-master/audio/1-100032-A-0.wav,1,0,dog,True,100032,A,"[1.8513412e-06, 1.125981e-05, 1.77284e-05, 3.1..."
1,data/ESC-50-master/audio/1-110389-A-0.wav,1,0,dog,True,110389,A,"[0.13045727, 0.0018769688, 0.0002673243, 0.001..."
2,data/ESC-50-master/audio/1-116765-A-41.wav,1,41,chainsaw,True,116765,A,"[0.005726964, 0.027042735, 0.3444246, 4.481706..."
3,data/ESC-50-master/audio/1-17150-A-12.wav,1,12,crackling_fire,True,17150,A,"[5.335389, 2.6672785, 0.7507066, 0.56450623, 0..."
4,data/ESC-50-master/audio/1-172649-A-40.wav,1,40,helicopter,True,172649,A,"[49.48378, 37.91428, 40.788185, 70.7473, 90.09..."


The `signal` values are arrays of length 128.
These values will be the input of the model.

## Prepare the Output

The output value of the model should be the probabilty of each class, given a particular input.
Machine learning models treat categorical outputs as numbers, so you must map each category (or class) to a number.

Add the `category_id` column to the dataframe.
This column maps classes in the `category` column to numerical indexes.

In [5]:
classes = list(data.category.unique())
class_ids = {c: idx for idx, c in enumerate(classes)}
class_ids_column = data.category.apply(lambda category: class_ids[category])
data = data.assign(category_id=class_ids_column)
data

Unnamed: 0,filename,fold,target,category,esc10,src_file,take,signal,category_id
0,data/ESC-50-master/audio/1-100032-A-0.wav,1,0,dog,True,100032,A,"[1.8513412e-06, 1.125981e-05, 1.77284e-05, 3.1...",0
1,data/ESC-50-master/audio/1-110389-A-0.wav,1,0,dog,True,110389,A,"[0.13045727, 0.0018769688, 0.0002673243, 0.001...",0
2,data/ESC-50-master/audio/1-116765-A-41.wav,1,41,chainsaw,True,116765,A,"[0.005726964, 0.027042735, 0.3444246, 4.481706...",1
3,data/ESC-50-master/audio/1-17150-A-12.wav,1,12,crackling_fire,True,17150,A,"[5.335389, 2.6672785, 0.7507066, 0.56450623, 0...",2
4,data/ESC-50-master/audio/1-172649-A-40.wav,1,40,helicopter,True,172649,A,"[49.48378, 37.91428, 40.788185, 70.7473, 90.09...",3
...,...,...,...,...,...,...,...,...,...
395,data/ESC-50-master/audio/5-233160-A-1.wav,5,1,rooster,True,233160,A,"[6.2960116e-06, 6.199518e-05, 0.0006980941, 0....",8
396,data/ESC-50-master/audio/5-234879-A-1.wav,5,1,rooster,True,234879,A,"[0.0020287747, 0.0066939634, 0.024266712, 0.03...",8
397,data/ESC-50-master/audio/5-234879-B-1.wav,5,1,rooster,True,234879,B,"[0.007926809, 0.025507677, 0.034778807, 0.0476...",8
398,data/ESC-50-master/audio/5-235671-A-38.wav,5,38,clock_tick,True,235671,A,"[9.167025, 1.9049535, 0.17857115, 0.045147046,...",6


## Create the Training and Validation Subsets

Split the data intro training and validation subsets.

* Folds 1 to 4 for training
* Fold 5 for validation

Tipically, in machine learning projects, the data is split into three subsets: training, validation, and test.
The test subset is used to ensure that the final model is not biased towards the validation subset.

Given the small amount of data used in this demo, you ony use the training and validation subsets.

In [6]:
training_data = data[data.fold < 5]
print("Number of clips: ", training_data.shape[0])

Number of clips:  320


In [7]:
validation_data = data[data.fold == 5]
validation_data = validation_data.reset_index(drop=True)

print("Number of clips: ", validation_data.shape[0])

Number of clips:  80


Extract the columns that the model requires for training and validation

In [8]:
X_train, y_train = training_data.signal, training_data.category_id
X_validation, y_validation = validation_data.signal, validation_data.category_id

Define the data training and validation loaders required by PyTorch to feed data into the model.

In [9]:
class Esc10(Dataset):

    def __init__(self, X, y):
        self.X = X
        self.y = y

    def __len__(self):
        return self.X.shape[0]

    def __getitem__(self, index):
        return self.X[index], self.y[index]


training_data = Esc10(X_train, y_train)
validation_data = Esc10(X_validation, y_validation)

# Provide training data in batches of 20
train_dataloader = DataLoader(
    training_data, batch_size=20, shuffle=True
)

# Provide a single batches of validation data (80 samples)
validation_dataloader = DataLoader(
    validation_data, batch_size=len(validation_data)
)

## Train the Model

Create the model class.
The following class defines a feedforward neural network with three layers.

The input size of the first layer must equal the size of the input data.
In this case, the expected input is an array of 128 elements.

The output layer must return an array of length `NUM_CLASSES`, which is eight in this case.
The array contains the probabilities of each class.
For example, `output[0]` contains the probabilty of category id `0`, which maps to the `dog` class.

In [10]:
NUM_CLASSES = len(classes)


class MyNet(nn.Module):

    def __init__(self):
        super().__init__()
        self.input_layer = nn.Linear(128, 64)
        self.hidden_layer = nn.Linear(64, 16)
        self.output_layer = nn.Linear(16, NUM_CLASSES)

    def forward(self, x):
        x = F.relu(self.input_layer(x))
        x = F.relu(self.hidden_layer(x))
        x = self.output_layer(x)
        return x

Create functions for training and evaluating the model

In [11]:
def train(model: nn.Module, epochs: int):

    # Loss and Optimizer functions
    # CrossEntropyLoss run softmax for you in the output
    # Sofmax is required for multi-class classifaction, where the output 
    # probabilities of predicted classes must add up to 1.
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

    for epoch in range(epochs):
        running_loss = 0.0
        for i, data in enumerate(train_dataloader, 0):
            # Each iteration gets a batch of data.
            # Each batch contains:
            # [
            #   [clip_1_signal, clip_2_signal, ..., clip_20_signal],
            #   [clip_1_category_id, clip2_category_id, ..., clip_20_category_id],
            # ]
            signals, category_ids = data

            # Set training gradients to zero
            # to avoid tracking info from previous batches
            optimizer.zero_grad()

            # Pass input through the network
            outputs = model(signals)
            # Calculate loss function
            loss = criterion(outputs, category_ids)
            # Gradient descent
            loss.backward()
            optimizer.step()

            # Print statistics every 10 batches
            running_loss += loss.item()
            if i % 10 == 9:
                print(
                    f"[Epoch {epoch + 1}, Batch {i + 1:5d}]"
                    f"training loss: {running_loss / 10:.3f}"
                )
                running_loss = 0.0


def evaluate(model: nn.Module):
    # Inference-only: disable gradient calculation
    with torch.no_grad():
        signals, category_ids = next(iter(validation_dataloader))

        outputs = model(signals)

        _, predictions = torch.max(outputs.data, 1)

        report = classification_report(
            y_validation,
            predictions,
            target_names=classes
        )
        print(report)

Create and train the model

In [12]:
model = MyNet()

train(model, epochs=200)
print("Finished Training")

[Epoch 1, Batch    10]training loss: 2.490
[Epoch 2, Batch    10]training loss: 2.093
[Epoch 3, Batch    10]training loss: 1.899
[Epoch 4, Batch    10]training loss: 1.844
[Epoch 5, Batch    10]training loss: 1.690
[Epoch 6, Batch    10]training loss: 1.703
[Epoch 7, Batch    10]training loss: 1.664
[Epoch 8, Batch    10]training loss: 1.619
[Epoch 9, Batch    10]training loss: 1.695
[Epoch 10, Batch    10]training loss: 1.595
[Epoch 11, Batch    10]training loss: 1.545
[Epoch 12, Batch    10]training loss: 1.524
[Epoch 13, Batch    10]training loss: 1.459
[Epoch 14, Batch    10]training loss: 1.398
[Epoch 15, Batch    10]training loss: 1.465
[Epoch 16, Batch    10]training loss: 1.370
[Epoch 17, Batch    10]training loss: 1.452
[Epoch 18, Batch    10]training loss: 1.363
[Epoch 19, Batch    10]training loss: 1.372
[Epoch 20, Batch    10]training loss: 1.291
[Epoch 21, Batch    10]training loss: 1.277
[Epoch 22, Batch    10]training loss: 1.291
[Epoch 23, Batch    10]training loss: 1.2

Evaluate the model.
The following report provides multiple metrics: accuracy, precision, recall, and f1-score, by class and aggregated.

Tipically, you should use larger datasets to achieve better evaluation results.

In [13]:
evaluate(model)

                precision    recall  f1-score   support

           dog       0.25      0.25      0.25         8
      chainsaw       0.86      0.75      0.80         8
crackling_fire       0.36      0.50      0.42         8
    helicopter       0.25      0.12      0.17         8
          rain       0.50      0.62      0.56         8
   crying_baby       0.50      0.38      0.43         8
    clock_tick       0.64      0.88      0.74         8
      sneezing       0.60      0.75      0.67         8
       rooster       0.57      0.50      0.53         8
     sea_waves       0.83      0.62      0.71         8

      accuracy                           0.54        80
     macro avg       0.54      0.54      0.53        80
  weighted avg       0.54      0.54      0.53        80



## Export the Model for Serving and Upload

To serve the trained the model with RHODS Model Serving, convert the Pytorch model to ONNX format.
The ONNX format is a open source format for AI models.

In [14]:
onnx_filename = "sound_classifier.onnx"
input_names = ["mels"]
output_names = ["output"]
# ONNX to infers the input shape from an example
example_input = torch.randn(1, 128)

torch.onnx.export(model, example_input, onnx_filename, input_names=input_names, output_names=output_names)

Upload the model to the storage layer.

In [18]:
key_id = os.getenv("AWS_ACCESS_KEY_ID")
secret_key = os.getenv("AWS_SECRET_ACCESS_KEY")
region = os.getenv("AWS_DEFAULT_REGION")
endpoint = os.getenv("AWS_S3_ENDPOINT")
bucket_name = os.getenv("AWS_S3_BUCKET")

s3_client = boto3.client(
    "s3",
    region,
    aws_access_key_id=key_id,
    aws_secret_access_key=secret_key,
    endpoint_url=endpoint,
    use_ssl=True
)

s3_client.upload_file(onnx_filename, bucket_name, Key=onnx_filename)

print(f"File {onnx_filename} uploaded to S3!")

File sound_classifier.onnx uploaded to S3!
