
Audio Classifier Tutorial (Fast.ai)
=========================
**Author**: `Bruno Lima <https://github.com/limazix>`

This is the recreation of the PyTorch tutorial for audio classification using the Fastai framework.
- [torchaudio](https://github.com/pytorch/audio)
- [fastai](https://docs.fast.ai/)



 First, let’s import the common packages ``pandas`` and ``numpy``. 

In [11]:
%matplotlib inline
import os

import pandas as pd
import numpy as np

Importing the Dataset
---------------------

We will use the UrbanSound8K dataset to train our network. It is
available for free `here <https://urbansounddataset.weebly.com/>`_ and contains
10 audio classes with over 8000 audio samples! Once you have downloaded
the compressed dataset, extract it to your current working directory.
First, we will look at the csv file that provides information about the
individual sound files. ``pandas`` allows us to open the csv file and
use ``.iloc()`` to access the data within it.




After the download at the data folder, let's map the data paths into global variables.

In [12]:
DATA_ROOT_DIR=os.path.normpath(os.path.join(os.getcwd(), 'data/UrbanSound8k'))
DATA_META_FILE=os.path.join(DATA_ROOT_DIR, 'metadata/UrbanSound8k.csv')
DATA_AUDIO_DIR=os.path.join(DATA_ROOT_DIR, 'audio')

In [13]:
data_meta = pd.read_csv(DATA_META_FILE)
print(data_meta.iloc[0, :])

slice_file_name    100032-3-0-0.wav
fsID                         100032
start                             0
end                        0.317551
salience                          1
fold                              5
classID                           3
class                      dog_bark
Name: 0, dtype: object


The 10 audio classes in the UrbanSound8K dataset are: `air_conditioner`, `car_horn`, `children_playing`, `dog_bark`, `drilling`, `enginge_idling`, `gun_shot`, `jackhammer`, `siren`, and `street_music`.

Each class has its own folder and id. For instance, the last print shows that the audio file has:
- **class** - dog_bark
- **id** - 3
- **folder** - 5

We will need the classID and the full file path only.

In [14]:
def build_file_path(item):
    item['path'] = 'fold{}/{}'.format(item['fold'], item['slice_file_name'])
    item['label'] = item['class']
    return item[['path', 'label']]

data = data_meta.apply(build_file_path, axis=1)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8732 entries, 0 to 8731
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   path    8732 non-null   object
 1   label   8732 non-null   object
dtypes: object(2)
memory usage: 136.6+ KB


---

## Fast.ai Data Structure

In order to run DNN with Fast.ai, it's necessary to create a DataBunch, the basic framework data structure.

The Fast.ai has a few built-in DataBanch, such as ImageDataBunch for image processing and TextDataBunch for NLP, but none for audio processing yet. Therefore, we'll have to create a custom one.

The simplest way to do that is using the ItemList and ItemBase classes. The first one creates a structure that allows operations to manipulate the dataset, while the second one defines the operation bahavior for one single item.

In [15]:
import torchaudio
from fastai.data_block import ItemList, ItemBase

class AudioItem(ItemBase):

    def apply_tfms(self, tfms):
        if not tfms: return self
        
        for t in tfms:
            self.data = t(self.data)
        return self
        
class AudioItemList(ItemList):
    
    def get(self, i):
        fn = super().get(i)
        file_path = os.path.join(self.path, fn)
        return self.open(file_path)

    def open(self, fn):
        audio, sample_rate = torchaudio.load(fn)
        return AudioItem(audio[0, :].view(1, -1))
    

To keep it simple, we create two methods for the custom ItemList:
- **get** - Method to retrieve one single item from the dataset given its position;
- **open** - Method used to load the audio file and build the AudioItem object.

And one for the custom ItemBase:
- **apply_ftms** - Method used to apply transformations in each data from the dataset.

### Data Transformation

Like any machine learning system, it's necessary to perform a few preprocessing steps on the data in order to run the intent algorithm. The original tutorial perform three transformations:
- Resample - It perform a downsample from 44.1KHz to 8KHz
- MaxClips - It uses a fixed sample size of 160000 clips by completting with zeros or remove when necessary
- ClipFrequency - It group every 5th clip to the final sample and ignore de rest

For the first transformation, the `torchaudio` library already have an implementation. But not for the last two, so we need to create then.

In [19]:
import torch

class MaxClips(object):
    
    def __init__(self, max_clips):
        self.max_clips = max_clips

    def __call__(self, sound_base):

        sound_base = sound_base.long()

        #tempData accounts for audio clips that are too short
        sound = torch.zeros([self.max_clips, 1])
        if sound_base.numel() < self.max_clips:
            sound[:sound_base.numel()] = sound_base.view(-1, 1)[:]
        else:
            sound[:] = sound_base.view(-1, 1)[:self.max_clips]

        return sound

class ClipFrequency(object):
    
    def __init__(self, size, frequency):
        self.frequency = frequency
        self.size = size
        
    def __call__(self, sound_base):
        sound = torch.zeros([self.size, 1])
        #take every fifth sample of soundData
        sound[:self.size] = sound_base[::self.frequency]
        sound = sound.permute(1, 0)
        
        return sound

Now, we can create the transformation pipelie.

In [17]:
from torchaudio.transforms import Resample

transforms = [
    Resample(orig_freq=44100, new_freq=8000),
    MaxClips(max_clips=160000),
    ClipFrequency(size=32000, frequency=5)
]
print(transforms)

[Resample(), <__main__.MaxClips object at 0x12e1e2ef0>, <__main__.ClipFrequency object at 0x12e1e28d0>]


### DataBunch Creation

We already have the minimun necessary to create the DataBunch. Here, we perform five operations:
- **from_df** - It transform the DataFrame with two columns into an AudioItemList instance;
- **split_subsets** - It splits the dataset in two, train and validation dataset, based on the given proportions;
- **label_from_df** - It informs the ItemList to use the column `label` from the dataset as the item class;
- **transform** - It applys the transformation pipeline to bouth datasets, train and validation;
- **databunch** - It creates the databunch with a batch size of 128 inputs per batch

In [20]:
data_bunch = (AudioItemList.from_df(data, path=DATA_AUDIO_DIR)
              .split_subsets(train_size=0.8, valid_size=0.2)
              .label_from_df(cols='label')
              .transform((transforms, transforms))
              .databunch(bs=128))

---

## Model

The tutorial built de model class after the M5 CNN architecture for audio processing and classification as presented by [Wei Dai Et al.](https://arxiv.org/pdf/1610.00087.pdf)

In [21]:
from torch import nn
from torch.functional import F

class M5Model(nn.Module):

    def __init__(self, n_classes):
        super(M5Model, self).__init__()
        self.conv1 = nn.Conv1d(1, 128, 80, 4)
        self.bn1 = nn.BatchNorm1d(128)
        self.pool1 = nn.MaxPool1d(4)
        self.conv2 = nn.Conv1d(128, 128, 3)
        self.bn2 = nn.BatchNorm1d(128)
        self.pool2 = nn.MaxPool1d(4)
        self.conv3 = nn.Conv1d(128, 256, 3)
        self.bn3 = nn.BatchNorm1d(256)
        self.pool3 = nn.MaxPool1d(4)
        self.conv4 = nn.Conv1d(256, 512, 3)
        self.bn4 = nn.BatchNorm1d(512)
        self.pool4 = nn.MaxPool1d(4)
        #input should be 512x30 so this outputs a 512x1
        self.avgPool = nn.AvgPool1d(30)
        self.fc1 = nn.Linear(512, n_classes)
        self.softmax = nn.LogSoftmax(dim=2)
        
    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(self.bn1(x))
        x = self.pool1(x)
        x = self.conv2(x)
        x = F.relu(self.bn2(x))
        x = self.pool2(x)
        x = self.conv3(x)
        x = F.relu(self.bn3(x))
        x = self.pool3(x)
        x = self.conv4(x)
        x = F.relu(self.bn4(x))
        x = self.pool4(x)
        x = self.avgPool(x)
        #change the 512x1 to 1x512
        x = x.permute(0, 2, 1)
        x = self.fc1(x)
        return self.softmax(x)

## Wrap-up

After defined the `databunch` and the `model`, it is time to create the main responsible by the larning operations, the `Learner`.

Usually, with `pytorch`, that's the moment to define the `optimizer`, the `learning rate`(lr), the `loss function` and methods for training and testing for each epoch. All those things are define manually.

With time, it becomes clear that not everything has to be custom made for each experiment scenario. The [Learner](https://docs.fast.ai/basic_train.html#Learner) class provide all those setups already built-in and it uses the the concept of partial to allow a few enhancements.

Tho original tutorial uses the `Adan` optimizer with lr `0.1`. The `Learner` use the same optimizer by default but, if experiment request a different one, it can be changed by using the `opt_func` parament from the constructor. Likewise, the `loss function` with the parameter `loss_func`.

In [27]:
from fastai.callbacks import EarlyStoppingCallback, ReduceLROnPlateauCallback
from fastai.metrics import accuracy, partial, error_rate
from fastai.basic_data import DataBunch
from fastai.basic_train import Learner

model = M5Model(n_classes=data_bunch.c)

learn = Learner(data_bunch, model, metrics=[error_rate, accuracy],
               callback_fns=[
                   partial(EarlyStoppingCallback, monitor='accuracy', patience=10, min_delta=5e-4),
                   partial(ReduceLROnPlateauCallback, monitor='accuracy', patience=5, factor=0.2, min_delta=0)
               ])

It's important to notice that the above setup is using `error_rate` and `accuracy` as metrics to evaluate the performace of the model while it performs the training and validation phases.

In adition, there are two partials:
- **EarlyStoppingCallback** - It stops the trainig if a given metric, in this case `accuracy`, does not increases `5e-4` in `10` epochs
- **ReduceLROnPlateauCallback** - It changes the lr by `0.2` times if the `accuracy` does not increases in `5` epochs

Those two partials avoid overtrainig and helps to scape from a possible plateau.

To run the training and the validation phases, the `Learner` has a method called `fit`. In the follow example, it receives two parameters: number of epochs and lr.

In [251]:
learn.fit(40, 0.01)

epoch,train_loss,valid_loss,error_rate,accuracy,time
0,2.266597,2.250529,0.883161,0.116838,16:50
1,2.236449,2.237231,0.881443,0.118557,17:42
2,2.216365,2.20277,0.852806,0.147194,17:20
3,2.204523,2.20627,0.856243,0.143757,15:38
4,2.18813,2.197274,0.853952,0.146048,15:59
5,2.190277,2.196049,0.859679,0.140321,15:31
6,2.181231,2.193279,0.845361,0.154639,17:55
7,2.175509,2.184804,0.865407,0.134593,17:20
8,2.169506,2.1766,0.857961,0.142039,13:44
9,2.169297,2.192725,0.870561,0.129439,14:02


Epoch 12: reducing lr to 0.002
Epoch 17: early stopping


## Final Considerations

It shows how simplier and cleaner is to wright using fast.ai even when no previous module exists, such as `vision` or `text`. 

Now, it is possible to dig a little deeper on audio processing universe using fast.ai. For the next article, it'll try to improve the accuracy of the model used here by playing with the input data in the preprocessing phase and with the parameters in the training phase.

The notebook used in this tutorial is available [here](https://github.com/limazix/audio-processing).