In this practical we are going to see how to:

*   Create a custom PyTorch dataset.
*   Show how data-augmentation can be applied to audio clips.
*   Introduce two types of audio features that can be used for classification applications.

To put this ideas together in a self-contained project, we will show how to build a Keyword-Spotting (KWS) application. A KWS system aims to detect a particular keyword (often just one word) from a continuous audio stream. An example we're all familiar with is "hey Siri" or "Alexa". These devices often perform this detection on-device (i.e. the data is processed localy instead of in the Cloud). In this practical we are going to build a simpler KWS system that classifies 1-second audio clips by using audio-specfic features.

We will use the popular [librosa](https://librosa.github.io) python package to perform the loading, visualization and feature extraction for our dataset.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

import torch
import torch.nn as nn

import IPython.display as ipd
import librosa
import librosa.display
import cv2
import warnings
warnings.filterwarnings('ignore')

To train our KWS classifier we'll use the Google Speech Commands dataset. This dataset is comprised of 65K 1-second long audio clips (`.wav` files at 16KHz) containing a single word spoken (e.g. "yes", "no", "up", "down", etc). There are a total of 30 short words. You can read more about the dataset in [this blogpost](https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html) or in [this paper](https://arxiv.org/pdf/1804.03209.pdf).

Below we download the dataset and uncompress it.

In [None]:
# Get Google Speech Commands dataset (~1.5GB)
!wget http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz -P ./datasets/

# uncompress dataset
!mkdir ./datasets/speech_commands_v0.01 
!tar -xzf ./datasets/speech_commands_v0.01.tar.gz -C ./datasets/speech_commands_v0.01

## Data exploration

Let's start by ispecting the dataset. 

In [None]:
def inspectAudio(filename, augment=None, ratio=0.15):
    
    plt.figure(figsize=(18, 5))    
    
    # load audio clip
    y, sr = librosa.load(filename)
    
    if augment is not None:
        # load background audio file
        y_bkg, s_bkg = librosa.load(augment)
        
        # select a random point in y_bkg to add to audio file y
        num_samples = len(y)
        start = np.random.randint(0, len(y_bkg)-num_samples)
        
        # add y_bkg to y following ratio
        y_corrupted = (1.0-ratio)*y + ratio*y_bkg[start:start+num_samples]
    
    # compute spectrogram
    D = librosa.amplitude_to_db(np.abs(librosa.stft(y)), ref=np.max)

    # Visualization
    plt.subplot(1, 2, 1)
    librosa.display.waveplot(y) # plot waveform
    plt.title('Audio Waveform')
    
    plt.subplot(1, 2, 2)
    librosa.display.specshow(D, y_axis='log') # plot spectrogram in log scale
    plt.colorbar(format='%+2.0f dB')
    plt.title('Log-frequency power spectrogram')
    plt.xlabel('Time')
    plt.show()
    
    # audio player
    ipd.display(ipd.Audio(y, rate=sr))
    
    if augment:
#         plt.figure()
        plt.figure(figsize=(18, 5))    
        plt.subplot(1, 2, 1)
        librosa.display.waveplot(y_corrupted) # plot waveform
        plt.title('Audio Waveform + background')

        # compute spectrogram
        D_corrupted = librosa.amplitude_to_db(np.abs(librosa.stft(y_corrupted)), ref=np.max)
    
        plt.subplot(1, 2, 2)
        librosa.display.specshow(D_corrupted, y_axis='log') # plot spectrogram in log scale
        plt.colorbar(format='%+2.0f dB')
        plt.title('Log-frequency power spectrogram')
        plt.xlabel('Time')
        plt.show()
        
        # audio player
        ipd.display(ipd.Audio(y_corrupted, rate=sr))
        
    
    return D
    

In [None]:
# one example in the dataset of a person saying the word "yes"
filename = "./datasets/speech_commands_v0.01/yes/1c3f4fac_nohash_1.wav"
# one example of background noise from the dataset
bkg = "./datasets/speech_commands_v0.01/_background_noise_/running_tap.wav"

# let's visualize the audio clip and see how augmenting it by using a 0.85-0.15 clip/noise ratio.
spec = inspectAudio(filename, bkg, ratio=0.15)

## Extracting audio features

### 1. Using spectrograms as features

In [None]:
# the previously computed spectrogram isn't squared. This is not a necessary requisite for training CNNs
# but having such a high dimensions (1025,44) could make our training slow.
print(spec.shape)

# Let's then resize our spectrogram by treating it like an image
spec_ = cv2.resize(np.flipud(spec), dsize=(64, 64), interpolation=cv2.INTER_CUBIC)
print(spec_.shape)
plt.figure(figsize=(6, 6))
plt.imshow(spec_, cmap='inferno')

Now that we know how to (1) load 1-second .wav files, (2) build the spectrogram of each clip and (3) reshape them so could be treated as an image of manageable dimensions, there's nothing stopping us from building a dataset by following steps (1)-(3) with all images in the Google Speech Commands dataset. We'll do that shortly.

### 2. Using MFCC features

Using the spectrogram as a first order descriptor for our one-second audio clips is a valid strategy and we'll later see how that performs. However, there's it's commonly accepted that [Mel-frequency cepstrum](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum) features (MFCC) are a more compact, high-quality descriptor than using raw spectrograms.

Extracting MFCC features is supported in several frameworks, including Librosa and TorchAudio. The cell below shows how to extract and visualize MFCC features using Librosa.

In [None]:
# load audio file
y, sr = librosa.load(filename)

# extract 40 MFCC features
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=40)
print('MFCC features shape:', mfccs.shape)

# Visualization
plt.figure(figsize=(6, 6))
librosa.display.specshow(mfccs, x_axis='time', cmap='inferno')
plt.colorbar()
plt.title('MFCC')

Unlike with the spectrogram, we don't really need to resize the MFCC array since it's already of a quite manageable size. We'll use the MFCC features of each audio clip as inputs to the CNN classifier for our Keyword Spotting System.

## Building a PyTorch dataset for KWS applications

In order to train our KWS classifier efficiently we need to build our Google Speech Commands dataset as a Pytorch Dataset object (i.e. [torch.utils.data.Dataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset)). Let's do that now by loading first a subset (you can try all if you want) the `.wav` files.

In [None]:
import torch
from torch.utils.data import Dataset, DataLoader
import os
from tqdm.notebook import tqdm

# To make things simpler we'll build our dataset for this practical as a subset of the Google Speech 
# Commands dataset. In this way, we'll only consider keywords of a few words (see below).

keywords = ["on", "off", "left", "right", "go", "stop", "yes", "no"]
path = "./datasets/speech_commands_v0.01" # path to where the dataset folder lives


def loadDataset(path):
    """ Get the filenames of .wav that belongs to the chosen keywords"""
    labels = []
    audios = []

    for i, kword in enumerate(keywords):
        pathKword = os.path.join(path, kword)
        for file in os.listdir(pathKword): # for all files in directory
            if file.endswith(".wav"): # if it's a .wav file
                filename = os.path.join(pathKword, file)
                audios.append(filename)
                labels.append(i)

    return labels, audios
        

The cell below shows how to create a custom PyTorch Dataset class. Upon initialization, we'll be loading the entire set of `.wav` files. Then, each time we ask for a element in our dataset, the `__getitem__()` method will be executed. There, we extract the MFCC features using TorchAudio. We use 40ms windows with a 20ms overlap (as was described in the [Hello Edge](https://arxiv.org/abs/1711.07128) seminal paper).

In [None]:
class keywordsDataset(Dataset):

    def __init__(self, root_dir, useMFCC=True, transform=None):

        # get labels - fileNames pairs
        self.labels, self.files = loadDataset(root_dir) 
        self.root_dir = root_dir
        self.useMFCC = useMFCC

        self.loadAudios()

    def __len__(self):
        return len(self.audios)

    def loadAudios(self):
        """Here we pre-load the audio files so reading from disk doesn't impact on training speed."""
        self.audios = []
        with tqdm(total=len(self.files), desc='Reading .wav Files') as t:
            for i , filename in enumerate(self.files):
                waveform, _ = librosa.load(filename) # load audio file
                self.audios.append(waveform)
                t.update(1)

    def __getitem__(self, idx):
        """We reach this point everytime the dataloader asks for a new batch. Here we extract the desired audio features
        from the pre-loaded audio clips."""
        
        waveform = self.audios[idx]
        n_fft = int(len(waveform)*0.04) # 40ms window
        hop_length = int(n_fft / 2) # 20ms stride
        
        # extract features
        if self.useMFCC:
            features = librosa.feature.mfcc(y=waveform, n_mfcc=40, n_fft=n_fft, hop_length=hop_length)
        
        features = torch.unsqueeze(torch.from_numpy(features), 0)
        
        # return features and label
        return features, self.labels[idx]

Let's now construct the dataset and leave 10% of the audio clips for test. While this happens consider reading the paper that introduced the [Speech Commands dataset](https://arxiv.org/pdf/1804.03209.pdf) or the [Hello Edge](https://arxiv.org/pdf/1711.07128.pdf), showing how a KWS system can be implemented in micro-controllers.

In [None]:
# create dataset (this might take ~10min depending on your hard drive speed)
myDataset = keywordsDataset(path)

# Use 10% images for validation
val_size = 0.1
num_train = len(myDataset)
split = int(np.floor(val_size * num_train))

# randomly split `myDataset` for train/val
train_dataset, val_dataset = torch.utils.data.random_split(myDataset,[num_train - split,split])

Here we verify that everything works.

In [None]:
# get a single element from the train set
dataPoint = train_dataset[12]
print(dataPoint[0].shape) # shape of the MFCC features
print(dataPoint) # returns a tuple as (MFCC, label)

Now it's time to define the [DataLoader](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) objects that will be feeding our network during training/evaluation. Here is where we define batch_size and the type of data augmentation to use

In [None]:
train_DataLoader = DataLoader(train_dataset, batch_size=32, shuffle=True, num_workers=8 if torch.cuda.is_available() else 0)
val_DataLoader = DataLoader(val_dataset, batch_size=32, shuffle=False, num_workers=8 if torch.cuda.is_available() else 0)

## Training a CNN for KWS

Now that our dataset is ready, the following cells of code should look very familiar.

Below we define a simple CNN.

In [None]:
class basicCNN(nn.Module):
    def __init__(self, numClasses):
        super(basicCNN, self).__init__()
        self.name = "basicCNN"
        self.layer1 = nn.Sequential(nn.Conv2d(in_channels=1, out_channels=32, kernel_size=5, stride=2),
                                    nn.BatchNorm2d(32))
        self.layer2 = nn.Sequential(nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=2),
                                    nn.BatchNorm2d(64))
        self.layer3 = nn.Sequential(nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, stride=1),
                                    nn.BatchNorm2d(128))
        self.relu = torch.nn.ReLU()
        self.pool = torch.nn.MaxPool2d(2)
        self.fc = nn.Linear(128 * 2*  3, numClasses)
        self.criterion = nn.CrossEntropyLoss()

    def forward(self, x):
        x = self.relu(self.layer1(x))
        x = self.pool(self.relu(self.layer2(x)))
        x = self.relu(self.layer3(x))
        x = self.fc(x.view(-1, 128 * 2 * 3))
        return x

Below we define the training and testing loops as well as other helper functions (everything is borrowed from the `utils.py` file we were using in the previous practicals)

In [None]:
def getAccuracy(outputs, labels, num):
    """ Computes the accuracy"""
    _, predicted = torch.max(outputs, 1)
    correct = (predicted == labels).sum()
    return correct.item()/float(num)

class RunningAverage():
    """A simple class that maintains the running average of a quantity """
    def __init__(self):
        self.steps = 0
        self.total = 0
    
    def update(self, val):
        self.total += val
        self.steps += 1
    
    def __call__(self):
        return self.total/float(self.steps)
    
def train(model, device, train_loader, writer, optimiser, epoch):
    model.train() 
        
    loss_avg = RunningAverage()
    acc_avg = RunningAverage()
    # for every mini-batch containing batch_size images...
    with tqdm(total=len(train_loader.dataset), desc='Train Epoch #' + str(epoch)) as t:
        for i , (data, target) in enumerate(train_loader):

            # print(data.shape)
            # send the data (images, labels) to the device (either CPU or GPU)
            inputs, labels = data.to(device), target.to(device)

            # zero gradients from previous step
            optimiser.zero_grad()

            # this executes the forward() method in the model
            outputs = model(inputs)

            # compute loss
            loss = model.criterion(outputs, labels)

            # backward pass
            loss.backward()

            # evaluate trainable parameters
            optimiser.step()

            # Monitoring progress, accuracy and loss
            acc_avg.update(getAccuracy(outputs, labels, inputs.shape[0]))
            loss_avg.update(loss.item())
            t.set_postfix({'avgAcc':'{:05.3f}'.format(acc_avg()), 'avgLoss':'{:05.3f}'.format(loss_avg())})
            t.update(data.shape[0])
            
def test(model, device, test_loader):
    
    model.eval() # no update of trainable parameters (e.g. batch norm)
    
    with torch.no_grad():
        correct = 0
        total = 0
        
        # now we evaluate every test image and compute the predicted labels
        for data in test_loader:
            
            # send data to device
            images, labels = data[0].to(device), data[1].to(device)
            
            # pass the images through the network
            outputs = model(images)

            # obtain predicted labels
            _, predicted = torch.max(outputs.data, 1)
            
            # count total images in batch
            total += labels.size(0)
            # count number of correct images
            correct += (predicted == labels).sum()
    
    # compute accuracy
    test_acc = correct.item()/float(total)
    
    print("Accuracy on Test Set: %.4f" % test_acc)

Below is a simplified version of the `main()` function we used in the previous practicals.

In [None]:
def main(net, numEpoch, trainLoader, valLoader, use_cuda=False, lr=0.1):
        
    # 1. Define optimiser
    optim = torch.optim.SGD(net.parameters(), lr=lr)
    
    # 2. Define define where training/testing will take place
    device = torch.device("cuda" if use_cuda else "cpu")
    print("Launching training on: %s" % device)
    
    # 2.1 Send model to device
    net = net.to(device)
    
    # 3. Train for `numEpoch` epochs
    for epoch in range(1, numEpoch + 1):
        train(net, device, trainLoader, None, optim, epoch)
              
    test(net, device, valLoader)
    

Let's train our KWS system!

In [None]:
main(basicCNN(len(keywords)), 5, train_DataLoader, val_DataLoader, use_cuda=True)

## (Optional) Building a custom dataset with data augmentation

We have seen how to create a dataset using PyTorch's API. This was done by first loading all audio clips in memory and then extracting MFCC fetures everytime we need to feed another batch to the CNN (i.e. each time we need to perform a new training step).

However, it could happen that the dataset is so big, that it cannnot possibly fit in memory. When this is the case, we'd normally load the files during batch creation (i.e. in `__getitem__`).

**Exercise**



*   Create a new PyTorch dataset that loads the files each time a new batch needs to be generated (i.e. in `__getitem__`)
*   After loading it, apply some data augmentation, for instance by superpossing another audio clip with background noises (as we saw at the top of this notebook)
*   Then extract the MFCC features.
*   To verify you've done it correctly perform a 5 epoch training.



In [None]:
class keywordsDatasetv2(Dataset):

    def __init__(self, root_dir, useMFCC=True, corrupt=0.10):

        # get labels - fileNames pairs
        self.labels, self.files = loadDataset(root_dir) 
        self.root_dir = root_dir
        self.useMFCC = useMFCC

        # load background audio file (you could randomize this as well)
        self.y_bkg, _ = librosa.load("./datasets/speech_commands_v0.01/_background_noise_/running_tap.wav")
        self.corrupt = corrupt

    def __len__(self):
        return len(self.files)

    def __getitem__(self, idx):
        """We reach this point everytime the dataloader asks for a new batch. Here we (1) load the audio clips, 
        (2) apply data augmentation and (3) extract the desired audio features."""
        
        return None, None

Now let's create an intance of our new dataset and the associated dataloaders. We'll also follow the same strategy as before to perform the train/test split. Then train this model using the dataset you just created.


In [None]:
# create dataset
# myDataset = keywordsDatasetv2(path, useMFCC=False, corrupt=0.0) # this will generate a dataset using Spectrogram as features and no augmentation
myDataset = keywordsDatasetv2(path, useMFCC=True, corrupt=0.10) # this will generate a dataset using MFCC as features and adding 10% of background noise

# Use 10% images for validation
val_size = 0.1
num_train = len(myDataset)
split = int(np.floor(val_size * num_train))

# randomly split `myDataset` for train/val
train_dataset, val_dataset = torch.utils.data.random_split(myDataset,[num_train - split,split])

Now we create the dataloader objects that will consume the dataset.

In [None]:
train_DataLoader = DataLoader(train_dataset, batch_size=32, shuffle=True, num_workers=2 if torch.cuda.is_available() else 0)
val_DataLoader = DataLoader(val_dataset, batch_size=32, shuffle=False, num_workers=2 if torch.cuda.is_available() else 0)

Now let's train for a few epochs.

In [None]:
main(basicCNN(len(keywords)), 5, train_DataLoader, val_DataLoader, use_cuda=True)

**Observations**



*   Is it faster training without a pre-loaded dataset?
*   How do you think this could be solved?

