# Part 1: Loading the dataset & simple linear model

#### Welcome to the part 1 tutorial!

In this tutorial, we will work with spectrograms while comparing the audio processing time between nnAudio GPU and librosa by using `Google SPEECHCOMMANDS dataset v2 (12 classes) with linear model` in KeyWord Spotting (KWS) task

Original dataset has total 35 single wordings. In this KWS task, 10 out of 35 words are chosen ( ‘down’, ‘go’, ‘left’, ‘no’, ‘off’, ‘on’, ‘right’, ‘stop’, ‘up’, ‘yes’). The remaining 25 words are grouped as `class ‘unknown’`. A `class ‘silence’` is created from background noise

[Step 1: import related libraries](#Step-1:-import-related-libraries)\
[Step 2: setting up configuration](#Step-2:-setting-up-configuration)\
[Step 3: loading the dataset](#Step-3:-loading-the-dataset)\
[Step 4: data rebalancing](#Step-4:-data-rebalancing)\
[Step 5: data processing and loading](#Step-5:-data-processing-and-loading)\
[Step 6: setting up the Lightning Module](#Step-6:-setting-up-the-Lightning-Module)
* [Step 6(i): Lightning Module for Linearmodel_nnAudio](#Step-6(i):-Lightning-Module-for-Linearmodel_nnAudio)
* [Step 6(ii): Lightning Module for Linearmodel_librosa](#Step-6(ii):-Lightning-Module-for-Linearmodel_librosa)

[Step 7: setting up nnAudio MelSpectrogram](#Step-7:-setting-up-nnAudio-MelSpectrogram)\
[Step 8: defining the model](#Step-8:-defining-the-model)
* [Step 8(i): defining the model with nnAudio](#Step-8(i):-defining-the-model-with-nnAudio)
* [Step 8(ii): defining the model with librosa](#Step-8(ii):-defining-the-model-with-librosa)

[Step 9: training the model for 1 epoch](#Step-9:-training-the-model-for-1-epoch)

* [Step 9(i): training the Linearmodel_nnAudio](#Step-9(i):-training-the-Linearmodel_nnAudio)
* [Step 9(ii): training the Linearmodel_librosa](#Step-9(ii):-training-the-Linearmodel_librosa)

[Conclusion](#Conclusion:)

## Step 1: import related libraries

In [None]:
# Libraries related to PyTorch
import torch
from torch import Tensor
import torchaudio 
import torch.nn as nn
import torch.nn.functional as F
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import WeightedRandomSampler,DataLoader
import torch.optim as optim

#Libraries related to dataset
from AudioLoader.Speech import SPEECHCOMMANDS_12C #for 12 classes KWS task

# Libraries related to PyTorch Lightning
from pytorch_lightning import Trainer
from pytorch_lightning.core.lightning import LightningModule

# Libraries used in lightning module
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix

# Front-end tool
from nnAudio.features.mel import MelSpectrogram, STFT
import librosa

## Step 2: setting up configuration

Note: If you don't have SPEECHCOMMANDS dataset, you can simply set `download_option= True` to download by using AudioLoader in Step 3.

In [None]:
device = 'cuda:0'
gpus = 1
batch_size= 100
max_epochs = 1
check_val_every_n_epoch = 2
num_sanity_val_steps = 5

data_root= './' # Download the data here
download_option= False

n_mels= 40 
#number of Mel bins

input_dim= (n_mels*101)
output_dim= 12

## Step 3: loading the dataset

In [None]:
trainset = SPEECHCOMMANDS_12C(root=data_root,
                              url='speech_commands_v0.02',
                              folder_in_archive='SpeechCommands',
                              download= download_option,subset= 'training') 

validset = SPEECHCOMMANDS_12C(root=data_root,
                              url='speech_commands_v0.02',
                              folder_in_archive='SpeechCommands',
                              download= download_option,subset= 'validation')


## Step 4: data rebalancing

Due to the class imbalance between the ‘silence’(10th class) and ‘unknown’(11th class) class, we re-balance the training set by adjusting the sampling weight for each class during training.

In [None]:
class_weights = [1,1,1,1,1,1,1,1,1,1,4.6,1/17]

#create a list as per length of trainset
sample_weights = [0] * len(trainset)

#apply sample_weights in each data base on their label class in class_weight
for idx, (data,rate,label,speaker_id, _) in enumerate(trainset):
    class_weight = class_weights[label]
    sample_weights[idx] = class_weight
    
sampler = WeightedRandomSampler(sample_weights, num_samples=len(sample_weights),replacement=True)

## Step 5: data processing and loading

In [None]:
#data padding
def data_processing(data):
    waveforms = []
    labels = []
    
    for batch in data:
        waveforms.append(batch[0].squeeze(0)) #after squeeze => (audio_len) tensor # remove batch dim
        labels.append(batch[2])      
        
    waveform_padded = nn.utils.rnn.pad_sequence(waveforms, batch_first=True)  
    
    output_batch = {'waveforms': waveform_padded, 
             'labels': torch.tensor(labels),
             }
    return output_batch

#data loading
trainloader = DataLoader(trainset,                                
                              collate_fn=lambda x: data_processing(x),
                                         batch_size=batch_size,sampler=sampler)

validloader = DataLoader(validset,                               
                              collate_fn=lambda x: data_processing(x),
                                         batch_size=batch_size)


## Step 6: setting up the Lightning Module

### Step 6(i): Lightning Module for Linearmodel_nnAudio

In [None]:
class SpeechCommand(LightningModule):
    def training_step(self, batch, batch_idx):
        outputs, spec = self(batch['waveforms']) 
        #return outputs [2D] for calculate loss, return spec [3D] for visual
        loss = self.criterion(outputs, batch['labels'].long())

        acc = sum(outputs.argmax(-1) == batch['labels'])/outputs.shape[0] #batch wise
        
        self.log('Train/acc', acc, on_step=False, on_epoch=True)
        self.log('Train/Loss', loss, on_step=False, on_epoch=True)
        #log(graph title, take acc as data, on_step: plot every step, on_epch: plot every epoch)
        return loss
 

    def optimizer_step(self, epoch, batch_idx, optimizer, optimizer_idx,
                       optimizer_closure, on_tpu, using_native_amp, using_lbfgs):
        
        optimizer.step(closure=optimizer_closure)
        with torch.no_grad():
            torch.clamp_(self.mel_layer.mel_basis, 0, 1)
        #after optimizer step, do clamp function on mel_basis         

        
    def validation_step(self, batch, batch_idx):               
        outputs, spec = self(batch['waveforms'])
        loss = self.criterion(outputs, batch['labels'].long())        
       
        self.log('Validation/Loss', loss, on_step=False, on_epoch=True)                     
        output_dict = {'outputs': outputs,
                       'labels': batch['labels']}        
        return output_dict

    
    def validation_epoch_end(self, outputs):
        pred = []
        label = []
        for output in outputs:
            pred.append(output['outputs'])
            label.append(output['labels'])
        label = torch.cat(label, 0)
        pred = torch.cat(pred, 0)
        acc = sum(pred.argmax(-1) == label)/label.shape[0]
        
        self.log('Validation/acc', acc, on_step=False, on_epoch=True)    

    
    def configure_optimizers(self):
        model_param = []
        for name, params in self.named_parameters():
            if 'mel_layer.' in name:
                pass
            else:
                model_param.append(params)          
        optimizer = optim.SGD(model_param, lr=1e-3, momentum= 0.9, weight_decay= 0.001)
        return [optimizer]

### Step 6(ii): Lightning Module for Linearmodel_librosa

In [None]:
class SpeechCommand_librosa(LightningModule):
    def training_step(self, batch, batch_idx):
        outputs, spec = self(batch['waveforms']) 
        loss = self.criterion(outputs, batch['labels'].long())

        acc = sum(outputs.argmax(-1) == batch['labels'])/outputs.shape[0] #batch wise
        
        self.log('Train/acc', acc, on_step=False, on_epoch=True)
        self.log('Train/Loss', loss, on_step=False, on_epoch=True)
        return loss

     
    def optimizer_step(self, epoch, batch_idx, optimizer, optimizer_idx,
                       optimizer_closure, on_tpu, using_native_amp, using_lbfgs):       
        optimizer.step(closure=optimizer_closure)       

        
    def validation_step(self, batch, batch_idx):               
        outputs, spec = self(batch['waveforms'])
        loss = self.criterion(outputs, batch['labels'].long())        
       
        self.log('Validation/Loss', loss, on_step=False, on_epoch=True)                     
        output_dict = {'outputs': outputs,
                       'labels': batch['labels']}        
        return output_dict

    
    def validation_epoch_end(self, outputs):
        pred = []
        label = []
        for output in outputs:
            pred.append(output['outputs'])
            label.append(output['labels'])
        label = torch.cat(label, 0)
        pred = torch.cat(pred, 0)
        acc = sum(pred.argmax(-1) == label)/label.shape[0]
        
        self.log('Validation/acc', acc, on_step=False, on_epoch=True)    

    
    def configure_optimizers(self):
        model_param = []
        for name, params in self.named_parameters():
            if 'mel_layer.' in name:
                pass
            else:
                model_param.append(params)          
  
        optimizer = optim.SGD(model_param, lr=1e-3, momentum= 0.9, weight_decay= 0.001)
        return [optimizer]

## Step 7: setting up nnAudio MelSpectrogram 

nnAudio supports the calculation of linear-frequency spectrogram, log-frequency spectrogram, Mel-spectrogram, and Constant Q Transform (CQT). 

In this tutorial, we will use Mel-spectrogram as an example.You can modify Mel-spectrogram argument from the function below: 

In [None]:
mel_layer = MelSpectrogram(sr=16000, 
                           n_fft=480,
                           win_length=None,
                           n_mels=n_mels, 
                           hop_length=160,
                           window='hann',
                           center=True,
                           pad_mode='reflect',
                           power=2.0,
                           htk=False,
                           fmin=0.0,
                           fmax=None,
                           norm=1,
                           trainable_mel=False,
                           trainable_STFT=False,
                           verbose=True)

## Step 8: defining the model
Both models take sound files (x) as input. Then we apply `nnAudio.features.mel.MelSpectrogram()` in Linearmodel_nnAudio and `librosa.feature.melspectrogram` in Linearmodel_librosa.

For demonstration purposes, we only build a simple model with one linear layer here. `The output of this KWS classification task is in 12 classes`, hence the output size of the layer should be 12.


### Step 8(i): defining the model with nnAudio


In [None]:
class Linearmodel_nnAudio(SpeechCommand):
    def __init__(self): 
        super().__init__()
        self.mel_layer = mel_layer       
        self.criterion = nn.CrossEntropyLoss()
        self.linearlayer = nn.Linear(input_dim, output_dim)
    
    def forward(self, x): 
        #x: 2D [B, 16000]
        spec = self.mel_layer(x)  
        #spec: 3D [B, F40, T101]
        
        spec = torch.log(spec+1e-10)
        flatten_spec = torch.flatten(spec, start_dim=1) 
        #flatten_spec: 2D [B, F*T(40*101)] 
        #start_dim: flattening start from 1st dimention
        
        out = self.linearlayer(flatten_spec) 
        #out: 2D [B,number of class(12)] 
                               
        return out, spec 

model_nnAudo = Linearmodel_nnAudio()
model_nnAudo = model_nnAudo.to(device)

### Step 8(ii): defining the model with librosa

In [None]:
class Linearmodel_librosa(SpeechCommand_librosa):
    def __init__(self): 
        super().__init__()       
        self.criterion = nn.CrossEntropyLoss()
        self.linearlayer = nn.Linear(input_dim, output_dim)   
    
    def forward(self, x): 
        #x: 2D [B, 16000]
        spec_list =[]
        for i in x:
            spec = i.cpu().detach().numpy()
            spec = librosa.feature.melspectrogram(y=spec,
                                                  sr=16000,  
                                                  n_fft=480,
                                                  win_length=None,
                                                  n_mels=n_mels,
                                                  hop_length=160,                                 
                                                  window='hann', 
                                                  center=True, 
                                                  pad_mode='reflect', 
                                                  power=2.0, 
                                                  htk=False, 
                                                  fmin=0.0, 
                                                  fmax=None,                                 
                                                  norm=1,)  
            
            #append back to batch
            spec_list.append(spec)
        spec_batch = torch.tensor(spec_list)  #spec_batch: [100, 40, 101]
        
        spec_batch.cuda()
        spec_batch = torch.log(spec_batch+1e-10)
        flatten_spec = torch.flatten(spec_batch, start_dim=1).cuda()
        #flatten_spec: 2D [B, F*T(40*101)] 
        #start_dim: flattening start from 1st dimention

        out = self.linearlayer(flatten_spec) #out: [B,12]
        return out, spec_batch

model_librosa = Linearmodel_librosa()
model_librosa = model_librosa.to(device)


## Step 9: training the model for 1 epoch

### Step 9(i): training the Linearmodel_nnAudio

In [None]:
trainer = Trainer(gpus=gpus, max_epochs=max_epochs,
    check_val_every_n_epoch= check_val_every_n_epoch,
    num_sanity_val_steps=num_sanity_val_steps)

trainer.fit(model_nnAudo, trainloader, validloader)

### Step 9(ii): training the Linearmodel_librosa

In [None]:
trainer = Trainer(gpus=gpus, max_epochs=max_epochs,
    check_val_every_n_epoch= check_val_every_n_epoch,
    num_sanity_val_steps=num_sanity_val_steps)

trainer.fit(model_librosa, trainloader, validloader)

# Conclusion:

The result above shows the computation time of nnAudio GPU and librosa. Librosa took 27mins for one epoch, however **nnAudio GPU only took around 17s to finish one epoch which is 95x faster than librosa!**

Next step, let's explore the nnAudio Trainable Basis Functions in Part 2 tutorial - **Part 2_Training a Linear model with Trainable Basis Functions**
