In [1]:


## Install dependencies
!pip install wget
!apt-get install sox libsndfile1 ffmpeg
!pip install text-unidecode

# ## Install NeMo
BRANCH = 'r1.20.0'
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[asr]

## Install TorchAudio
!pip install torchaudio>=0.13.0 -f https://download.pytorch.org/whl/torch_stable.html

## Grab the config we'll use in this example
!mkdir configs

Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9655 sha256=2ee0e568d4fa6e6111b825f8203202905c996918246174a96a75d1b6feff4676
  Stored in directory: /root/.cache/pip/wheels/8b/f1/7f/5c94f0a7a505ca1c81cd1d9208ae2064675d97582078e6c769
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libsndfile1 is already the newest version (1.0.31-2build1).
ffmpeg is already the newest version (7:4.4.2-0ubuntu0.22.04.1).
The following additional packages will be installed:
  libopencore-amrnb0 libopencore-amrwb0 libsox-fmt-alsa libsox-fmt-base
  libsox3 libwavpack1
Suggested packages:
  libsox-fmt-all
The following NEW packages will be installed:

# Introduction

This VAD tutorial is based on the MarbleNet model from paper "[MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection](https://arxiv.org/abs/2010.13886)", which is an modification and extension of [MatchboxNet](https://arxiv.org/abs/2004.08531).

The notebook will follow the steps below:

 - Dataset preparation: Instruction of downloading datasets. And how to convert it to a format suitable for use with nemo_asr
 - Audio preprocessing (feature extraction): signal normalization, windowing, (log) spectrogram (or mel scale spectrogram, or MFCC)

 - Data augmentation using SpecAugment "[SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition](https://arxiv.org/abs/1904.08779)" to increase number of data samples.

 - Develop a small Neural classification model which can be trained efficiently.

 - Model training on the Google Speech Commands dataset and Freesound dataset in NeMo.

 - Evaluation of error cases of the model by audibly hearing the samples

 - Add more evaluation metrics and transfer learning/fine tune


In [2]:
# Some utility imports
import os
from omegaconf import OmegaConf

# Data Preparation

## Download the background data
We suggest to use the background categories of [freesound](https://freesound.org/) dataset  as our non-speech/background data.
We provide scripts for downloading and resampling it. Please have a look at Data Preparation part in NeMo docs. Note that downloading this dataset may takes hours.

**NOTE:** Here, this tutorial serves as a demonstration on how to train and evaluate models for vad using NeMo. We avoid using freesound dataset, and use `_background_noise_` category in Google Speech Commands Dataset as non-speech/background data.

## Download the speech data
   
We will use the open source Google Speech Commands Dataset (we will use V2 of the dataset for the tutorial, but require very minor changes to support V1 dataset) as our speech data. Google Speech Commands Dataset V2 will take roughly 6GB disk space. These scripts below will download the dataset and convert it to a format suitable for use with nemo_asr.


**NOTE**: You may additionally pass `--test_size` or `--val_size` flag for splitting train val and test data.
You may additionally pass `--window_length_in_sec` flag for indicating the segment/window length. Default is 0.63s.

**NOTE**: You may additionally pass a `--rebalance_method='fixed|over|under'` at the end of the script to rebalance the class samples in the manifest.
* 'fixed': Fixed number of samples for each class. For example, train 500, val 100, and test 200. (Change number in script if you want)
* 'over': Oversampling rebalance method
* 'under': Undersampling rebalance method

**NOTE**: We only take a small subset of speech data for demonstration, if you want to use entire speech data. Don't forget to **delete `--demo`** and change rebalance method/number.  `_background_noise_` category only has **6** audio files. So we would like to generate more based on the audio files to enlarge our background training data. If you want to use your own background noise data, just change the `background_data_root` and **delete `--demo`**


In [3]:
tmp = 'src'
data_folder = 'data'
if not os.path.exists(tmp):
    os.makedirs(tmp)
if not os.path.exists(data_folder):
    os.makedirs(data_folder)

In [4]:
script = os.path.join(tmp, 'process_vad_data.py')
if not os.path.exists(script):
    !wget -P $tmp https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/scripts/dataset_processing/process_vad_data.py

--2023-09-29 04:33:39--  https://raw.githubusercontent.com/NVIDIA/NeMo/r1.20.0/scripts/dataset_processing/process_vad_data.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19135 (19K) [text/plain]
Saving to: ‘src/process_vad_data.py’


2023-09-29 04:33:39 (80.6 MB/s) - ‘src/process_vad_data.py’ saved [19135/19135]



In [5]:
speech_data_root = os.path.join(data_folder, 'google_dataset_v2')
background_data_root = os.path.join(data_folder, 'google_dataset_v2/google_speech_recognition_v2/_background_noise_')# your <resampled freesound data directory>
out_dir = os.path.join(data_folder, 'manifest')
if not os.path.exists(speech_data_root):
    os.mkdir(speech_data_root)

In [6]:
# This may take a few minutes
!python $script \
    --out_dir={out_dir} \
    --speech_data_root={speech_data_root} \
    --background_data_root={background_data_root}\
    --log \
    --demo \
    --rebalance_method='fixed'

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
            {'$phi80.0': {('$78get_iter.36',
                           State(pc_initial=0 nstack_initial=0))},
             '$phi82.0': {('$78get_iter.36',
                           State(pc_initial=0 nstack_initial=0))},
             '$phi82.1': {('$80for_iter.2',
                           State(pc_initial=80 nstack_initial=1))}})
DEBUG:numba.core.byteflow:keep phismap: {'$phi80.0': {('$78get_iter.36', State(pc_initial=0 nstack_initial=0))},
 '$phi82.1': {('$80for_iter.2', State(pc_initial=80 nstack_initial=1))}}
DEBUG:numba.core.byteflow:new_out: defaultdict(<class 'dict'>,
            {State(pc_initial=0 nstack_initial=0): {'$phi80.0': '$78get_iter.36'},
             State(pc_initial=80 nstack_initial=1): {'$phi82.1': '$80for_iter.2'}})
DEBUG:numba.core.byteflow:----------------------DONE Prune PHIs-----------------------
DEBUG:numba.core.byteflow:block_infos State(pc_initial=0 nstack_initial=0):
AdaptBlockInfo(inst

## Preparing the manifest file

Manifest files are the data structure used by NeMo to declare a few important details about the data :

1) `audio_filepath`: Refers to the path to the raw audio file <br>
2) `label`: The class label (speech or background) of this sample <br>
3) `duration`: The length of the audio file, in seconds.<br>
4) `offset`: The start of the segment, in seconds.

In [7]:
# change below if you don't have or don't want to use rebalanced data
train_dataset = 'data/manifest/balanced_background_training_manifest.json,data/manifest/balanced_speech_training_manifest.json'
val_dataset = 'data/manifest/background_validation_manifest.json,data/manifest/speech_validation_manifest.json'
test_dataset = 'data/manifest/balanced_background_testing_manifest.json,data/manifest/balanced_speech_testing_manifest.json'

## Read a few rows of the manifest file

Manifest files are the data structure used by NeMo to declare a few important details about the data :

1) `audio_filepath`: Refers to the path to the raw audio file <br>
2) `command`: The class label (or speech command) of this sample <br>
3) `duration`: The length of the audio file, in seconds.

In [8]:
sample_test_dataset =  test_dataset.split(',')[0]

In [9]:
!head -n 5 {sample_test_dataset}

{"audio_filepath": "data/google_dataset_v2/google_speech_recognition_v2/_background_noise_more/dude_miaowing.wav_500000.wav", "duration": 0.63, "label": "background", "text": "_", "offset": 0.35000000000000014}
{"audio_filepath": "data/google_dataset_v2/google_speech_recognition_v2/_background_noise_more/dude_miaowing.wav_100000.wav", "duration": 0.63, "label": "background", "text": "_", "offset": 0.21000000000000005}
{"audio_filepath": "data/google_dataset_v2/google_speech_recognition_v2/_background_noise_more/dude_miaowing.wav_100000.wav", "duration": 0.63, "label": "background", "text": "_", "offset": 0.12999999999999998}
{"audio_filepath": "data/google_dataset_v2/google_speech_recognition_v2/_background_noise_more/dude_miaowing.wav_600000.wav", "duration": 0.63, "label": "background", "text": "_", "offset": 0.05}
{"audio_filepath": "data/google_dataset_v2/google_speech_recognition_v2/_background_noise_more/dude_miaowing.wav_400000.wav", "duration": 0.63, "label": "background", "tex

# Training - Preparation

We will be training a MarbleNet model from paper "[MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection](https://arxiv.org/abs/2010.13886)", evolved from [QuartzNet](https://arxiv.org/pdf/1910.10261.pdf) and [MatchboxNet](https://arxiv.org/abs/2004.08531) model. The benefit of QuartzNet over JASPER models is that they use Separable Convolutions, which greatly reduce the number of parameters required to get good model accuracy.

MarbleNet models generally follow the model definition pattern QuartzNet-[BxRXC], where B is the number of blocks, R is the number of convolutional sub-blocks, and C is the number of channels in these blocks. Each sub-block contains a 1-D masked convolution, batch normalization, ReLU, and dropout.


In [10]:
# NeMo's "core" package
import nemo
# NeMo's ASR collection - this collections contains complete ASR models and
# building blocks (modules) for ASR
import nemo.collections.asr as nemo_asr

## Model Configuration
The MarbleNet Model is defined in a config file which declares multiple important sections.

They are:

1) `model`: All arguments that will relate to the Model - preprocessors, encoder, decoder, optimizer and schedulers, datasets and any other related information

2) `trainer`: Any argument to be passed to PyTorch Lightning

In [11]:
MODEL_CONFIG = "marblenet_3x2x64.yaml"

if not os.path.exists(f"configs/{MODEL_CONFIG}"):
  !wget -P configs/ "https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/asr/conf/marblenet/{MODEL_CONFIG}"

--2023-09-29 04:35:18--  https://raw.githubusercontent.com/NVIDIA/NeMo/r1.20.0/examples/asr/conf/marblenet/marblenet_3x2x64.yaml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4395 (4.3K) [text/plain]
Saving to: ‘configs/marblenet_3x2x64.yaml’


2023-09-29 04:35:19 (58.7 MB/s) - ‘configs/marblenet_3x2x64.yaml’ saved [4395/4395]



In [12]:
# This line will print the entire config of the MarbleNet model
config_path = f"configs/{MODEL_CONFIG}"
config = OmegaConf.load(config_path)
config = OmegaConf.to_container(config, resolve=True)
config = OmegaConf.create(config)

print(OmegaConf.to_yaml(config))

name: MarbleNet-3x2x64
model:
  sample_rate: 16000
  repeat: 2
  dropout: 0.0
  kernel_size_factor: 1.0
  labels:
  - background
  - speech
  train_ds:
    manifest_filepath: ???
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 128
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: scatter
    shuffle_n: 2048
    num_workers: 8
    pin_memory: true
    bucketing_strategy: synced_randomized
    bucketing_batch_size: null
    bucketing_weights: null
    augmentor:
      shift:
        prob: 1.0
        min_shift_ms: -5.0
        max_shift_ms: 5.0
      white_noise:
        prob: 1.0
        min_level: -90
        max_level: -46
  validation_ds:
    manifest_filepath: ???
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 128
    shuffle: false
    num_workers: 8
    pin_memory: true
    val_loss_idx: 0
  test_ds:
    manifest_filepath: null
    sample_rate: 16000
    labels:
    

In [13]:
# Preserve some useful parameters
labels = config.model.labels
sample_rate = config.model.sample_rate

### Setting up the datasets within the config

If you'll notice, there are a few config dictionaries called `train_ds`, `validation_ds` and `test_ds`. These are configurations used to setup the Dataset and DataLoaders of the corresponding config.



In [14]:
print(OmegaConf.to_yaml(config.model.train_ds))

manifest_filepath: ???
sample_rate: 16000
labels:
- background
- speech
batch_size: 128
shuffle: true
is_tarred: false
tarred_audio_filepaths: null
tarred_shard_strategy: scatter
shuffle_n: 2048
num_workers: 8
pin_memory: true
bucketing_strategy: synced_randomized
bucketing_batch_size: null
bucketing_weights: null
augmentor:
  shift:
    prob: 1.0
    min_shift_ms: -5.0
    max_shift_ms: 5.0
  white_noise:
    prob: 1.0
    min_level: -90
    max_level: -46



### `???` inside configs

You will often notice that some configs have `???` in place of paths. This is used as a placeholder so that the user can change the value at a later time.

Let's add the paths to the manifests to the config above.

In [15]:
config.model.train_ds.manifest_filepath = train_dataset
config.model.validation_ds.manifest_filepath = val_dataset
config.model.test_ds.manifest_filepath = test_dataset

## Building the PyTorch Lightning Trainer

NeMo models are primarily PyTorch Lightning modules - and therefore are entirely compatible with the PyTorch Lightning ecosystem!

Let's first instantiate a Trainer object!

In [16]:
import torch
import pytorch_lightning as pl

In [17]:
print("Trainer config - \n")
print(OmegaConf.to_yaml(config.trainer))

Trainer config - 

devices: 1
max_epochs: 150
max_steps: -1
num_nodes: 1
accelerator: gpu
strategy: ddp
accumulate_grad_batches: 1
enable_checkpointing: false
logger: false
log_every_n_steps: 1
val_check_interval: 1.0
benchmark: false



In [18]:
# Let's modify some trainer configs for this demo
# Checks if we have GPU available and uses it
accelerator = 'gpu' if torch.cuda.is_available() else 'cpu'
config.trainer.devices = 1
config.trainer.accelerator = accelerator

# Reduces maximum number of epochs to 5 for quick demonstration
config.trainer.max_epochs = 5

# Remove distributed training flags
config.trainer.strategy = None

In [19]:
trainer = pl.Trainer(**config.trainer)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.utilities.rank_zero:`Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..


## Setting up a NeMo Experiment

NeMo has an experiment manager that handles logging and checkpointing for us, so let's use it !

In [20]:
from nemo.utils.exp_manager import exp_manager

In [21]:
exp_dir = exp_manager(trainer, config.get("exp_manager", None))

[NeMo I 2023-09-29 04:35:19 exp_manager:374] Experiments will be logged at /content/nemo_experiments/MarbleNet-3x2x64/2023-09-29_04-35-19
[NeMo I 2023-09-29 04:35:19 exp_manager:797] TensorboardLogger has been set up


In [22]:
# The exp_dir provides a path to the current experiment for easy access
exp_dir = str(exp_dir)
exp_dir

'/content/nemo_experiments/MarbleNet-3x2x64/2023-09-29_04-35-19'

## Building the MarbleNet Model

MarbleNet is an ASR model with a classification task - it generates one label for the entire provided audio stream. Therefore we encapsulate it inside the `EncDecClassificationModel` as follows.

In [23]:
vad_model = nemo_asr.models.EncDecClassificationModel(cfg=config.model, trainer=trainer)

[NeMo I 2023-09-29 04:35:19 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2023-09-29 04:35:19 collections:302] Dataset loaded with 1000 items, total duration of  0.17 hours.
[NeMo I 2023-09-29 04:35:19 collections:304] # 1000 files loaded accounting to # 2 labels


    


[NeMo I 2023-09-29 04:35:19 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2023-09-29 04:35:19 collections:302] Dataset loaded with 148 items, total duration of  0.03 hours.
[NeMo I 2023-09-29 04:35:19 collections:304] # 148 files loaded accounting to # 2 labels
[NeMo I 2023-09-29 04:35:19 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2023-09-29 04:35:19 collections:302] Dataset loaded with 400 items, total duration of  0.07 hours.
[NeMo I 2023-09-29 04:35:19 collections:304] # 400 files loaded accounting to # 2 labels


# Training a MarbleNet Model

As MarbleNet is inherently a PyTorch Lightning Model, it can easily be trained in a single line - `trainer.fit(model)` !


# Training the model

Even with such a small model (73k parameters), and just 5 epochs (should take just a few minutes to train), you should be able to get a test set accuracy score around 98.83% (this result is for the [freesound](https://freesound.org/) dataset) with enough training data.

**NOTE:** If you follow our tutorial and user the generated background data, you may notice the below results are acceptable, but please remember, this tutorial is only for **demonstration** and the dataset is not good enough. Please change background dataset and train with enough data for improvement!

Experiment with increasing the number of epochs or with batch size to see how much you can improve the score!

**NOTE:** Noise robustness is quite important for VAD task. Below we list the augmentation we used in this demo.
Please refer to [Online_Noise_Augmentation.ipynb](https://github.com/NVIDIA/NeMo/blob/stable/tutorials/asr/Online_Noise_Augmentation.ipynb)  for understanding noise augmentation in NeMo.




In [24]:
# Noise augmentation
print(OmegaConf.to_yaml(config.model.train_ds.augmentor)) # noise augmentation
print(OmegaConf.to_yaml(config.model.spec_augment)) # SpecAug data augmentation

shift:
  prob: 1.0
  min_shift_ms: -5.0
  max_shift_ms: 5.0
white_noise:
  prob: 1.0
  min_level: -90
  max_level: -46

_target_: nemo.collections.asr.modules.SpectrogramAugmentation
freq_masks: 2
time_masks: 2
freq_width: 15
time_width: 25
rect_masks: 5
rect_time: 25
rect_freq: 15



If you are interested in  **pretrained** model, please have a look at [Transfer Leaning & Fine-tuning on a new dataset](#Transfer-Leaning-&-Fine-tuning-on-a-new-dataset) and incoming tutorial 07 Offline_and_Online_VAD_Demo


### Monitoring training progress

Before we begin training, let's first create a Tensorboard visualization to monitor progress


In [25]:
try:
    from google import colab
    COLAB_ENV = True
except (ImportError, ModuleNotFoundError):
    COLAB_ENV = False

# Load the TensorBoard notebook extension
if COLAB_ENV:
    %load_ext tensorboard
    %tensorboard --logdir {exp_dir}
else:
    print("To use tensorboard, please use this notebook in a Google Colab environment.")

<IPython.core.display.Javascript object>

### Training for 5 epochs
We see below that the model begins to get modest scores on the validation set after just 5 epochs of training

In [26]:
trainer.fit(vad_model)

INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


[NeMo I 2023-09-29 04:35:27 modelPT:721] Optimizer config = SGD (
    Parameter Group 0
        dampening: 0
        differentiable: False
        foreach: None
        lr: 0.01
        maximize: False
        momentum: 0.9
        nesterov: False
        weight_decay: 0.001
    )
[NeMo I 2023-09-29 04:35:27 lr_scheduler:910] Scheduler "<nemo.core.optim.lr_scheduler.PolynomialHoldDecayAnnealing object at 0x7d5befb2d4e0>" 
    will be used during training (effective maximum steps = 40) - 
    Parameters : 
    (power: 2.0
    warmup_ratio: 0.05
    hold_ratio: 0.45
    min_lr: 0.001
    last_epoch: -1
    max_steps: 40
    )


INFO:pytorch_lightning.callbacks.model_summary:
  | Name              | Type                         | Params
-------------------------------------------------------------------
0 | spec_augmentation | SpectrogramAugmentation      | 0     
1 | preprocessor      | AudioToMFCCPreprocessor      | 0     
2 | encoder           | ConvASREncoder               | 88.9 K
3 | decoder           | ConvASRDecoderClassification | 258   
4 | loss              | CrossEntropyLoss             | 0     
5 | _accuracy         | TopKClassificationAccuracy   | 0     
-------------------------------------------------------------------
89.2 K    Trainable params
0         Non-trainable params
89.2 K    Total params
0.357     Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

    


Training: 0it [00:00, ?it/s]

[NeMo I 2023-09-29 04:35:29 preemption:56] Preemption requires torch distributed to be initialized, disabling preemption


    


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 0, global step 8: 'val_loss' reached 0.66182 (best 0.66182), saving model to '/content/nemo_experiments/MarbleNet-3x2x64/2023-09-29_04-35-19/checkpoints/MarbleNet-3x2x64--val_loss=0.6618-epoch=0.ckpt' as top 3


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 1, global step 16: 'val_loss' reached 0.73714 (best 0.66182), saving model to '/content/nemo_experiments/MarbleNet-3x2x64/2023-09-29_04-35-19/checkpoints/MarbleNet-3x2x64--val_loss=0.7371-epoch=1.ckpt' as top 3


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 2, global step 24: 'val_loss' reached 0.79537 (best 0.66182), saving model to '/content/nemo_experiments/MarbleNet-3x2x64/2023-09-29_04-35-19/checkpoints/MarbleNet-3x2x64--val_loss=0.7954-epoch=2.ckpt' as top 3


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 3, global step 32: 'val_loss' reached 0.52898 (best 0.52898), saving model to '/content/nemo_experiments/MarbleNet-3x2x64/2023-09-29_04-35-19/checkpoints/MarbleNet-3x2x64--val_loss=0.5290-epoch=3.ckpt' as top 3


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 4, global step 40: 'val_loss' reached 0.29014 (best 0.29014), saving model to '/content/nemo_experiments/MarbleNet-3x2x64/2023-09-29_04-35-19/checkpoints/MarbleNet-3x2x64--val_loss=0.2901-epoch=4.ckpt' as top 3
INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=5` reached.


# Fast Training

We can dramatically improve the time taken to train this model by using Multi GPU training along with Mixed Precision.

```python
# Trainer with a distributed backend:
trainer = Trainer(devices=2, num_nodes=2, accelerator='gpu', strategy='dp')

# Mixed precision:
trainer = Trainer(amp_level='O1', precision=16)

# Of course, you can combine these flags as well.
```

# Evaluation

## Evaluation on the Test set

Let's compute the final score on the test set via `trainer.test(model)`

In [27]:
trainer.test(vad_model, ckpt_path=None)

INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Testing: 0it [00:00, ?it/s]

[{'test_loss': 0.2693271040916443, 'test_epoch_top@1': 0.9200000166893005}]

## Evaluation of incorrectly predicted samples

Given that we have a trained model, which performs reasonably well, let's try to listen to the samples where the model is least confident in its predictions.

### Extract the predictions from the model

We want to possess the actual logits of the model instead of just the final evaluation score, so we can define a function to perform the forward step for us without computing the final loss. Instead, we extract the logits per batch of samples provided.

### Accessing the data loaders

We can utilize the `setup_test_data` method in order to instantiate a data loader for the dataset we want to analyze.

For convenience, we can access these instantiated data loaders using the following accessors - `vad_model._train_dl`, `vad_model._validation_dl` and `vad_model._test_dl`.

In [28]:
vad_model.setup_test_data(config.model.test_ds)
test_dl = vad_model._test_dl

[NeMo I 2023-09-29 04:35:41 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2023-09-29 04:35:41 collections:302] Dataset loaded with 400 items, total duration of  0.07 hours.
[NeMo I 2023-09-29 04:35:41 collections:304] # 400 files loaded accounting to # 2 labels


### Partial Test Step

Below we define a utility function to perform most of the test step. For reference, the test step is defined as follows:

```python
    def test_step(self, batch, batch_idx, dataloader_idx=0):
        audio_signal, audio_signal_len, labels, labels_len = batch
        logits = self.forward(input_signal=audio_signal, input_signal_length=audio_signal_len)
        loss_value = self.loss(logits=logits, labels=labels)
        correct_counts, total_counts = self._accuracy(logits=logits, labels=labels)
        return {'test_loss': loss_value, 'test_correct_counts': correct_counts, 'test_total_counts': total_counts}
```

In [29]:
@torch.no_grad()
def extract_logits(model, dataloader):
    logits_buffer = []
    label_buffer = []

    # Follow the above definition of the test_step
    for batch in dataloader:
        audio_signal, audio_signal_len, labels, labels_len = batch
        logits = model(input_signal=audio_signal, input_signal_length=audio_signal_len)

        logits_buffer.append(logits)
        label_buffer.append(labels)
        print(".", end='')
    print()

    print("Finished extracting logits !")
    logits = torch.cat(logits_buffer, 0)
    labels = torch.cat(label_buffer, 0)
    return logits, labels


In [30]:
cpu_model = vad_model.cpu()
cpu_model.eval()
logits, labels = extract_logits(cpu_model, test_dl)
print("Logits:", logits.shape, "Labels :", labels.shape)

....
Finished extracting logits !
Logits: torch.Size([400, 2]) Labels : torch.Size([400])


In [31]:
# Compute accuracy - `_accuracy` is a PyTorch Lightning Metric !
acc = cpu_model._accuracy(logits=logits, labels=labels)
print(f"Accuracy : {float(acc[0]*100)} %")

Accuracy : 92.0 %


### Filtering out incorrect samples
Let us now filter out the incorrectly labeled samples from the total set of samples in the test set

In [32]:
import librosa
import json
import IPython.display as ipd

In [33]:
# First let's create a utility class to remap the integer class labels to actual string label
class ReverseMapLabel:
    def __init__(self, data_loader):
        self.label2id = dict(data_loader.dataset.label2id)
        self.id2label = dict(data_loader.dataset.id2label)

    def __call__(self, pred_idx, label_idx):
        return self.id2label[pred_idx], self.id2label[label_idx]

In [34]:
# Next, let's get the indices of all the incorrectly labeled samples
sample_idx = 0
incorrect_preds = []
rev_map = ReverseMapLabel(test_dl)

# Remember, evaluated_tensor = (loss, logits, labels)
probs = torch.softmax(logits, dim=-1)
probas, preds = torch.max(probs, dim=-1)

total_count = cpu_model._accuracy.total_counts_k[0]
incorrect_ids = (preds != labels).nonzero()
for idx in incorrect_ids:
    proba = float(probas[idx][0])
    pred = int(preds[idx][0])
    label = int(labels[idx][0])
    idx = int(idx[0]) + sample_idx

    incorrect_preds.append((idx, *rev_map(pred, label), proba))


print(f"Num test samples : {total_count.item()}")
print(f"Num errors : {len(incorrect_preds)}")

# First let's sort by confidence of prediction
incorrect_preds = sorted(incorrect_preds, key=lambda x: x[-1], reverse=False)

Num test samples : 400
Num errors : 32


### Examine a subset of incorrect samples
Let's print out the (test id, predicted label, ground truth label, confidence) tuple of first 20 incorrectly labeled samples

In [35]:
for incorrect_sample in incorrect_preds[:20]:
    print(str(incorrect_sample))

(394, 'background', 'speech', 0.5016511082649231)
(375, 'background', 'speech', 0.5155142545700073)
(236, 'background', 'speech', 0.5315167307853699)
(213, 'background', 'speech', 0.5355279445648193)
(351, 'background', 'speech', 0.5440881252288818)
(245, 'background', 'speech', 0.5528361797332764)
(305, 'background', 'speech', 0.5604750514030457)
(243, 'background', 'speech', 0.566459596157074)
(322, 'background', 'speech', 0.5721189975738525)
(249, 'background', 'speech', 0.5873631238937378)
(328, 'background', 'speech', 0.5878866314888)
(353, 'background', 'speech', 0.6014326214790344)
(241, 'background', 'speech', 0.6043530702590942)
(342, 'background', 'speech', 0.6078547239303589)
(254, 'background', 'speech', 0.6560388207435608)
(393, 'background', 'speech', 0.6735473275184631)
(311, 'background', 'speech', 0.6832898855209351)
(396, 'background', 'speech', 0.7180747389793396)
(214, 'background', 'speech', 0.7260203957557678)
(354, 'background', 'speech', 0.8190340995788574)


###  Define a threshold below which we designate a model's prediction as "low confidence"

In [36]:
# Filter out how many such samples exist
low_confidence_threshold = 0.8
count_low_confidence = len(list(filter(lambda x: x[-1] <= low_confidence_threshold, incorrect_preds)))
print(f"Number of low confidence predictions : {count_low_confidence}")

Number of low confidence predictions : 19


### Let's hear the samples which the model has least confidence in !

In [37]:
# First let's create a helper function to parse the manifest files
def parse_manifest(manifest):
    data = []
    for line in manifest:
        line = json.loads(line)
        data.append(line)

    return data

In [38]:
# Next, let's create a helper function to actually listen to certain samples
def listen_to_file(sample_id, pred=None, label=None, proba=None):
    # Load the audio waveform using librosa
    filepath = test_samples[sample_id]['audio_filepath']
    audio, sample_rate = librosa.load(filepath,
                                      offset = test_samples[sample_id]['offset'],
                                      duration = test_samples[sample_id]['duration'])


    if pred is not None and label is not None and proba is not None:
        print(f"filepath: {filepath}, Sample : {sample_id} Prediction : {pred} Label : {label} Confidence = {proba: 0.4f}")
    else:

        print(f"Sample : {sample_id}")

    return ipd.Audio(audio, rate=sample_rate)


In [39]:
import json
# Now let's load the test manifest into memory
all_test_samples = []
for _ in test_dataset.split(','):
    print(_)
    with open(_, 'r') as test_f:
        test_samples = test_f.readlines()

        all_test_samples.extend(test_samples)
print(len(all_test_samples))
test_samples = parse_manifest(all_test_samples)

data/manifest/balanced_background_testing_manifest.json
data/manifest/balanced_speech_testing_manifest.json
400


In [40]:
# Finally, let's listen to all the audio samples where the model made a mistake
# Note: This list of incorrect samples may be quite large, so you may choose to subsample `incorrect_preds`
count = min(count_low_confidence, 20)  # replace this line with just `count_low_confidence` to listen to all samples with low confidence

for sample_id, pred, label, proba in incorrect_preds[:count]:
    ipd.display(listen_to_file(sample_id, pred=pred, label=label, proba=proba))

filepath: data/google_dataset_v2/google_speech_recognition_v2/eight/735845ab_nohash_0.wav, Sample : 394 Prediction : background Label : speech Confidence =  0.5017


filepath: data/google_dataset_v2/google_speech_recognition_v2/one/d394ef8e_nohash_0.wav, Sample : 375 Prediction : background Label : speech Confidence =  0.5155


filepath: data/google_dataset_v2/google_speech_recognition_v2/up/c7aaad67_nohash_0.wav, Sample : 236 Prediction : background Label : speech Confidence =  0.5315


filepath: data/google_dataset_v2/google_speech_recognition_v2/eight/4e6902d0_nohash_3.wav, Sample : 213 Prediction : background Label : speech Confidence =  0.5355


filepath: data/google_dataset_v2/google_speech_recognition_v2/one/d394ef8e_nohash_0.wav, Sample : 351 Prediction : background Label : speech Confidence =  0.5441


filepath: data/google_dataset_v2/google_speech_recognition_v2/up/c7aaad67_nohash_0.wav, Sample : 245 Prediction : background Label : speech Confidence =  0.5528


filepath: data/google_dataset_v2/google_speech_recognition_v2/four/53578f4e_nohash_1.wav, Sample : 305 Prediction : background Label : speech Confidence =  0.5605


filepath: data/google_dataset_v2/google_speech_recognition_v2/four/53578f4e_nohash_1.wav, Sample : 243 Prediction : background Label : speech Confidence =  0.5665


filepath: data/google_dataset_v2/google_speech_recognition_v2/dog/2a89ad5c_nohash_0.wav, Sample : 322 Prediction : background Label : speech Confidence =  0.5721


filepath: data/google_dataset_v2/google_speech_recognition_v2/right/74551073_nohash_1.wav, Sample : 249 Prediction : background Label : speech Confidence =  0.5874


filepath: data/google_dataset_v2/google_speech_recognition_v2/dog/2a89ad5c_nohash_0.wav, Sample : 328 Prediction : background Label : speech Confidence =  0.5879


filepath: data/google_dataset_v2/google_speech_recognition_v2/eight/4e6902d0_nohash_3.wav, Sample : 353 Prediction : background Label : speech Confidence =  0.6014


filepath: data/google_dataset_v2/google_speech_recognition_v2/yes/b5cf6ea8_nohash_6.wav, Sample : 241 Prediction : background Label : speech Confidence =  0.6044


filepath: data/google_dataset_v2/google_speech_recognition_v2/on/7213ed54_nohash_0.wav, Sample : 342 Prediction : background Label : speech Confidence =  0.6079


filepath: data/google_dataset_v2/google_speech_recognition_v2/off/017c4098_nohash_0.wav, Sample : 254 Prediction : background Label : speech Confidence =  0.6560


filepath: data/google_dataset_v2/google_speech_recognition_v2/right/74551073_nohash_1.wav, Sample : 393 Prediction : background Label : speech Confidence =  0.6735


filepath: data/google_dataset_v2/google_speech_recognition_v2/tree/5ef35194_nohash_1.wav, Sample : 311 Prediction : background Label : speech Confidence =  0.6833


filepath: data/google_dataset_v2/google_speech_recognition_v2/right/31e686d2_nohash_0.wav, Sample : 396 Prediction : background Label : speech Confidence =  0.7181


filepath: data/google_dataset_v2/google_speech_recognition_v2/tree/5ef35194_nohash_1.wav, Sample : 214 Prediction : background Label : speech Confidence =  0.7260


## Adding evaluation metrics

Here is an example of how to use more metrics (e.g. from torchmetrics) to evaluate your result.

**Note:** If you would like to add metrics for training and testing, have a look at
```python
NeMo/nemo/collections/common/metrics
```


In [41]:
from torchmetrics import ConfusionMatrix

In [42]:
_, pred = logits.topk(1, dim=1, largest=True, sorted=True)
pred = pred.squeeze()
metric = ConfusionMatrix(num_classes=2, task='binary')
metric(pred, labels)
# confusion_matrix(preds=pred, target=labels)

tensor([[200,   0],
        [ 32, 168]])

# Transfer Leaning & Fine-tuning on a new dataset
For transfer learning, please refer to [**Transfer learning** part of ASR tutorial](https://github.com/NVIDIA/NeMo/blob/stable/tutorials/asr/ASR_with_NeMo.ipynb)

More details on saving and restoring checkpoint, and exporting a model in its entirety, please refer to [**Fine-tuning on a new dataset** & **Advanced Usage parts** of Speech Command tutorial](https://github.com/NVIDIA/NeMo/blob/stable/tutorials/asr/Speech_Commands.ipynb)



