# Speech Commands recognition using ConvNets in PyTorch

This is a tutorial post on speech commands recognition. There are around 10 speech commands like Yes, No, Up, Down. But first we will need to install torchaudio, which has some convenient dataloaders for the [Speech Commands dataset](https://arxiv.org/abs/1804.03209).

The command below should be executed in a terminal to install torchaudio. The -c option searches on the pytorch channel.
conda install -c pytorch torchaudio

Here's the basic plan
1. Data loading and preprocessing: setup the features (log Mel) and labels 
2. Model specification and loss function: CNN architecture and cross-entropy loss 
3. Training: minimize the loss function using training data and use validation data to assess training quality
4. Evaluate performance on test set
5. Real-world evaluation: record from real mic and test 

This notebook is available on github at this [link](https://github.com/jumpml/pytorch-tutorials/blob/master/SpeechCommands_CNN.ipynb)

In [1]:
# CUSTOMARY IMPORTS
import torch
import torchaudio
import matplotlib.pyplot as plt
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

%matplotlib inline

random_seed = 1        
torch.manual_seed(random_seed)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

## Data  Setup

### Step 1: Download the data
If we look at the source code for the speechcommands dataset class in torchaudio, we need to specify the version of speech commands (v0.2) and if we want to download the dataset. The dataset is around 2.3GB in size. Each item in the dataloader consists of the following information
waveform, sample_rate, label, speaker_id, utterance_number

In [70]:
# Setup a audio processing pipeline as done in 
# https://www.assemblyai.com/blog/end-to-end-speech-recognition-pytorch


def process_data(data,sr=8000,nfft=256,hoplen=128,nMel=40):
    batchSize = len(data)
    audio_transforms = nn.Sequential(
                    torchaudio.transforms.Resample(16000, sr),
                    torchaudio.transforms.MelSpectrogram(sample_rate=sr,n_fft=nfft,hop_length=hoplen,n_mels=nMel)
    )
    features=torch.zeros([batchSize, nMel, int(np.ceil(1.*sr/hoplen))])
    labels = []
    idx = 0
    for (waveform,sr,label,sid,utnum) in data:
        #print(f'sr={sr} waveform_len={waveform.shape} {sid} {utnum}')
        #feature = waveform
        #features[idx,:,:] = spec
        labels.append(label)
        #print(f'At index {idx} we have label= {label}')
        idx = idx + 1
        
    return features, labels
    
    

In [77]:
speechcommands_data = torchaudio.datasets.SPEECHCOMMANDS('.',
                                                         url='speech_commands_v0.02',
                                                         folder_in_archive='SpeechCommands',
                                                         download=False)
data_loader = torch.utils.data.DataLoader(speechcommands_data,
                                          batch_size=4000,
                                          shuffle=True,
                                          collate_fn=lambda x: process_data(x),
                                          num_workers=0)

In [78]:
examples = enumerate(data_loader)
batch_idx, (features, labels) = next(examples)

In [79]:
print(set(labels))

{'backward', 'left', 'two', 'forward', 'on', 'one', 'visual', 'down', 'four', 'bird', 'wow', 'seven', 'three', 'tree', 'happy', 'six', 'nine', 'house', 'right', 'eight', 'up', 'sheila', 'cat', 'yes', 'learn', 'zero', 'five', 'stop', 'dog', 'go', 'off', 'bed', 'follow', 'no', 'marvin'}


In [69]:
print("Shape of spectrogram: {}".format(features[0].size()))

plt.figure()
p = plt.imshow(features[0].log2()[0,:,:].detach().numpy(), cmap='gray')

Shape of spectrogram: torch.Size([40, 63])


IndexError: too many indices for tensor of dimension 2

<Figure size 432x288 with 0 Axes>

Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fc485b1c5f0>
Traceback (most recent call last):
  File "/home/ragh/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 962, in __del__
    self._shutdown_workers()
  File "/home/ragh/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 942, in _shutdown_workers
    w.join()
  File "/home/ragh/anaconda3/lib/python3.7/multiprocessing/process.py", line 140, in join
    res = self._popen.wait(timeout)
  File "/home/ragh/anaconda3/lib/python3.7/multiprocessing/popen_fork.py", line 48, in wait
    return self.poll(os.WNOHANG if timeout == 0.0 else 0)
  File "/home/ragh/anaconda3/lib/python3.7/multiprocessing/popen_fork.py", line 28, in poll
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt: 


105829