In [1]:
!apt install sox

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  libmagic-mgc libmagic1 libopencore-amrnb0 libopencore-amrwb0 libsox-fmt-alsa
  libsox-fmt-base libsox3
Suggested packages:
  file libsox-fmt-all
The following NEW packages will be installed:
  libmagic-mgc libmagic1 libopencore-amrnb0 libopencore-amrwb0 libsox-fmt-alsa
  libsox-fmt-base libsox3 sox
0 upgraded, 8 newly installed, 0 to remove and 37 not upgraded.
Need to get 760 kB of archives.
After this operation, 6,717 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libopencore-amrnb0 amd64 0.1.3-2.1 [92.0 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libopencore-amrwb0 amd64 0.1.3-2.1 [45.8 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libmagic-mgc amd64 1:5.32-2ubuntu0.4 [184 kB]
Get:4 http://archive.ubuntu.com/ubuntu bionic-updates/main a

### Baseline commands recognition (2-5 points)

We're now going to train a classifier to recognize voice. More specifically, we'll use the [Speech Commands Dataset] that contains around 30 different words with a few thousand voice records each.

In [3]:
import os
import time
from IPython.display import display, Audio, clear_output
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import librosa
import torch
from torch.utils.data import TensorDataset, DataLoader

datadir = "speech_commands"

!wget http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz -O speech_commands_v0.01.tar.gz
# alternative url: https://www.dropbox.com/s/j95n278g48bcbta/speech_commands_v0.01.tar.gz?dl=1
!mkdir {datadir} && tar -C {datadir} -xvzf speech_commands_v0.01.tar.gz 1> log

samples_by_target = {
    cls: [os.path.join(datadir, cls, name) for name in os.listdir("./speech_commands/{}".format(cls))]
    for cls in os.listdir(datadir)
    if os.path.isdir(os.path.join(datadir, cls))
}
print('Classes:', ', '.join(sorted(samples_by_target.keys())[1:]))

--2021-12-13 19:10:39--  http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz
Resolving download.tensorflow.org (download.tensorflow.org)... 108.177.15.128, 2a00:1450:400c:c0c::80
Connecting to download.tensorflow.org (download.tensorflow.org)|108.177.15.128|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1489096277 (1.4G) [application/gzip]
Saving to: ‘speech_commands_v0.01.tar.gz’


2021-12-13 19:10:56 (80.5 MB/s) - ‘speech_commands_v0.01.tar.gz’ saved [1489096277/1489096277]

mkdir: cannot create directory ‘speech_commands’: File exists
Classes: bed, bird, cat, dog, down, eight, five, four, go, happy, house, left, marvin, nine, no, off, on, one, right, seven, sheila, six, stop, three, tree, two, up, wow, yes, zero


In [4]:
!sox --info speech_commands/bed/00176480_nohash_0.wav


Input File     : 'speech_commands/bed/00176480_nohash_0.wav'
Channels       : 1
Sample Rate    : 16000
Precision      : 16-bit
Duration       : 00:00:01.00 = 16000 samples ~ 75 CDDA sectors
File Size      : 32.0k
Bit Rate       : 256k
Sample Encoding: 16-bit Signed Integer PCM



In [5]:
from sklearn.model_selection import train_test_split
from itertools import chain
from tqdm import tqdm
import joblib as jl

classes = ("left", "right", "up", "down", "stop")

def preprocess_sample(filepath, max_length=150):
    amplitudes, sr = librosa.core.load(filepath)
    spectrogram = librosa.feature.melspectrogram(amplitudes, sr=sr)[:, :max_length]
    spectrogram = np.pad(spectrogram, [[0, 0], [0, max(0, max_length - spectrogram.shape[1])]], mode='constant')
    target = classes.index(filepath.split(os.sep)[-2])
    return np.float32(spectrogram), np.int64(target)

all_files = chain(*(samples_by_target[cls] for cls in classes))
spectrograms_and_targets = jl.Parallel(n_jobs=-1)(tqdm(list(map(jl.delayed(preprocess_sample), all_files))))
X, y = map(np.stack, zip(*spectrograms_and_targets))
X = X.transpose([0, 2, 1])  # to [batch, time, channels]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

100%|██████████| 11834/11834 [07:36<00:00, 25.94it/s]


In [6]:
X_train.shape

(8875, 150, 128)

In [7]:
X_test.shape

(2959, 150, 128)

In [8]:
X_train = np.expand_dims(X_train, axis=1)
X_test = np.expand_dims(X_test, axis=1)

In [9]:
X_train.shape

(8875, 1, 150, 128)

In [10]:
X_test.shape

(2959, 1, 150, 128)

In [23]:
device = 'cuda'

batch_size = 16

tensor_x = torch.Tensor(X_train)
tensor_y = torch.LongTensor(y_train)

train_dataset = TensorDataset(tensor_x, tensor_y)

tensor_x = torch.Tensor(X_test) # transform to torch tensor
tensor_y = torch.LongTensor(y_test)

test_dataset = TensorDataset(tensor_x, tensor_y)


trainloader = DataLoader(train_dataset, batch_size=batch_size,
                         shuffle=True, num_workers=2)
testloader = DataLoader(test_dataset, batch_size=batch_size,
                        shuffle=False, num_workers=2)

In [31]:
import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):
    def __init__(self):
        super().__init__()
        # TODO: define your layers here
        self.conv1 = nn.Conv2d(1, 64, 5)
        self.bn1 = nn.BatchNorm2d(64)
        self.dropout1 = nn.Dropout(0.2)

        self.conv2 = nn.Conv2d(64, 128, 5)
        self.bn2 = nn.BatchNorm2d(128)
        self.max_pool1 = nn.MaxPool2d(2)

        self.conv3 = nn.Conv2d(128, 256, 5)
        self.bn3 = nn.BatchNorm2d(256)
        self.max_pool2 = nn.MaxPool2d(4)

        self.conv4 = nn.Conv2d(256, 256, 5)
        self.bn4 = nn.BatchNorm2d(256)
        self.max_pool3 = nn.MaxPool2d(6)

        self.flatten = nn.Flatten()
        self.fc1 = nn.Linear(512, 256)
        self.fc2 = nn.Linear(256, 5)

    def forward(self, x):
        # TODO: apply your layers here
        x = F.relu(self.bn1(self.conv1(x)))
        x = self.dropout1(x)
        x = F.relu(self.bn2(self.conv2(x)))
        x = self.max_pool1(x)
        x = F.relu(self.bn3(self.conv3(x)))
        x = self.max_pool2(x)
        x = F.relu(self.bn4(self.conv4(x)))
        x = self.max_pool3(x)
        x = F.relu(self.fc1(self.flatten(x)))
        x = self.fc2(x)
        x = F.softmax(x)
        return x

net = Net().to(device)

In [32]:
import torch.optim as optim

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

In [33]:
def accuracy(loader, model):
    correct_count = 0
    sample_count = 0
    model.eval()
    
    with torch.no_grad():
        for x, y in loader:
            x = x.to(device)
            y = y.to(device)   
            scores = model(x)
            _, preds = scores.max(1)
            correct_count += (preds == y).sum()
            sample_count += preds.size(0)
    model.train()
    return float(correct_count)/float(sample_count)*100

In [34]:
import seaborn as sn
sn.set()
from tqdm.auto import trange
from ipywidgets import Output

def plot_progress(losses, scores, disp):
  with disp:
    fig, ax = plt.subplots(1, 2, figsize=(16, 5))
    ax[0].plot([i*100 for i in range(len(losses))], losses)
    ax[1].plot([i*100 for i in range(len(scores))], scores)
    ax[0].scatter([i*100 for i in range(len(losses))], losses, color='blue')
    ax[1].scatter([i*100 for i in range(len(scores))], scores, color='blue')


    ax[0].set(xlabel='samples', ylabel='loss', title=f'Training loss')
    ax[1].set(xlabel='samples', ylabel='accuracy', title=f'Test accuracy')
    clear_output(wait=True)
    plt.show()
  time.sleep(0.5)

In [36]:
losses = []
scores = []
out = Output()
display.display(out)

for epoch in range(5):  # loop over the dataset multiple times

    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data
        inputs = inputs.to(device)
        labels = labels.to(device)

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if i % 100 == 99:    # print every 100 mini-batches
            acc = accuracy(testloader, net)
            print('[%d, %5d] loss: %.3f, accuracy: %.3f' %
                  (epoch + 1, i + 1, running_loss / 100, acc))
            losses.append(running_loss / 100)
            scores.append(acc)
            plot_progress(losses, scores, out)
            running_loss = 0.0

print('Finished Training')

Output()



[1,   100] loss: 1.528, accuracy: 43.765
[1,   200] loss: 1.517, accuracy: 44.779
[1,   300] loss: 1.509, accuracy: 43.325
[1,   400] loss: 1.507, accuracy: 39.743
[1,   500] loss: 1.494, accuracy: 41.872
[2,   100] loss: 1.459, accuracy: 44.306
[2,   200] loss: 1.464, accuracy: 51.909
[2,   300] loss: 1.482, accuracy: 50.558
[2,   400] loss: 1.437, accuracy: 50.727
[2,   500] loss: 1.435, accuracy: 47.989
[3,   100] loss: 1.422, accuracy: 51.200
[3,   200] loss: 1.386, accuracy: 58.770
[3,   300] loss: 1.389, accuracy: 54.309
[3,   400] loss: 1.376, accuracy: 59.142
[3,   500] loss: 1.349, accuracy: 54.714
[4,   100] loss: 1.358, accuracy: 53.802
[4,   200] loss: 1.331, accuracy: 62.014
[4,   300] loss: 1.299, accuracy: 49.645
[4,   400] loss: 1.298, accuracy: 70.091
[4,   500] loss: 1.269, accuracy: 61.642
[5,   100] loss: 1.245, accuracy: 73.302
[5,   200] loss: 1.236, accuracy: 67.590
[5,   300] loss: 1.216, accuracy: 71.071
[5,   400] loss: 1.218, accuracy: 74.282
[5,   500] loss:

Train a model: finally, lets' build and train a classifier neural network. You can use any library you like. If in doubt, consult the model & training tips below.

__Training tips:__ here's what you can try:
* __Layers:__ 1d or 2d convolutions, perhaps with some batch normalization in between;
* __Architecture:__ VGG-like, residual, highway, densely-connected, MatchboxNet, Dilated convs - you name it :)
* __Batch size matters:__ smaller batches usually train slower but better. Try to find the one that suits you best.
* __Data augmentation:__ add background noise, faster/slower, change pitch;
* __Average checkpoints:__ you can make model more stable with [this simple technique (arxiv)](https://arxiv.org/abs/1803.05407)
* __For full scale stage:__ make sure you're not losing too much data due to max_length in the pre-processing stage!

These are just recommendations. As long as your model works, you're not required to follow them.

### Full scale commands recognition (3+ points)

Your final task is to train a full-scale voice command spotter and apply it to a video:
1. Build the dataset with all 30+ classes (directions, digits, names, etc.)
  * __Optional:__ include a special "noise" class that contains random unrelated sounds
  * You can download youtube videos with [`youtube-dl`](https://ytdl-org.github.io/youtube-dl/index.html) library.
2. Train a model on this full dataset. Kudos for tuning its accuracy :)
3. Apply it to a audio/video of your choice to spot the occurences of each keyword
 * Here's one [video about primes](https://www.youtube.com/watch?v=EK32jo7i5LQ) that you can try. It should be full of numbers :)
 * There are multiple ways you can analyze the performance of your network, e.g. plot probabilities predicted for every time-step. Chances are you'll discover something useful about how to improve your model :)


Please briefly describe what you did in a short informal report.