# Speech recognition

Simple sppeech recognition system can be implemented using DTW + MFCC. 

In [1]:
%matplotlib inline

import os
import glob
import librosa
import numpy as np

We will use the [data-speech-commands database](https://storage.cloud.google.com/download.tensorflow.org/data/speech_commands_v0.02.tar.gz) composed of 105,000 WAVE audio files of people saying thirty different words. We will use only a subset of this database.

We assume that you have previously downloaded and extracted the database. You need to specify the path to the folder where you extracted it.

In [2]:
DATABASE_PATH = '/Users/pierrerouanet/Downloads/data-speech_commands_v0.02'

In [3]:
labels = {'cat', 'dog', 'house', 'happy', 'zero'}
labels

{'cat', 'dog', 'happy', 'house', 'zero'}

In [4]:
# We will use only N occurences per word
N = 25

## Precompute all MFCCs

In [5]:
mfccs = []
true_labels = []

for l in labels:
    sounds = glob.glob(os.path.join(DATABASE_PATH, l, '*.wav'))
    np.random.shuffle(sounds)
    sounds = sounds[:N]
    
    for s in sounds:
        y, sr = librosa.load(s)
        mfcc = librosa.feature.mfcc(y, sr, n_mfcc=13)
        mfccs.append(mfcc.T)
        true_labels.append(l)
        
mfccs = np.array(mfccs)
true_labels = np.array(true_labels)

## Prepare train/val dataset

In [6]:
val_percent = 0.2
n_val = int(val_percent * len(true_labels))

I = np.random.permutation(len(true_labels))
I_val, I_train = I[:n_val], I[n_val:]

## Leave P Out Cross Validation

In [7]:
from dtw import dtw


def cross_validation(train_indices, val_indices):
    score = 0.0

    for i in val_indices:
        x = mfccs[i]

        dmin, jmin = np.inf, -1
        for j in train_indices:
            y = mfccs[j]
            d, _, _, _ = dtw(x, y, dist=lambda x, y: np.linalg.norm(x - y, ord=1))

            if d < dmin:
                dmin = d
                jmin = j

        score += 1.0 if (true_labels[i] == true_labels[jmin]) else 0.0
        
    return score / len(val_indices)

In [8]:
rec_rate = cross_validation(I_train, I_val)
print('Recognition rate {}%'.format(100. * rec_rate))

Recognition rate 52.0%
