## Speech Recognition

A survey of statistical and deep learning models

## Conventional (Statistical) Pipeline

<img src='assets/speech/asr-pipeline.jpg'/>


Source: https://www.techrepublic.com/article/how-we-learned-to-talk-to-computers/

## Speech Feature Extraction

<img src='assets/speech/Spectrogram-19thC.png'/>

A spectrogram for "nineteen century" - power vs. frequency

Common method: Mel-frequency cepstral coefficients (MFCC)

## Statistical Speech Recognition

$$W^* = \underset{W}{\operatorname{argmax}}P(W|X)$$

word sequence: $W$

most likely word sequence: $W^*$

acoustic input feature vector (e.g. MFCC): $X$

## Statistical Speech Recognition

After Bayes' Theorem:

$$W^* = \underset{W}{\operatorname{argmax}}p(X|W)P(W)$$

acoustic model: $p(X|W)$

language model (e.g. N-gram): $P(W)$

## Statistical Acoustic Model: $p(X|W)$

<img src='assets/speech/acoustic-statistical.png' width='50%'/>

Credits: https://www.inf.ed.ac.uk/teaching/courses/asr/2016-17/asr03-hmmgmm-handout.pdf

## Hidden Markov Model: $p(S_i|S_{i-1})$, Gaussian Mixture Model: $p(X|S_i)$

<img src='assets/speech/acoustic-hmm-gmm.png' width='50%'/>

Credits: https://www.inf.ed.ac.uk/teaching/courses/asr/2016-17/asr03-hmmgmm-handout.pdf

## Gaussian Mixture Model

Mixture distribution: combine multiple probabability distributions to make an improved model

$$P(x) = \sum_iP(c=i)P(x \mid c=i)$$

$i^{th}$ Gaussian component: $P(x \mid c=i)$

Applications
- Clustering
- Classification

Nice intro:
https://yulearning.blogspot.sg/2014/11/einsteins-most-famous-equation-is-emc2.html

## Workshop: GMM gender detector

Credits: https://github.com/abhijeet3922/PyGender-Voice

<img src='assets/speech/workshop1_pygender.png' style='float:right'/>

1. Download data from [here](
https://www.dropbox.com/s/hcku4t7alrhacqv/pygender.zip?dl=0)

2. Extract the .zip file to a folder of your choice. Note down the path as you will need to enter it in the workshop code.

In [None]:
!pip3 install python_speech_features

import os
from os.path import basename, join
import numpy as np

import python_speech_features as mfcc
from scipy.io.wavfile import read
from sklearn import preprocessing
from sklearn.mixture import GaussianMixture

TRAIN_PATH = 'C:\\mldds\\pygender\\train_data\\youtube\\' # modify to your actual path

In [None]:
def get_MFCC(audio_file, scale=True):
    '''Computes the Mel-frequency ceptrum coefficients for an audio file,
    with optional scaling
    See: https://github.com/jameslyons/python_speech_features
    '''
    sample_rate, audio = read(audio_file)
    features = mfcc.mfcc(audio, sample_rate, winlen=0.025, winstep=0.01, numcep=13, appendEnergy=False)
    if scale:
        features = preprocessing.scale(features) # scale to (0, 1)
    return features

In [None]:
# Playback a sample file
from IPython import display

sample_file = join(TRAIN_PATH, 'male', 'male1.wav')
sample_rate, audio = read(sample_file)
display.Audio(data=audio, rate=sample_rate)

In [None]:
# Plot the MFCC
import matplotlib.pyplot as plt

mfcc_vector = get_MFCC(sample_file, scale=False)
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(20,4))
cax = ax.matshow(np.transpose(mfcc_vector), interpolation='nearest', aspect='auto', cmap='coolwarm', origin='lower')
fig.colorbar(cax)
plt.title("Spectrogram of {}".format(sample_file))
plt.show()

In [None]:
def train_GMM(data_path, n_components=8, covariance_type='diag'):
    '''Trains a Gaussian mixture model for a given label and data path'''
    files = [join(data_path, f) for f in os.listdir(data_path) if f.endswith('.wav')]
    features = np.asarray(());

    for f in files:
        mfcc_vector = get_MFCC(f)

        if features.size:
            features = np.vstack((features, mfcc_vector))
        else:
            features = mfcc_vector

    # http://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html
    gmm = GaussianMixture(n_components=n_components, covariance_type=covariance_type,
                          max_iter=200, n_init=3)
    gmm.fit(features)
    
    # print some metrics applicable to GMMs
    print('BIC: ', gmm.bic(features), ', AIC: ', gmm.aic(features))
    return gmm

In [None]:
models = dict()
%time models['male'] = train_GMM(join(TRAIN_PATH, 'male'), n_components=8, covariance_type='diag')

# ==================================================================
# Exercise:
# Add code below to train the female model, using the above as an example

## === ANSWER === ##
%time models['female'] = train_GMM(join(TRAIN_PATH, 'female'), n_components=8, covariance_type='diag')
## === ANSWER === ##

# ==================================================================
# Optional Exercises:
# a. Try different values of n_component (e.g. 2, 16)
# b. Try different values of covariance_type (e.g. full)
#
# See http://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html
# on how to interpret the BIC and AIC metrics for selecting models

In [None]:
def test_GMM(models, test_data_path):
    '''Tests multiple Gaussian mixture models with test data'''
    files = [os.path.join(test_data_path,f) for f in os.listdir(test_data_path)
             if f.endswith(".wav")]
    
    predictions = []
    for f in files:
        features = get_MFCC(f)
        keys = []
        log_likelihood = np.zeros(len(models))

        for i, (key, gmm) in enumerate(models.items()):
            scores = np.array(gmm.score(features))
            keys.append(key)
            log_likelihood[i] = scores.sum()

        # find the model with the maximum score
        winner = np.argmax(log_likelihood)
        # print('prediction:', keys[winner], "\tscores:", log_likelihood[winner])
        predictions.append(keys[winner])
    return predictions

In [None]:
# ==================================================================
# Exercise:
# 1. Complete the code below to test the GMM models using test_GMM().
#    Be sure to run against both male and female models.
# 2. Plot the confusion matrix

from sklearn.metrics import confusion_matrix

TEST_PATH = 'C:\\mldds\\pygender\\test_data\\AudioSet' # modify to your actual path

## === ANSWER === ##
male_predictions = test_GMM(models, join(TEST_PATH, 'male_clips'))
female_predictions = test_GMM(models, join(TEST_PATH, 'female_clips'))

truth = ['male' for p in male_predictions] + ['female' for p in female_predictions]
cm = confusion_matrix(truth, male_predictions + female_predictions)
print(cm)
plt.matshow(cm)
plt.colorbar()
## === ANSWER === ##

## Shortcomings of Statistical Approaches

Lots of hand-tuning

Inefficient for approximating non-linear data: covariance matrices get very large

Solution: deep learning

## HMM-DNN

Paper:
https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/38131.pdf

## Recurrent Neural Networks

Paper: http://proceedings.mlr.press/v32/graves14.pdf

## Workshop: Deep Speech