## Speech Recognition

A survey of statistical and deep learning models

## Automatic Speech Recognition Pipeline

<img src='assets/speech/asr-pipeline.jpg'/>


Source: https://www.techrepublic.com/article/how-we-learned-to-talk-to-computers/

## Speech Feature Extraction

<img src='assets/speech/Spectrogram-19thC.png'/>

A spectrogram for "nineteen century" - power vs. frequency

Common method: Mel-frequency cepstral coefficients (MFCC)

## Statistical Speech Recognition

$$W^* = \underset{W}{\operatorname{argmax}}P(W|X)$$

- word sequence: $W$
- most likely word sequence: $W^*$
- acoustic input feature vector (e.g. MFCC): $X$

## Statistical Speech Recognition

After Bayes' Theorem:

$$W^* = \underset{W}{\operatorname{argmax}}p(X|W)P(W)$$

- acoustic model: $p(X|W)$
- language model (e.g. N-gram): $P(W)$

## Statistical Acoustic Model: $p(X|W)$

<img src='assets/speech/acoustic-statistical.png' width='50%'/>

Credits: https://www.inf.ed.ac.uk/teaching/courses/asr/2016-17/asr03-hmmgmm-handout.pdf

## Hidden Markov Model: $p(S_i|S_{i-1})$, Gaussian Mixture Model: $p(X|S_i)$

<img src='assets/speech/acoustic-hmm-gmm.png' width='50%'/>

Credits: https://www.inf.ed.ac.uk/teaching/courses/asr/2016-17/asr03-hmmgmm-handout.pdf

## Gaussian Mixture Model

Mixture distribution: combine multiple probabability distributions to make an improved model

$$P(x) = \sum_iP(c=i)P(x \mid c=i)$$

$i^{th}$ Gaussian component: $P(x \mid c=i)$

Applications
- Clustering
- Classification

Nice intro:
https://yulearning.blogspot.sg/2014/11/einsteins-most-famous-equation-is-emc2.html

## Workshop: GMM gender detector
---

Credits: https://github.com/abhijeet3922/PyGender-Voice

<img src='assets/speech/workshop1_pygender.png' style='float:right'/>

1. Download data from [here](
https://www.dropbox.com/s/hcku4t7alrhacqv/pygender.zip?dl=0)

2. Extract the .zip file to a folder of your choice. Note down the path as you will need to enter it in the workshop code.

In [None]:
!pip3 install python_speech_features

import os
from os.path import basename, join
import numpy as np

import python_speech_features as mfcc
from scipy.io.wavfile import read
from sklearn import preprocessing
from sklearn.mixture import GaussianMixture

TRAIN_PATH = 'C:\\mldds\\pygender\\train_data\\youtube\\' # modify to your actual path

In [None]:
def get_MFCC(audio_file, scale=True):
    '''Computes the Mel-frequency ceptrum coefficients for an audio file,
    with optional scaling
    See: https://github.com/jameslyons/python_speech_features
    '''
    sample_rate, audio = read(audio_file)
    features = mfcc.mfcc(audio, sample_rate, winlen=0.025, winstep=0.01, numcep=13, appendEnergy=False)
    if scale:
        features = preprocessing.scale(features) # scale to (0, 1)
    return features

In [None]:
# Playback a sample file
from IPython import display

sample_file = join(TRAIN_PATH, 'male', 'male1.wav')
sample_rate, audio = read(sample_file)
display.Audio(data=audio, rate=sample_rate)

In [None]:
# Plot the MFCC
import matplotlib.pyplot as plt

mfcc_vector = get_MFCC(sample_file, scale=False)
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(20,4))
cax = ax.matshow(np.transpose(mfcc_vector), interpolation='nearest', aspect='auto', cmap='coolwarm', origin='lower')
fig.colorbar(cax)
plt.title("Spectrogram of {}".format(sample_file))
plt.show()

In [None]:
def train_GMM(data_path, n_components=8, covariance_type='diag'):
    '''Trains a Gaussian mixture model for a given label and data path'''
    files = [join(data_path, f) for f in os.listdir(data_path) if f.endswith('.wav')]
    features = np.asarray(());

    for f in files:
        mfcc_vector = get_MFCC(f)

        if features.size:
            features = np.vstack((features, mfcc_vector))
        else:
            features = mfcc_vector

    # http://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html
    gmm = GaussianMixture(n_components=n_components, covariance_type=covariance_type,
                          max_iter=200, n_init=3)
    gmm.fit(features)
    
    # print some metrics applicable to GMMs
    print('BIC: ', gmm.bic(features), ', AIC: ', gmm.aic(features))
    return gmm

In [None]:
models = dict()
%time models['male'] = train_GMM(join(TRAIN_PATH, 'male'), n_components=8, covariance_type='diag')

# ==================================================================
# Exercise:
# Add code below to train the female model, using the above as an example








# ==================================================================
# Optional Exercises:
# a. Try different values of n_component (e.g. 2, 16)
# b. Try different values of covariance_type (e.g. full)
#
# See http://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html
# on how to interpret the BIC and AIC metrics for selecting models









In [None]:
def test_GMM(models, test_data_path):
    '''Tests multiple Gaussian mixture models with test data'''
    files = [os.path.join(test_data_path,f) for f in os.listdir(test_data_path)
             if f.endswith(".wav")]
    
    predictions = []
    for f in files:
        features = get_MFCC(f)
        keys = []
        log_likelihood = np.zeros(len(models))

        for i, (key, gmm) in enumerate(models.items()):
            scores = np.array(gmm.score(features))
            keys.append(key)
            log_likelihood[i] = scores.sum()

        # find the model with the maximum score
        winner = np.argmax(log_likelihood)
        # print('prediction:', keys[winner], "\tscores:", log_likelihood[winner])
        predictions.append(keys[winner])
    return predictions

In [None]:
# ==================================================================
# Exercise:
# 1. Complete the code below to test the GMM models using test_GMM().
#    Be sure to run against both male and female models.
# 2. Plot the confusion matrix

from sklearn.metrics import confusion_matrix

TEST_PATH = 'C:\\mldds\\pygender\\test_data\\AudioSet' # modify to your actual path










## Shortcomings of Statistical Approaches

Lots of hand-tuning

Inefficient for approximating non-linear data: combination covariance matrices get very large / complicated

Solution: deep learning

## Deep Belief Nets

[Paper](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/38131.pdf)

- Replaces the Gaussian Mixture Model in an ASR system 
- Generative pre-training
 - Speeds up training with less overfitting
 - Train a hidden layer (using Restricted Boltmann Machines or Gaussian RBM)
 - Use the weights as inputs to train next layer
 - Stack up into a forward-only "Deep Belief Net" (DBN)
 - Add softmax to create the DBM-DNN

Python: https://pypi.org/project/nolearn/

![DBN-DNN](assets/speech/dbn-dnn.png)

![DBN-DNN performance](assets/speech/dbn-dnn-compare.png)

## Word Error Rate

$$WER = \frac{S + D + I}{N}$$

- substitution word count: $S$
- deletion word count: $D$
- insertion word count: $I$
- correct word count: $C$
- number of reference words: $N = S + D + C$

## Word Accuracy
$$WAcc = 1 - WER = \frac{(N - S - D) - I}{N} = \frac{C - I}{N} $$

## Recurrent Neural Networks

[Paper](http://proceedings.mlr.press/v32/graves14.pdf)

End-to-End Speech Recognition
- Transcription is hard. Skip it and train model that converts speech directly to text
- Bidirectional LSTM to learn long sequences
- Correctionist Temporal Classification (CTC) to align audio with text 

## LSTM

- Learns long sequences (like speech)
- Forget gate learns what to forget

![lstm](assets/speech/lstm.png)

## Birectional RNN

- Forward ($\overrightarrow{h}$) and backward ($\overleftarrow{h}$) hidden sequences
- Learns context in both directions (like words in speech)
- Bidirectional LSTM: LSTM hidden units 

![BRNN](assets/speech/brnn.png)

## Correctionist Temporal Classification

[Paper](https://www.cs.toronto.edu/~graves/icml_2006.pdf)

- To train speech to text, we need to know how audio "lines up" with transcripts
- Input sequences: $X$ (such as audio)
- Output sequences: $Y$ (such as transcripts)

## Correctionist Temporal Classification

Problem: $X$ and $Y$ can vary in length, in different ways
![CTC](assets/speech/naive_alignment.svg)

Solution: introduce the blank token ($\epsilon$)
![CTC](assets/speech/ctc_alignment_steps.svg)

Process: train an RNN to estimate probabilities of each character per time step

## Correctionist Temporal Classification

![CTC](assets/speech/full_collapse_from_audio.svg)

[Visual Guide and Explanation](https://distill.pub/2017/ctc/)

## Walkthrough: CTC + BLSTM
---
Credits: https://github.com/jonrein/tensorflow_CTC_example

In [None]:
# https://github.com/lspvic/jupyter_tensorboard
!pip3 install jupyter-tensorboard

'''
Example of a single-layer bidirectional long short-term memory network trained with
connectionist temporal classification to predict character sequences from nFeatures x nFrames
arrays of Mel-Frequency Cepstral Coefficients.  This is test code to run on the
8-item data set in the "sample_data" directory, for those without access to TIMIT.
Author: Jon Rein
'''
import os
import sys
import numpy as np
import tensorflow as tf
from tensorflow.python.ops import ctc_ops as ctc

# Learning Parameters
learningRate = 0.001
momentum = 0.9
nEpochs = 300
batchSize = 4

# Network Parameters
nFeatures = 26 # 12 MFCC coefficients + energy, and derivatives
nHidden = 128
nClasses = 28 # 27 characters, plus the "blank" for CTC

In [2]:
# Load data
sys.path += ['demos/tensorflow_CTC_example']
from utils import load_batched_data

# directory of MFCC nFeatures x nFrames 2-D array .npy files
INPUT_PATH = 'demos/tensorflow_CTC_example/sample_data/mfcc' 

# directory of nCharacters 1-D array .npy files
TARGET_PATH = 'demos/tensorflow_CTC_example/sample_data/char_y/'

batchedData, maxTimeSteps, totalN = load_batched_data(INPUT_PATH, TARGET_PATH, batchSize)

In [9]:
# Define graph
graph = tf.Graph()
with graph.as_default():

    ####NOTE: try variable-steps inputs and dynamic bidirectional rnn, when it's implemented in tensorflow
        
    ####Graph input
    inputX = tf.placeholder(tf.float32, shape=(maxTimeSteps, batchSize, nFeatures))
    #Prep input data to fit requirements of rnn.bidirectional_rnn
    #  Reshape to 2-D tensor (nTimeSteps*batchSize, nfeatures)
    inputXrs = tf.reshape(inputX, [-1, nFeatures])
    #  Split to get a list of 'n_steps' tensors of shape (batch_size, n_hidden)
    inputList = tf.split(inputXrs, maxTimeSteps, 0)
    targetIxs = tf.placeholder(tf.int64)
    targetVals = tf.placeholder(tf.int32)
    targetShape = tf.placeholder(tf.int64)
    targetY = tf.SparseTensor(targetIxs, targetVals, targetShape)
    seqLengths = tf.placeholder(tf.int32, shape=(batchSize))

    ####Weights & biases
    weightsOutH1 = tf.Variable(tf.truncated_normal([2, nHidden],
                                                   stddev=np.sqrt(2.0 / (2*nHidden))))
    biasesOutH1 = tf.Variable(tf.zeros([nHidden]))
    weightsOutH2 = tf.Variable(tf.truncated_normal([2, nHidden],
                                                   stddev=np.sqrt(2.0 / (2*nHidden))))
    biasesOutH2 = tf.Variable(tf.zeros([nHidden]))
    weightsClasses = tf.Variable(tf.truncated_normal([nHidden, nClasses],
                                                     stddev=np.sqrt(2.0 / nHidden)))
    biasesClasses = tf.Variable(tf.zeros([nClasses]))

    ####Network
    forwardH1 = tf.contrib.rnn.LSTMCell(nHidden, use_peepholes=True, state_is_tuple=True)
    backwardH1 = tf.contrib.rnn.LSTMCell(nHidden, use_peepholes=True, state_is_tuple=True)
    fbH1, _, _ = tf.contrib.rnn.static_bidirectional_rnn(forwardH1, backwardH1, inputList, dtype=tf.float32,
                                                         scope='BDLSTM_H1')
    fbH1rs = [tf.reshape(t, [batchSize, 2, nHidden]) for t in fbH1]
    outH1 = [tf.reduce_sum(tf.multiply(t, weightsOutH1), reduction_indices=1) + biasesOutH1 for t in fbH1rs]

    logits = [tf.matmul(t, weightsClasses) + biasesClasses for t in outH1]

    ####Optimizing
    logits3d = tf.stack(logits)
    loss = tf.reduce_mean(ctc.ctc_loss(targetY, logits3d, seqLengths))
    optimizer = tf.train.MomentumOptimizer(learningRate, momentum).minimize(loss)

    ####Evaluating
    logitsMaxTest = tf.slice(tf.argmax(logits3d, 2), [0, 0], [seqLengths[0], 1])
    predictions = tf.to_int32(ctc.ctc_beam_search_decoder(logits3d, seqLengths)[0][0])
    errorRate = tf.reduce_sum(tf.edit_distance(predictions, targetY, normalize=False)) / \
                tf.to_float(tf.size(targetY.values))

    # TensorBoard
    tf.summary.scalar('loss', loss)
    tf.summary.scalar('error_rate', errorRate)
    merged_summary_op = tf.summary.merge_all()

In [None]:
# Train

with tf.Session(graph=graph) as session:
    print('Initializing')

    tf.global_variables_initializer().run()

    # TensorBoard
    summary_writer = tf.summary.FileWriter('./logs/train', session.graph)
    
    for epoch in range(nEpochs):
        print('Epoch', epoch+1, '...')
        batchErrors = np.zeros(len(batchedData))
        batchRandIxs = np.random.permutation(len(batchedData)) #randomize batch order
        for batch, batchOrigI in enumerate(batchRandIxs):
            batchInputs, batchTargetSparse, batchSeqLengths = batchedData[batchOrigI]
            batchTargetIxs, batchTargetVals, batchTargetShape = batchTargetSparse
            feedDict = {inputX: batchInputs, targetIxs: batchTargetIxs, targetVals: batchTargetVals,
                        targetShape: batchTargetShape, seqLengths: batchSeqLengths}

            _, l, er, lmt, summary = session.run([optimizer, loss, errorRate, logitsMaxTest, merged_summary_op], feed_dict=feedDict)
 
            summary_writer.add_summary(summary, epoch * batchSize + batch)
            
            print(np.unique(lmt)) #print unique argmax values of first sample in batch; should be blank for a while, then spit out target values
            if (batch % 1) == 0:
                print('Minibatch', batch, '/', batchOrigI, 'loss:', l)
                print('Minibatch', batch, '/', batchOrigI, 'error rate:', er)
            batchErrors[batch] = er*len(batchSeqLengths)
        epochErrorRate = batchErrors.sum() / totalN
        print('Epoch', epoch+1, 'error rate:', epochErrorRate)

Initializing
Epoch 1 ...
[ 0  1  2  3  4  5  6  7  8  9 11 12 14 16 17 18 19 20 21 22 23 24 25 27]
Minibatch 0 / 1 loss: 952.65106
Minibatch 0 / 1 error rate: 1.5361446
[27]
Minibatch 1 / 0 loss: 796.5039
Minibatch 1 / 0 error rate: 1.2878228
Epoch 1 error rate: 1.4119837284088135
Epoch 2 ...
[27]
Minibatch 0 / 1 loss: 729.9696
Minibatch 0 / 1 error rate: 0.92771083
[27]
Minibatch 1 / 0 loss: 375.2119
Minibatch 1 / 0 error rate: 0.7527675
Epoch 2 error rate: 0.8402391672134399
Epoch 3 ...
[27]
Minibatch 0 / 1 loss: 268.21786
Minibatch 0 / 1 error rate: 0.97590363
[27]
Minibatch 1 / 0 loss: 312.0193
Minibatch 1 / 0 error rate: 1.0
Epoch 3 error rate: 0.9879518151283264
Epoch 4 ...
[27]
Minibatch 0 / 1 loss: 504.63135
Minibatch 0 / 1 error rate: 1.0
[27]
Minibatch 1 / 0 loss: 404.34372
Minibatch 1 / 0 error rate: 1.0
Epoch 4 error rate: 1.0
Epoch 5 ...
[27]
Minibatch 0 / 1 loss: 411.62268
Minibatch 0 / 1 error rate: 0.9879518
[27]
Minibatch 1 / 0 loss: 205.49542
Minibatch 1 / 0 error rat

[27]
Minibatch 1 / 1 loss: 234.89272
Minibatch 1 / 1 error rate: 0.9608434
Epoch 41 error rate: 0.9453663527965546
Epoch 42 ...
[27]
Minibatch 0 / 1 loss: 234.77744
Minibatch 0 / 1 error rate: 0.95481926
[27]
Minibatch 1 / 0 loss: 182.19875
Minibatch 1 / 0 error rate: 0.9298893
Epoch 42 error rate: 0.942354291677475
Epoch 43 ...
[27]
Minibatch 0 / 1 loss: 234.37598
Minibatch 0 / 1 error rate: 0.9518072
[27]
Minibatch 1 / 0 loss: 182.28412
Minibatch 1 / 0 error rate: 0.92619926
Epoch 43 error rate: 0.9390032291412354
Epoch 44 ...
[27]
Minibatch 0 / 0 loss: 182.27621
Minibatch 0 / 0 error rate: 0.9335793
[27]
Minibatch 1 / 1 loss: 234.0835
Minibatch 1 / 1 error rate: 0.9518072
Epoch 44 error rate: 0.9426932632923126
Epoch 45 ...
[27]
Minibatch 0 / 1 loss: 234.00851
Minibatch 0 / 1 error rate: 0.94578314
[27]
Minibatch 1 / 0 loss: 182.07428
Minibatch 1 / 0 error rate: 0.92619926
Epoch 45 error rate: 0.9359911978244781
Epoch 46 ...
[27]
Minibatch 0 / 1 loss: 233.83617
Minibatch 0 / 1 error

[27]
Minibatch 0 / 1 loss: 220.4983
Minibatch 0 / 1 error rate: 0.9126506
[27]
Minibatch 1 / 0 loss: 170.90096
Minibatch 1 / 0 error rate: 0.9188192
Epoch 82 error rate: 0.9157348871231079
Epoch 83 ...
[27]
Minibatch 0 / 0 loss: 170.64674
Minibatch 0 / 0 error rate: 0.9188192
[27]
Minibatch 1 / 1 loss: 219.54288
Minibatch 1 / 1 error rate: 0.9126506
Epoch 83 error rate: 0.9157348871231079
Epoch 84 ...
[27]
Minibatch 0 / 1 loss: 219.23558
Minibatch 0 / 1 error rate: 0.9126506
[27]
Minibatch 1 / 0 loss: 169.73485
Minibatch 1 / 0 error rate: 0.9188192
Epoch 84 error rate: 0.9157348871231079
Epoch 85 ...
[27]
Minibatch 0 / 0 loss: 169.40288
Minibatch 0 / 0 error rate: 0.9114391
[27]
Minibatch 1 / 1 loss: 218.35645
Minibatch 1 / 1 error rate: 0.9066265
Epoch 85 error rate: 0.9090328216552734
Epoch 86 ...
[27]
Minibatch 0 / 1 loss: 218.01537
Minibatch 0 / 1 error rate: 0.9066265
[27]
Minibatch 1 / 0 loss: 168.50656
Minibatch 1 / 0 error rate: 0.9114391
Epoch 86 error rate: 0.9090328216552734

## Deep Speech

Deep Speech 1: https://arxiv.org/abs/1412.5567

Deep Speech 2: https://arxiv.org/abs/1512.02595

## Workshop: Deep Speech