##### This script will associate spoken phrases with their corresponding translation, using Dynamic Time Warping (DTW) algorithm on MFCC of audio recordings.  

##### Use the methods discovered so far to maximize accuracy and speed (ie shift / vote to get maximun accuracy, and +/- 30% threshold for speed). Also check impact of reducing MFCC's and adding a wider threshold for the warping path by using the radius parameter

## Dynamic Time Warping (DTW)

One of the challenges inherent in speech recognition is that the tenor of speech can vary, and this is particularly true for dysarthric speech, where imprecision is the result of disruption of muscular control. Although a dysarthric speaker will typically make consistent errors or distortions (which is why we are focusing on dysarthria rather than apraxia) they have difficulty with the speed of articulation, or with phoneme transitions, and the same phrase spoken by the same speaker may vary in length.

We could use Euclidian distance to compare the similarity of two signals of the same duration, but if we want to compare signals of different duration, we need to transform one of them. DTW is an algorithm that allows for a non-linear transformation of the time series, either compressing or stretching it, so that the distance between the two signals is minimized (ie the optimal alignment is achieved taking into account temporal distortions such as pauses or changes in speed).


![](img/DTW.png)

<sup>_image from E Keough, Dept of CS, UC Riverside_</sup>

In this example, the two signals are actually fairly similar, but the peaks occur at different times. Using simple Euclidean distance, we would compare periods of silence in one signal with periods of noise in the other, whereas DTW allows us to effectively warp the signal and compress offset periods of silence so that the signals are much better aligned and we can make a more accurate comparison. The mechanics are described in more detail at the bottom of this workbook. 

By applying DTW, we can compare an input phrase to a dictionary of labeled phrases of different durations, and compare the minimized distances to determine which labeled phrase most closely matches the input.

## 1) Load Packages

In [1]:
# Load packages

import pandas as pd
import cv2
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import datetime
import time

from scipy.spatial.distance import euclidean
from numpy.linalg import norm
from fastdtw import fastdtw, dtw

# import matplotlib.pylab as plt
from skimage import data, img_as_float
from skimage import exposure
import sklearn
import random
import itertools
import librosa
import librosa.display
import IPython.display as ipd

%matplotlib inline

## 2) Load Data

In [2]:
df = pd.read_csv('../index_TORGO.txt', sep="|", converters={'prompt_id': lambda x: str(x)})

# Try M05, whose speech is more severely impaired than F03
df = df.loc[(df['speaker'] == 'M05') & (df['mic'] == 'wav_headMic')]

# Remove instances where prompt is None, ends with "]" or contains "jpg"
df['remove'] = (df['prompt']==None) | (df['prompt']=='None') | (df['prompt'].str.contains('jpg')) | (df['prompt'].str.endswith(']')) 
df = df.loc[df['remove'] == False]

# Now remove all instances of a single word, since we are trying to match phrases at this stage
df['remove'] = (df['prompt'].str.contains(' '))
df = df.loc[df['remove'] == True]

# Identify phrases that were recorded twice
df_filter = df.groupby(['prompt']).size().reset_index(name='counts')
df_filter = df_filter[df_filter['counts']==2]
df_filter


Unnamed: 0,prompt,counts
25,He further proposed grants of an unspecified s...,2
50,Mother sews yellow gingham aprons.,2
53,Nobody really expects to evacuate.,2
106,Why yell or worry over silly items?,2


So, at this stage, we have two recordings of four different phrases. This means that we can use one as a test file and compare it to the remaining recordings. Only one of those remaining recordings should be the correct match.

In [3]:
# Only keep examples that are recorded twice
df = df.merge(df_filter, on='prompt', how='inner')
df

Unnamed: 0,speaker,session,mic,prompt_id,has_spect,spect_width,spect_height,prompt,remove,counts
0,M05,Session2,wav_headMic,42,yes,434,513,Mother sews yellow gingham aprons.,True,2
1,M05,Session2,wav_headMic,43,yes,262,513,Mother sews yellow gingham aprons.,True,2
2,M05,Session2,wav_headMic,55,yes,284,513,Why yell or worry over silly items?,True,2
3,M05,Session2,wav_headMic,56,yes,288,513,Why yell or worry over silly items?,True,2
4,M05,Session2,wav_headMic,145,yes,543,513,He further proposed grants of an unspecified s...,True,2
5,M05,Session2,wav_headMic,146,yes,602,513,He further proposed grants of an unspecified s...,True,2
6,M05,Session2,wav_headMic,313,yes,299,513,Nobody really expects to evacuate.,True,2
7,M05,Session2,wav_headMic,314,yes,267,513,Nobody really expects to evacuate.,True,2


## 3) Compute MFCCs and set up required lists

In [15]:
def getaudio(file_path, top_db=15):
    data, rate = librosa.core.load(file_path)
    data, index = librosa.effects.trim(data, top_db=top_db)
    return ipd.Audio(data, rate=rate)

In [14]:
def generate_mfcc_lists(df=df, number_mfcc=20):
    
    """ 
    Takes a fixed-format dataframe containing information about the recordings, and outputs
    lists that are then used in the DTW calculation. Assumes that the dataframe has been reindexed
    so that the indexing begins at zero and ends at df.shape[0]-1, and that there are sequential 
    pairs of recordings
    
    mfcc: a list of the MFCCs corresponding to each recording
    prompts: a numerical code representing each prompt
    prompt_text: the actual text of each prompt
    train: whether the recording should be considered a test or training example
    pretrim len: the length of each recording before trimming leading silences
    trim_len: the length of each recording after trimming leading silences
    aud_locs: path to the audio recording
    
    Returns a list of the MFCC's to be used in training, a list of the test MFCC's
    the length of each (for use when applying a threshold) and the correct labels of the test data
    """

    # Load data
    mfcc = []
    prompts = []
    prompt_text = []
    train = []
    pretrim_len = []
    trim_len = []
    aud_locs = []

    # Loop through each row of the dataframe
    for p in range(df.shape[0]):
        aud_loc = aud_loc = '/'.join(['../data/TORGO', df['speaker'][p], df['session'][p], df['mic'][p], df['prompt_id'][p]+'.wav'])           
        aud_locs.append(aud_loc)
        data, rate = librosa.load(aud_loc)

        # Trim leading and trailing silence
        pretrim_len.append(round(librosa.get_duration(data, rate),1))
        data, index = librosa.effects.trim(data, top_db=15)
        trim_len.append(round(librosa.get_duration(data, rate),1))

        mfcc.append(librosa.feature.mfcc(data, rate, n_mfcc=number_mfcc))
        prompts.append(p//2) # Each pair of audio files has the same prompt
        prompt_text.append(df['prompt'][p])
        train.append(1 if (p+2) % 2== 0 else -1) # Assign every other record to train
    
    # Scale features
    for i,x in enumerate(mfcc): 
        mfcc[i] = sklearn.preprocessing.scale(mfcc[i], axis=1)
        
    # Assign data to train or test
    x_train = [mfcc[i] for i,x in enumerate(train) if x==-1]
    x_test = [mfcc[i] for i,x in enumerate(train) if x==1]
    y_test = [prompts[i] for i,x in enumerate(train) if x==1]
    
    # Store length of recordings for use in threshold testing
    train_len = [trim_len[i] for i,x in enumerate(train) if x==-1]
    test_len = [trim_len[i] for i,x in enumerate(train) if x==1]
    
    # Store audio location
    train_aud = [aud_locs[i] for i,x in enumerate(train) if x==-1]
    test_aud = [aud_locs[i] for i,x in enumerate(train) if x==1]
    
    return x_train, x_test, train_aud, test_aud, train_len, test_len, y_test

## 4) Set up DTW calculations

In [6]:
# (1) NEW METHOD to speed up algorithm AND potentially increase accuracy
#   Only run DTW distance calculation on stored training phrases that are within -/+ 30%
#   seconds in legnth from the requested test phrase. This range should be a min of -/+ 5 seconds
#   This way most small phrases will be compared to each other.

# (2) ALSO - steps should not be a hard coded range
# The range of the steps should be a % based on the width of the requested test MFCC vector. Lets try -/+30% for this also.
# (It seems like 1 syllable takes around 15 width - very rough estimate)
# Run this commented code below to see how the shifts are determined for an MFCC of width 100 if we want -/+ 3 shifts = 7 total
# max_shift = 100*.3    # at -/+30%
# total_shifts = 7
# shift = int(max_shift/int(total_shifts/2))
# for d in range(shift * int(total_shifts/2) * -1, shift * int(total_shifts/2) + 1, shift):
#     print(d)

# Calculate the DTW distance

def calc_dtw(x_train, x_test, train_len, test_len, radius=1, total_shifts = 7):
    """
    Calculates the DTW distance between the test cases and the training data
    after applying a series of time shifts to the test data
    
    Returns an array of the DTW dist of each shifted MFCC against the training
    prompt, and prints out the time taken to run the calculation"""
    
    start = time.time()
    master_dist = []
    for i,x in enumerate(x_test):
        mfcc_dist = []
        # Default: For 7 total vectors - 3 shifts left, no shift, and 3 shifts right @ 15% range
        max_shift = x.shape[1]*0.15   # Indicate % range here
        # Total shifts will always be an odd number so there is the same number of shifts in each direction
        total_shifts = total_shifts + 1 if total_shifts % 2 == 0 else total_shifts
        shift = int(max_shift/int(total_shifts/2))
        for d in range(shift * int(total_shifts/2) * -1, shift * int(total_shifts/2) + 1, shift):
            dist = []
            for i2,x2 in enumerate(x_train):
                len_threshold = max(train_len[i]*0.3, 5)
                min_thres = train_len[i] - len_threshold
                max_thres = train_len[i] + len_threshold

                # Run DTW dist if stored phrase is within -/+ 30% seconds as requested test phrase
                if min_thres <= test_len[i2] <= max_thres:
                    distance, path = fastdtw(np.roll(x,d).T, x2.T, radius=radius, dist=lambda x, y: norm(x - y))
                # else assume they are not the same by assuming a very large distance
                else:
                    distance = 1000000

                dist.append(distance)

            mfcc_dist.append(dist)
        master_dist.append(mfcc_dist)
    
    end = time.time()
    calc_time = end - start
    
    print('MFCCs:{0}, Radius:{1}, Time:{2:.2f} sec'.format(x_train[0].shape[0], radius, calc_time))
    return master_dist


def prediction(master_dist, y_test):
    
    """
    Given an array of DTW distances and the correct labels associated with the test case
    check what the predicted label would be for each shifted MFCC vector by recording
    the minimum DTW distance between the test and training examples
    The overall prediction is then the minimum DTW distance across the entire array of
    shifted vectors
    
    Return a table showing the correct label, the overall prediction, and the intermediate
    predictions for each shift of the test MFCC"""
    
    prediction_overalldist = []
    dtw_distance = []
    votes = []

    # Loop through each training example
    for i,x in enumerate(master_dist):
        vote = []
        # For each of the shifted vectors, get the prediction with min distance - the votes
        min_dist = 1000000
        for i2,x2 in enumerate(x):
            vote.append(x2.index(min(x2)))

            # Save the overall min distance from all shifted vectors = overall closest prediction
            if min(x2) < min_dist:
                min_dist = min(x2)
                min_overall = x2.index(min(x2))

        # Overall closest prediction out of the shifted MFCC vectors - the final vote
        prediction_overalldist.append(min_overall)
        dtw_distance.append(min_dist)

        # Track votes - determine if some vectors perform worse
        votes.append(vote)
    
    num_correct_overall = 0

    print('----------------------------------------------------------------------')
    print('Correct|Prediction|MFCC Predictions|DTW Distance')
    for i,x in enumerate(votes):
        print(y_test[i], '     |', prediction_overalldist[i], '       |', x, '       |', dtw_distance[i])
        if y_test[i] == prediction_overalldist[i]: num_correct_overall += 1
    print('----------------------------------------------------------------------')    
    print('% Correct (Overall):', num_correct_overall / len(y_test) * 100, '\n')
    


## 5) Check impact of different MFCC and radius parameters

### a) 20 MFCC's, radius=1

In [25]:
x_train, x_test, train_aud, test_aud, train_len, test_len, y_test = generate_mfcc_lists(df, 20)
master_dist = calc_dtw(x_train, x_test, train_len, test_len, radius=1)
prediction(master_dist, y_test)

MFCCs:20, Radius:1, Time:8.10 sec
----------------------------------------------------------------------
Correct|Prediction| MFCC Predictions
0      | 0        | [3, 3, 1, 1, 0, 0, 0]
1      | 1        | [0, 0, 3, 1, 3, 1, 1]
2      | 2        | [2, 2, 2, 2, 2, 2, 2]
3      | 3        | [3, 3, 3, 3, 0, 3, 0]
----------------------------------------------------------------------
% Correct (Overall): 100.0 



#### NOTES from checking different MFCC and radius parameters:

1. Reducing the number of MFCC's seems to give better results, for example with 20 MFCCs, prompt zero is predicted incorrectly six times out of seven, but with 13 MFCC's, it is incorrect only four times out of seven

2. Reducing MFCC's also results in a small gain in speed, but they are fairly marginal (0.08 sec on the fastest run, increasing to 1.17 sec on the slowest run)

3. Increasing the radius (ie the width of the band in which warping paths can be calculated) increases the accuracy, however there is a time penalty. For instance with 20 MFCCs and radius 1, six of the seven predictions for prompt zero are incorrect, but with radius 10, only three of the seven are incorrect. Unfortunately, the calculation time increases from roughly 5 seconds to roughly 37 seconds.


##### Overall: We should probably reduce MFCC's to 13, but before adjusting radius we need to know how much time the shifts add, so that we can work out whether it is more efficient to increase acuracy by shifting or increasing the radius


Different papers suggest either 12 or 13 MFCC's for dysarthric speech - both seem to work well

##### Check the other speakers

In [7]:
 def create_df(speaker = 'ALL'):
    """
    Create a dataframe containing information to allow the creation of path names and 
    easy identification of prompts
    
    The various actions are
    - filter on the speaker fed in as an argument
    - select only recordings made using the head mic
    - remove instances without text propmts
    - remove single words to just leave multi-word phrases
    - select only phrases that were recorded twice
    """
    
    df = pd.read_csv('../index_TORGO.txt', sep="|", converters={'prompt_id': lambda x: str(x)})
    df = df.loc[(df['mic'] == 'wav_headMic')]
    if speaker != 'ALL':
        df = df.loc[(df['speaker'] == speaker)]
    df['remove'] = (df['prompt']==None) | (df['prompt']=='None') | (df['prompt'].str.contains('jpg')) | (df['prompt'].str.endswith(']')) 
    df = df.loc[df['remove'] == False]
    df['remove'] = (df['prompt'].str.contains(' '))
    df = df.loc[df['remove'] == True]
    df_filter = df.groupby(['speaker', 'prompt']).size().reset_index(name='counts')
    df_filter = df_filter[df_filter['counts']==2]
    df = df.merge(df_filter, on=['speaker', 'prompt'], how='inner')
    df['audloc'] = '../data/TORGO/' + df['speaker'] + '/' + df['session'] + '/' + df['mic'] + '/' +  df['prompt_id'] + '.wav'
    return df

#### F01

In [101]:
df = create_df('F01')
df.drop('audloc', axis=1)

Unnamed: 0,speaker,session,mic,prompt_id,has_spect,spect_width,spect_height,prompt,remove,counts
0,F01,Session1,wav_headMic,27,yes,165,513,"but he always answers, Banana oil!",True,2
1,F01,Session1,wav_headMic,28,yes,113,513,"but he always answers, Banana oil!",True,2
2,F01,Session1,wav_headMic,30,yes,151,513,The quick brown fox jumps over the lazy dog.,True,2
3,F01,Session1,wav_headMic,31,yes,145,513,The quick brown fox jumps over the lazy dog.,True,2


In [30]:
x_train, x_test, train_aud, test_aud, train_len, test_len, y_test = generate_mfcc_lists(df, 13)
master_dist = calc_dtw(x_train, x_test, train_len, test_len, radius=1)
prediction(master_dist, y_test)

MFCCs:13, Radius:1, Time:0.85 sec
----------------------------------------------------------------------
Correct|Prediction| MFCC Predictions
0      | 0        | [0, 0, 0, 0, 0, 0, 0]
1      | 1        | [0, 0, 1, 0, 1, 1, 1]
----------------------------------------------------------------------
% Correct (Overall): 100.0 



## F03

In [31]:
df = create_df('F03')
df.drop('audloc', axis=1)

Unnamed: 0,speaker,session,mic,prompt_id,has_spect,spect_width,spect_height,prompt,remove,counts
0,F03,Session2,wav_headMic,6,yes,441,513,"If you destroy confidence in banks, you do som...",True,2
1,F03,Session2,wav_headMic,7,yes,290,513,"If you destroy confidence in banks, you do som...",True,2
2,F03,Session2,wav_headMic,31,yes,236,513,Two other cases also were under advisement.,True,2
3,F03,Session2,wav_headMic,32,yes,183,513,Two other cases also were under advisement.,True,2
4,F03,Session2,wav_headMic,130,yes,179,513,The dolphins swam around our boat.,True,2
5,F03,Session2,wav_headMic,131,yes,200,513,The dolphins swam around our boat.,True,2
6,F03,Session3,wav_headMic,38,yes,214,513,Some hotels are available nearby.,True,2
7,F03,Session3,wav_headMic,63,yes,88,513,Some hotels are available nearby.,True,2
8,F03,Session3,wav_headMic,55,yes,116,513,The results were very disappointing.,True,2
9,F03,Session3,wav_headMic,130,yes,97,513,The results were very disappointing.,True,2


In [32]:
x_train, x_test, train_aud, test_aud, train_len, test_len, y_test = generate_mfcc_lists(df, 13)
master_dist = calc_dtw(x_train, x_test, train_len, test_len, radius=1)
prediction(master_dist, y_test)

MFCCs:13, Radius:1, Time:10.54 sec
----------------------------------------------------------------------
Correct|Prediction| MFCC Predictions
0      | 1        | [7, 1, 1, 1, 1, 1, 1]
1      | 1        | [3, 1, 5, 4, 4, 4, 4]
2      | 4        | [3, 3, 4, 3, 3, 4, 3]
3      | 3        | [3, 4, 3, 4, 3, 3, 4]
4      | 4        | [4, 4, 4, 4, 3, 3, 3]
5      | 5        | [5, 5, 4, 5, 5, 5, 4]
6      | 6        | [4, 6, 3, 4, 3, 6, 6]
7      | 7        | [4, 6, 4, 7, 4, 4, 3]
----------------------------------------------------------------------
% Correct (Overall): 75.0 



##### Review incorrect predictions

In [14]:
# It is getting 0 and 2 incorrect. The lengths are close enough, so length is not the issue.
print(train_len)
print(test_len)

[8.4, 3.8, 3.7, 2.0, 2.0, 2.1, 2.6, 5.4]
[9.5, 4.6, 3.2, 2.6, 2.1, 2.2, 2.3, 5.9]


##### 1) Prompt 0: Issue: The speaker messes up and repeats part of the phrase twice

In [35]:
# Review prompt 0
getaudio(train_aud[0])

Prompt: 0


In [36]:
getaudio(test_aud[0])

In [37]:
print(len(master_dist))
pd.DataFrame(master_dist[0])

8


Unnamed: 0,0,1,2,3,4,5,6,7
0,1907.089362,1887.76406,1000000,1000000,1000000,1000000,1000000,1721.005013
1,1899.009629,1784.330929,1000000,1000000,1000000,1000000,1000000,1786.811553
2,1968.821767,1674.358631,1000000,1000000,1000000,1000000,1000000,1848.425752
3,1798.78168,1700.996255,1000000,1000000,1000000,1000000,1000000,1781.210843
4,1881.903432,1621.89989,1000000,1000000,1000000,1000000,1000000,1812.156407
5,1807.946033,1802.876457,1000000,1000000,1000000,1000000,1000000,1828.62952
6,1874.3349,1773.859964,1000000,1000000,1000000,1000000,1000000,1899.396516


In [38]:
# It is mistaking it for Prompt 1  -- Why?
getaudio(train_aud[1])

##### Checking other parameters to see whether we can improve the results

It looks like we can get 100% accuracy on F03 with radius increase, so let's move on.

In [39]:
x_train, x_test, train_aud, test_aud, train_len, test_len, y_test = generate_mfcc_lists(df, 13)
master_dist = calc_dtw(x_train, x_test, train_len, test_len, radius=5)
prediction(master_dist, y_test)

MFCCs:13, Radius:5, Time:36.23 sec
----------------------------------------------------------------------
Correct|Prediction| MFCC Predictions
0      | 0        | [1, 1, 1, 0, 7, 1, 1]
1      | 1        | [1, 1, 4, 1, 3, 3, 6]
2      | 2        | [3, 3, 2, 2, 2, 4, 4]
3      | 3        | [3, 3, 3, 3, 3, 3, 3]
4      | 4        | [4, 4, 4, 4, 4, 4, 4]
5      | 5        | [5, 5, 5, 5, 5, 5, 5]
6      | 6        | [6, 5, 6, 6, 6, 6, 6]
7      | 7        | [3, 4, 7, 7, 7, 5, 6]
----------------------------------------------------------------------
% Correct (Overall): 100.0 



In [40]:
x_train, x_test, train_aud, test_aud, train_len, test_len, y_test = generate_mfcc_lists(df, 20)
master_dist = calc_dtw(x_train, x_test, train_len, test_len, radius=1)
prediction(master_dist, y_test)

MFCCs:20, Radius:1, Time:10.64 sec
----------------------------------------------------------------------
Correct|Prediction| MFCC Predictions
0      | 1        | [7, 1, 1, 1, 1, 7, 1]
1      | 5        | [5, 4, 6, 5, 4, 6, 3]
2      | 5        | [4, 3, 4, 5, 3, 4, 4]
3      | 3        | [3, 4, 3, 3, 3, 3, 5]
4      | 4        | [4, 4, 4, 4, 3, 3, 3]
5      | 5        | [5, 5, 5, 5, 5, 4, 3]
6      | 6        | [3, 4, 4, 3, 3, 6, 6]
7      | 7        | [5, 4, 2, 7, 5, 4, 1]
----------------------------------------------------------------------
% Correct (Overall): 62.5 



Worse with 20 MFCC's and radius 1, so 13 does look like the best. Need to adjust radius to get better results

## M01

In [76]:
df = create_df('M01')
df.drop('audloc', axis=1)

Unnamed: 0,speaker,session,mic,prompt_id,has_spect,spect_width,spect_height,prompt,remove,counts
0,M01,Session1,wav_headMic,12,no,,,"A long, flowing beard clings to his chin,",True,2
1,M01,Session1,wav_headMic,13,no,,,"A long, flowing beard clings to his chin,",True,2
2,M01,Session1,wav_headMic,27,no,,,I can read,True,2
3,M01,Session1,wav_headMic,58,no,,,I can read,True,2
4,M01,Session1,wav_headMic,40,no,,,"but he always answers, ""Banana oil!""",True,2
5,M01,Session1,wav_headMic,41,no,,,"but he always answers, ""Banana oil!""",True,2
6,M01,Session2_3,wav_headMic,159,yes,378.0,513.0,This is not a program of socialized medicine.,True,2
7,M01,Session2_3,wav_headMic,160,yes,213.0,513.0,This is not a program of socialized medicine.,True,2


In [77]:
x_train, x_test, train_aud, test_aud, train_len, test_len, y_test = generate_mfcc_lists(df, 13)
master_dist = calc_dtw(x_train, x_test, train_len, test_len, radius=1)
prediction(master_dist, y_test)

MFCCs:13, Radius:1, Time:5.41 sec
----------------------------------------------------------------------
Correct|Prediction| MFCC Predictions
0      | 0        | [2, 0, 0, 0, 0, 0, 0]
1      | 1        | [1, 1, 1, 1, 1, 1, 1]
2      | 1        | [1, 1, 0, 1, 0, 0, 0]
3      | 3        | [0, 0, 3, 3, 3, 0, 2]
----------------------------------------------------------------------
% Correct (Overall): 75.0 



In [78]:
x_train, x_test, train_aud, test_aud, train_len, test_len, y_test = generate_mfcc_lists(df, 13)
master_dist = calc_dtw(x_train, x_test, train_len, test_len, radius=5)
prediction(master_dist, y_test)

MFCCs:13, Radius:5, Time:19.13 sec
----------------------------------------------------------------------
Correct|Prediction| MFCC Predictions
0      | 0        | [0, 0, 0, 0, 0, 0, 2]
1      | 1        | [1, 1, 1, 1, 1, 1, 1]
2      | 1        | [1, 1, 1, 1, 1, 1, 1]
3      | 3        | [3, 0, 3, 3, 3, 1, 3]
----------------------------------------------------------------------
% Correct (Overall): 75.0 



##### Review incorrect predictions

In [79]:
# It is getting prompt 2 incorrect. It always predicts it matches to prompt 1. Do not see a length issue
print(train_len)
print(test_len)

[5.8, 1.7, 5.8, 6.1]
[8.8, 2.1, 5.5, 10.7]


##### 1) Prompt 2: Issue: The speaker does not appear to finish saying the phrase correctly in the test audio. He says it right in the train audio.

In [82]:
# Review prompt 2
getaudio(train_aud[2])

In [83]:
getaudio(test_aud[2])

In [47]:
pd.DataFrame(master_dist[2])

Unnamed: 0,0,1,2,3
0,1130.941742,916.448344,1102.324568,1248.162762
1,1108.997901,917.516898,1262.979734,1200.356812
2,1174.376379,953.580497,1041.487322,1206.680658
3,1104.390698,973.65474,1089.514538,1215.802912
4,1131.709951,968.273015,1130.562558,1209.949959
5,1089.742148,979.016002,1060.911975,1208.316583
6,1072.437185,999.085663,1035.963209,1130.974747


In [48]:
# Mistakes it for prompt 1. Is there are trend of choosing the shortest training audio if it does not match well?
getaudio(train_aud[1])

##### NOTES from reviewing incorrect predictions:

1) We need a way to determine if a test phrase does not match any stored train phrases

2) We need to allow the user to delete train phrases, in case they messed up

3) Keep watching out for phrases it gets incorrect. Is there are pattern of choosing smaller length recordings when the DTW path is unsure?

## M02

In [51]:
df = create_df('M02')
df.drop('audloc', axis=1)

Unnamed: 0,speaker,session,mic,prompt_id,has_spect,spect_width,spect_height,prompt,remove,counts
0,M02,Session1,wav_headMic,36,yes,98,513,I can read,True,2
1,M02,Session1,wav_headMic,103,yes,164,513,I can read,True,2
2,M02,Session2,wav_headMic,212,yes,204,513,The job provides many benefits.,True,2
3,M02,Session2,wav_headMic,213,yes,247,513,The job provides many benefits.,True,2


In [52]:
x_train, x_test, train_aud, test_aud, train_len, test_len, y_test = generate_mfcc_lists(df, 13)
master_dist = calc_dtw(x_train, x_test, train_len, test_len, radius=1)
prediction(master_dist, y_test)

MFCCs:13, Radius:1, Time:0.65 sec
----------------------------------------------------------------------
Correct|Prediction| MFCC Predictions
0      | 0        | [0, 0, 0, 0, 0, 0, 0]
1      | 1        | [1, 1, 1, 1, 1, 1, 1]
----------------------------------------------------------------------
% Correct (Overall): 100.0 



## M03

In [53]:
df = create_df('M03')
df.drop('audloc', axis=1)

Unnamed: 0,speaker,session,mic,prompt_id,has_spect,spect_width,spect_height,prompt,remove,counts
0,M03,Session2,wav_headMic,13,yes,120,513,"Well, he is nearly ninety-three years old;",True,2
1,M03,Session2,wav_headMic,14,yes,147,513,"Well, he is nearly ninety-three years old;",True,2
2,M03,Session2,wav_headMic,99,yes,183,513,he dresses himself in an ancient black frock c...,True,2
3,M03,Session2,wav_headMic,100,yes,173,513,he dresses himself in an ancient black frock c...,True,2
4,M03,Session2,wav_headMic,101,yes,194,513,"When he speaks, his voice is just a bit cracke...",True,2
5,M03,Session2,wav_headMic,102,yes,224,513,"When he speaks, his voice is just a bit cracke...",True,2
6,M03,Session2,wav_headMic,244,yes,203,513,"If you destroy confidence in banks, you do som...",True,2
7,M03,Session2,wav_headMic,245,yes,194,513,"If you destroy confidence in banks, you do som...",True,2


In [54]:
x_train, x_test, train_aud, test_aud, train_len, test_len, y_test = generate_mfcc_lists(df, 13)
master_dist = calc_dtw(x_train, x_test, train_len, test_len, radius=1)
prediction(master_dist, y_test)

MFCCs:13, Radius:1, Time:4.59 sec
----------------------------------------------------------------------
Correct|Prediction| MFCC Predictions
0      | 0        | [0, 0, 0, 0, 0, 0, 0]
1      | 1        | [1, 0, 1, 1, 0, 1, 1]
2      | 0        | [1, 1, 1, 0, 0, 0, 0]
3      | 3        | [0, 0, 3, 3, 1, 0, 0]
----------------------------------------------------------------------
% Correct (Overall): 75.0 



In [55]:
x_train, x_test, train_aud, test_aud, train_len, test_len, y_test = generate_mfcc_lists(df, 13)
master_dist = calc_dtw(x_train, x_test, train_len, test_len, radius=5)
prediction(master_dist, y_test)

MFCCs:13, Radius:5, Time:14.05 sec
----------------------------------------------------------------------
Correct|Prediction| MFCC Predictions
0      | 0        | [0, 0, 0, 0, 0, 0, 0]
1      | 1        | [1, 1, 1, 1, 1, 1, 1]
2      | 0        | [0, 0, 0, 0, 0, 0, 0]
3      | 3        | [0, 3, 3, 3, 3, 3, 0]
----------------------------------------------------------------------
% Correct (Overall): 75.0 



##### Review incorrect predictions

In [56]:
# It is getting prompt 2 incorrect. It always predicts it matches to prompt 0
print(train_len)
print(test_len)

[2.2, 3.2, 6.3, 4.1]
[3.6, 3.7, 4.2, 5.0]


##### 1) Prompt 2: Issue: The speaker messes up the train phrase at the beginning and then quickly starts over. The train phrase is 2 seconds longer because of this. They are still compared, but do not match

In [57]:
# Review prompt 2
getaudio(train_aud[2])

In [58]:
getaudio(test_aud[2])

In [61]:
# The distances are correlated with the length of the training audio. We need to look into this.
print(train_len)
pd.DataFrame(master_dist[2])

[2.2, 3.2, 6.3, 4.1]


Unnamed: 0,0,1,2,3
0,720.359934,748.014258,1114.82921,835.915127
1,699.19576,718.892759,1158.124528,826.767283
2,706.161019,732.478629,777.660156,843.661679
3,690.318477,754.248511,1015.316379,838.190195
4,711.925112,770.780942,878.541136,838.577421
5,711.676596,761.622342,916.501974,857.207668
6,705.63003,748.375692,1034.724757,842.236349


In [63]:
# Let's take a look at all 4 distance matrices
for i, x in enumerate(master_dist):
    print(pd.DataFrame(x))

            0           1            2           3
0  584.150260  730.519354  1081.414194  819.099201
1  604.021955  694.929562  1223.893989  791.067317
2  578.759856  704.076280  1105.781842  789.611441
3  503.630000  688.888150  1140.587804  787.444192
4  529.127214  670.381228  1144.263710  817.034964
5  508.073362  684.053169  1138.146051  818.568667
6  498.602790  692.475959  1187.075183  798.045215
            0           1            2           3
0  706.217920  597.931072  1097.473989  751.735630
1  689.768671  535.922417  1133.905670  747.697209
2  675.410857  469.664015  1145.978386  776.931415
3  651.251615  379.661746  1138.673998  745.042205
4  642.738586  383.704826  1141.291707  774.693638
5  634.985561  443.062659  1159.250711  793.489860
6  658.717277  499.969060  1089.746950  805.038001
            0           1            2           3
0  720.359934  748.014258  1114.829210  835.915127
1  699.195760  718.892759  1158.124528  826.767283
2  706.161019  732.478629   777

##### Note: One thing that stands out for prompt 2 is a large % difference between the max and min length (jumps down to 777 from 1158 in a single shift) when it was compared to the shifted test prompt 2s. The other 3 prompts have much closer distances between all of the shifts. It is almost like it almost got it correct on one of the shifts, but was still beat by prompt 0. 

Need to look into specifics of distance calculation. Does it calculate in a way that benefits matching to shorter phrases. If so, how can we adjust for this? 

## M04

In [64]:
df = create_df('M04')
df.drop('audloc', axis=1)

Unnamed: 0,speaker,session,mic,prompt_id,has_spect,spect_width,spect_height,prompt,remove,counts
0,M04,Session2,wav_headMic,45,yes,466,513,We have often urged him to walk more and smoke...,True,2
1,M04,Session2,wav_headMic,46,yes,349,513,We have often urged him to walk more and smoke...,True,2
2,M04,Session2,wav_headMic,155,yes,254,513,I was conscious all the time.,True,2
3,M04,Session2,wav_headMic,156,yes,235,513,I was conscious all the time.,True,2
4,M04,Session2,wav_headMic,182,yes,204,513,He will allow a rare lie.,True,2
5,M04,Session2,wav_headMic,183,yes,178,513,He will allow a rare lie.,True,2
6,M04,Session2,wav_headMic,246,yes,328,513,Nothing is as offensive as innocence.,True,2
7,M04,Session2,wav_headMic,248,yes,273,513,Nothing is as offensive as innocence.,True,2


In [65]:
x_train, x_test, train_aud, test_aud, train_len, test_len, y_test = generate_mfcc_lists(df, 13)
master_dist = calc_dtw(x_train, x_test, train_len, test_len, radius=1)
prediction(master_dist, y_test)

MFCCs:13, Radius:1, Time:5.79 sec
----------------------------------------------------------------------
Correct|Prediction| MFCC Predictions
0      | 0        | [1, 1, 0, 3, 0, 2, 2]
1      | 1        | [1, 1, 1, 1, 1, 2, 2]
2      | 2        | [2, 2, 2, 2, 2, 2, 2]
3      | 1        | [3, 2, 2, 1, 1, 2, 2]
----------------------------------------------------------------------
% Correct (Overall): 75.0 



In [66]:
x_train, x_test, train_aud, test_aud, train_len, test_len, y_test= generate_mfcc_lists(df, 13)
master_dist = calc_dtw(x_train, x_test, train_len, test_len, radius=5)
prediction(master_dist, y_test)

MFCCs:13, Radius:5, Time:24.66 sec
----------------------------------------------------------------------
Correct|Prediction| MFCC Predictions
0      | 0        | [0, 1, 1, 0, 0, 0, 2]
1      | 1        | [1, 1, 1, 1, 2, 2, 2]
2      | 2        | [2, 2, 2, 2, 2, 2, 2]
3      | 2        | [2, 2, 2, 1, 2, 2, 2]
----------------------------------------------------------------------
% Correct (Overall): 75.0 



##### Review incorrect predictions

In [71]:
# It is getting prompt 3 incorrect. It mainly predicts it matches to prompt 2 (which is the shortest)
print(train_len)
print(test_len)

[8.9, 5.8, 3.5, 8.5]
[13.4, 6.5, 4.9, 8.0]


##### 1) Prompt 3: Issue: This speaker is essentially impossible to understand. This may be the worst case of dysarthric speech. However, this is a perfect example of how our system is useful. ASR may never be possible for this person.

In [69]:
# Review prompt 3
getaudio(train_aud[3])

In [70]:
getaudio(test_aud[3])

In [72]:
pd.DataFrame(master_dist[3])

Unnamed: 0,0,1,2,3
0,1701.05648,1481.41376,1470.113646,1506.729353
1,1713.203852,1484.073776,1461.882264,1643.568283
2,1602.928019,1458.615735,1415.242917,1689.947306
3,1684.426223,1414.781036,1423.549735,1698.915485
4,1751.786144,1511.109233,1379.061539,1728.518761
5,1747.636075,1599.481036,1384.468975,1683.774996
6,1691.302229,1532.9309,1484.897007,1651.685677


In [73]:
getaudio(train_aud[2])

## Test using all speakers

In [16]:
df = create_df('ALL')
df = df.sort_values(by=['prompt','speaker'])
df['prompt_instance'] = df.groupby(['prompt']).cumcount()+1
df = df[df['prompt_instance'] <3] # Remove rows after first 2 instance of a prompt
df = df[df['prompt'] != 'but he always answers, Banana oil!']  # This occurs across two people with prompt label slightly diff
df =df.reset_index()
df.drop('audloc', axis=1)

Unnamed: 0,index,speaker,session,mic,prompt_id,has_spect,spect_width,spect_height,prompt,remove,counts,prompt_instance
0,22,M01,Session1,wav_headMic,12,no,,,"A long, flowing beard clings to his chin,",True,2,1
1,23,M01,Session1,wav_headMic,13,no,,,"A long, flowing beard clings to his chin,",True,2,2
2,14,F03,Session3,wav_headMic,58,yes,86.0,513.0,Each one volunteered to jump first.,True,2,1
3,15,F03,Session3,wav_headMic,67,yes,85.0,513.0,Each one volunteered to jump first.,True,2,2
4,18,F03,Session3,wav_headMic,153,yes,221.0,513.0,He further proposed grants of an unspecified s...,True,2,1
5,19,F03,Session3,wav_headMic,209,yes,221.0,513.0,He further proposed grants of an unspecified s...,True,2,2
6,46,M04,Session2,wav_headMic,182,yes,204.0,513.0,He will allow a rare lie.,True,2,1
7,47,M04,Session2,wav_headMic,183,yes,178.0,513.0,He will allow a rare lie.,True,2,2
8,24,M01,Session1,wav_headMic,27,no,,,I can read,True,2,1
9,25,M01,Session1,wav_headMic,58,no,,,I can read,True,2,2


In [9]:
print('Number of phrases:', len(df)/2)

Number of phrases: 23.0


In [17]:
x_train, x_test, train_aud, test_aud, train_len, test_len, y_test = generate_mfcc_lists(df, 13)
master_dist = calc_dtw(x_train, x_test, train_len, test_len, radius=1)
prediction(master_dist, y_test)

MFCCs:13, Radius:1, Time:115.12 sec
----------------------------------------------------------------------
Correct|Prediction|MFCC Predictions|DTW Distance
0      | 0        | [8, 8, 5, 0, 0, 8, 12]        | 1218.3885753701718
1      | 1        | [14, 1, 10, 1, 1, 14, 10]        | 274.95999429743654
2      | 2        | [14, 10, 14, 2, 2, 22, 14]        | 716.1555284994479
3      | 11        | [8, 11, 18, 3, 22, 10, 22]        | 793.444391674735
4      | 4        | [4, 4, 4, 4, 4, 14, 14]        | 234.51225827346803
5      | 5        | [18, 11, 5, 10, 10, 11, 18]        | 1066.6719382509475
6      | 15        | [15, 13, 15, 15, 15, 22, 15]        | 1600.617662214682
7      | 7        | [1, 5, 11, 11, 22, 7, 14]        | 855.8166506849623
8      | 8        | [8, 8, 8, 8, 8, 8, 10]        | 284.62621868310015
9      | 14        | [14, 14, 15, 8, 22, 14, 8]        | 1272.318236950194
10      | 10        | [10, 10, 10, 14, 10, 4, 1]        | 393.14917468115505
11      | 11        | [8, 10, 

In [13]:
# Increase radius
print('Start:', datetime.datetime.now())
x_train, x_test, train_aud, test_aud, train_len, test_len, y_test= generate_mfcc_lists(df, 13)
master_dist = calc_dtw(x_train, x_test, train_len, test_len, radius=5)
prediction(master_dist, y_test)
print('End:', datetime.datetime.now())

Start: 2019-03-30 10:01:23.424427
MFCCs:13, Radius:5, Time:503.76 sec
----------------------------------------------------------------------
Correct|Prediction|MFCC Predictions|DTW Distance
0      | 0        | [14, 11, 14, 0, 0, 14, 21]        | 1101.4231416255661
1      | 1        | [1, 1, 1, 1, 1, 1, 1]        | 275.01457361093065
2      | 2        | [10, 14, 2, 2, 2, 1, 8]        | 721.1772639826287
3      | 14        | [14, 14, 14, 14, 14, 14, 10]        | 745.938213187226
4      | 4        | [4, 4, 4, 4, 4, 4, 4]        | 250.74746547928083
5      | 1        | [1, 5, 1, 14, 14, 14, 1]        | 1002.9308016504688
6      | 6        | [16, 16, 13, 6, 0, 16, 16]        | 1405.2421284831273
7      | 7        | [18, 11, 14, 7, 14, 7, 14]        | 962.6623678772252
8      | 8        | [8, 1, 8, 8, 8, 8, 8]        | 333.8755762503739
9      | 16        | [16, 0, 16, 16, 16, 16, 18]        | 1293.795056863409
10      | 10        | [10, 10, 10, 10, 10, 10, 10]        | 360.0159625247511
11 

In [11]:
print('Prompt | Train Len | Test Len | Diff')
for i in range(len(train_len)):
    print('{:>10} {:>10} {:>10} {:>10}'.format(i, train_len[i], test_len[i], round(train_len[i]-test_len[i],1) ))

Prompt | Train Len | Test Len | Diff
         0        5.8        8.8       -3.0
         1        2.1        2.2       -0.1
         2        5.4        5.9       -0.5
         3        3.5        4.9       -1.4
         4        1.7        2.1       -0.4
         5        5.8        6.5       -0.7
         6        8.4        9.5       -1.1
         7        6.2        7.5       -1.3
         8        2.6        2.3        0.3
         9        8.5        8.0        0.5
        10        2.0        2.6       -0.6
        11        3.7        3.2        0.5
        12        6.4        3.7        2.7
        13        4.1        4.1        0.0
        14        2.0        2.1       -0.1
        15        3.4        3.3        0.1
        16        3.8        4.6       -0.8
        17        8.9       13.4       -4.5
        18        2.2        3.6       -1.4
        19        6.3        4.2        2.1
        20        6.0        6.9       -0.9
        21        5.8        5.5       

In [12]:
# Prompt 3 - test audio repeats part of phrase at the end
# Prompt 5 - these are sort of close. It guesses correctly in one of the shifts
# Prompt 9 - these are sort of close.
# Prompt 12 - issue with silence at beginning. Could try more trim.
# Prompt 17 - issue with silence at beginning. Could try more trim.
# Prompt 19 - Repeats part of phrase at beginning
# Prompt 21 - He does not finish test phrase well. Does not sound the same as train.

getaudio(train_aud[8])

In [13]:
getaudio(test_aud[8])

## Further background - Mechanics of the DTW algorithm

To calculate the DTW distance between two vectors X and Y, the first step is to create a matrix of size |Y| by |X| where the elements are the distance between every pair of points in the vectors, where distance is measured using the following methodology:

_To calculate the element [i,j] (ie the distance between the vector elements $Y_{i}$ and $X_{j}$), take the absolute value of $Y_{i}$ - $X_{j}$ and then add the minimum value of the three adjacent cells to the left, diagonal below left and below, ie cells [i-1, j], [i-1, j-1], [i, j-1]._

So, for example, if the two vectors we wish to compare are X = [1,6,2,3,0,9,4,3,6,3] and Y = [1,3,4,9,8,2,1,5,7,3] we would build the following cost matrix

![](img/DTW_matrix.png)

Cell [3,3] in bold with a value of 11 is calculated as |$Y_{3}$ - $X_{3}$| = |9-3| = 6, **plus** the minimum of the three adjacent cells to the left and below, ie left = 11, left diagonal below = 5, below = 5, so min{11,5,5) = 5. Along the edges where there are a limited number of adjacent cells, just add whichever cell is present, which will either be one to the left or one below (note that the diagram above incorrectly highlights cell [0,8], but the calculation provided is for the bold cell [0,4] with a value of 20)

Once the matrix is populated, it is possible to find the minimum distance between the start and ending points by selecting the cell in the top right corner, and then tracing a path back to the origin by successively choosing whichever cell in the three adjacent cells to the left and below contains the lowest value

![](img/DTW_path.png)

Starting with the top right cell [9,9], the cell to the left = 18, left diagonal below = 15, below = 18, so we select left diagonal below. Then the adjacent cells are left = 18, left diagonal below = 15, below = 14, so we select below. In this way, a warping path is traced back to the origin, and that path reveals which points in Y should map to points in X

![](img/warping_path.png)

It is common practice to apply a weighting function to the overall distance (ie the sum of the elements along the warping path) to normalize for the path length. This weighting can either be based upon the distance travelled along either the X axis, or along both X and Y.