## Summary
MLEnd dataset is a dataset of spoken numerals collected in 2021. It consists of 20,000 audio files. 32 different numerals have been included: 0-20, 30, 40, 50, 60, 70, 80, 90, 100, 1000, 1 million, 1 billion. This solution is attempting to create a machine learning algorithm capable of predicting numerals from 0 to 9 from an audio file, which can then be used in a voice recognition scenario. Thereofore only the subset of data related to digits 0-9 will be used. <br><br>
In particular, a use case of recognising a telephone number will be explored. The model will be fed a series of telephone numbers (11 digit long sequences) and will try to predict what telephone number the speaker said.


In [1]:
#library import
from google.colab import drive

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import os, sys, re, pickle, glob
import urllib.request
import zipfile

from random import randrange

#from IPython.display import Audio
import IPython.display as ipd
from tqdm import tqdm
import librosa
drive.mount('/content/drive')

Mounted at /content/drive


## Dataset preparation

Check that the dataset has been donwloaded and is available for use - there should be 20,000 files.

In [3]:
files = glob.glob('/content/drive/MyDrive/Data/MLEnd/training/*/*.wav')
len(files)

20000

Check the dataset information from the 'trainingMLEnd.csv' file and save them to a variable.

In [4]:
labels = pd.read_csv('/content/drive/MyDrive/Data/MLEnd/trainingMLEnd.csv')
labels.head()

Unnamed: 0,File ID,digit_label,participant,intonation
0,0000000.wav,4,S73,question
1,0000001.wav,2,S88,excited
2,0000002.wav,70,S5,neutral
3,0000003.wav,2,S85,bored
4,0000004.wav,4,S30,excited


Let's explore how many files per numeral there are

In [5]:
numeral_count = labels.groupby(['digit_label']).count()
numeral_count

Unnamed: 0_level_0,File ID,participant,intonation
digit_label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,655,655,655
1,663,663,663
2,652,652,652
3,650,650,650
4,641,641,641
5,650,650,650
6,668,668,668
7,638,638,638
8,653,653,653
9,663,663,663


The data for each numeral is fairly well split - in general there seem to be between 569 and 653 files for each digit. This shows that there is not much skew in the data. In particular, numerals 0 to 9 are well represented, with each one having at least 641 examples.

## Feature extraction
In order to predict the digit spoken from the audio files, features contained in the audio files can be used.

The helper function below will be later used to extract some of the features of an audio file.

In [6]:
def getPitch(x,fs,winLen=0.02):
  #winLen = 0.02 
  p = winLen*fs
  frame_length = int(2**int(p-1).bit_length())
  hop_length = frame_length//2
  f0, voiced_flag, voiced_probs = librosa.pyin(y=x, fmin=80, fmax=450, sr=fs,
                                                 frame_length=frame_length,hop_length=hop_length)
  return f0,voiced_flag

I will extract 10 different features from each audio file. I will then check which ones are useful in predicting the numeral. The feautres I will look at are:<br>
- power
- pitch mean
- pitch standard deviation
- fraction of voiced region
- zero crossing rate
- spectral centroid
- spectral rolloff
- spectral bandwidth
- chroma stft
- rmse

References:<br>
https://towardsdatascience.com/extract-features-of-music-75a3f9bc265d<br>
https://medium.com/@alexandro.ramr777/audio-files-to-dataset-by-feature-extraction-with-librosa-d87adafe5b64<br>

The following function takes a number of files and creates a NumPy array containing the specified audio features used as predictors (`X`) and their labels (`z`).

In [7]:
def getXy(files,labels_file,scale_audio=False, onlySingleDigit=False):
  X,y,z =[],[],[]
  for file in tqdm(files):
    fileID = file.split('/')[-1]
    # yi is our label - in this case intonation
    yi = list(labels_file[labels_file['File ID']==fileID]['intonation'])[0]
    zi = list(labels_file[labels_file['File ID']==fileID]['digit_label'])[0]

    if onlySingleDigit and zi>9:
      continue
    else:
      fs = None # if None, fs would be 22050
      x, fs = librosa.load(file,sr=fs)
      if scale_audio: x = x/np.max(np.abs(x))
      f0, voiced_flag = getPitch(x,fs,winLen=0.02)
          
      power = np.sum(x**2)/len(x)
      pitch_mean = np.nanmean(f0) if np.mean(np.isnan(f0))<1 else 0
      pitch_std  = np.nanstd(f0) if np.mean(np.isnan(f0))<1 else 0
      voiced_fr = np.mean(voiced_flag)
      crossing_rate = np.sum(librosa.feature.zero_crossing_rate(x))
      spectral_centroid = np.mean(librosa.feature.spectral_centroid(x, fs))
      spectral_rolloff = np.mean(librosa.feature.spectral_rolloff(x, fs))
      spectral_bandwidth = np.mean(librosa.feature.spectral_bandwidth(x, fs))
      chroma_stft = np.mean(librosa.feature.chroma_stft(x, fs))
      rmse = np.mean(librosa.feature.rms(x))
      

      xi = [power,pitch_mean,pitch_std,voiced_fr, crossing_rate, spectral_centroid, 
            spectral_rolloff, spectral_bandwidth, chroma_stft,rmse]
      X.append(xi)
      y.append(yi)
      z.append(zi)
  return np.array(X),np.array(y), np.array(z)

## Preprocessing
In this section, I will use the above function to extract the features from a number of audio files to aid selecting the features most useful in predicting the numeral.

Apply the getXy function to 5000 files. As only a proportion of them contain numberals 0-9, I have selected a relatively high sample in order to have enough data for training. I am not using the whole dataset here, because extracting 10 features will take a long time. Once I select which features are most useful, I will use all audio files for training the model.

In [8]:
X,y,z = getXy(files[5000:10000],labels_file=labels,scale_audio=True, onlySingleDigit=True)

100%|██████████| 5000/5000 [18:33<00:00,  4.49it/s]


Check the shape of `X` and `z` (the numeral):

In [9]:
print('The shape of X is', X.shape) 
print('The shape of z is', z.shape)

The shape of X is (1625, 10)
The shape of z is (1625,)


### Feature selection
For slecting which features are most useful, I will use logistic regression. Once I select the features, I will try out different models.

In [10]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [11]:
from google.colab import widgets
import operator
import itertools

# set up a dataframe for storing results
columns = ['Features', 'No of features', 'Testing', 'Validation']
features_results = pd.DataFrame(columns=columns)
data = []

# I want to check the combination of any of the features, from using each
# individual feature, to using all 10 at the same time.
no_features = range(1,11)
tb = widgets.TabBar([str(number) for number in no_features])

for number in no_features:
    with tb.output_to(str(number), select= (number < 2)):

      # set up combinations
      for n in (itertools.combinations(range(10), number)):
        # need them displayed as a list
        tmp = []
        [tmp.append(i) for i in n]
        # only select the features from current combination  
        XX = X[:,tmp]
        X_train, X_val, z_train, z_val = train_test_split(XX,z,test_size=0.3)
        # normalise the predictors
        X_train = (X_train-X_train.mean(0))/X_train.std(0)
        X_val = (X_val-X_val.mean(0))/X_val.std(0)
        model = LogisticRegression()
        model.fit(X_train,z_train)
          

        zt_p = model.predict(X_train)
        zv_p = model.predict(X_val)
        # set up variables for appending to a data frame
        values = [n,number,np.mean(zt_p==z_train),np.mean(zv_p==z_val)]
        zipped = zip(columns, values)
        a_dictionary = dict(zipped)
        data.append(a_dictionary) 

      # append results to a data frame
      features_results = features_results.append(data, True)

      # select the part of the dataframe containing current no of features only
      result = features_results.loc[features_results['No of features'] == number]
      # sort them by testing metric
      result = result.sort_values(by=['Validation'], ascending=False)
      print(result.head().to_string())

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

  Features No of features   Testing  Validation
3     (3,)              1  0.188215    0.178279
5     (5,)              1  0.191733    0.172131
4     (4,)              1  0.158311    0.151639
6     (6,)              1  0.162709    0.151639
0     (0,)              1  0.144239    0.141393


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

   Features No of features   Testing  Validation
50   (4, 5)              2  0.221636    0.245902
44   (3, 4)              2  0.233949    0.237705
55   (5, 6)              2  0.217238    0.219262
49   (3, 9)              2  0.215479    0.215164
57   (5, 8)              2  0.216359    0.215164


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

      Features No of features   Testing  Validation
209  (3, 4, 9)              3  0.240106    0.272541
221  (4, 5, 7)              3  0.252419    0.272541
205  (3, 4, 5)              3  0.263852    0.262295
208  (3, 4, 8)              3  0.229551    0.258197
224  (4, 6, 7)              3  0.245383    0.258197


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

         Features No of features   Testing  Validation
593  (3, 4, 5, 9)              4  0.277924    0.327869
592  (3, 4, 5, 8)              4  0.259455    0.303279
591  (3, 4, 5, 7)              4  0.278804    0.290984
464  (0, 3, 4, 5)              4  0.278804    0.288934
604  (3, 5, 7, 9)              4  0.284081    0.286885


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

             Features No of features   Testing  Validation
1243  (3, 4, 5, 6, 9)              5  0.319261    0.346311
1102  (0, 3, 4, 5, 7)              5  0.328056    0.321721
1113  (0, 3, 5, 6, 9)              5  0.288478    0.319672
1253  (3, 5, 6, 8, 9)              5  0.284960    0.315574
1245  (3, 4, 5, 7, 9)              5  0.321900    0.311475


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

                Features No of features   Testing  Validation
1969  (0, 2, 3, 4, 5, 6)              6  0.313105    0.329918
2062  (1, 3, 4, 5, 6, 9)              6  0.318382    0.329918
2102  (3, 4, 5, 6, 7, 8)              6  0.279683    0.323770
1956  (0, 1, 4, 5, 6, 9)              6  0.280563    0.319672
2081  (2, 3, 4, 5, 6, 7)              6  0.279683    0.319672


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

                   Features No of features   Testing  Validation
2956  (0, 1, 2, 3, 4, 5, 6)              7  0.333333    0.342213
3064  (1, 3, 4, 5, 7, 8, 9)              7  0.330695    0.334016
2993  (0, 1, 3, 4, 5, 6, 9)              7  0.330695    0.331967
2998  (0, 1, 3, 4, 6, 7, 9)              7  0.302551    0.329918
3013  (0, 2, 3, 4, 5, 6, 8)              7  0.318382    0.327869


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

                      Features No of features   Testing  Validation
4082  (1, 2, 3, 4, 5, 7, 8, 9)              8  0.306948    0.334016
4086  (1, 3, 4, 5, 6, 7, 8, 9)              8  0.339490    0.329918
4072  (0, 2, 3, 4, 5, 6, 7, 9)              8  0.323659    0.329918
4071  (0, 2, 3, 4, 5, 6, 7, 8)              8  0.335972    0.327869
4043  (0, 1, 2, 3, 4, 5, 6, 7)              8  0.319261    0.325820


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

                         Features No of features   Testing  Validation
5103  (0, 1, 2, 3, 4, 5, 7, 8, 9)              9  0.327177    0.315574
5108  (0, 2, 3, 4, 5, 6, 7, 8, 9)              9  0.353562    0.309426
5107  (0, 1, 3, 4, 5, 6, 7, 8, 9)              9  0.335972    0.305328
5105  (0, 1, 2, 3, 5, 6, 7, 8, 9)              9  0.306069    0.293033
5106  (0, 1, 2, 4, 5, 6, 7, 8, 9)              9  0.295515    0.288934


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

                            Features No of features   Testing  Validation
6132  (0, 1, 2, 3, 4, 5, 6, 7, 8, 9)             10  0.345646    0.311475


<IPython.core.display.Javascript object>

The validation efficiencies are not great, the best seem to be around 0.35, which is better than random (0.1).

Next, I want to see the top 10 validation results in order to select the most relevant features.

In [12]:
# remove duplicated entries from my data frame
features_results = features_results.drop_duplicates()
# sort by validation
features_results = features_results.sort_values(by=['Validation'], ascending=False)
# only display the top result
features_results.head(10)

Unnamed: 0,Features,No of features,Testing,Validation
1243,"(3, 4, 5, 6, 9)",5,0.319261,0.346311
2956,"(0, 1, 2, 3, 4, 5, 6)",7,0.333333,0.342213
3064,"(1, 3, 4, 5, 7, 8, 9)",7,0.330695,0.334016
4082,"(1, 2, 3, 4, 5, 7, 8, 9)",8,0.306948,0.334016
2993,"(0, 1, 3, 4, 5, 6, 9)",7,0.330695,0.331967
4086,"(1, 3, 4, 5, 6, 7, 8, 9)",8,0.33949,0.329918
1969,"(0, 2, 3, 4, 5, 6)",6,0.313105,0.329918
2062,"(1, 3, 4, 5, 6, 9)",6,0.318382,0.329918
2998,"(0, 1, 3, 4, 6, 7, 9)",7,0.302551,0.329918
4072,"(0, 2, 3, 4, 5, 6, 7, 9)",8,0.323659,0.329918


I am going to select features with indices 3, 4, 5, 6 and 9. It showed validation efficienct of almost 0.35.

### Choosing a model
Now that I have selected the features to use, I will evaluate a few different models in order to pick the one that performs best at predicting the numeral spoken from an audio file.

In [13]:
# import some widely used prediction models
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import svm
# decision tree classifier
from sklearn import tree

In [15]:
# set up a dataframe for storing results
columns = ['Model', 'Testing', 'Validation']
models_results = pd.DataFrame(columns=columns)

# helper function that runs each model
def runModel(model_name):
  model.fit(X_train,z_train)
  zt_p = model.predict(X_train)
  zv_p = model.predict(X_val)

  data = {'Model': model_name, 'Testing': np.mean(zt_p==z_train), 
        'Validation':np.mean(zv_p==z_val)}
  return data
  

# select the feautures extracted
XX = X[:,[3, 4, 5, 6, 9]]
X_train, X_val, z_train, z_val = train_test_split(XX,z,test_size=0.4)
# normalise the predictors
X_train = (X_train-X_train.mean(0))/X_train.std(0)
X_val = (X_val-X_val.mean(0))/X_val.std(0)

## run a number of models
# Logistic Regression
model = LogisticRegression()
data = runModel('Logistic Regression')
models_results = models_results.append(data, ignore_index=True)

# Support Vector Machine
model  = svm.SVC(C=1)
data = runModel('SVM_c1')
models_results = models_results.append(data, ignore_index=True)

model  = svm.SVC(C=2)
data = runModel('SVM_c2')
models_results = models_results.append(data, ignore_index=True)

model  = svm.SVC(C=3)
data = runModel('SVM_c3')
models_results = models_results.append(data, ignore_index=True)

# Support Vector Machine with c and gamma
model  = svm.SVC(C=1,gamma=2)
data = runModel('SVM_g2')
models_results = models_results.append(data, ignore_index=True)

model  = svm.SVC(C=2,gamma=2)
data = runModel('SVM_g2')
models_results = models_results.append(data, ignore_index=True)

model  = svm.SVC(C=2,gamma=3)
data = runModel('SVM_g2')
models_results = models_results.append(data, ignore_index=True)

# Normal Decision Tree
model = tree.DecisionTreeClassifier()
data = runModel('Decision Tree')
models_results = models_results.append(data, ignore_index=True)

# Decision Tree depth 4
model = tree.DecisionTreeClassifier(max_depth=4)
data = runModel('Decision Tree depth=4')
models_results = models_results.append(data, ignore_index=True)

# Decision Tree depth 5
model = tree.DecisionTreeClassifier(max_depth=5)
data = runModel('Decision Tree depth=5')
models_results = models_results.append(data, ignore_index=True)

# Decision Tree depth 6
model = tree.DecisionTreeClassifier(max_depth=6)
data = runModel('Decision Tree depth=6')
models_results = models_results.append(data, ignore_index=True)

models_results

Unnamed: 0,Model,Testing,Validation
0,Logistic Regression,0.323077,0.316923
1,SVM_c1,0.357949,0.286154
2,SVM_c2,0.381538,0.281538
3,SVM_c3,0.398974,0.283077
4,SVM_g2,0.636923,0.258462
5,SVM_g2,0.726154,0.224615
6,SVM_g2,0.849231,0.201538
7,Decision Tree,1.0,0.163077
8,Decision Tree depth=4,0.267692,0.206154
9,Decision Tree depth=5,0.308718,0.216923


Logistic regression seems to be performing best, with validation score of 0.32, which is a bit worse than previously, but the data is split randomly between training and validation, therefore each time the code is run, a different result will be produced. Some of the S

### The model
My final model will use the following features extracted from the audio files (index, name):
- 3, fraction of voiced region
- 4, zero crossing rate
- 5, spectral centroid
- 6, spectral rolloff
- 9, rmse

<br><br>
The model used will be **Logistic Regression**<br>
Using this model on a smaller dataset produced correct training prediction 32% of the time and validation prediction 31% of the time. In the next section, I will run this model on the whole dataset available.

## Training and validation
For training of the model I will use all audio files with numerals 0-9. I am going to create a dataset of all files with all ten features and save it as csv for future use. This task will take a long time and therefore I only want to do it once. One audio file was causing errors, therefore I will end up with 2 csv files of length 19999 each:
- `feeatures.csv` - all 10 features I identified earlier - I extracted all 10 in case they could be useful for any further exploration
- `numerals.csv` - corresponding numerals<br><br>

These files can also be downloaded from my GitHub:<br>
https://github.com/maciejtarsa/MLEnd-Spoken-Numerals

In [17]:
# the next line is commented out because I only needed to run it once
#X,y,z = getXy(files[:],labels_file=labels,scale_audio=True, onlySingleDigit=False)

Instead of the above line, I will import the csv files I have created. I will import the features into a vector `X` and the numerals into a vector `z`.

In [16]:
import csv
# import the features from a csv file into an array
data_x = []
with open('/content/drive/MyDrive/Data/MLEnd/features.csv', 'r') as rf:
    reader = csv.reader(rf, delimiter=',')
    for row in reader:
      data_x.append(row[1:11])

X = np.array(data_x).astype(np.float)
X = np.delete(X, 0, axis=0)

# import the numerals from a csv file into an array
data_z=[]
with open('/content/drive/MyDrive/Data/MLEnd/numerals.csv', 'r') as rf:
    reader = csv.reader(rf, delimiter=',')
    for row in reader:
      data_z.append(int(row[1]))

z = np.array(data_z)
z = np.delete(z, [0])

I only want to keep the data for numerals 0-9.

In [18]:
# Extract a list of indeces from z where numeral is less than 10
num_index = np.where(z < 10)
# amend X to only keep features for numerals 0-9
X = np.take(X, num_index, axis=0)
X = np.squeeze(X)
# amend z to only keep numerals 0-9
z = np.take(z, num_index)
z = np.squeeze(z)

With this, I end up with 6533 audio files for numerals 0-9. Next, I will run them on the model chosen earlier.

### Run the model

In [19]:
# select the feautures extracted
XX = X[:,[3, 4, 5, 6, 9]]
X_train, X_val, z_train, z_val = train_test_split(XX,z,test_size=0.4)
# normalise the predictors
X_train = (X_train-X_train.mean(0))/X_train.std(0)
X_val = (X_val-X_val.mean(0))/X_val.std(0)

# run the model
model_numerals = LogisticRegression()
model_numerals.fit(X_train,z_train)
zt_p = model_numerals.predict(X_train)
zv_p = model_numerals.predict(X_val)

print("The final model produced the following testing and validation values:")
print(f"Testing: {np.mean(zt_p==z_train)}")
print(f"Validation: {np.mean(zv_p==z_val)}")

The final model produced the following testing and validation values:
Testing: 0.30237305435059963
Validation: 0.3091048201989288


My final model achieved the testing accuracy of 30% and validation accuracy of 31% using a dataset of 6533 audio files.

### Confusion matrix
In order to explore the results further, I will create a confusion matrix and scores for sensitivity and specificity.

In [20]:
# generate the confusion matrix
numeral_confusion = pd.crosstab(zv_p, z_val, rownames=['Actual'], colnames=['Predicted'], margins=True)
# and print it
print('Confusion Matrix for numerals 0-9 : \n\n', numeral_confusion)

Confusion Matrix for numerals 0-9 : 

 Predicted    0    1    2    3    4    5    6    7    8    9   All
Actual                                                           
0           58   19   22   20   25   22   12   26   13   25   242
1           41  146   49   50   43   54    1   24   39   54   501
2           28   22   49   32   33   17    4   11   14   17   227
3           15    5   11   28    6   13    0    3    7   13   101
4           15    5   22   13   37   14    5   17   11    7   146
5            5   11    3    8    4   14    2   11    6   10    74
6           15    3   13   21   21   16  189   27   40    9   354
7           29   14   10   19   19   28   12   88   19   10   248
8           17   24   58   59   37   45   16   22  101   18   397
9           33   21   15   25   26   44   15   35   12   98   324
All        256  270  252  275  251  267  256  264  262  261  2614


In [21]:
FP = numeral_confusion.sum(axis=0) - np.diag(numeral_confusion)  
FN = numeral_confusion.sum(axis=1) - np.diag(numeral_confusion)
TP = np.diag(numeral_confusion)
TN = numeral_confusion.values.sum() - (FP + FN + TP)

# Sensitivity, hit rate, recall, or true positive rate
TPR = TP/(TP+FN)*2
# Specificity or true negative rate
TNR = TN/(TN+FP) 

# combine into one array
sen_spec = np.stack(( TPR, TNR), axis=1)
# convert to a data frame and display
sen_spec_df = pd.DataFrame(sen_spec, columns=['Sensitivity', 'Specificity'])
sen_spec_df

Unnamed: 0,Sensitivity,Specificity
0,0.239669,0.954473
1,0.291417,0.958325
2,0.215859,0.954509
3,0.277228,0.949093
4,0.253425,0.95425
5,0.189189,0.949554
6,0.533898,0.966865
7,0.354839,0.955823
8,0.254408,0.95622
9,0.302469,0.95677


**Specificity** is the measure of how many negative cases were predicted as negative, so how many numerals that are not the one asked for has been correctly labelled as not that numeral. Generally, all numerals do well here, with values around 0.95.<br>**Sensitivity** is the measure of how many actual cases were predicted correctly, so for example numeral '1' predicted as numeral '1'. The performance of different numerals varied here, with '6' doing the best (correctly identified 53% of the time), while 5 has only been correctly identified 19% of the time.<br><br>

## Conclusions
Overall, these results show that the prediction is not very effective and is unlikely to be of practical use. Perhaps a bigger training data would improve the results. Furthermore, as participants were from many different areas of the world, with many different accents and pronunciations, training on only specific pronunciations or accents might yield better results. 