# Summary
MLEnd dataset is a dataset of spoken numerals collected in 2021. It consists of 20,000 audio files. 32 different numerals have been included: 0-20, 30, 40, 50, 60, 70, 80, 90, 100, 1000, 1 million, 1 billion. This solution is going to first train a model developed in the other section and then attempt to use that model to predict the telephone number from audio files of recorded spoken telephone numbers.


# Prepare and train the model

In [1]:
#library import
from google.colab import drive

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import os, sys, re, pickle, glob
import urllib.request
import zipfile

from random import randrange

#from IPython.display import Audio
import IPython.display as ipd
from tqdm import tqdm
import librosa
drive.mount('/content/drive')

Mounted at /content/drive


## Dataset preparation

Check that the dataset has been donwloaded and is available for use - there should be 20,000 files.

In [2]:
files = glob.glob('/content/drive/MyDrive/Data/MLEnd/training/*/*.wav')
len(files)

20000

Check the dataset information from the 'trainingMLEnd.csv' file and save them to a variable.

In [None]:
labels = pd.read_csv('/content/drive/MyDrive/Data/MLEnd/trainingMLEnd.csv')
labels.head()

Unnamed: 0,File ID,digit_label,participant,intonation
0,0000000.wav,4,S73,question
1,0000001.wav,2,S88,excited
2,0000002.wav,70,S5,neutral
3,0000003.wav,2,S85,bored
4,0000004.wav,4,S30,excited


The following function will be used for feature extraction.

In [3]:
def getPitch(x,fs,winLen=0.02):
  #winLen = 0.02 
  p = winLen*fs
  frame_length = int(2**int(p-1).bit_length())
  hop_length = frame_length//2
  f0, voiced_flag, voiced_probs = librosa.pyin(y=x, fmin=80, fmax=450, sr=fs,
                                                 frame_length=frame_length,hop_length=hop_length)
  return f0,voiced_flag

The following function takes a number of files and creates a NumPy array containing the specified audio features used as predictors (`X`) and their labels (`z`).

In [4]:
def getXy(files,labels_file,scale_audio=False, onlySingleDigit=False):
  X,y,z =[],[],[]
  for file in tqdm(files):
    fileID = file.split('/')[-1]
    # yi is our label - in this case intonation
    yi = list(labels_file[labels_file['File ID']==fileID]['intonation'])[0]
    zi = list(labels_file[labels_file['File ID']==fileID]['digit_label'])[0]

    if onlySingleDigit and zi>9:
      continue
    else:
      fs = None # if None, fs would be 22050
      x, fs = librosa.load(file,sr=fs)
      if scale_audio: x = x/np.max(np.abs(x))
      f0, voiced_flag = getPitch(x,fs,winLen=0.02)
          
      power = np.sum(x**2)/len(x)
      pitch_mean = np.nanmean(f0) if np.mean(np.isnan(f0))<1 else 0
      pitch_std  = np.nanstd(f0) if np.mean(np.isnan(f0))<1 else 0
      voiced_fr = np.mean(voiced_flag)
      crossing_rate = np.sum(librosa.feature.zero_crossing_rate(x))
      spectral_centroid = np.mean(librosa.feature.spectral_centroid(x, fs))
      spectral_rolloff = np.mean(librosa.feature.spectral_rolloff(x, fs))
      spectral_bandwidth = np.mean(librosa.feature.spectral_bandwidth(x, fs))
      chroma_stft = np.mean(librosa.feature.chroma_stft(x, fs))
      rmse = np.mean(librosa.feature.rms(x))
      

      xi = [power,pitch_mean,pitch_std,voiced_fr, crossing_rate, spectral_centroid, 
            spectral_rolloff, spectral_bandwidth, chroma_stft,rmse]
      X.append(xi)
      y.append(yi)
      z.append(zi)
  return np.array(X),np.array(y), np.array(z)

## The model
The model will use the following features extracted from the audio files (index, name):
- 3, fraction of voiced region
- 4, zero crossing rate
- 5, spectral centroid
- 6, spectral rolloff
- 9, rmse

<br><br>
The model used will be **Logistic Regression**<br>
Using this model on a smaller dataset produced correct training prediction 32% of the time and validation prediction 31% of the time. In the next section, I will run this model on the whole dataset available.

## Training and validation
For training of the model I will use all audio files with numerals 0-9. I am going to create a dataset of all files with all ten features and save it as csv for future use. This task will take a long time and therefore I only want to do it once. One audio file was causing errors, therefore I will end up with 2 csv files of length 19999 each:
- `feeatures.csv` - all 10 features I identified earlier - I extracted all 10 in case they could be useful for any further exploration
- `numerals.csv` - corresponding numerals<br><br>

These files can also be downloaded from my GitHub:<br>
https://github.com/maciejtarsa/MLEnd-Spoken-Numerals

In [5]:
# the next line is commented out because I only needed to run it once
#X,y,z = getXy(files[:],labels_file=labels,scale_audio=True, onlySingleDigit=False)

Instead of the above line, I will import the csv files I have created. I will import the features into a vector `X` and the numerals into a vector `z`.

In [6]:
import csv
# import the features from a csv file into an array
data_x = []
with open('/content/drive/MyDrive/Data/MLEnd/features.csv', 'r') as rf:
    reader = csv.reader(rf, delimiter=',')
    for row in reader:
      data_x.append(row[1:11])

X = np.array(data_x).astype(np.float)
X = np.delete(X, 0, axis=0)

# import the numerals from a csv file into an array
data_z=[]
with open('/content/drive/MyDrive/Data/MLEnd/numerals.csv', 'r') as rf:
    reader = csv.reader(rf, delimiter=',')
    for row in reader:
      data_z.append(int(row[1]))

z = np.array(data_z)
z = np.delete(z, [0])

I only want to keep the data for numerals 0-9.

In [7]:
# Extract a list of indeces from z where numeral is less than 10
num_index = np.where(z < 10)
# amend X to only keep features for numerals 0-9
X = np.take(X, num_index, axis=0)
X = np.squeeze(X)
# amend z to only keep numerals 0-9
z = np.take(z, num_index)
z = np.squeeze(z)

With this, I end up with 6533 audio files for numerals 0-9. Next, I will run them on the model chosen earlier.

### Run the model

In [9]:
# import the model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [10]:
# select the feautures extracted
XX = X[:,[3, 4, 5, 6, 9]]
X_train, X_val, z_train, z_val = train_test_split(XX,z,test_size=0.4)
# normalise the predictors
X_train = (X_train-X_train.mean(0))/X_train.std(0)
X_val = (X_val-X_val.mean(0))/X_val.std(0)

# run the model
model_numerals = LogisticRegression()
model_numerals.fit(X_train,z_train)
zt_p = model_numerals.predict(X_train)
zv_p = model_numerals.predict(X_val)

print("The final model produced the following testing and validation values:")
print(f"Testing: {np.mean(zt_p==z_train)}")
print(f"Validation: {np.mean(zv_p==z_val)}")

The final model produced the following testing and validation values:
Testing: 0.303648890022965
Validation: 0.30642693190512627


My final model achieved the testing accuracy of 30% and validation accuracy of 31% using a dataset of 6533 audio files.

# Telephone number use case
In this section, I will create a product that will attempt to identify all digits of a telephone number from an audio file recording of person reciting a telephone number. The model trained in previous section is going to be used.<br><br>
For the purpose of this exercise, UK mobile numbers are going to be used (11 numerals). It is assumed that the speaker is pronouncing each number as a digit, and that zero is pronouncaed as 'zero' rather than for example 'o'.

### Exploration of the problem

#### Load an audio file:
Upload data to collaborate panel on the left<br><br>
Recordings used for this use cases can be found on my github at:<br>
https://github.com/maciejtarsa/MLEnd-Spoken-Numerals/tree/main/telephone_numbers

In [11]:
# set a variable for the audio file
tel_file = './V_2.wav'

In [12]:
# display s sound wave and play
fs = None # Sampling frequency. If None, fs would be 22050
x, fs = librosa.load(tel_file,sr=fs)
t = np.arange(len(x))/fs
plt.plot(t,x)
plt.xlabel('time (sec)')
plt.ylabel('amplitude')
plt.show()
display(ipd.Audio(tel_file))

Output hidden; open in https://colab.research.google.com to view.

I now have the file in the working environment, I then want to split it on silence.

#### Split the audio file

In order to split the sound file on silence I can use functionality from pydub library. I tried using librosa for this, however it was not splitting the files into 11 chunks effectively, hence I decided to attempt using another library

In [13]:
# need to install pydub
!pip install pydub

Collecting pydub
  Downloading https://files.pythonhosted.org/packages/a6/53/d78dc063216e62fc55f6b2eebb447f6a4b0a59f55c8406376f76bf959b08/pydub-0.25.1-py2.py3-none-any.whl
Installing collected packages: pydub
Successfully installed pydub-0.25.1


In [14]:
from pydub import AudioSegment
from pydub.silence import split_on_silence

# a function for splitting a file into chunks
def split(filepath):
  sound = AudioSegment.from_wav(filepath)
  dBFS = sound.dBFS
  chunks = split_on_silence(sound, 
    min_silence_len = 250,
    silence_thresh = dBFS-16)
  return chunks

In [15]:
# chunk the file
chunks = split(tel_file)

In [28]:
# save chunks as .wav files in the local collab environment
for i, chunk in enumerate(chunks):
  chunk.export("./chunk{0}.wav".format(i), format="wav")

In [29]:
# create a list of .wav files to import
numeral_files = glob.glob("./chunk*.wav".format(tel_file))
len(numeral_files)

11

#### Get features for each numeral
At this stage I have 11 files with numerals for each telephone number. I now need to extract the features for each numeral.

In [30]:
# a function for extracting the features
def getX(numerals):
  X = []
  for file in tqdm(numerals):

    fs = None # if None, fs would be 22050
    x, fs = librosa.load(file,sr=fs)
    x = x/np.max(np.abs(x))
    f0, voiced_flag = getPitch(x,fs,winLen=0.02)
          
    power = np.sum(x**2)/len(x)
    pitch_mean = np.nanmean(f0) if np.mean(np.isnan(f0))<1 else 0
    pitch_std  = np.nanstd(f0) if np.mean(np.isnan(f0))<1 else 0
    voiced_fr = np.mean(voiced_flag)
    crossing_rate = np.sum(librosa.feature.zero_crossing_rate(x))
    spectral_centroid = np.mean(librosa.feature.spectral_centroid(x, fs))
    spectral_rolloff = np.mean(librosa.feature.spectral_rolloff(x, fs))
    spectral_bandwidth = np.mean(librosa.feature.spectral_bandwidth(x, fs))
    chroma_stft = np.mean(librosa.feature.chroma_stft(x, fs))
    rmse = np.mean(librosa.feature.rms(x))
      

    xi = [power,pitch_mean,pitch_std,voiced_fr, crossing_rate, spectral_centroid, 
            spectral_rolloff, spectral_bandwidth, chroma_stft,rmse]
    X.append(xi)

  return np.array(X)

Run the above function for the numerals

In [31]:
X = getX(numeral_files)
print('\n\nThe shape of X is', X.shape) 

100%|██████████| 11/11 [00:02<00:00,  4.22it/s]



The shape of X is (11, 10)





I now have an array of 11 numerals with 10 features for each.

#### Run the model
Run the model to predict the number given.

In [32]:
def runModel():
  # select the relevant features
  XX = X[:,[3, 4, 5, 7, 9]]
  # normalise the predictor
  XX = (XX-XX.mean(0))/XX.std(0)
  # run the prediction
  number_prediction = model_numerals.predict(XX)
  #print('The predicted number is:')
  # print the number as string
  string = ''
  for no in number_prediction:
    #print(no, end='')
    string+=str(no)

  # and add it to the dataframe
  telephone_results.loc[telephone_results['File'] == tel_file[2:], 'Predicted'] = string
  # get the value of the actual number
  actual_string = telephone_results.loc[telephone_results['File'] == tel_file[2:], 'Actual'].iloc[0]
  # calculate the accuracy by comparing the strings
  sum = 0
  for i in range(11):
    if string[i] == actual_string[i]:
      sum += 1
  telephone_results.loc[telephone_results['File'] == tel_file[2:], 'Accuracy'] = sum/11

A helper function adding silence at the beginning and at the end of each numeral file.

In [33]:
from pydub import AudioSegment

def addSilence(silence):
  # read all chunk files
  numeral_files = glob.glob('./chunk*.wav')
  # create 1 sec of silence audio segment
  one_sec_segment = AudioSegment.silent(duration=silence)  #duration in milliseconds
  # iterate over files
  for file in numeral_files:
    # read wav file to an audio segment
    sound = AudioSegment.from_wav(file)
    # add silence at the beginning and at the end
    final_sound = one_sec_segment + sound + one_sec_segment
    # save the amended file
    final_sound.export(file, format="wav")

### Putting it all together - the model

The code below uses some of the functions developed earlier. It requires the relevant files to be uploaded to Collab panel on the left. The files can be found on by GitHub at:<br>
https://github.com/maciejtarsa/MLEnd-Spoken-Numerals/tree/main/telephone_numbers <br><br>
Initially, I was aiming to test this on 5 telephone numbers spoken by two speakers. After initial dissapointing results, I decided to include 5 telephone numbers put together from the training data (it should therefore contain 11 speakers pronuncing a numeral each). Hence, I ended up with 15 files, 5 by speaker 1, 5 by speaker 2 and 5 put together by randomly selected speakers.

In [34]:
# Set up a dataframe to compare results
# only need to do this once

telephone_results = pd.DataFrame(
      [['M_1.wav', '07486124390','',''],
      ['M_2.wav', '84566510972','',''],
      ['M_3.wav', '61237890121','',''],
      ['M_4.wav', '94459718231','',''],
      ['M_5.wav', '07578414321','',''],
      ['V_1.wav', '07486124390','',''],
      ['V_2.wav', '84566510972','',''],
      ['V_3.wav', '61237890121','',''],
      ['V_4.wav', '94459718231','',''],
      ['V_5.wav', '07578414321','',''],
      ['R_1.wav', '22459577439','',''],
      ['R_2.wav', '32996314224','',''],
      ['R_3.wav', '75093654553','',''],
      ['R_4.wav', '47295281336','',''],
      ['R_5.wav', '21404158635','','']],
      columns = ['File', 'Actual', 'Predicted', 'Accuracy']) 

In [35]:
# create a list of .wav files to import
telephone_files = glob.glob('./*_*.wav')
# iterate over them
for telephone in telephone_files:
  tel_file = telephone
  # chunk the file
  chunks = split(tel_file)
  # save chunks as .wav files in the local collab environment
  for i, chunk in enumerate(chunks):
    chunk.export("./chunk{0}.wav".format(i), format="wav")
  # add silence to all chink files
  # addSilence(500)
  # create a list of .wav files to import
  numeral_files = glob.glob('./chunk*.wav')
  # extract features
  X = getX(numeral_files)
  # run the model
  runModel()
# print all results
telephone_results

100%|██████████| 11/11 [00:02<00:00,  4.25it/s]
100%|██████████| 11/11 [00:01<00:00,  5.55it/s]
100%|██████████| 11/11 [00:02<00:00,  4.63it/s]
100%|██████████| 11/11 [00:02<00:00,  4.50it/s]
100%|██████████| 11/11 [00:02<00:00,  4.64it/s]
100%|██████████| 11/11 [00:02<00:00,  4.68it/s]
100%|██████████| 11/11 [00:02<00:00,  5.25it/s]
100%|██████████| 11/11 [00:02<00:00,  4.23it/s]
100%|██████████| 11/11 [00:02<00:00,  4.31it/s]
100%|██████████| 11/11 [00:02<00:00,  4.63it/s]
100%|██████████| 11/11 [00:02<00:00,  5.40it/s]
100%|██████████| 11/11 [00:02<00:00,  4.31it/s]
100%|██████████| 11/11 [00:02<00:00,  5.10it/s]
100%|██████████| 11/11 [00:02<00:00,  4.19it/s]
100%|██████████| 11/11 [00:01<00:00,  5.60it/s]


Unnamed: 0,File,Actual,Predicted,Accuracy
0,M_1.wav,7486124390,86739763397,0.181818
1,M_2.wav,84566510972,36277971693,0.0
2,M_3.wav,61237890121,87333131760,0.0909091
3,M_4.wav,94459718231,89777379736,0.0909091
4,M_5.wav,7578414321,98673661607,0.0909091
5,V_1.wav,7486124390,16799173837,0.0909091
6,V_2.wav,84566510972,46779971693,0.0
7,V_3.wav,61237890121,77182191260,0.0909091
8,V_4.wav,94459718231,19762379796,0.0
9,V_5.wav,7578414321,96773199607,0.0909091


The results are very dissapointing, the model is capable of predicting at most 2 characters, performance is essentially as bad as or even worse than random.<br><br>
I noticed that the way pydub was splitting the sound files was that individual chunks did not have any leading or following silence. I tried adding it, hoping the performance would improve. I experimented with a few different durations of the silnce. Unfortunately, it made no difference to the performance of the model. It did make the processing of each telephone number take longer. Silence of 1000 miliseconds took about 6s to process each file, silcence of 500 miliseconds 4s to process each file. Without adding the silnce, it took about 2s per file. Hence I did not include the additional silence in the end.

Let's create a random set of numbers and compare that to the performance of my model.

In [36]:
from random import randint
# add new column to the data frame
telephone_results['Random_number'] = ''
telephone_results['Random_accuracy'] = ''

for i, row in telephone_results.iterrows():
  string = ''
  for _ in range(11):
    string += str(randint(0,9))
  telephone_results.loc[i,'Random_number'] = string
  # calculate the accuracy by comparing the strings
  actual_string = str(telephone_results.loc[i, 'Actual'])
  sum = 0
  if len(actual_string) == 11:
    for j in range(11):
      if string[j] == actual_string[j]:
        sum += 1
  telephone_results.loc[i,'Random_accuracy'] =  sum/11

# convert the accuracy columns to numeric
telephone_results['Accuracy'] = pd.to_numeric(telephone_results['Accuracy'])
telephone_results['Random_accuracy'] = pd.to_numeric(telephone_results['Random_accuracy'])
# and random numbers as string
telephone_results['Random_number'] = telephone_results['Random_number'].astype(str)
# add an average row
telephone_results.loc['mean'] = telephone_results.mean()
# print the results
telephone_results

Unnamed: 0,File,Actual,Predicted,Accuracy,Random_number,Random_accuracy
0,M_1.wav,7486124390.0,86739763397.0,0.181818,93431037153.0,0.090909
1,M_2.wav,84566510972.0,36277971693.0,0.0,53160768749.0,0.090909
2,M_3.wav,61237890121.0,87333131760.0,0.090909,53752461528.0,0.090909
3,M_4.wav,94459718231.0,89777379736.0,0.090909,64509819925.0,0.272727
4,M_5.wav,7578414321.0,98673661607.0,0.090909,7319137258.0,0.181818
5,V_1.wav,7486124390.0,16799173837.0,0.090909,76287357380.0,0.272727
6,V_2.wav,84566510972.0,46779971693.0,0.0,66218607982.0,0.181818
7,V_3.wav,61237890121.0,77182191260.0,0.090909,6981556177.0,0.090909
8,V_4.wav,94459718231.0,19762379796.0,0.0,74176990509.0,0.090909
9,V_5.wav,7578414321.0,96773199607.0,0.090909,36826355958.0,0.0


Randomly generated numbers are performing better than the model trained. The randomly generated number achieved the accuracy of around 0.12, my model achieved the accuracy of 0.08. I was hoping for at least similar accuracy to the validation accuracy, which was around 0.3. I would have expected at least the files put together from the training data to perform better, but unfortunately, none of the files performed satisfactory.

### Conclusions

Unfortunately, the model I developed was not capable of performing well in a deployment scenario. The model was not great to begin with, as it only achieved validation accuracy of 30%, which is better that random (1 in 10 or 10%), but still not very accurate. When deployed to recognise digits from a spoken telephone number (11 digits), it only achieved accuracy of 6%. Number generated by random number generator achieved accuracy of 8% on the same data. <br><br>

For testing of this product, I used a series of numbers spoken by myself (Polish native speaker), my wife (English native speaker) and numerals randomly selected from the testing/validation dataset. Each subset achieved the following results:
 - out of a set of 5, 11-long numbers spoken by muself (files beginning with M), the model correctly recognised 5 digits 
 - out of a set of 5, 11-long numbers spoken by my wife (files beginning with V), the model correctly recognised 3 digits
 - out of a set of 5, 11-long numbers spoken by random participants from training/validation data (files starting with R), 5 were recognised correctly<br><br>

The number spoken by an English native speaker achieved the worst performance, perhaps indicating that accent of the speaker plays a heavy role in digit recognition. However, I would have expected for the ones taken from training/validation to perform slightly better. Potentially, providing the model with mode test data would provide more insight on this.<br><br>
Perhaps exploring a more complicated model, such as neural network, would produce better results in this case.<br><br>
Furthermore, it might be possible to improve the model by expanding the training dataset, e.g. additionally using spoken digits dataset:<br>
https://www.tensorflow.org/datasets/catalog/spoken_digit<br><br>


