# AI Framework
## Speech to Text
#### Nouha Kacimi Alaoui, Nour Amine, Scarlett Gatt, Hà Nhi Ngo et Khoa Thi Nguyen

«Hey Siri », « OK Google » are some quotes that that have been part of our daily life, with the appearance of voice search on our smartphones and laptops. Nowadays, more and more people are using voice recognition to ask, for example, what time it is, the weather of the day, or even sports results. 
Automatic speech recognition is actually a computer concept that analyzes the voice of a human, in order to transcribe it into text. Behind this computer concept, we can find several algorithms. Among these, we have the Google Speech to Text Algorithm.
This algorithm allows developers to convert sound to text by applying powerful neural network models via an easy-to-use API, which recognizes 120 languages. 
The Cloud Speech-To-Text API uses the most sophisticated deep learning algorithms on the market. This neural network based technology enables speech recognition with unrivaled precision. However, we noticed some transcription errors while testing this algorithm.
Although Google has one of the most powerful voice recognition technologies on the market, it struggles to correctly transcribe the words of people who have a pronounced accent, or a difficulty to speak. Thus, in order to simplify, we decided to work on another algorithm that we found in the « Challenges Speech-To-Text Algorithms » section on Kaggle. It is an algorithm which transcribes a vocal comprising only one english word, into text.

We start by importing the libraries that we need.

In [1]:
#path
import os
from os.path import isdir, join
from pathlib import Path

# Scientific Math 
import numpy as np
from scipy.fftpack import fft
from scipy import signal
from scipy.io import wavfile
from sklearn.model_selection import train_test_split

# Visualization
import matplotlib.pyplot as plt
import tensorflow as tf
import plotly.offline as py
import plotly.graph_objs as go

#Deep learning
import tensorflow.keras as keras
from tensorflow.keras.layers import Dense, Dropout, Flatten
from tensorflow.keras import Input, layers
from tensorflow.keras import backend as K

import random
import copy
import librosa

%matplotlib inline

### Dataset

In this section, we will import our data and gather it into groups for later use.

For that, we first extract the names of the folders that contain the sound samples. Note that the names of these folders match the sound in the 'wav' files that they hold. 


Instruction : You have to change the path leading to your data (`train_audio_path`).

We use the `os.listdir` function with the condition (`isdirjoin(train_audio_path, f)`) to return the names of the folders in our path. The condition is here to prevent any file names from entering the list.

In [2]:
train_audio_path = 'train/audio/'
dirs = [f for f in os.listdir(train_audio_path) if isdir(join(train_audio_path, f))]
dirs.sort()
print('Number of labels: ' + str(len(dirs[1:])))
print(dirs)

Number of labels: 30
['_background_noise_', 'bed', 'bird', 'cat', 'dog', 'down', 'eight', 'five', 'four', 'go', 'happy', 'house', 'left', 'marvin', 'nine', 'no', 'off', 'on', 'one', 'right', 'seven', 'sheila', 'six', 'stop', 'three', 'tree', 'two', 'up', 'wow', 'yes', 'zero']


 As we have a big amount of data, we gather three samples of sounds :
 
 - Target list is the first sample, it contains folders of the following sounds : `'yes', 'no', 'up', 'down', 'left', 'right', 'on', 'off', 'stop', 'go'`.
 - Unknown list contains all the folders except for the ones in the target list and the ones in the background noise. The folders in this list are : `'bed', 'bird', 'cat', 'dog', 'eight', 'five', 'four', 'happy', 'house', 'marvin', 'nine', 'one', 'seven', 'sheila', 'six', 'three', 'tree', 'two', 'wow', 'zero'`.
 - The background noise is a list of the sounds in the folder '_background_noise_' ending with '.wav', we only extract sound files.


In [3]:
all_wav = []
unknown_wav = []
label_all = []
label_value = {}
target_list = ['yes', 'no', 'up', 'down', 'left', 'right', 'on', 'off', 'stop', 'go']
unknown_list = [d for d in dirs if d not in target_list and d != '_background_noise_' ]
print('target_list : ',end='')
print(target_list)
print('unknowns_list : ', end='')
print(unknown_list)
print('silence : _background_noise_')

target_list : ['yes', 'no', 'up', 'down', 'left', 'right', 'on', 'off', 'stop', 'go']
unknowns_list : ['bed', 'bird', 'cat', 'dog', 'eight', 'five', 'four', 'happy', 'house', 'marvin', 'nine', 'one', 'seven', 'sheila', 'six', 'three', 'tree', 'two', 'wow', 'zero']
silence : _background_noise_


The following process concerns the background noises :

1- Using librosa.load, we load the audio backgroung noises as a floating point time series. This function returns the sample (audio time series) and the sampling rate of the audio time series. The audio is automatically resampled to the rate sr=22050 ( default).\


2- We use librosa.resample to resample our audio time series from 'sample_rate' ( the default value : 22050 ) to the new_sample rate : 8000Hz.\


3- We add the resampled background noise  to the 'background_noise' list.

In [4]:
background = [f for f in os.listdir(join(train_audio_path, '_background_noise_')) if f.endswith('.wav')]
background_noise = []
for wav in background : 
    samples, sample_rate = librosa.load(join(join(train_audio_path,'_background_noise_'),wav))
    samples = librosa.resample(samples, sample_rate, 8000)
    background_noise.append(samples)

We will now process the rest of the data :

1- We first select all the 'wav' files of each folders (except for the background noise), that we put in the 'waves' list.\


2- We index each folder from 1 to 30 as following : `1:bed 2:bird 3:cat 4:dog 5:down 6:eight 7:five 8:four 9:go 10:happy 11:house 12:left 13:marvin 14:nine 15:no 16:off 17:on 18:one 19:right 20:seven 21:sheila 22:six 23:stop 24:three 25:tree 26:two 27:up 28:wow 29:yes 30:zero`.\


3- The same way than the background noise, we load the audio files. Then, we resample them with a rate of 16000 instead of 22050 to a rate of 8000.


4- If the folder is from the unknown list, we add the sample in the unkown_wav list. Else, we add our sample and its folder name in all_wav.

The data sampling rate is reduced to 8000Hz to lessen the computation cost.

In [5]:
i=0;
for direct in dirs[1:]:
    waves = [f for f in os.listdir(join(train_audio_path, direct)) if f.endswith('.wav')]
    label_value[direct] = i
    i = i + 1
    print(str(i)+":" +str(direct) + " ", end="")
    for wav in waves:
        samples, sample_rate = librosa.load(join(join(train_audio_path,direct),wav), sr = 16000)
        samples = librosa.resample(samples, sample_rate, 8000)
        if len(samples) != 8000 : 
            continue
            
        if direct in unknown_list:
            unknown_wav.append(samples)
        else:
            label_all.append(direct)
            all_wav.append([samples, direct])

1:bed 2:bird 3:cat 4:dog 5:down 6:eight 7:five 8:four 9:go 10:happy 11:house 12:left 13:marvin 14:nine 15:no 16:off 17:on 18:one 19:right 20:seven 21:sheila 22:six 23:stop 24:three 25:tree 26:two 27:up 28:wow 29:yes 30:zero 

This is a reminder : 

- dirs contains all the folder names  .
- Target_list : `'yes', 'no', 'up', 'down', 'left', 'right', 'on', 'off', 'stop', 'go']` ( folders ) ( also label_all)
- unknowns_list : `'bed', 'bird', 'cat', 'dog', 'eight', 'five', 'four', 'happy', 'house', 'marvin', 'nine', 'one', 'seven', 'sheila', 'six', 'three', 'tree', 'two', 'wow', 'zero'` ( folders)
- background_noise the list of the resampled background noises.(samples) 
- unknown_wav is the list of the resamples noises of the unknowns_list ( samples ) 
- all_wav is the list of the resampled noises of the target list ( [sample, name of the folder] ) 

Now, we will cut all_wav in two lists : one part, the sample, are going in `wav_all` and the second part, the folder names, are going in `label_all`.
In the next step, we divide 'all_wav' into two other lists. The first one, 'Wav_all', will contain the samples and the other one, 'label_all' will contain the folder names (in direct).

In [6]:
wav_all = np.reshape(np.delete(all_wav,1,1),(len(all_wav)))
label_all = [i for i in np.delete(all_wav,0,1).tolist()]

### Data Augmentation

Unfortunatly, we may not have enough data to train correctly our model. Moreover, the audio files are too clean : the people who may wish to translate their words to texts may be bothered by some background sounds. To assure that our model needs to be able to understand them anyway, we are duplicating some audio files and adding hazardous background sound.

In [7]:
#Random pick start point
def get_one_noise(noise_num = 0):
    selected_noise = background_noise[noise_num]
    start_idx = random.randint(0, len(selected_noise)- 1 - 8000)
    return selected_noise[start_idx:(start_idx + 8000)]

A noise is selected to get added to the audio files and put in `noise`. Then, every audio file of `wav_all` see itself either put in the `delete_index`to be deleted later if its length isn't equal to 8000 or is added to `noised_wav`. 

In [8]:
max_ratio = 0.1
noised_wav = []
augment = 1
delete_index = []
for i in range(augment):
    new_wav = []
    noise = get_one_noise(i)
    for i, s in enumerate(wav_all):
        if len(s) != 8000:
            delete_index.append(i)
            continue
        s = s + (max_ratio * noise)
        noised_wav.append(s)
np.delete(wav_all, delete_index)
np.delete(label_all, delete_index)

array(['down', 'down', 'down', ..., 'yes', 'yes', 'yes'], dtype='<U5')

For later use, wav_vals and label_vals type are changed.

In [9]:
wav_vals = np.array([x for x in wav_all])
label_vals = [x for x in label_all]
wav_vals.shape

(21312, 8000)

Now, a new `label_val` is needed to correspond with the grouping of `wav_all` and `noised_wav` which will occur later. For this purpose, `labels` is duplicated (with copy.deepcopy to create a true copy : if `label_val` change, `labels` will not) and `labels` is added as many times as noise had been added previously.  

In [10]:
labels = copy.deepcopy(label_vals)
for _ in range(augment):
    label_vals = np.concatenate((label_vals, labels), axis = 0)
label_vals = label_vals.reshape(-1,1)

For the moment, only the words on the former list can be detected. However, people may want to say words which aren't in this list or, even, in this language. With the way things are, they will be associated to a word in the list. But it could be better if the model can admit that this word isn't in his dictionnary. To this end, the `unknow` list is composed with a certain number of aleatory chosen audio files of the previously put aside folder (the audio files corresponding to the words of `unknowns_list`). 

The following cell creates that `unknow` list. 
First, it mix up the files so, while taking the 2000*(augment+1) first audio files, every words of `unknowns_list` should appear. 

In [11]:
unknown = unknown_wav
np.random.shuffle(unknown_wav)
unknown = np.array(unknown)
unknown = unknown[:2000*(augment+1)]
unknown_label = np.array(['unknown' for _ in range(2000*(augment+1))])
unknown_label = unknown_label.reshape(2000*(augment+1),1)

Some audio files of `unknow` may not have the same length than all the previous data so, all the unknow files without the normal length (8000) are deleted.

In [12]:
delete_index = []
for i,w in enumerate(unknown):
    if len(w) != 8000:
        delete_index.append(i)
unknown = np.delete(unknown, delete_index, axis=0)

It could be problematic if the algorithm began writing aleatory words of its dictionnary depending on the background sound. To prevent that, a third category of words is created : the `silence_wav`. The same way than before, part of length 8000 of background noise are put in the `silence_wav` list, associated with the label `silence`, which can be found in the `silence_label` list.

In [13]:
#silence audio
silence_wav = []
num_wav = (2000*(augment+1))//len(background_noise)
for i, _ in enumerate(background_noise):
    for _ in range((2000*(augment+1))//len(background_noise)):
        silence_wav.append(get_one_noise(i))
silence_wav = np.array(silence_wav)
silence_label = np.array(['silence' for _ in range(num_wav*len(background_noise))])
silence_label = silence_label.reshape(-1,1)
silence_wav.shape

(3996, 8000)

In [14]:
wav_vals    = np.reshape(wav_vals,    (-1, 8000))
noised_wav  = np.reshape(noised_wav,  (-1, 8000))
unknown       = np.reshape(unknown,   (-1, 8000))
silence_wav = np.reshape(silence_wav, (-1, 8000))

For safety, the dimension of the list are checked.

In [15]:
print(wav_vals.shape)
print(noised_wav.shape)
print(unknown.shape)
print(silence_wav.shape)

(21312, 8000)
(21312, 8000)
(4000, 8000)
(3996, 8000)


In [16]:
print(label_vals.shape)
print(unknown_label.shape)
print(silence_label.shape)

(42624, 1)
(4000, 1)
(3996, 1)


All the different sort of new audio files are added to the list `wav_vals`. `label_vals` is also updated with the corresponding labels. Reminder : the labels corresponding to `noised_wav` have already been added to `label_vals`a little earlier.

In [17]:
wav_vals = np.concatenate((wav_vals, noised_wav), axis = 0)
wav_vals = np.concatenate((wav_vals, unknown), axis = 0)
wav_vals = np.concatenate((wav_vals, silence_wav), axis = 0)

In [18]:
label_vals = np.concatenate((label_vals, unknown_label), axis = 0)
label_vals = np.concatenate((label_vals, silence_label), axis = 0)

Before continuing, let's check that `wav_vals` length and `label_vals` length are the same. 

In [19]:
print(len(wav_vals))
print(len(label_vals))

50620
50620


### Prepare Train

The data is now ready. Like usual, we can now split it in two groups : one for training the model and another one for testing its accuracy.\
The test group, less useful than the training group, will include 20% of the audio files and their corresponding labels. 

In [20]:
train_wav, test_wav, train_label, test_label = train_test_split(wav_vals, label_vals, 
                                                                    test_size=0.2,
                                                                    random_state = 1993,
                                                                   shuffle=True)

In [21]:
# Parameters
lr = 0.001
generations = 20000
num_gens_to_wait = 250
batch_size = 512
drop_out_rate = 0.5
input_shape = (8000,1)

In [22]:
#For Conv1D add Channel
train_wav = train_wav.reshape(-1,8000,1)
test_wav = test_wav.reshape(-1,8000,1)

A list containing all the labels used is created and called `label_value`.

In [23]:
label_value = target_list
label_value.append('unknown')
label_value.append('silence')

A dictionnary allowing to pass from the words to their number is created (`new_label_value`) before being renamed '`label_value`'. 

In [24]:
new_label_value = dict()
for i, l in enumerate(label_value):
    new_label_value[l] = i
label_value = new_label_value

Since we have a dictionnary, we can switch the labels from being string to being numerate. 

In [25]:
temp = []
for v in train_label:
    temp.append(label_value[v[0]])
train_label = np.array(temp)

temp = []
for v in test_label:
    temp.append(label_value[v[0]])
test_label = np.array(temp)

`train_label` and `test_label` are converted to two binary class matrix.

In [27]:
train_label = keras.utils.to_categorical(train_label, len(label_value))
test_label = keras.utils.to_categorical(test_label, len(label_value))

Before using all the matrix created earlier, let's check the dimensions again and verify especially that the lengths of the wav matrix and the label matrix of training and testisting match.

In [29]:
print('Train_Wav Dimension : ' + str(np.shape(train_wav)))

Train_Wav Dimension : (40496, 8000, 1)


In [30]:
print('Train_Label Dimension : ' + str(np.shape(train_label)))

Train_Label Dimension : (40496, 12)


In [31]:
print('Test_Wav Dimension : ' + str(np.shape(test_wav)))

Test_Wav Dimension : (10124, 8000, 1)


In [32]:
print('Test_Label Dimension : ' + str(np.shape(test_label)))

Test_Label Dimension : (10124, 12)


In [33]:
print('Number Of Labels : ' + str(len(label_value)))

Number Of Labels : 12


Everything is ready, we only need a model now.

In [34]:
input_tensor = Input(shape=(input_shape))

x = layers.Conv1D(8, 11, padding='valid', activation='relu', strides=1)(input_tensor)
x = layers.MaxPooling1D(2)(x)
x = layers.Dropout(drop_out_rate)(x)
x = layers.Conv1D(16, 7, padding='valid', activation='relu', strides=1)(x)
x = layers.MaxPooling1D(2)(x)
x = layers.Dropout(drop_out_rate)(x)
x = layers.Conv1D(32, 5, padding='valid', activation='relu', strides=1)(x)
x = layers.MaxPooling1D(2)(x)
x = layers.Dropout(drop_out_rate)(x)
x = layers.Conv1D(64, 5, padding='valid', activation='relu', strides=1)(x)
x = layers.MaxPooling1D(2)(x)
x = layers.Dropout(drop_out_rate)(x)
x = layers.Conv1D(128, 3, padding='valid', activation='relu', strides=1)(x)
x = layers.MaxPooling1D(2)(x)
x = layers.Flatten()(x)
x = layers.Dense(256, activation='relu')(x)
x = layers.Dropout(drop_out_rate)(x)
x = layers.Dense(128, activation='relu')(x)
x = layers.Dropout(drop_out_rate)(x)
output_tensor = layers.Dense(len(label_value), activation='softmax')(x)

model = tf.keras.Model(input_tensor, output_tensor)

model.compile(loss=keras.losses.categorical_crossentropy,
             optimizer=keras.optimizers.Adam(lr = lr),
             metrics=['accuracy'])


Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


In [35]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 8000, 1)           0         
_________________________________________________________________
conv1d (Conv1D)              (None, 7990, 8)           96        
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 3995, 8)           0         
_________________________________________________________________
dropout (Dropout)            (None, 3995, 8)           0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 3989, 16)          912       
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 1994, 16)          0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 1994, 16)          0         
__________

### Train

Using the train data and test data, the model will be trained to obtain the best performances.

In [None]:
model.fit(train_wav, train_label, validation_data=[test_wav, test_label],
          batch_size=batch_size, 
          epochs=30,
          verbose=1)

Train on 40496 samples, validate on 10124 samples
Instructions for updating:
Use tf.cast instead.
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30

In [None]:
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

To conclude, this algorithm is able to distinguish the words of its dictionary even if the sounds have a background  noise. However, this algorithm is not able to recognise whole sentences nor homophones. To improve this algorithm, one should have a bigger dictionary.