In this lab we will use Python to work with a simple speech recognition system.
Unzip Lab11.rar file and you will find:

* Vowel recordings: in folder: /Lab11/DB/DBvocals/*.wav
* where part of these recordings will be used to train the vowel
speech recognition system. The other part is the one that we will
use to classify and evaluate the accuracy of the system based on
the correct percentage.
* Guide files:
* /Lab11/DBvocals_train_list_times.txt
* /Lab11/DBvocals_test_list_times.txt
* To know what recordings we will use as a training and as a test we
have these guide files.
* Each file contains the path of some * .wav files as well as 10 time
stamps: the start and end times of each of the 5 vowels.


* Install the environment using 
    pip install librosa

Exercises

A. Set up
Have a look at the following code to check how it works. You can execute the code and debug to see what the variables contain:

accuracy = lab11('DBvocals_train_list_times.txt', 'DBvocals_test_list_times.txt', 13)

In [4]:
import numpy as np
from scipy.io.wavfile import read
import matplotlib.pyplot as plt
import librosa
from sklearn.neighbors import KNeighborsClassifier

In [2]:
def process_data(dataset, nc, plot_data):
    data_mfcc = np.array([])
    data_labels = np.array([])
    for audio in range(len(dataset)):
        vowel_times = dataset[audio][1:];
        y, sr = librosa.load(dataset[audio][0],
                            sr=44100)
        # Use parameters : n_fft = 0.025 seconds, hop_length = 0.010
        mfcc_win = 0.025
        mfcc_hop = 0.010
        mfcc = librosa.feature.mfcc(y=y, 
                                    sr=sr,
                                    hop_length = 441,
                                    n_fft = 1103,
                                    n_mfcc=nc)

        labels = np.zeros(np.shape(mfcc))[1]
        tt_mfcc  = mfcc_win/2 + mfcc_hop*np.arange(np.shape(mfcc)[1]);
        for vowel_ind in np.arange(1,6):
            start_sample = np.argmin(abs(tt_mfcc-float(vowel_times[(vowel_ind-1)*2])))
            end_sample = np.argmin(abs(tt_mfcc-float(vowel_times[(vowel_ind-1)*2 + 1])))
            labels[start_sample:end_sample] = vowel_ind;
        
        data_mfcc  = np.concatenate((data_mfcc, mfcc), axis = 1) if data_mfcc.size else mfcc
        
        data_labels = np.append(data_labels, labels)
        if plot_data:
            tt = np.arange(len(y))/sr
            plt.plot(tt,y)
            plt.plot(tt_mfcc, max(abs(y))/max(labels)*labels, 'r');
            plt.show()
        
    return (data_mfcc, data_labels)
    


In [3]:
train_data = np.loadtxt('./Lab11/DBvocals_train_list_times.txt', dtype = 'unicode')
test_data = np.loadtxt('./Lab11/DBvocals_test_list_times.txt', dtype = 'unicode')

In [92]:
# Write code to run the above function with the train and test data. Test the function using
# process_data(train_data, 13, True)

P1. Open the input txt file DBvocals_test_list_times.txt and make sure you
understand their content. To which vowel do times 4.28 and 4.74
correspond to?

P2. By debugging the code, plot the data in the
train_labels variable. What does it correspond to?

P3. By debugging the code with breakpoints, check the dimensions of
train_mfcc.

B. Classification with “nearest neighbours”

A simple method for classification is called nearest neighbors. Starting from 2
data set, one of them will serve as a reference (training set) to be able to classify
the other one (test set). The method assigns to the unknown data the closest
class of the training group based on a measure of distance between vectors.
For each data point in the test set to be classified (example: vector "a" of N = 13
MFCC coefficients), we will calculate the Euclidean distance to each element "b"
of the training set. Remember not to compute the distance to silence vectors,
only to vowel MFCC vectors. Thus, in our case of vectors of 13 components, the
distance between 2 vectors "a" and "b" would be:
\begin{equation}
d = \sqrt{(a_{1}-b_{1})^2 + (a_{2}-b_{2})^2 + ... + (a_{N}-b_{N})^2}  with N = 13
\end{equation}

The criterion of "nearest neighbors" to assign a class is that the observation to
classify will be assigned the same class of the closest (minimum distance d)
element from the training set. In general, we will use the data of train recordings
for the training and the test set to classify.

class_labels = nearest_neighbours(train_labels, train_mfcc, test_labels, test_mfcc);

P4. Implement the nearest_neighbours function. How many ‘for’ loops does
your implementation need in order to process all the data?

* In this case, just try to understand the algorithm how it works and how many for loops you would need to implement it. For this exercise, you can use KNeighborsClassifier from sklean library with K=5. To use it:
* knn = KNeighborsClassifier(n_neighbors = (), metric='euclidean')
* knn.fit((training data), (training labels)
* (prediction labels)= knn.predict((test data))

* Note: You might have to use numpy.transpose with training data and test data to match the dimensions with the corresponging labels

P5. Which set should you first iterate to build the nearest neighbours
classifier?

To evaluate the classification accuracy, we calculate the percentage of well-
qualified frames. That is, only for those frames that correspond to a vowel, we
compare the reference (test_labels) to our classification (class_labels).
Implement the code to compute the precision (accuracy) of the test frames
classification (only taking into account those labeled as a vowel, not the silence).
accuracy = ... (it is a percentage)

For this part, use numpy.where function to filter labels that are set to 0. The vowel labels are represented using 1 to 5. Once you have the filtered labels, use the resulting indexes to get the corresponding data points. Remember to filter the testing data labels also for predicting

C. Classification with different configurations

Run the code with the training file that accompanies the lab. As a test, use this
same training file. Use nc = 13.

P6. Which accuracy do you get when classifying the train set (percentage)?
Run the code with the training and test files that accompany the lab. Use nc = 13
MFCCs.

P7. Which accuracy do get when classifying the test set?

P8. Which accuracy do you get when classifying the test set with 3
coefficients?

P9. Which accuracy do you get when classifying the test set using only the
first 2 training files as reference and 13 coeffients?