# Introduction
Language Identification (LID) systems from voice are classification models that predict the spoken language from a given audio recording. The LID systems can facilitate the process of any speech processing system such as speech recognition (ASR) or speech translation systems. In speech-based assistant systems, LID works as a first step by selecting the appropriate grammar from a list of available languages for further semantic analysis. Also, these models can be employed in call centers in order to redirect an international user to an operator who is fluent in that identified language.

## Objective

The objective of this project is to use machine learning methods for constructing a LID model which can discriminate 4 languages; English, French, Arabic, Japanese. There are 2 expected phases.
The first phase is constructing a classifier. It is expected to compare the performance of different models, optimize the hyperparameters, and practice of finding the best model.
The second phase is the evaluation of the model in a simulated situation of real life deployment. The objective is to understand the challenge of generalization. It's also expected to analyze the result of the model’s performance and make hypothesis about the weak and strong aspect of models. The competition among models' accuracy can provide a better understanding of performance


## Data

The provided dataset has been collected from TEDx talks YouTube for the Language Identification task from audio. The samples are recorded audio of speaker speech from available TEDx talks videos.
In order to have a standard sample's type, they follow below convention.
The length of recorded audio files should be around 5 seconds (5.00 - 5.99 seconds).
The format of audio files should be *.wav.
The sample rate of recording files should be 16 kHz (in mono format).
### Dataset
A repository contains recording files in the standard format (*.wav, 16kHz, mono, 5-6 seconds) and a *.txt file with 4 information (separated by , ) for each recorded file (one file per line) has been provided.
The 1st column is the name of *.wav file
The 2nd column is the URL address of YouTube video
The 3rd column is the starting time of recording from YouTube video
The 4th column is the label (language) of recorded speech (EN, FR, AR, JP)
### Evaluation set
A repository contains recording files in the standard format (*.wav, 16kHz, mono, 5-6 seconds) and a *.txt file with 2 columns as the file names and a the predicted label by your classification model.

### 1. As a first step, download the provided dataset and save in an accessible directory to your code.

# Feature extractor

By using [librosa](https://librosa.org/doc/latest/index.html) (more information [here](http://conference.scipy.org/proceedings/scipy2015/pdfs/brian_mcfee.pdf)) different time domain and frequency domain features can be extracted.
The librosa is a python package for audio and music analysis.

In [None]:
import librosa

# sr should be set to your recording sample rate (16k)
# x,freq = librosa.load("[your_wav_files_directory]/FR_01.wav",sr=16000)
x,freq = librosa.load(r"H:\Home\Documents\ProjetIA\Dataset\Dataset\0000.wav",sr=16000)
# The load function will return a time series value (x) and
#   the input sample rate (freq) which is 16000
print("The duration of FR_001.wav in seconds:",len(x)/freq)

The duration of FR_001.wav in seconds: 5.494


It would be proposed to use Mel-Frequency Cepstral Coefficients (MFCC) which is a short term spectral feature.

They are commonly used to extract important information from a voice signal.

MFCC can be extracted by *librosa.feature.mfcc()* as follows.

In [None]:
# This function will return n_mfcc number of MFCC per
#     a window of time in audio time series
x_mfcc=librosa.feature.mfcc(y=x,sr=freq, n_mfcc=40)
print(x_mfcc.shape)
# x_mfcc is an array with 40 values for a window of time
# The len(x_mfcc) is a proportion of wav file duration (5-6 seconds)

(40, 172)


The extracted MFCC features (x_mfcc in previous cell) has is a matrix with size of n_mfcc (here is 40 which can be changed) * a proportion of the imput duration

### For example

if your input audio file has a fix length of 5.00 seconds the calculated x_mfcc by above code would be a size of 40 * 157 

if your input audio file has a fix length of 6.00 seconds the calculated x_mfcc by above code would be a size of 40 * 188 

You can find more information about MFCC feature and other types of feature such as Root Mean Squared Energy, Spectral Centroid, Zero Crossing Rate, etc. in  [this link](https://www.kaggle.com/volkandl/audio-processing-features-cnn-training) that can be used in this case. It is also possible to extract several types of features and concatenate them.  

The length of extracted x_mfcc is a proportion of wav file duration. So it means the by having audio files with different duration (5.00 - 5.99 seconds) the length of extracted array would be varied.
In the case of LID problem, several methods for feeding features to model can be suggested.

## Computing statistics from time series: 

By computing mean, variance, median, etc. it is possible to summarize and convert a list of values with variable length to one (or certain number of) value. In this method, the information related to the time will be lost. 

In [None]:
def feature_extractor_1(audio_file_dir):

    #load the audio files
    x,freq = librosa.load(audio_file_dir,sr=16000)
    #extract 20 MFCCs
    mfcc=librosa.feature.mfcc(y=x,sr=freq,n_mfcc=20)
    #calculate the mean and variance of each MFFC 
    mean_mfccs=np.mean(mfcc,axis=1)
    var_mfccs=np.var(mfcc,axis=1)
    #return mean and variance as the audio file feature 
    return list(mean_mfccs)+list(var_mfccs)

## Converting sample to a fixed length:

The audio time series data can be cut to a fix length (such as minimum length of 5.00 seconds) which called \textit{Sequence Truncation}. Another method to have a fix length sequence is padding to a maximum length (for example 6.00 seconds) by concatenating a constant value such as 0 to the end (or beginning) of the sequence. These two model will keep the information related to the time in the time series data.

In [None]:
def feature_extractor_2(audio_file_dir):

    #load the audio files
    x,freq = librosa.load(audio_file_dir,sr=16000)
    # trim the first 5 seconds (Sequence Truncation)
    length_of_5seconds=5*16000
    x_5sec=x[:length_of_5seconds]
    # extract 20 MFCCs
    mfccs_5sec=librosa.feature.mfcc(y=x_5sec,sr=freq,n_mfcc=20)
    # return mfcc of the first 5 sec as the audio file feature
    return mfccs_5sec

### 2. Read/Load your data

Below code help you to read your data from your directory [your_TP_data] and extract feature based on [your_feature_extractor]

At the end your data will be available in x_data (features) and y_data (labels)

In [None]:
import csv
import numpy as np
#set data_dir to the directory of your data files
data_dir= "H:\Home\Documents\ProjetIA\Dataset\Dataset/"

# Read file info file to get the list of audio files and their labels
file_list=[]
label_list=[]
with open(data_dir+"info.txt", 'r') as file:
    reader = csv.reader(file)
    for row in reader:
        # The first column contains the file name
        file_list.append(row[0])
        # The last column contains the lable (language)
        label_list.append(row[-1]) 
        
        
# create a dictionary for labels
lang_dic={'EN':0,'FR':1,'AR':2,'JP':3}

# create a list of extracted feature (MFCC) for files
x_data=[]

for audio_file in file_list:
    file_feature = feature_extractor_1(data_dir+audio_file)
    #add extracted feature to dataset 
    x_data.append(file_feature)

# create a list of labels for files
y_data=[]
for lang_label in label_list:
    #convert the label to a value in {0,1,2,3} as the class label
    y_data.append(lang_dic[lang_label])

In [None]:
#random forest prend une matrice de taille inférieure ou égale a 2, donc je peux pas utiliser extractor_2 car 
#il a une dimension de taille 3

### 3. Shuffle your data

Using below code your data (features and corresponding labels) will be shuffled

In [None]:
import random

# shuffle two lists
temp_list = list(zip(x_data, y_data))
random.shuffle(temp_list)
x_data, y_data = zip(*temp_list)

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x_data,
                                                    y_data, 
                                                    test_size=0.20,
                                                    shuffle=True)

# Train model
#clf.fit(X_train, y_train)

# Predict the test data
#y_pred = clf.predict(X_test)

### 4. Build your classifier

Now everything (almost) ready to build your classifier.

Below code is an example for creating an Random Forest classifier, training , and calculating its accuracy

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

clf = RandomForestClassifier(max_depth=10)
#en mettant max_depth a 9 on obtient 90%
#clf.fit(x_data, y_data)
# Train model
clf.fit(X_train, y_train)
# Predict the test data
y_pred = clf.predict(X_test)
# the resulted accuracy is on a small set which is same for train and test
#print("Accuracy",clf.score(x_data, y_data))
print("Accuracy:::",accuracy_score(y_test,y_pred))

Accuracy::: 0.6779661016949152


In [None]:

from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
#clf.fit(x_data, y_data)
# Train model
clf.fit(X_train, y_train)
# Predict the test data
y_pred = clf.predict(X_test)
# the resulted accuracy is on a small set which is same for train and test
#print("Accuracy",clf.score(x_data, y_data))
print("Accuracy:::",accuracy_score(y_test,y_pred))

Accuracy::: 0.4830508474576271


In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
#clf.fit(x_data, y_data)
# Train model
clf.fit(X_train, y_train)
# Predict the test data
y_pred = clf.predict(X_test)
#print("Accuracy",clf.score(x_data, y_data))
print("Accuracy:::",accuracy_score(y_test,y_pred))

Accuracy::: 0.711864406779661


In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
#clf = MLPClassifier(random_state=1, max_iter=300).fit(x_data, y_data)
clf = MLPClassifier(random_state=1, max_iter=300).fit(X_train, y_train)
#clf = GaussianNB()
#clf.fit(x_data, y_data)
# Train model
clf.fit(X_train, y_train)
# Predict the test data
y_pred = clf.predict(X_test)

print("Accuracy:::",accuracy_score(y_test,y_pred))
#print("score" clf.score(X_train,y_train))
# the resulted accuracy is on a small set which is same for train and test
#il faut diviser la base de donnée et utiliser 80% des data_test pour tester
#utiliser y_pred = clf.predict(X_test) pour prédire l'accuracy au lieu de clf.score)
#useAccuracy(y_test,y_pred)
#print("Accuracy",clf.score(x_data, y_data))


Accuracy::: 0.4491525423728814


### 5. Have you used different data for train and test?

### 6. Find a model with the best accuracy

In order to find the model with highest accuracy the performance of below combiniations should be tested.

1. Compare two feature extractors
2. Find the best hyperparameter for models : for example you can google "sklearn RandomForestClassifier" and go to [this link](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) to find the RandomForestClassifier hyperparameteres (some of RandomForestClassifier's hyperparametere : n_estimators , criterion , max_depth )
3. Compare different classification algorithems


Below you can find a lits of algorithem with hyperparameters that can be tested:

[Gaussian Naive Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB) 

[C-Support Vector Classification](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC) kernel: {'linear', 'sigmoid', 'rbf'}

[Random Forest Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) n_estimators: {10, 100, 1000} , criterion: {'gini', 'entropy'} , max_features: {'auto', 'sqrt', 'log2'} , bootstrap : {True, False}

[Multi-layer Perceptron classifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier) activation: {'tanh', 'relu'} , solver: {'sgd', 'adam'} , hidden_layer_sizes: {(100,10,),(1000,100,)}

#### Be careful 1: All arguments (inputs) of a algorithem are not hyperparameter to optimize. Read their discription! For example it does not make anysense to optimize random_state or n_jobs.

#### Be careful 2 : In order to find the best value for a hyperparameter they shuold be compare when all of other variable are same. 

#### Be careful 3: The performance (accuracy) of models can be compare only on test set not training set. It is suggested to follow k-fold cross validaton.

### 6+. Supplementary hints

1. The impact of PCA on result
2. Use an ensemmble model (aggregate (majority votes) the result of several algorithms)

### 7. Impact of dataset size

For your best model campare the impact of dataset size on the accuracy. While you have used all of provided dataset so far, train and test your model with 50% and 25% of provided dataset. 

### 8. Predict label of Evaluation set

Below code can help you to predict a label for each file in evaluation set and save the result on a file called [YourName_YourModelName_Version].csv



In [None]:
#set data_dir to the directory of your data files
data_dir= "H:\Home\Documents\ProjetIA\Dataset\Dataset/"

# Read file info file to get the list of audio files and their labels
file_list=[]
label_list=[]
with open(data_dir+"1Info.txt", 'r') as file:
    reader = csv.reader(file)
    for row in reader:
        # The first column contains the file name
        file_list.append(row[0])

lang_dic={'EN':0,'FR':1,'AR':2,'JP':3}
class2lang_dic={0:"EN",1:"FR",2:"AR",3:"JP"}

for test_sample in file_list:
    test_sample_feature=feature_extractor(data_dir+test_sample)
    predicted=class2lang_dic[clf.predict([test_sample_feature])[0]]
    print(f'Predicted class: "{predicted}"')
    # save the predicted output in Output_evaluation.txt
    with open(data_dir+"[SanogoKassoum_SVC_Version].csv",'a+') as file:
        file.write(f"{test_sample},{predicted}\n")

# Congratulation You are ready to submit your result