In [5]:
from Utils import *
from classifier.CNNClassifier import CNNClassifier
from classifier.Classifier import Classifier
from classifier.ConstantClassifier import ConstantClassifier
from classifier.LinearClassifier import LinearClassifier
from classifier.RFClassifier import RFClassifier
from classifier.SNNClassifier import SNNClassifier
from Settings import *
from sklearn.model_selection import train_test_split
from main import run_for_classifier

SAVE=True
LOAD=False

### Preliminary informations
##### Disclaimer : The neural networks results may vary a bit when re-training them due to some randomness in tensorflow-gpu / cuDNN.

The models were trained on a computer with an i7-6700K, 16GB RAM and a GTX 1070.    
The main libraries used are Keras with tensorflow backend, librosa, scipy.    
###### If you want to avoid retraining every classifier, set LOAD=True above. The RF with 100 estimators couldn't be pushed to the repository due to its file size though.

Let's begin by explaining how to measure the performance of the classifiers. There are two main metrics :    
- The sample accuracy : It's the accuracy for predicting the label of a sample of a file. For the CNN, it's a window of 10 samples.
- The file accuracy : It's the accuracy for predicting the label of a file. The label is simply calculated by taking the most predicted label on all the samples of the file.

The files were cut in samples of 20 MFCC features using the librosa library. This was mostly sufficient to allow the training on most of the classifiers. No preprocessing was done on the audio files or the MFCC features except normalizing them.    
Some more preprocessing was done to allow the files to be fed to a CNN (cutting the files in windows).

The files were separated in train set and test set using 80/20 proportions. There are 2703 files in total.

In [2]:
features_with_label = files_to_features_with_labels(list_files(AUDIO_FILES_DIR))
train_set, test_set = train_test_split(features_with_label, random_state=SEED, train_size=TRAIN_PERCENT, test_size=1-TRAIN_PERCENT)

def test_classifier(classifier : Classifier) -> None:
    """
    Wrapper for method run_for_classifier.
    :param classifier: The classifier to test
    """
    one_d = not isinstance(classifier, CNNClassifier)
    run_for_classifier(classifier, test_set=test_set, train_set=train_set, save=SAVE, load=LOAD, one_d=one_d)

### Constant Classifier
This classifier always predicts male.

In [3]:
test_classifier(ConstantClassifier())

Finished loading/creating features
Using classifier ConstantClassifier
Training ConstantClassifier
Saved ConstantClassifier
Predicting on files...
Predicting on samples...
Test accuracy - files : 0.4824399260628466
Test accuracy - samples : 0.5065437497537723


This gives us some information on the files as well as a baseline :     
- The files are distributed nearly uniformly between male and female
- When taking duration into account, the files are distributed even more closely half male / half female

This means that it is not really needed to perform balancing between the train and test set, assuming we distribute the files randomly between them.

### Linear SVC
This classifier is created using SVM with a linear kernel. It could serve as another simple baseline.

In [4]:
test_classifier(LinearClassifier(c=1))

Finished loading/creating features
Using classifier LinearClassifier - C 1
Training LinearClassifier - C 1
Saved LinearClassifier - C 1
Predicting on files...
Predicting on samples...
Test accuracy - files : 0.822550831792976
Test accuracy - samples : 0.7052279084426585


 Here, the linear SVC returns quite mediocre results, having a file-accuracy of only 82%, and a worse sample accuracy of 70%.    
Trying to increase the C parameter to 1000 makes the classifier not converge (or very slowly. When tested earlier, it hadn't converged in 10000 iterations).    
It isn't surprising that a linear classifier would fail on this problem, as it is probably non-linear. Improving the preprocessing as well as doing features selection could improve the result.

In [5]:
test_classifier(LinearClassifier(c=1000))

Finished loading/creating features
Using classifier LinearClassifier - C 1000
Training LinearClassifier - C 1000
Saved LinearClassifier - C 1000
Predicting on files...
Predicting on samples...
Test accuracy - files : 0.5508317929759704
Test accuracy - samples : 0.5846669030453453


### RandomForestClassifier
Let's test RandomForestClassifiers, which are known to be quite accurate when features are well-defined and results are dependent on these features, which should be the case with the MFCC features and the speaker's gender.

In [6]:
test_classifier(RFClassifier(n_estimators=10))

Finished loading/creating features
Using classifier RFClassifier - n_est 10 - max_depth None
Training RFClassifier - n_est 10 - max_depth None
Saved RFClassifier - n_est 10 - max_depth None
Predicting on files...
Predicting on samples...
Test accuracy - files : 0.9981515711645101
Test accuracy - samples : 0.8465665996927078


Here, we can already see that the files accuracy is nearly perfect (in fact, only 1 file is mispredicted). The sample accuracy could be better though.    
Let's try to improve it by increasing the number of trees. We may also get a 100% file accuracy by doing this.

In [7]:
test_classifier(RFClassifier(n_estimators=100))

Finished loading/creating features
Using classifier RFClassifier - n_est 100 - max_depth None
Training RFClassifier - n_est 100 - max_depth None
Saved RFClassifier - n_est 100 - max_depth None
Predicting on files...
Predicting on samples...
Test accuracy - files : 0.9981515711645101
Test accuracy - samples : 0.8814324547925777


It doesn't seem to improve much and isn't really worth it given the increase in memory and time consumption.    
Let's try with 1 and 5 trees to see if we can go lower than 10 while preserving the accuracy :

In [8]:
test_classifier(RFClassifier(n_estimators=5))
print("\n")
test_classifier(RFClassifier(n_estimators=1))

Finished loading/creating features
Using classifier RFClassifier - n_est 5 - max_depth None
Training RFClassifier - n_est 5 - max_depth None
Saved RFClassifier - n_est 5 - max_depth None
Predicting on files...
Predicting on samples...
Test accuracy - files : 0.9981515711645101
Test accuracy - samples : 0.8261513611472245


Finished loading/creating features
Using classifier RFClassifier - n_est 1 - max_depth None
Training RFClassifier - n_est 1 - max_depth None
Saved RFClassifier - n_est 1 - max_depth None
Predicting on files...
Predicting on samples...
Test accuracy - files : 0.9926062846580407
Test accuracy - samples : 0.746507505023047


Strangely enough, we get a really good result with even only one tree. The sample accuracy decreases quite a bit obviously, but the file accuracy is still very good. This means that there are features which are very relevant to finding the speaker's gender.   
With 5 trees, the file accuracy is the same as the one with 10 trees.    
The RandomForest seems to be very well suited for this problem, as expected. What was less expected was that such a performance was obtained using a very small number of estimators.    
RandomForest is therefore a very efficient and easy way to solve this problem.

### Shallow Neural Network
Let's try using a shallow neural network with only a single fully-connected hidden layer.    
In theory, such a simple neural net should be able to approximate every function perfectly, but in practice, a deep neural network usually yields better results than a wide one.

In [9]:
test_classifier(SNNClassifier(num_units=64, verbose=1))

Finished loading/creating features
Using classifier SNNClassifier - units 64
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 64)                1344      
_________________________________________________________________
p_re_lu_1 (PReLU)            (None, 64)                64        
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 65        
_________________________________________________________________
activation_1 (Activation)    (None, 1)                 0         
Total params: 1,473
Trainable params: 1,473
Non-trainable params: 0
_________________________________________________________________
None
Training SNNClassifier - units 64
Train on 384480 samples, validate on 96121 samples
Epoch 1/300
Epoch 2/300
Epoch 3/300
Epoch 4/300
Epoch 5/300
Epoch 6/300
Epoch 7/300
Epoch 8/300
Epoch 9/3

Epoch 52/300
Epoch 53/300
Epoch 54/300
Epoch 55/300
Epoch 56/300
Epoch 57/300
Epoch 58/300
Epoch 59/300
Epoch 60/300
Epoch 61/300
Epoch 62/300
Epoch 63/300
Epoch 64/300
Epoch 65/300
Epoch 66/300
Epoch 67/300
Epoch 68/300
Epoch 69/300
Epoch 70/300
Epoch 71/300

Epoch 00071: ReduceLROnPlateau reducing learning rate to 0.00010000000474974513.
Epoch 72/300
Epoch 73/300
Epoch 74/300
Epoch 75/300
Epoch 76/300
Epoch 77/300
Epoch 78/300
Epoch 79/300
Epoch 80/300
Epoch 81/300
Epoch 82/300
Epoch 83/300
Epoch 84/300
Epoch 85/300
Epoch 86/300
Epoch 87/300
Epoch 88/300
Epoch 89/300

Epoch 00089: ReduceLROnPlateau reducing learning rate to 1.0000000474974514e-05.
Epoch 90/300
Epoch 91/300
Epoch 92/300
Epoch 93/300
Epoch 94/300
Epoch 00094: early stopping
Saved SNNClassifier - units 64
Predicting on files...
Predicting on samples...
Test accuracy - files : 0.9981515711645101
Test accuracy - samples : 0.8652956703305362


We already get a nearly perfect accuracy for files, which is very impressive with only 64 units. This probably means that the function to approximate / the problem is not really complex, but still non-linear.    
Let's try to improve the accuracy by increasing the number of units in the dense layer.

In [10]:
test_classifier(SNNClassifier(num_units=128, verbose=1))

Finished loading/creating features
Using classifier SNNClassifier - units 128
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_3 (Dense)              (None, 128)               2688      
_________________________________________________________________
p_re_lu_2 (PReLU)            (None, 128)               128       
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 129       
_________________________________________________________________
activation_2 (Activation)    (None, 1)                 0         
Total params: 2,945
Trainable params: 2,945
Non-trainable params: 0
_________________________________________________________________
None
Training SNNClassifier - units 128
Train on 384480 samples, validate on 96121 samples
Epoch 1/300
Epoch 2/300
Epoch 3/300
Epoch 4/300
Epoch 5/300
Epoch 6/300
Epoch 7/300
Epoch 8/300
Epoch 9

Epoch 52/300
Epoch 53/300
Epoch 54/300
Epoch 55/300
Epoch 56/300
Epoch 57/300
Epoch 58/300
Epoch 59/300
Epoch 60/300
Epoch 61/300
Epoch 62/300
Epoch 63/300
Epoch 64/300
Epoch 65/300
Epoch 66/300
Epoch 67/300
Epoch 68/300
Epoch 69/300
Epoch 70/300
Epoch 71/300
Epoch 72/300
Epoch 73/300
Epoch 74/300
Epoch 75/300
Epoch 76/300
Epoch 77/300
Epoch 78/300
Epoch 79/300

Epoch 00079: ReduceLROnPlateau reducing learning rate to 0.00010000000474974513.
Epoch 80/300
Epoch 81/300
Epoch 82/300
Epoch 83/300
Epoch 84/300
Epoch 85/300
Epoch 86/300
Epoch 87/300
Epoch 88/300
Epoch 89/300
Epoch 90/300

Epoch 00090: ReduceLROnPlateau reducing learning rate to 1.0000000474974514e-05.
Epoch 91/300
Epoch 92/300
Epoch 93/300
Epoch 94/300
Epoch 95/300
Epoch 00095: early stopping
Saved SNNClassifier - units 128
Predicting on files...
Predicting on samples...
Test accuracy - files : 1.0
Test accuracy - samples : 0.87788677461293


As expected, the sample accuracy is ~1% better, and the file accuracy is now perfect.

In [11]:
test_classifier(SNNClassifier(num_units=256, verbose=1))

Finished loading/creating features
Using classifier SNNClassifier - units 256
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_5 (Dense)              (None, 256)               5376      
_________________________________________________________________
p_re_lu_3 (PReLU)            (None, 256)               256       
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 257       
_________________________________________________________________
activation_3 (Activation)    (None, 1)                 0         
Total params: 5,889
Trainable params: 5,889
Non-trainable params: 0
_________________________________________________________________
None
Training SNNClassifier - units 256
Train on 384480 samples, validate on 96121 samples
Epoch 1/300
Epoch 2/300
Epoch 3/300
Epoch 4/300
Epoch 5/300
Epoch 6/300
Epoch 7/300
Epoch 8/300
Epoch 9

Epoch 52/300
Epoch 53/300
Epoch 54/300
Epoch 55/300
Epoch 56/300
Epoch 57/300
Epoch 58/300
Epoch 59/300
Epoch 60/300
Epoch 61/300
Epoch 62/300
Epoch 63/300
Epoch 64/300
Epoch 65/300
Epoch 66/300
Epoch 67/300
Epoch 68/300
Epoch 69/300
Epoch 70/300
Epoch 71/300
Epoch 72/300
Epoch 73/300
Epoch 74/300
Epoch 75/300
Epoch 76/300
Epoch 77/300
Epoch 78/300
Epoch 79/300
Epoch 80/300
Epoch 81/300

Epoch 00081: ReduceLROnPlateau reducing learning rate to 1.0000000474974514e-05.
Epoch 82/300
Epoch 83/300
Epoch 84/300
Epoch 85/300
Epoch 86/300
Epoch 00086: early stopping
Saved SNNClassifier - units 256
Predicting on files...
Predicting on samples...
Test accuracy - files : 1.0
Test accuracy - samples : 0.8819524878855928


We gain a very marginal increase in sample accuracy. File accuracy is still perfect.

I decided to not add any Dropout or regularization to the network because, as we can see in the results, there is little to no overfitting.

### Convolutional Neural Network
The CNN Classifier was more difficult to implement.    
First, the input is different compared to the other classifiers. We need to give identical 2D inputs to the network, but the number of samples per file varies a lot.    
- The first attempt was padding the smaller files to the size of the biggest one by adding 0s features. Due to the difference in size, the network couldn't learn anything, as most of the data was empty (smallest file has 46 samples, biggest has around 1200).   
- The next attempt was cutting the files into smaller windows of 46 samples (and potentially padding the last sample of the file). The next cell is the (reduced) output of a run using this strategy.    
- The current one is to cut the files into windows of 10 samples. We'll see the result later.

As the purpose of this classifier is to be a deep neural network, I decided to put two convolutional layers along with two max pooling, followed by two denses layers and an output layer. The whole network can be seen in the following cell.    
Due to the number of parameters for this network (parameters in each layer, layout of the network, etc), most of these are heuristically chosen from a previous project.    
About design decisions :    
- Batch normalization is used for faster and better training
- Prelu activation is used instead of relu for potential better results
- Glorot Normal initialization is used, but it seems that it is a matter of preference between uniform and normal and has no real impact on performance
- Dropouts are used after max pooling and dense layers to reduce overfitting
- Kernel regularizers are used on the convolution and dense layers to reduce overfitting
- (Both dropouts and regularizers were added after checking that the network could learn perfectly the training set)

As expected, the CNN got a really good score for both file and sample accuracy, but it strangely was not perfect like the SNN and is even a bit worse than the random forest. Moreover, the sample accuracy can not really be compared to the one obtained with the previous classifiers, as the CNN has more context (46 times more information than the other classifiers).  
The next run obtains a perfect file score by reducing the window size.

In [4]:
test_classifier(CNNClassifier(verbose=1))

Finished loading/creating features
Using classifier CNNClassifier
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
batch_normalization_4 (Batch (None, 10, 20, 1)         4         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 8, 18, 32)         320       
_________________________________________________________________
p_re_lu_5 (PReLU)            (None, 8, 18, 32)         4608      
_________________________________________________________________
batch_normalization_5 (Batch (None, 8, 18, 32)         128       
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 7, 17, 32)         0         
_________________________________________________________________
dropout_5 (Dropout)          (None, 7, 17, 32)         0         
_________________________________________________________________
conv2d_4 (

Epoch 38/300
Epoch 39/300
Epoch 40/300
Epoch 41/300
Epoch 42/300
Epoch 43/300
Epoch 44/300
Epoch 45/300
Epoch 46/300
Epoch 47/300
Epoch 48/300
Epoch 49/300
Epoch 50/300
Epoch 51/300
Epoch 52/300
Epoch 53/300
Epoch 54/300
Epoch 55/300
Epoch 56/300
Epoch 57/300
Epoch 58/300
Epoch 59/300
Epoch 60/300
Epoch 61/300
Epoch 62/300
Epoch 63/300
Epoch 64/300

Epoch 00064: ReduceLROnPlateau reducing learning rate to 0.00010000000474974513.
Epoch 65/300
Epoch 66/300
Epoch 67/300
Epoch 68/300
Epoch 69/300
Epoch 70/300
Epoch 71/300
Epoch 72/300
Epoch 73/300
Epoch 74/300
Epoch 75/300
Epoch 76/300
Epoch 77/300
Epoch 78/300
Epoch 79/300
Epoch 80/300
Epoch 81/300
Epoch 82/300
Epoch 83/300
Epoch 84/300
Epoch 85/300
Epoch 86/300
Epoch 87/300
Epoch 88/300
Epoch 89/300
Epoch 90/300
Epoch 91/300
Epoch 92/300
Epoch 93/300
Epoch 94/300
Epoch 95/300


Epoch 96/300
Epoch 97/300
Epoch 98/300
Epoch 99/300
Epoch 100/300
Epoch 101/300
Epoch 102/300
Epoch 103/300
Epoch 104/300
Epoch 105/300
Epoch 106/300
Epoch 107/300
Epoch 108/300
Epoch 109/300
Epoch 110/300
Epoch 111/300
Epoch 112/300
Epoch 113/300
Epoch 114/300
Epoch 115/300
Epoch 116/300
Epoch 117/300
Epoch 118/300
Epoch 119/300
Epoch 120/300
Epoch 121/300
Epoch 122/300
Epoch 123/300
Epoch 124/300
Epoch 125/300
Epoch 126/300
Epoch 127/300
Epoch 128/300
Epoch 129/300
Epoch 130/300

Epoch 00130: ReduceLROnPlateau reducing learning rate to 1.0000000474974514e-05.
Epoch 131/300
Epoch 132/300
Epoch 133/300
Epoch 134/300
Epoch 135/300
Epoch 136/300
Epoch 137/300
Epoch 138/300
Epoch 139/300
Epoch 140/300
Epoch 141/300
Epoch 142/300
Epoch 143/300
Epoch 144/300
Epoch 145/300
Epoch 146/300
Epoch 147/300
Epoch 148/300
Epoch 149/300
Epoch 150/300
Epoch 151/300
Epoch 152/300
Epoch 153/300


Epoch 154/300
Epoch 155/300
Epoch 156/300
Epoch 157/300
Epoch 158/300
Epoch 159/300
Epoch 160/300
Epoch 161/300
Epoch 162/300
Epoch 163/300

Epoch 00163: ReduceLROnPlateau reducing learning rate to 1.0000000656873453e-06.
Epoch 164/300
Epoch 165/300
Epoch 166/300
Epoch 167/300
Epoch 168/300
Epoch 169/300
Epoch 170/300
Epoch 171/300
Epoch 172/300
Epoch 173/300
Epoch 174/300
Epoch 175/300
Epoch 176/300
Epoch 177/300
Epoch 178/300
Epoch 179/300
Epoch 00179: early stopping
Saved CNNClassifier
Predicting on files...
Predicting on samples...
Test accuracy - files : 0.9981515711645101
Test accuracy - samples : 0.9319990727146279


Now, by using windows of size 10, we get a the same result as the RandomForest for file accuracy (1 mispredicted file), which is better than a window size of 46. It is probably mainly due to averaging over more samples : As the per sample accuracy is still very high, averaging over them allows for a very good score. In theory, as long as we have a per sample accuracy >50% and a very high number of samples per file, the predictions should be perfect. It is not always the case in practice due to the number of samples per file though. 


100% files accuracy should be obtainable by tuning the parameters or changing the layout.      

Compared to the other classifiers however, it is slower to train, more memory hungry, and therefore seems a bit overkill for this task, given than a simple neural net with a single hidden layer obtains file predictions results better than this one.    

### Dumps size
If we compare the performance/file size metric of the classifiers, we see that the SNN is easily the best one, the 256 units network weighing only 96Ko, compared to the 12Mo for the 10 estimators RandomForest or the 23Mo CNN. The linear SVC only weighs 1Ko, but has bad results.

### Other considerations
- I've also added a cross-validation method to test the classifiers on samples, but due to the time it takes to execute it, I decided to leave it out.
- I have considered data augmentation, but given the results we've already obtained, this isn't at all necessary for this problem and would only be wasted time.

### Conclusion

In conclusion, we can rate the classifiers like this : 
- The winner is obviously the SNN. The complexity of designing it is really low, the file size is also very small, and it gets a perfect result.
- The second one would be the RandomForest. The design complexity is non-existant, the file size is moderately high but still very manageable for a low number of trees, and it gets a nearly perfect result.
- The third is the CNN. Although the design complexity is high due to the number of parameters we can change and layouts we can do, and the file size is higher than the others, it gets a nearly score. It is totally unnecessary to implement such a complex network for this problem though.
- The fourth is the linear SVC, because even though the design complexity and file size is null, the results are bad.
- For obvious reasons, the constant classifier is the last one.

In short, a more complex model is not always better than a simple one.