# System for human activity detection using smartphone sensor data

**Author: Miguel Zabaleta (100463947)**

We will begin by describing the strategies that will be implemented and the reasoning behind them.

<br>

**Part I: Original variables**

First, I will try to get the best results just by using the features that we are initially given.

In this section, we will make the following implementations (in principle, using default hyperparameters):

**1.** Models **without an id column**, merging every sequence into a single dataset.

  - 1.1. Gaussian Mixture Model

  - 1.2. Classifiers
    - Single models
    - Ensemble of best models
    - Ensemble of best models with tweaked parameters

**2.** Models **with an id column**, indicating the person that is performing the actions (merging every sequence into a single dataset).  
This could be a useful variable for the model to have since there is a lot of variability in the actions performed by different people (way of walking, for instance).

- 2.1 Gaussian Mixture Model
- 2.2 Classifiers (single models)

**3.** Make **one model per sequence** (per person), and predict the test observations as the **majority class predicted among all the models**.  
The reasoning is that if we do one model per sequence, we are avoiding the variability between the people, so maybe this way the individual model will be very good at predicting for their participant, and thus by selecting the majority class, the predictions for the test set (considered as different people than in the training) will also be very accurate.

<br>
<br>

**Part II: MFCC variables**

Secondly, let us advance that these implementations did not provide excellent results (around 0.8 accuracy).

This can be due to the lack of relevant variables, as we are trying to differentiate between tasks with only 6 variables, where there is already a lot of variability in the movements and also overlap.

To try to get better features, we will extract the **MFCC coefficients** and use them as our new variables, developing a similar implementation as before.

For this implementation, we will try the following models:

- 1. Gaussian Mixture Model
- 2. Classifiers: ensemble of best models, trying **number of MFCC coefficients** from 5, 10, ..., 45

<br>

The third implementation regards selecting the **best combination of features** from which to construct the MFCC coefficients out of the 3 accelerometer variables and the 3 gyroscope variables.

The thinking behind this is because getting 45 MFCC for every 6 variables didn't provide excellent results.  
This may be because there are too many variables with high **multicolliniearity**, which would make it difficult for the models to differentiate between the signals.

Maybe, we can find a group of 2 or 3 variables from which to construct the MFCC, and the right amount of MFCC too, which will provide excellent results after all.

- 3. Selecting best combination of features (63 features combinations)
  - 3.1. Gaussian Mixture Model
  - 3.2. Classifiers: ensemble of best models, trying **number of MFCC coefficients** for 5, 10, 15, 20


Now, let's begin by loading a set of packages and models.

In [2]:
import time
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats


from sklearn import metrics
from sklearn.metrics import accuracy_score,confusion_matrix
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, HalvingGridSearchCV, HalvingRandomSearchCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, KFold, train_test_split

from sklearn.preprocessing import StandardScaler


from sklearn.ensemble import (AdaBoostClassifier, BaggingClassifier, ExtraTreesClassifier,
                            GradientBoostingClassifier, RandomForestClassifier, HistGradientBoostingClassifier,
                             VotingRegressor, VotingClassifier)
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.mixture import GaussianMixture
from sklearn.svm import LinearSVC, NuSVC

In [3]:
import scipy.io as sio
ar_data = sio.loadmat('AR_database.mat', verify_compressed_data_integrity=False)
data_train = ar_data['data_train'][:,0]
label_train = ar_data['label_train'][:,0]
data_test = ar_data['data_test'][:,0]
label_test = ar_data['label_test'][:,0]

The models we will implement are the following:

- Gaussian Mixture: as it has worked well in the past (using MFCC, but maybe it also works with raw signals)
-  Linear Discriminant Analysis
-  Quadratic Discriminant Analysis
-    AdaBoost 
-    Bagging  
-    Extra Trees
-    Random Forest 
-    Hist Gradient Boosting
-    LinearSVC

# Part 1. Original variables

## 1. All observations together without id column

First, we need to merge the sequences into a single dataset.

In [None]:
X = data_train[0].T
Y = label_train[0].T

for k in range(1,8):
  X = np.concatenate((X,data_train[k].T))
  Y = np.concatenate((Y,label_train[k].T))

X_real_test = data_test[0].T
Y_real_test = label_test[0].T
for k in range(3):
  X_real_test = np.concatenate((X_real_test,data_test[k].T))
  Y_real_test = np.concatenate((Y_real_test,label_test[k].T))

### 1.1. Gaussian Mixture model

In [None]:
# convert datasets to dataframes for ease in implementation
merge_train = np.append(X,Y,axis=1)
merge_test = np.append(X_real_test,Y_real_test,axis=1)
merge_train = pd.DataFrame(merge_train, columns = ['x1','y1','z1','x2','y2','z2','label'])
merge_test = pd.DataFrame(merge_test, columns = ['x1','y1','z1','x2','y2','z2','label'])

In [None]:
merge_train.head()

Unnamed: 0,x1,y1,z1,x2,y2,z2,label
0,1.012817,-0.123217,0.102934,0.030191,0.066014,0.022859,4.0
1,1.022028,-0.124004,0.102102,0.035688,0.07485,0.01325,4.0
2,1.02368,-0.125767,0.102814,0.047097,0.052343,0.002553,4.0
3,1.017746,-0.127361,0.109386,0.050545,0.049867,0.004325,4.0
4,1.016417,-0.125868,0.102473,0.047686,0.058189,0.017189,4.0


In [None]:
# train one model per label
classes = np.unique(Y)
nclasses = len(classes)

models = []
for i in range(nclasses):
  data_train_class = merge_train.loc[merge_train['label'] == classes[i], ['x1','y1','z1','x2','y2','z2']]
  gm = GaussianMixture(n_components=8, 
                        covariance_type='diag',
                        random_state=100463947).fit(data_train_class)
  models.append(gm)

In [None]:
# predict based on highest loglike score
scores = []
for gmm in models:
  loglike = gmm.score_samples(merge_test.iloc[:,[0,1,2,3,4,5]])
  scores.append(loglike)
pred_level = np.argmax(scores, axis=0)+1
print('accuracy score:',np.round(accuracy_score(merge_test['label'],pred_level),2)) # 0.73
print('confusion matrix:','\n',confusion_matrix(merge_test['label'],pred_level))

accuracy score: 0.73
confusion matrix: 
 [[15192     0   870     0     0]
 [    0 15069   992  3807   193]
 [    0   192 20103   306 12359]
 [    0  4004   257 15403   421]
 [    0   206  4061   430 11527]]


As we can see, in the whole dataset we achieve an **accuracy of 0.73**.  

We can see that the easiest class to predict is laying (which seems reasonable), whereas climbing stairs is the hardest for this model to get right.

Now, we will repeat the same process but for every test sequence.

In [None]:
for k in range(3):
  test_seq = pd.DataFrame(data_test[k].T, columns = ['x1','y1','z1','x2','y2','z2'])
  scores = []
  for gmm in models:
    loglike = gmm.score_samples(test_seq)
    scores.append(loglike)
  pred_level = np.argmax(scores, axis=0)+1
  print('accuracy score:',np.round(accuracy_score(label_test[k].T,pred_level),2)) # 0.72, 0.72, 0.79
  print('confusion matrix:','\n',confusion_matrix(label_test[k].T,pred_level))

accuracy score: 0.72
confusion matrix: 
 [[3326    0    1    0    0]
 [   0 3950  223 1191   75]
 [   0   51 4827   57 2713]
 [   0 1844    9 3751  124]
 [   0   86 1031  184 3339]]
accuracy score: 0.72
confusion matrix: 
 [[3997    0  867    0    0]
 [   0 3764  545  782   29]
 [   0   61 6301  126 4456]
 [   0   22   28 4058   84]
 [   0   19 1016   33 2388]]
accuracy score: 0.79
confusion matrix: 
 [[4543    0    1    0    0]
 [   0 3405    1  643   14]
 [   0   29 4148   66 2477]
 [   0  294  211 3843   89]
 [   0   15  983   29 2461]]


We can see that we get **0.72, 0.72, 0.79 respectives accuracies**, with similar difficulties.

### 1.2. Classification algorithms

Now we will try each of the mentioned classifiers, evaluating the performance on the whole test set, and on the single sequences.

In [None]:
classifiers = [
    LinearDiscriminantAnalysis(),
    QuadraticDiscriminantAnalysis(),
    AdaBoostClassifier(random_state=100463947), 
    BaggingClassifier(random_state=100463947), 
    ExtraTreesClassifier(random_state=100463947),
    RandomForestClassifier(random_state=100463947), 
    HistGradientBoostingClassifier(random_state=100463947),
    LinearSVC(random_state=100463947)
    ]

i=0
for clf in classifiers:
    print('model',str(clf))
    
    clf.fit(X,Y.ravel())
    pred = clf.predict(X_real_test)
    print("accuracy over all sequences:",round(metrics.accuracy_score(Y_real_test,pred),3))
    print(metrics.confusion_matrix(Y_real_test, pred))

    for k in range(3):
      y_test_pred = clf.predict(data_test[k].T)
      print('accuracy over sequence ',k+1,': ',round(metrics.accuracy_score(label_test[k].T, y_test_pred),3),sep='')
      print(metrics.confusion_matrix(label_test[k].T, y_test_pred))
    
    i=i+1
    print('\n')

model LinearDiscriminantAnalysis()
accuracy over all sequences: 0.539
[[15198     0   864     0     0]
 [    0 12642  6711   708     0]
 [   10  3342 26644  2964     0]
 [    0  1654 16142  2289     0]
 [    9  2790 12101  1324     0]]
accuracy over sequence 1: 0.506
[[3327    0    0    0    0]
 [   0 3290 1819  330    0]
 [   3  976 5800  869    0]
 [   0  827 3760 1141    0]
 [   0  985 3253  402    0]]
accuracy over sequence 2: 0.555
[[4000    0  864    0    0]
 [   0 3038 2078    4    0]
 [   4 1231 8828  881    0]
 [   0    0 4192    0    0]
 [   9  612 2494  341    0]]
accuracy over sequence 3: 0.593
[[4544    0    0    0    0]
 [   0 3024  995   44    0]
 [   0  159 6216  345    0]
 [   0    0 4430    7    0]
 [   0  208 3101  179    0]]


model QuadraticDiscriminantAnalysis()
accuracy over all sequences: 0.74
[[15900     0   162     0     0]
 [    0 11854  1021  6403   783]
 [    0   119 23318   103  9420]
 [    0   762   153 18193   977]
 [    0   181  7134   153  8756]]
accur

We have that the best models are the following: **QDA, Bagging, GB, RF, HGB**

Let's try an ensemble of these models.

In [None]:
# ensemble with soft voting
classifiers = [
    QuadraticDiscriminantAnalysis(),
    BaggingClassifier(random_state=100463947), 
    GradientBoostingClassifier(random_state=100463947), 
    RandomForestClassifier(random_state=100463947), 
    HistGradientBoostingClassifier(random_state=100463947)
    ]
initials = ['QDA','Bagg','GB','RF','HGB']

e_list = [(i,c) for i,c in zip(initials,classifiers)]

eclf = VotingClassifier(estimators=e_list, voting='soft')
eclf.fit(X,Y.ravel())
pred = eclf.predict(X_real_test)
score = round(metrics.accuracy_score(Y_real_test, pred),3)
print(score)
print(metrics.confusion_matrix(Y_real_test, pred))

for k in range(3):
      y_test_pred = eclf.predict(data_test[k].T)
      print('accuracy over sequence ',k+1,': ',round(metrics.accuracy_score(label_test[k].T, y_test_pred),3),sep='')
      print(metrics.confusion_matrix(label_test[k].T, y_test_pred))


0.766
[[16060     0     2     0     0]
 [    0 14288  1218  4376   179]
 [    0    64 27163   175  5558]
 [    0  3651   635 15425   374]
 [    0    62  8185   230  7747]]
accuracy over sequence 1: 0.704
[[3326    0    1    0    0]
 [   0 3309  333 1737   60]
 [   0   18 6307   29 1294]
 [   0 1773   70 3761  124]
 [   0   25 2348  103 2164]]
accuracy over sequence 2: 0.813
[[4864    0    0    0    0]
 [   0 3779  546  747   48]
 [   0   22 8796   78 2048]
 [   0    0   77 4062   53]
 [   0    6 1700   15 1735]]
accuracy over sequence 3: 0.848
[[4544    0    0    0    0]
 [   0 3891    6  155   11]
 [   0    6 5753   39  922]
 [   0  105  418 3841   73]
 [   0    6 1789    9 1684]]


We have **improved** our score to **0.766 accuracy overall**, and **0.704, 0.813, 0.848 accuracies** for the respective sequences.

In this case, it is clear that the laying activity is almost perfectly classified, whereas the hardest for this ensemble is walking.

Let's do a brief **manual hyperparameter tuning**.

In [None]:
# now, manually tweak the parameters try to achieve better results
classifiers = [
    QuadraticDiscriminantAnalysis(),
    BaggingClassifier(n_estimators=200,max_features=6,max_samples=1500,random_state=100463947), 
    GradientBoostingClassifier(learning_rate=0.0001,n_estimators=2000,min_samples_split=100,random_state=100463947), 
    RandomForestClassifier(n_estimators=2000,max_depth=200,random_state=100463947), 
    HistGradientBoostingClassifier(learning_rate=0.0001,max_iter=2000,max_depth=200,random_state=100463947)
    ]
initials = ['QDA','Bagg','GB','RF','HGB']

e_list = [(i,c) for i,c in zip(initials,classifiers)]

eclf = VotingClassifier(estimators=e_list, voting='soft')
eclf.fit(X,Y.ravel())
pred = eclf.predict(X_real_test)
score = round(metrics.accuracy_score(Y_real_test, pred),3)
print(score)
print(metrics.confusion_matrix(Y_real_test, pred))

for k in range(3):
      y_test_pred = eclf.predict(data_test[k].T)
      print('accuracy over sequence ',k+1,': ',round(metrics.accuracy_score(label_test[k].T, y_test_pred),3),sep='')
      print(metrics.confusion_matrix(label_test[k].T, y_test_pred))

0.771
[[16062     0     0     0     0]
 [    0 14080  1321  4470   190]
 [    0    44 29330   150  3436]
 [    0  2983   670 16101   331]
 [    0    80 10224   191  5729]]
accuracy over sequence 1: 0.712
[[3327    0    0    0    0]
 [   0 3331  375 1676   57]
 [   0   14 6788   25  821]
 [   0 1479  103 4023  123]
 [   0   33 2917   85 1605]]
accuracy over sequence 2: 0.817
[[4864    0    0    0    0]
 [   0 3619  557  888   56]
 [   0   10 9563   63 1308]
 [   0    0  116 4034   42]
 [   0    9 2169   11 1267]]
accuracy over sequence 3: 0.852
[[4544    0    0    0    0]
 [   0 3799   14  230   20]
 [   0    6 6191   37  486]
 [   0   25  348 4021   43]
 [   0    5 2221   10 1252]]


Yet again, we have managed to **improve our scores** to **0.771 overall accuracy**, and **0.712, 0.817, 0.852** for the respective sequence accuracies.

## 2. Adding 'id' column per sequence

In [None]:
# re-load data
import scipy.io as sio
ar_data = sio.loadmat('AR_database.mat', verify_compressed_data_integrity=False)
data_train = ar_data['data_train'][:,0]
label_train = ar_data['label_train'][:,0]
data_test = ar_data['data_test'][:,0]
label_test = ar_data['label_test'][:,0]

In [None]:
for k in range(data_train.shape[0]):
  m = data_train[k].shape[1] # numer of observations
  a = data_train[k].T # change dimension to match append
  b = np.array([np.repeat(str(k+1),m)]).T # add 'id' column of 1's, 2's,..., 8's for the observations of each sequence
  data_train[k] = np.append(a,b,axis=1)

for k in range(data_test.shape[0]):
  m = data_test[k].shape[1] 
  a = data_test[k].T 
  b = np.array([np.repeat(str(k+9),m)]).T 
  data_test[k] = np.append(a,b,axis=1)

In [None]:
X = data_train[0]
Y = label_train[0].T

for k in range(1,8):
  X = np.concatenate((X,data_train[k]))
  Y = np.concatenate((Y,label_train[k].T))

X_real_test = data_test[0]
Y_real_test = label_test[0].T
for k in range(3):
  X_real_test = np.concatenate((X_real_test,data_test[k]))
  Y_real_test = np.concatenate((Y_real_test,label_test[k].T))

### 2.1 Gaussian Mixture Model

In [None]:
# Gaussian Mixture implementation
merge_train = np.append(X,Y,axis=1)
merge_test = np.append(X_real_test,Y_real_test,axis=1)
merge_train = pd.DataFrame(merge_train, columns = ['x1','y1','z1','x2','y2','z2','id','label'])
merge_test = pd.DataFrame(merge_test, columns = ['x1','y1','z1','x2','y2','z2','id','label'])

In [None]:
# train one model per label
classes = np.unique(Y)
nclasses = len(classes)

models = []
for i in range(nclasses):
  data_train_class = merge_train.loc[merge_train['label'] == str(classes[i]), ['x1','y1','z1','x2','y2','z2','id']]
  gm = GaussianMixture(n_components=8, 
                        covariance_type='diag',
                        random_state=100463947).fit(data_train_class)
  models.append(gm)

In [None]:
# predict based on highest loglike score
scores = []
for gmm in models:
  loglike = gmm.score_samples(merge_test.iloc[:,[0,1,2,3,4,5,6]])
  scores.append(loglike)
pred_level = np.argmax(scores, axis=0)+1
print('accuracy score:',np.round(accuracy_score(merge_test['label'].astype('int'),pred_level),2)) # 0

accuracy score: 0.31


We get an accuracy score of 0.31, predicting the label 3 (climbing stairs) every time.  
This is probably due to the way the 'id' variable is implemented in the Gaussian Mixture model.

As it's codified as a string variable, in the test set it has values that it has never seen before.

This probably confuses the model a lot and therefore tries to predict the most frequent class as the most likely to be statistically similar to the test observations.

We will now implement the classifiers.

### 2.2 Classifiers

In [None]:
# implement the classifiers
classifiers = [
    LinearDiscriminantAnalysis(),
    QuadraticDiscriminantAnalysis(),
    AdaBoostClassifier(random_state=100463947), 
    BaggingClassifier(random_state=100463947), 
    ExtraTreesClassifier(random_state=100463947),
    RandomForestClassifier(random_state=100463947), 
    HistGradientBoostingClassifier(random_state=100463947),
    LinearSVC(random_state=100463947)
    ]

i=0
for clf in classifiers:
    print('model',str(clf))
    
    clf.fit(X,Y.ravel())
    pred = clf.predict(X_real_test)
    print("accuracy over all sequences:",round(metrics.accuracy_score(Y_real_test,pred),3))
    print(metrics.confusion_matrix(Y_real_test, pred))

    for k in range(3):
      y_test_pred = clf.predict(data_test[k])
      print('accuracy over sequence ',k+1,': ',round(metrics.accuracy_score(label_test[k].T, y_test_pred),3),sep='')
      print(metrics.confusion_matrix(label_test[k].T, y_test_pred))
    
    i=i+1
    print('\n')

model LinearDiscriminantAnalysis()


  X = check_array(X, **check_params)


accuracy over all sequences: 0.528
[[15198     0   864     0     0]
 [    0 12074  7928    59     0]
 [    4  2884 27681  2391     0]
 [    0   846 18523   716     0]
 [    6  2403 12798  1017     0]]


  X = check_array(X, **check_params)
  X = check_array(X, **check_params)


accuracy over sequence 1: 0.481
[[3327    0    0    0    0]
 [   0 3124 2292   23    0]
 [   1  841 6068  738    0]
 [   0  423 4948  357    0]
 [   0  860 3435  345    0]]
accuracy over sequence 2: 0.562
[[4000    0  864    0    0]
 [   0 2871 2249    0    0]
 [   2 1095 9188  659    0]
 [   0    0 4192    0    0]
 [   6  526 2697  227    0]]
accuracy over sequence 3: 0.596
[[4544    0    0    0    0]
 [   0 2955 1095   13    0]
 [   0  107 6357  256    0]
 [   0    0 4435    2    0]
 [   0  157 3231  100    0]]


model QuadraticDiscriminantAnalysis()


  X = check_array(X, **check_params)
  estimator=estimator,
  X = check_array(X, **check_params)


accuracy over all sequences: 0.737
[[15894     0   168     0     0]
 [    0 10142  1085  8031   803]
 [    0   105 27879    95  4881]
 [    0   864   457 18028   736]
 [    0   188 10173   136  5727]]


  X = check_array(X, **check_params)
  X = check_array(X, **check_params)


accuracy over sequence 1: 0.72
[[3270    0   57    0    0]
 [   0 2605  262 2291  281]
 [   0   27 6205   22 1394]
 [   0  410   26 5044  248]
 [   0   70 2359   62 2149]]
accuracy over sequence 2: 0.743
[[4812    0   52    0    0]
 [   0 2442  545 1984  149]
 [   0   36 9057   27 1824]
 [   0   17  107 3965  103]
 [   0   39 2444    7  966]]
accuracy over sequence 3: 0.769
[[4542    0    2    0    0]
 [   0 2490   16 1465   92]
 [   0   15 6412   24  269]
 [   0   27  298 3975  137]
 [   0    9 3011    5  463]]


model AdaBoostClassifier(random_state=100463947)


  X = check_array(X, **check_params)


accuracy over all sequences: 0.556
[[15198     0   864     0     0]
 [    0  5322  2214 10707  1818]
 [    2   581 14078 10525  7774]
 [    0    10     3 19086   986]
 [    1   641  4926  5791  4865]]
accuracy over sequence 1: 0.526
[[3327    0    0    0    0]
 [   0  778  834 3281  546]
 [   1   84 3279 2381 1903]
 [   0    5    0 5304  419]
 [   0  244 1391 1619 1386]]
accuracy over sequence 2: 0.531
[[4000    0  864    0    0]
 [   0 1733  545 2500  342]
 [   0  368 4389 3606 2581]
 [   0    0    0 4132   60]
 [   1  147 1121 1263  924]]
accuracy over sequence 3: 0.655
[[4544    0    0    0    0]
 [   0 2033    1 1645  384]
 [   0   45 3131 2157 1387]
 [   0    0    3 4346   88]
 [   0    6 1023 1290 1169]]


model BaggingClassifier(random_state=100463947)
accuracy over all sequences: 0.697
[[16062     0     0     0     0]
 [    0 10980  2727  6061   293]
 [    4    76 28818   229  3833]
 [    0  2835  3179 13754   317]
 [    1    84 11973   277  3889]]
accuracy over sequence 1: 0.6

  X = check_array(X, **check_params)


accuracy over all sequences: 0.534
[[15198     0   864     0     0]
 [    0 12942  7119     0     0]
 [    3  4869 28088     0     0]
 [    0  2561 17524     0     0]
 [    1  3527 12696     0     0]]


  X = check_array(X, **check_params)
  X = check_array(X, **check_params)


accuracy over sequence 1: 0.485
[[3327    0    0    0    0]
 [   0 3364 2075    0    0]
 [   1 1357 6290    0    0]
 [   0 1279 4449    0    0]
 [   0 1163 3477    0    0]]
accuracy over sequence 2: 0.577
[[4000    0  864    0    0]
 [   0 3184 1936    0    0]
 [   1 1651 9292    0    0]
 [   0    0 4192    0    0]
 [   1  746 2709    0    0]]
accuracy over sequence 3: 0.593
[[4544    0    0    0    0]
 [   0 3030 1033    0    0]
 [   0  504 6216    0    0]
 [   0    3 4434    0    0]
 [   0  455 3033    0    0]]




  X = check_array(X, **check_params)


We have obtained that the **results do not improve**.

Possibly because as it's converted to numeric, it confuses the patterns for each sequence.
(i.e. sequence 2 is "twice" sequence 1)

Let's try the next implementation: selecting the best individual model per sequence (with default values),
and then predict each observation on test by the majority vote among the 8 individual models.



## 3. One model per sequence

First, we need to select one model per sequence, among the ones that we have been trying.

In [7]:
# re-load data
ar_data = sio.loadmat('AR_database.mat', verify_compressed_data_integrity=False)
data_train = ar_data['data_train'][:,0]
label_train = ar_data['label_train'][:,0]
data_test = ar_data['data_test'][:,0]
label_test = ar_data['label_test'][:,0]

In [None]:
# try every model on its own for each sequence
classifiers = [
    LinearDiscriminantAnalysis(),
    QuadraticDiscriminantAnalysis(),
    AdaBoostClassifier(random_state=100463947), 
    BaggingClassifier(random_state=100463947), 
    ExtraTreesClassifier(random_state=100463947),
    RandomForestClassifier(random_state=100463947), 
    HistGradientBoostingClassifier(random_state=100463947),
    LinearSVC(random_state=100463947)
    ]

X_real_test = data_test[0].T
Y_real_test = label_test[0].T
for k in range(3):
  X_real_test = np.concatenate((X_real_test,data_test[k].T))
  Y_real_test = np.concatenate((Y_real_test,label_test[k].T))   

for j in range(8):
  print('sequence',j+1)
  X = data_train[j].T
  Y = label_train[j].T

  for clf in classifiers:
      print('model',str(clf))
      
      clf.fit(X,Y.ravel())
      pred = clf.predict(X_real_test)
      print("accuracy over all sequences:",round(metrics.accuracy_score(Y_real_test,pred),3))
      print(metrics.confusion_matrix(Y_real_test, pred))
      print('\n')

sequence 1
model LinearDiscriminantAnalysis()
accuracy over all sequences: 0.549
[[13534     0   864  1664     0]
 [    0 10912  2309  6452   388]
 [    3  2663 21302  7615  1377]
 [    0   174  7894 11091   926]
 [    4  2017  8043  5135  1025]]


model QuadraticDiscriminantAnalysis()
accuracy over all sequences: 0.717
[[15075     0   987     0     0]
 [    0 12809  1179  5617   456]
 [    0    66 24511    92  8291]
 [    0  1402   595 17112   976]
 [    0   119  9857   224  6024]]


model AdaBoostClassifier(random_state=100463947)
accuracy over all sequences: 0.433
[[11118     0  4944     0     0]
 [    0  7438   577   437 11609]
 [    0  1094 16397  6580  8889]
 [    0     8    47  4389 15641]
 [    0   968  6172  2819  6265]]


model BaggingClassifier(random_state=100463947)
accuracy over all sequences: 0.672
[[13514     0    20     0  2528]
 [    0 11924  2590  4801   746]
 [   12   293 26422   560  5673]
 [    0  1717  3195 14515   658]
 [    7    84 11065   671  4397]]


model E

Best performing model per sequence:

- Seq 1: Random Forest

- Seq 2: Random Forest

- Seq 3: Extra Trees

- Seq 4: Random Forest

- Seq 5: Random Forest

- Seq 6: Hist Gradient Boosting

- Seq 7: Random Forest

- Seq 8: Extra Trees

Now, we will re-train the best model for each sequence, and predict based on the majority class among all models.

In [8]:
# train one model per sequence
X = data_train[0].T
Y = label_train[0].T
mod1 = RandomForestClassifier(random_state=100463947).fit(X,Y.ravel())

X = data_train[1].T
Y = label_train[1].T
mod2 = RandomForestClassifier(random_state=100463947).fit(X,Y.ravel())

X = data_train[2].T
Y = label_train[2].T
mod3 = ExtraTreesClassifier(random_state=100463947).fit(X,Y.ravel())

X = data_train[3].T
Y = label_train[3].T
mod4 = RandomForestClassifier(random_state=100463947).fit(X,Y.ravel())

X = data_train[4].T
Y = label_train[4].T
mod5 = RandomForestClassifier(random_state=100463947).fit(X,Y.ravel())

X = data_train[5].T
Y = label_train[5].T
mod6 = HistGradientBoostingClassifier(random_state=100463947).fit(X,Y.ravel())

X = data_train[6].T
Y = label_train[6].T
mod7 = RandomForestClassifier(random_state=100463947).fit(X,Y.ravel())

X = data_train[7].T
Y = label_train[7].T
mod8 = ExtraTreesClassifier(random_state=100463947).fit(X,Y.ravel())

In [9]:
X_real_test = data_test[0].T
Y_real_test = label_test[0].T
for k in range(3):
  X_real_test = np.concatenate((X_real_test,data_test[k].T))
  Y_real_test = np.concatenate((Y_real_test,label_test[k].T))   

In [None]:
# this takes quite a long time (a couple of hours), counter printed for reference
final_preds = list()
for j in range(X_real_test.shape[0]):
  if j%20000 == 0:
    print(j)
  obs = X_real_test[j,:]
  p1 = mod1.predict(obs.reshape(1,-1))
  p2 = mod2.predict(obs.reshape(1,-1))
  p3 = mod3.predict(obs.reshape(1,-1))
  p4 = mod4.predict(obs.reshape(1,-1))
  p5 = mod5.predict(obs.reshape(1,-1))
  p6 = mod6.predict(obs.reshape(1,-1))
  p7 = mod7.predict(obs.reshape(1,-1))
  p8 = mod8.predict(obs.reshape(1,-1))
  lst = [p1[0],p2[0],p3[0],p4[0],p5[0],p6[0],p7[0],p8[0]]
  most_repeated = max(set(lst), key=lst.count)
  final_preds.append(most_repeated)

0
20000
40000
60000
80000
100000


In [None]:
print("accuracy over all sequences:",round(metrics.accuracy_score(Y_real_test,final_preds),3))
print(metrics.confusion_matrix(Y_real_test, final_preds))

accuracy over all sequences: 0.744
[[16062     0     0     0     0]
 [    0 15093  1383  3447   138]
 [    0    43 30911    92  1914]
 [    0  4156  2950 12658   321]
 [    0    63 12374   149  3638]]


This model achieves a **0.744 accuracy** over the entire test set.

In [12]:
# now, predict for every sequence
for k in range(3):
  print(k)
  final_preds = list()
  for j in range(data_test[k].shape[1]):
    if j%10000 == 0:
      print(j)
    obs = data_test[k][:,j].T
    p1 = mod1.predict(obs.reshape(1,-1))
    p2 = mod2.predict(obs.reshape(1,-1))
    p3 = mod3.predict(obs.reshape(1,-1))
    p4 = mod4.predict(obs.reshape(1,-1))
    p5 = mod5.predict(obs.reshape(1,-1))
    p6 = mod6.predict(obs.reshape(1,-1))
    p7 = mod7.predict(obs.reshape(1,-1))
    p8 = mod8.predict(obs.reshape(1,-1))
    lst = [p1[0],p2[0],p3[0],p4[0],p5[0],p6[0],p7[0],p8[0]]
    most_repeated = max(set(lst), key=lst.count)
    final_preds.append(most_repeated)
  print('accuracy over sequence ',k+1,': ',round(metrics.accuracy_score(label_test[k].T, final_preds),3),sep='')
  print(metrics.confusion_matrix(label_test[k].T, final_preds))

0
0
10000
20000
accuracy over sequence 1: 0.669
[[3327    0    0    0    0]
 [   0 3831  394 1173   41]
 [   0   11 7170   14  453]
 [   0 2073  874 2651  130]
 [   0   29 3602   67  942]]
1
0
10000
20000
accuracy over sequence 2: 0.835
[[ 4864     0     0     0     0]
 [    0  3977   553   554    36]
 [    0    18 10172    44   710]
 [    0     0   127  4035    30]
 [    0     2  2646     9   799]]
2
0
10000
20000
accuracy over sequence 3: 0.803
[[4544    0    0    0    0]
 [   0 3454   42  547   20]
 [   0    3 6399   20  298]
 [   0   10 1075 3321   31]
 [   0    3 2524    6  955]]


The accuracies evaluated on every test sequence are **0.669, 0.835, 0.803**

# Part 2. MFCC variables

Now we will try to predict using the variables provided by the MFCC.

We will not be including 'id' variable and not doing one model per sequence.

In [4]:
# re-load data
import librosa

data_train = ar_data['data_train'][:,0]
label_train = ar_data['label_train'][:,0]
data_test = ar_data['data_test'][:,0]
label_test = ar_data['label_test'][:,0]

sr = 25

In [5]:
X = data_train[0].T
Y = label_train[0].T

for k in range(1,8):
  X = np.concatenate((X,data_train[k].T))
  Y = np.concatenate((Y,label_train[k].T))

X_real_test = data_test[0].T
Y_real_test = label_test[0].T
for k in range(3):
  X_real_test = np.concatenate((X_real_test,data_test[k].T))
  Y_real_test = np.concatenate((Y_real_test,label_test[k].T))

## 1. Gaussian Mixture model

In [None]:
# convert datasets to dataframes for ease in implementation
merge_train = np.append(X,Y,axis=1)
merge_test = np.append(X_real_test,Y_real_test,axis=1)
merge_train = pd.DataFrame(merge_train, columns = ['x1','y1','z1','x2','y2','z2','label'])
merge_test = pd.DataFrame(merge_test, columns = ['x1','y1','z1','x2','y2','z2','label'])

In [None]:
# train one model per label
classes = np.unique(Y)
nclasses = len(classes)

models = []
for i in range(nclasses):
  mfcc_train = []
  data_train_class = merge_train.loc[merge_train['label'] == classes[i], ['x1','y1','z1','x2','y2','z2']]
  for j in range(6):
    y = data_train_class.iloc[:,j]
    y = y.to_numpy()
    mfcc = librosa.feature.mfcc(y, sr, n_mfcc = 100, hop_length = sr, n_fft = sr, n_mels=20)
    mfcc_train.append(mfcc.T)

  mfcc_train = np.vstack(mfcc_train)
  gm = GaussianMixture(n_components=8, 
                        covariance_type='diag',
                        random_state=100463947).fit(mfcc_train)
  models.append(gm)

In [None]:
# predict based on highest loglike score
scores = []
mfcc_test = []
for j in range(6):
  y = X_real_test[:,j]
  mfcc = librosa.feature.mfcc(y, sr, n_mfcc = 100, hop_length = sr, n_fft = sr, n_mels=20)
  mfcc_test.append(mfcc.T)

mfcc_test = np.vstack(mfcc_test)
print(mfcc_test.shape)

for gmm in models:
  loglike = gmm.score_samples(mfcc_test)
  scores.append(loglike)
pred_level = np.argmax(scores, axis=0)+1

# need to select the appropriate labels to compare (the most repeated one for every 25 samples in the original signal)
# pred_level has each feature stacked on top of each other, so get labels and repeat sequence 6 times
j=0
new_labels = []
while j <= len(Y_real_test):
  m = stats.mode(Y_real_test[j:j+24].ravel())[0]
  new_labels.append(m)
  j=j+25

new_labels = new_labels*6

print('accuracy score:',np.round(accuracy_score(new_labels,pred_level),2)) # 0.45
print('confusion matrix:','\n',confusion_matrix(new_labels,pred_level))

(25296, 20)
accuracy score: 0.45
confusion matrix: 
 [[1666 1228   62  827   57]
 [1657 1671   77 1362   57]
 [  57  107 4285   72 3381]
 [1480 1397  168 1673  118]
 [  24   53 1645   33 2139]]


The accuracy score is much worse than before using this model.

## 2. Classifiers

In [None]:
# ensemble of best models trying number of MFCC = 5,10,...,45

# ensemble with soft voting
classifiers = [
    QuadraticDiscriminantAnalysis(),
    BaggingClassifier(random_state=100463947), 
    GradientBoostingClassifier(random_state=100463947), 
    RandomForestClassifier(random_state=100463947), 
    HistGradientBoostingClassifier(random_state=100463947)
    ]
initials = ['QDA','Bagg','GB','RF','HGB']

e_list = [(i,c) for i,c in zip(initials,classifiers)]

j=0
new_labels_train = []
while j <= Y.shape[0]:
  m = stats.mode(Y[j:j+24].ravel())[0]
  new_labels_train.append(m)
  j=j+25

new_labels_train = new_labels_train*6
for k in range(5,50,5):
  print(k)
  mfcc_train = []
  for j in range(6):
    y = X[:,j]
    mfcc = librosa.feature.mfcc(y, sr, n_mfcc = k, hop_length = sr, n_fft = sr, n_mels=20)
    mfcc_train.append(mfcc.T)

  mfcc_train = np.vstack(mfcc_train)
  mfcc_test = []
  for j in range(6):
    y = X_real_test[:,j]
    mfcc = librosa.feature.mfcc(y, sr, n_mfcc = k, hop_length = sr, n_fft = sr, n_mels=20)
    mfcc_test.append(mfcc.T)
  
  mfcc_test = np.vstack(mfcc_test)


  j=0
  new_labels_test = []
  while j <= len(Y_real_test):
    m = stats.mode(Y_real_test[j:j+24].ravel())[0]
    new_labels_test.append(m)
    j=j+25
  new_labels_test = new_labels_test*6
  
  eclf = VotingClassifier(estimators=e_list, voting='soft')
  eclf.fit(mfcc_train,np.array(new_labels_train).ravel())
  pred = eclf.predict(mfcc_test)
  score = round(metrics.accuracy_score(new_labels_test, pred),3)
  print(score) # 0.502
  print(metrics.confusion_matrix(new_labels_test, pred))


5
0.508
[[1687  906  138 1104    5]
 [1601 1246  166 1805    6]
 [  35    2 6937   53  875]
 [1364  953  254 2253   12]
 [  20    5 3128   20  721]]
10
0.531
[[1909  778  114 1025   14]
 [1662 1205  147 1803    7]
 [  26    3 6984   55  834]
 [1315  932  246 2323   20]
 [  16    1 2852   23 1002]]
15




0.497
[[1822  872   57  998   91]
 [1589 1293   68 1781   93]
 [  12    2 4489   32 3367]
 [1261  931  146 2347  151]
 [  11    1 1246   17 2619]]
20




0.504
[[1853  820   50 1020   97]
 [1534 1278   67 1853   92]
 [  11    1 4616   32 3242]
 [1241  931  144 2370  150]
 [  12    1 1238   14 2629]]
25




0.504
[[1853  820   50 1020   97]
 [1534 1278   67 1853   92]
 [  11    1 4616   32 3242]
 [1241  931  144 2370  150]
 [  12    1 1238   14 2629]]
30




0.504
[[1853  820   50 1020   97]
 [1534 1278   67 1853   92]
 [  11    1 4616   32 3242]
 [1241  931  144 2370  150]
 [  12    1 1238   14 2629]]
35




0.504
[[1853  820   50 1020   97]
 [1534 1278   67 1853   92]
 [  11    1 4616   32 3242]
 [1241  931  144 2370  150]
 [  12    1 1238   14 2629]]
40




0.504
[[1853  820   50 1020   97]
 [1534 1278   67 1853   92]
 [  11    1 4616   32 3242]
 [1241  931  144 2370  150]
 [  12    1 1238   14 2629]]
45




0.504
[[1853  820   50 1020   97]
 [1534 1278   67 1853   92]
 [  11    1 4616   32 3242]
 [1241  931  144 2370  150]
 [  12    1 1238   14 2629]]


The best accuracy is obtained using **10 MFCC, 0.531 accuracy**.

Overall, the rest of the models achieve around 0.5 accuracy.

## 3. Selecting features

The final implementation we will try is to predict not using the MFCC related to every signal, but to a few of them.

In particular, we will try all possible combinations of grouping the numbers 1,...,6 (6 variables) in groups of 1,2,...,6, which is a total of 63 models.

For instance, for the combination (1,5), we will select the first and fifth variables in the original dataset, get their respective MFCC, and train the classifier and GMM with these coefficients.

We will only evaluate the performance on the entire test set for matters of extensiveness.

### 3.1. Gaussian Mixture model

In [None]:
# convert datasets to dataframes for ease in implementation
merge_train = np.append(X,Y,axis=1)
merge_test = np.append(X_real_test,Y_real_test,axis=1)
merge_train = pd.DataFrame(merge_train, columns = ['x1','y1','z1','x2','y2','z2','label'])
merge_test = pd.DataFrame(merge_test, columns = ['x1','y1','z1','x2','y2','z2','label'])

In [None]:
from itertools import combinations
[com for sub in range(6) for com in combinations([0,1,2,3,4,5], sub + 1)] # all possible combinations of idx variables

[(0,),
 (1,),
 (2,),
 (3,),
 (4,),
 (5,),
 (0, 1),
 (0, 2),
 (0, 3),
 (0, 4),
 (0, 5),
 (1, 2),
 (1, 3),
 (1, 4),
 (1, 5),
 (2, 3),
 (2, 4),
 (2, 5),
 (3, 4),
 (3, 5),
 (4, 5),
 (0, 1, 2),
 (0, 1, 3),
 (0, 1, 4),
 (0, 1, 5),
 (0, 2, 3),
 (0, 2, 4),
 (0, 2, 5),
 (0, 3, 4),
 (0, 3, 5),
 (0, 4, 5),
 (1, 2, 3),
 (1, 2, 4),
 (1, 2, 5),
 (1, 3, 4),
 (1, 3, 5),
 (1, 4, 5),
 (2, 3, 4),
 (2, 3, 5),
 (2, 4, 5),
 (3, 4, 5),
 (0, 1, 2, 3),
 (0, 1, 2, 4),
 (0, 1, 2, 5),
 (0, 1, 3, 4),
 (0, 1, 3, 5),
 (0, 1, 4, 5),
 (0, 2, 3, 4),
 (0, 2, 3, 5),
 (0, 2, 4, 5),
 (0, 3, 4, 5),
 (1, 2, 3, 4),
 (1, 2, 3, 5),
 (1, 2, 4, 5),
 (1, 3, 4, 5),
 (2, 3, 4, 5),
 (0, 1, 2, 3, 4),
 (0, 1, 2, 3, 5),
 (0, 1, 2, 4, 5),
 (0, 1, 3, 4, 5),
 (0, 2, 3, 4, 5),
 (1, 2, 3, 4, 5),
 (0, 1, 2, 3, 4, 5)]

In [None]:
from itertools import combinations
comb_lst = [com for sub in range(6) for com in combinations([0,1,2,3,4,5], sub + 1)]

for comb in comb_lst:
  print('model using variables with idx', comb)
  # train one model per label
  classes = np.unique(Y)
  nclasses = len(classes)

  models = []
  for i in range(nclasses):
    mfcc_train = []
    data_train_class = merge_train.loc[merge_train['label'] == classes[i], ['x1','y1','z1','x2','y2','z2']]
    for j in comb:
      y = data_train_class.iloc[:,j]
      y = y.to_numpy()
      mfcc = librosa.feature.mfcc(y, sr, n_mfcc = 100, hop_length = sr, n_fft = sr, n_mels=20)
      mfcc_train.append(mfcc.T)

    mfcc_train = np.vstack(mfcc_train)
    gm = GaussianMixture(n_components=8, 
                          covariance_type='diag',
                          random_state=100463947).fit(mfcc_train)
    models.append(gm)

  # predict based on highest loglike score
  scores = []
  mfcc_test = []
  for j in comb:
    y = X_real_test[:,j]
    mfcc = librosa.feature.mfcc(y, sr, n_mfcc = 100, hop_length = sr, n_fft = sr, n_mels=20)
    mfcc_test.append(mfcc.T)

  mfcc_test = np.vstack(mfcc_test)

  for gmm in models:
    loglike = gmm.score_samples(mfcc_test)
    scores.append(loglike)
  pred_level = np.argmax(scores, axis=0)+1

  # need to select the appropriate labels to compare (the most repeated one for every 25 samples in the original signal)
  # pred_level has each feature stacked on top of each other, so get labels and repeat sequence 6 times
  j=0
  new_labels = []
  while j <= len(Y_real_test):
    m = stats.mode(Y_real_test[j:j+24].ravel())[0]
    new_labels.append(m)
    j=j+25

  new_labels = new_labels*len(comb)

  print('accuracy score:',np.round(accuracy_score(new_labels,pred_level),2),'\n') # 0.45


model using variables with idx (0,)
accuracy score: 0.64 

model using variables with idx (1,)
accuracy score: 0.58 

model using variables with idx (2,)
accuracy score: 0.55 

model using variables with idx (3,)
accuracy score: 0.49 

model using variables with idx (4,)
accuracy score: 0.49 

model using variables with idx (5,)
accuracy score: 0.5 

model using variables with idx (0, 1)
accuracy score: 0.47 

model using variables with idx (0, 2)
accuracy score: 0.47 

model using variables with idx (0, 3)
accuracy score: 0.55 

model using variables with idx (0, 4)
accuracy score: 0.53 

model using variables with idx (0, 5)
accuracy score: 0.55 

model using variables with idx (1, 2)
accuracy score: 0.56 

model using variables with idx (1, 3)
accuracy score: 0.5 

model using variables with idx (1, 4)
accuracy score: 0.5 

model using variables with idx (1, 5)
accuracy score: 0.52 

model using variables with idx (2, 3)
accuracy score: 0.49 

model using variables with idx (2, 4)
a

Best result is **0.64 accuracy**, using idx 0 (**only the first variable**).

This validates our hypothesis that we may find a selection of features that will construct the best combination of MFCC.

Now, we will finally search for the best model, in terms of the combination of features, and the number of MFCC (for the ensemble of best models).

When trying all possible combinations, it was quickly obvious that we were not going to improve our results.

Due to lack of computational resources , we will only the results that have the **first variable in them**, as it provides the best results, and groups of maximum 3 variables.

### 3.2. Classification algorithms

In [6]:
from itertools import combinations
comb_lst = [com for sub in range(3) for com in combinations([0,1,2,3,4,5], sub + 1)]
comb_lst2 = []
for elem in comb_lst:
    if 0 in elem:
        comb_lst2.append(elem)

# ensemble with soft voting
classifiers = [
    QuadraticDiscriminantAnalysis(),
    BaggingClassifier(random_state=100463947), 
    GradientBoostingClassifier(random_state=100463947), 
    RandomForestClassifier(random_state=100463947), 
    HistGradientBoostingClassifier(random_state=100463947)
    ]
initials = ['QDA','Bagg','GB','RF','HGB']

e_list = [(i,c) for i,c in zip(initials,classifiers)]


for comb in comb_lst2:
  print('model using variables with idx', comb)
  j=0
  new_labels_train = []
  while j <= Y.shape[0]:
    m = stats.mode(Y[j:j+24].ravel())[0]
    new_labels_train.append(m)
    j=j+25

  new_labels_train = new_labels_train*len(comb)


  for k in range(5,25,5):
    print('n_mfcc:',k)
    mfcc_train = []
    for j in comb:
      y = X[:,j]
      mfcc = librosa.feature.mfcc(y, sr, n_mfcc = k, hop_length = sr, n_fft = sr, n_mels=20)
      mfcc_train.append(mfcc.T)

    mfcc_train = np.vstack(mfcc_train)
    mfcc_test = []
    for j in comb:
      y = X_real_test[:,j]
      mfcc = librosa.feature.mfcc(y, sr, n_mfcc = k, hop_length = sr, n_fft = sr, n_mels=20)
      mfcc_test.append(mfcc.T)
    
    mfcc_test = np.vstack(mfcc_test)


    j=0
    new_labels_test = []
    while j <= len(Y_real_test):
      m = stats.mode(Y_real_test[j:j+24].ravel())[0]
      new_labels_test.append(m)
      j=j+25
    new_labels_test = new_labels_test*len(comb)
    
    eclf = VotingClassifier(estimators=e_list, voting='soft')
    eclf.fit(mfcc_train,np.array(new_labels_train).ravel())
    pred = eclf.predict(mfcc_test)
    score = round(metrics.accuracy_score(new_labels_test, pred),3)
    print('accuracy score:',score)
  print('\n')

model using variables with idx (0,)
n_mfcc: 5
accuracy score: 0.672
n_mfcc: 10
accuracy score: 0.728
n_mfcc: 15




accuracy score: 0.699
n_mfcc: 20




accuracy score: 0.705


model using variables with idx (0, 1)
n_mfcc: 5
accuracy score: 0.527
n_mfcc: 10
accuracy score: 0.549
n_mfcc: 15




accuracy score: 0.525
n_mfcc: 20




accuracy score: 0.525


model using variables with idx (0, 2)
n_mfcc: 5
accuracy score: 0.528
n_mfcc: 10
accuracy score: 0.575
n_mfcc: 15




accuracy score: 0.538
n_mfcc: 20




accuracy score: 0.542


model using variables with idx (0, 3)
n_mfcc: 5
accuracy score: 0.591
n_mfcc: 10
accuracy score: 0.625
n_mfcc: 15




accuracy score: 0.628
n_mfcc: 20




accuracy score: 0.62


model using variables with idx (0, 4)
n_mfcc: 5
accuracy score: 0.581
n_mfcc: 10
accuracy score: 0.616
n_mfcc: 15




accuracy score: 0.61
n_mfcc: 20




accuracy score: 0.615


model using variables with idx (0, 5)
n_mfcc: 5
accuracy score: 0.587
n_mfcc: 10
accuracy score: 0.623
n_mfcc: 15




accuracy score: 0.602
n_mfcc: 20




accuracy score: 0.6


model using variables with idx (0, 1, 2)
n_mfcc: 5
accuracy score: 0.517
n_mfcc: 10
accuracy score: 0.55
n_mfcc: 15




accuracy score: 0.516
n_mfcc: 20




accuracy score: 0.519


model using variables with idx (0, 1, 3)
n_mfcc: 5
accuracy score: 0.524
n_mfcc: 10
accuracy score: 0.552
n_mfcc: 15




accuracy score: 0.534
n_mfcc: 20




accuracy score: 0.542


model using variables with idx (0, 1, 4)
n_mfcc: 5
accuracy score: 0.514
n_mfcc: 10
accuracy score: 0.541
n_mfcc: 15




accuracy score: 0.505
n_mfcc: 20




accuracy score: 0.516


model using variables with idx (0, 1, 5)
n_mfcc: 5
accuracy score: 0.518
n_mfcc: 10
accuracy score: 0.546
n_mfcc: 15




accuracy score: 0.52
n_mfcc: 20




accuracy score: 0.526


model using variables with idx (0, 2, 3)
n_mfcc: 5
accuracy score: 0.517
n_mfcc: 10
accuracy score: 0.543
n_mfcc: 15




accuracy score: 0.512
n_mfcc: 20




accuracy score: 0.517


model using variables with idx (0, 2, 4)
n_mfcc: 5
accuracy score: 0.515
n_mfcc: 10
accuracy score: 0.546
n_mfcc: 15




accuracy score: 0.517
n_mfcc: 20




accuracy score: 0.518


model using variables with idx (0, 2, 5)
n_mfcc: 5
accuracy score: 0.525
n_mfcc: 10
accuracy score: 0.558
n_mfcc: 15




accuracy score: 0.527
n_mfcc: 20




accuracy score: 0.528


model using variables with idx (0, 3, 4)
n_mfcc: 5
accuracy score: 0.558
n_mfcc: 10
accuracy score: 0.596
n_mfcc: 15




accuracy score: 0.588
n_mfcc: 20




accuracy score: 0.586


model using variables with idx (0, 3, 5)
n_mfcc: 5
accuracy score: 0.559
n_mfcc: 10
accuracy score: 0.598
n_mfcc: 15




accuracy score: 0.591
n_mfcc: 20




accuracy score: 0.591


model using variables with idx (0, 4, 5)
n_mfcc: 5
accuracy score: 0.553
n_mfcc: 10
accuracy score: 0.595
n_mfcc: 15




accuracy score: 0.579
n_mfcc: 20




accuracy score: 0.576




The best results out of all these combinations is using **only the first variable and 10 MFCC**, with **0.728 accuracy**.

# Summary of leading results


| model description | overall accuracy | sequence 1 accuracy | sequence 2 accuracy | sequence 3 accuracy |  
| ----------------- | ---- | ------------- | -------- | --------- |  
| GMM without id column | 0.73 | 0.72 | 0.72 | 0.79 | 
| QDA without id column | 0.74 | 0.732 | 0.725 | 0.779 |
| Bagging without id column  | 0.742 | 0.678 | 0.801 | 0.818 |
| RF without id column  | 0.744 | 0.675 | 0.816 | 0.815 |
| Ensemble without id column | 0.766 | 0.704 | 0.813 | 0.848 |
| **Ensemble with tuned parameters, without id column** | **0.771** | **0.712** | **0.817** | **0.852** |
| QDA with id column | 0.737 | 0.72 | 0.743 | 0.769 |
| One model per sequence | 0.744 | 0.669 | 0.835 | 0.803 | 
