# SDTSIA 210 Data Challenge
### TANG Joël
### TORRES PÉREZ Claudia

The goal of this challenge is to decide wheteher or not two faces belong to the same person, using a dataset composed of 13 features per face and 11 metrics calculated on the previous features. If the faces belong to the same person, we mark them with $1$ in the $y_{dataset}$ , otherwise its marked as $0$ in the dataset. 

In order to perform this binary classification task this we are going to explore different classification algorithms, and choose the one that gives us the best accuracy.

###SVM
SVM can be used with a linear kernel or not linear one. We observed that the dataset was better represented through a linear model. We performed cross-validation, the main hyper parameter was the margin tolerance. Results were good, but the model showed its limits, being unable to fit the training set.
Another problem we had to cope with was the execution time, during the SVM construction. We used the Bagging method with SVM trained on smaller sub-datasets. This also improved the stability of the algorithm, by reducing its variance.

### Random Forests
A random forest is simply a collection of decision trees whose results are aggregated into one final result. Decision trees are a type of model used for both classification and regression, which classify the data using decision tresholds, which seems an appropriate approach for our data. Using Random Forests as a beginning model, using the gini criterion, and playing with different numbers of estimators gave us acceptable results, with accuracies of around 0.98, however, the model had a big variance, so the predicitions varied a lot from one set of training data to another, however, we liked the idea of using decision trees as the base of our model.

Since decision trees are known for having a big variance, we thus had to look for a model that would help us implement it, but reduce its variance at the same time. The solution = Gradient Boosting.

### XGBoost
XGBoost is a gradient boosting algorithm. It is an iterative algorithm, based on learners which a certain base accuracy. Taken individually, these learners can have a high bias (with an accuracy slightly better than 0.5 for instance). But the algorithm will iteratively sum these learners up, by forcing each new learning to fit the error of the combination of its predecessors.

We are going to use the decision trees as learners.

### First, the imports

In [None]:
import tensorflow as tf

import numpy as np
import sys
import os
import matplotlib.pyplot as plt
import math
from sklearn.svm import LinearSVC
from sklearn import svm
from sklearn.ensemble import BaggingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

### Loading the datasets
We upload them from the drive

In [None]:
# Load training data

nrows_train = 1068504
nrows_test = 0

from google.colab import drive
drive.mount('/content/gdrive')
root_dir = "/content/gdrive/My Drive/SD-TSIA210_CHALLENGE/"

xtrain = np.loadtxt(root_dir + 'xtrain_challenge.csv', delimiter=',', skiprows = 1, max_rows = nrows_train + nrows_test)
ytrain = np.loadtxt(root_dir + 'ytrain_challenge.csv', delimiter=',', skiprows = 1, max_rows = nrows_train + nrows_test)
ytrain = np.array(ytrain).reshape(nrows_train + nrows_test)

In the cell below, we perform data augmentation techniques.

Our algorithm (decision trees) doesn't require scaled data, and allows their range to be heterogeneous. However, we can still mirror the data: since each sample is composed of two faces, and a certain range of features is separable with respect to one of the two faces, we can simply exchange them. This increases the size of the training set by a 2 factor.

In [None]:
#%%
# Pre-processing: we just remove the 13*2 first features, concerning only one of the two faces
xtrain = xtrain.astype('float32')
xtest = np.loadtxt(root_dir + 'xtest_challenge.csv', delimiter=',', skiprows = 1).astype('float32')

# We change the columns
x_train_permuter = np.copy(xtrain)
x_train_permuter[:, :13] = xtrain[:, 13:26]
x_train_permuter[:, 13:26] = xtrain[:, :13]

new_x_train = np.concatenate((xtrain, x_train_permuter), axis=0)
new_y_train = np.concatenate((ytrain, ytrain), axis=0)

print(new_x_train.shape)
print(new_y_train.shape)

Now we are going to move on to the tain/test split. We are going to save 20% of our data for validating the model.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(new_x_train, new_y_train, test_size=0.2)

In the code cell below we apply cross-validation on the XGBClassifier model. We boost the learning of many decision trees which are considered as weak leaners.
The most important parameters are :
- learning_rate: similar to other gradient descent methods, the classifier perform more or less a gradient correction. A small learning rate will lead to very slow convergence, a big learning rate can make us miss the sweet spot.
- n_estimators: to be considered as a trade-off with the learning rate. It is the number of rounds performed by the classifier during learning. A too great number will lead to over-fitting.
- max_depth: the maximum depth related to the decision trees. The less it is, the more the decision trees are basic (hence having higher bias, which is not a problem for boosting).

In [None]:
from sklearn.model_selection import RandomizedSearchCV

with tf.device('/device:GPU:0'):
  boost = XGBClassifierboost = XGBClassifier(colsample_bylevel=1, max_depth=6,
                        max_delta_step=0, reg_alpha=0,
                        min_child_weight=1, subsample=1, missing=None, nthread=-1,
                        objective='binary:logistic', sampling_method='gradient_based',
                         reg_lambda=1, tree_method='gpu_hist',
                        scale_pos_weight=1.002, silent=True)
  
  parameters = {"gamma" : [0, 0.01, 0.1, 0.3],
               "colsample_bytree": [0.7,0.85,1],
               "reg_lambda": [1.5,2,2.5,3],
                "base_score": [0.45, 0.5],
                "n_estimators":[655,660,665,670,675,680,685,690,695,700],
                "learning_rate":[0.15,0.2,0.25,0.35]}

  xgb_rscv = RandomizedSearchCV(boost, param_distributions = parameters, scoring = "accuracy",
                             cv = 5, verbose = 3, random_state = 40)
 
  boost_fit = xgb_rscv.fit(X_train, y_train)

  print("Best Score: {}".format(boost_fit.best_score_))
  print("Best params: {}".format(boost_fit.best_params_))
  

In [None]:
y_pred= boost_fit.predict(X_test)
print(boost.score(y_pred, y_test))

We observed that the training accuracy was very high (> 0.999), compared with slightly less better accuracy on the validation set (0.9985). This is a sign of overfitting. We used the early stopping technique, which stops the learning phase when the validation accuracy starts to decrease.

In [None]:
boost_2 = XGBClassifier(base_score=0.45, colsample_bylevel=1, colsample_bytree=1,max_depth=6,
                        gamma=0, learning_rate=0.25, max_delta_step=0,
                        min_child_weight=1, missing=None, n_estimators=167, nthread=-1,
                        objective='binary:logistic', sampling_method='gradient_based',
                        reg_alpha=0, reg_lambda=1, tree_method='gpu_hist',
                        scale_pos_weight=1.005, seed=0, silent=True, subsample=1)

early_stopping_rounds=15
eval_set = [(X_train, y_train), (X_test, y_test)]
boost_2.fit(X_train,y_train, early_stopping_rounds=15, eval_metric=["error"], eval_set=eval_set, verbose=True)


In [None]:
y_pred = boost_2.predict(xtest)
np.savetxt(root_dir + 'ytest_challenge_student.csv', y_pred, fmt = '%1.0d', delimiter=',')

We tested the early stopping technique as an attempt to try to improve our maximal accuracy, of $0.998650512191$, yet, the results were inconclusive.
Even after tweaking the paramaters over and over again, we found it hard to augment our accuracy by $0.00001$. We think it would be a good option to explore different data augmentation techniques, which would allow us to get a bigger training dataset, which would allow us to get better classification tresholds on our decision trees.
