# Performance comparison of trained models

In this notebook we will find which one of all trained models has fitted better on the data. Than, we gonna use that model to make predictions in another notebook. 

First things first. At this moment we already hiperparametrized some classification models. All of those models are available on the folder "modelling/hiperparameters/". Each .joblib file was generated throught the script "tunning_hiperparameters.py". Those files contains the hiperparameters from each models which has achieve the best results of performance metrics, such as $F_\beta$ score. The process of tunning the hiperparameters was made by cross validation, through the GridSerachCV sklearn method. 

In [1]:
# packages
import pandas as pd

from modelling.pre_process import *
from modelling.fit import *

from sklearn.model_selection import train_test_split

%load_ext autoreload
%autoreload 2

## Pre processing train dataset

From the pre_process class tha we have imported, we gonna use some methods to pre process the train dataset. Some of this methods includes:
- to subselect the important columns,
- to create the design (incidence) matrix for the categorical variables,
- tranform the numerical variables through standartization,
- to remove the outliers,
- fill missing values on Age column through ols regression.

In [3]:
train = pd.read_csv('train.csv') #read train dataset
train.index = train['PassengerId'] # save Id information before drop it
train = pre_processing(train) # create the class
train.select_columns(drop=['PassengerId', 'Name', 'Ticket', 'Cabin', 'Embarked']) ## selecting features
train.Create_I_Matrix() # create incidence matrix
train.Standardize() # standardizing data
# train.FactorCategorical() # labeling categorical featuers
train.RemoveOutliers() # removing outliers
train.fill_nan_ols('Age') # fill nan values with regression

Detecting and removing outliers...
Total rows deleted containing outliers: 162


## Performances

Our dataset is ready to be modelled. To evaluate the models performances we have to split the train dataset into two new datasets: train and test data set. It is common to use 20% of the data as test set. 

Once we have this new data sets, let's train the models, save them and evaluate it's performances! (All the trained models will be saved in the folder "modelling/trainded/").

In [15]:
## Splitting train dataset into sub train and sub test datasets
train_data, test_data = train_test_split(train.df, 
                                         test_size=.2, 
                                         random_state=123, 
                                         stratify=train.df['Survived'])


xtrain = train_data.loc[:, ~train_data.columns.isin(['Survived'])]
ytrain = train_data['Survived']
xtest = train_data.loc[:, ~train_data.columns.isin(['Survived'])]
ytest = train_data['Survived']

# Create the class for fitting
compare_models = ml_fitting(X_train=xtrain, y_train=ytrain)

# train the models
compare_models.train_models(chosen_models=['Logistic', 
                                           'Naive_Bayes', 
                                           'Random_Forest', 
                                           'MLP',
                                           'XGBoost',
                                           'SVC'])


['Naive_Bayes', 'SVC', 'XGBoost', 'Random_Forest', 'MLP', 'Logistic']
All chosen models are available.
The hiperparameters of each chosen model was loaded successfully.
Training models... 

Hiperparameters of Logistic model are settled. Fitting...
Logistic: trained and saved. 

Hiperparameters of Naive_Bayes model are settled. Fitting...
Naive_Bayes: trained and saved. 

Hiperparameters of Random_Forest model are settled. Fitting...
Random_Forest: trained and saved. 

Hiperparameters of XGBoost model are settled. Fitting...
XGBoost: trained and saved. 

Hiperparameters of MLP model are settled. Fitting...
Iteration 1, loss = 0.64605693
Iteration 2, loss = 0.64020536
Iteration 3, loss = 0.63539347
Iteration 4, loss = 0.63021215
Iteration 5, loss = 0.62472309
Iteration 6, loss = 0.62005187
Iteration 7, loss = 0.61512848
Iteration 8, loss = 0.61048142
Iteration 9, loss = 0.60598594
Iteration 10, loss = 0.60147545
Iteration 11, loss = 0.59692386
Iteration 12, loss = 0.59234358
Iteration 13



Iteration 48, loss = 0.46114350
Iteration 49, loss = 0.45951340
Iteration 50, loss = 0.45794972
Iteration 51, loss = 0.45631027
Iteration 52, loss = 0.45484244
Iteration 53, loss = 0.45373986
Iteration 54, loss = 0.45230468
Iteration 55, loss = 0.45095146
Iteration 56, loss = 0.44992343
Iteration 57, loss = 0.44876506
Iteration 58, loss = 0.44779208
Iteration 59, loss = 0.44681158
Iteration 60, loss = 0.44580826
Iteration 61, loss = 0.44486162
Iteration 62, loss = 0.44414568
Iteration 63, loss = 0.44341988
Iteration 64, loss = 0.44252858
Iteration 65, loss = 0.44188325
Iteration 66, loss = 0.44107395
Iteration 67, loss = 0.44034017
Iteration 68, loss = 0.43967911
Iteration 69, loss = 0.43907192
Iteration 70, loss = 0.43834707
Iteration 71, loss = 0.43777944
Iteration 72, loss = 0.43726698
Iteration 73, loss = 0.43667979
Iteration 74, loss = 0.43619918
Iteration 75, loss = 0.43565397
Iteration 76, loss = 0.43511176
Iteration 77, loss = 0.43473444
Iteration 78, loss = 0.43419113
Iteratio



In [16]:
# Comparing all available models 
compare_models.compare_models(X_test=xtest, 
                              y_test=ytest,
                              chosen_models=['Logistic', 
                                           'Naive_Bayes', 
                                           'Random_Forest', 
                                           'MLP',
                                           'XGBoost',
                                           'SVC']);

Classificador: Naive_Bayes, tempo de predição:0:00:00.007714 

Classificador: SVC, tempo de predição:0:00:00.016109 

Classificador: XGBoost, tempo de predição:0:00:00.009188 

Classificador: Random_Forest, tempo de predição:0:00:00.034908 

Classificador: MLP, tempo de predição:0:00:00.013601 

Classificador: Logistic, tempo de predição:0:00:00.015712 



Now, we have some performance metrics to choose what was the best model for this proposition.

In [17]:
compare_models.comparison

Unnamed: 0,MODEL,AUC,RECALL,FHALF,F1,F2,TN,FN,FP,TP,N_NEG,N_POS
0,Naive_Bayes,0.73528,0.699507,0.634495,0.657407,0.682037,293,61,87,142,380,203
1,SVC,0.802048,0.70936,0.766773,0.744186,0.722892,340,59,40,144,380,203
2,XGBoost,0.959917,0.940887,0.955956,0.950249,0.944609,372,12,8,191,380,203
3,Random_Forest,0.989979,0.985222,0.98912,0.987654,0.986193,378,3,2,200,380,203
4,MLP,0.801069,0.699507,0.772579,0.743455,0.716448,343,61,37,142,380,203
5,Logistic,0.775194,0.763547,0.675676,0.70615,0.739504,299,48,81,155,380,203


As we can see in the table above, the Random Forest and XGBoost classifiers were the best models according to the choosen metrics. To see what are the parameters used in each model:

In [20]:
for model in compare_models.hiperparameters.keys():
    print(f'{model}: {compare_models.hiperparameters[model]} \n')

Logistic: {'C': 4, 'class_weight': 'balanced', 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'auto', 'n_jobs': None, 'penalty': 'l2', 'random_state': None, 'solver': 'lbfgs', 'tol': 0.0001, 'verbose': 0, 'warm_start': False} 

Naive_Bayes: {'alpha': 1.5, 'binarize': 0.0, 'class_prior': None, 'fit_prior': False} 

Random_Forest: {'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': 'balanced', 'criterion': 'entropy', 'max_depth': None, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 160, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False} 

XGBoost: {'objective': 'binary:logistic', 'use_label_encoder': True, 'base_score': 0.5, 'booster': 'gbtree', 'colsample_bylevel': 1, 'colsample_bynode': 1, 'colsample_bytre