# Modelos

**Autores:**

- José Antonio Nazar Alaez (jose.nazar@cunef.edu)

- Francisco Martínez García (f.martinezgarcia@cunef.edu)

A continuación realizaremos 5 modelos de predicción: un modelo base, un random forest, un XGBoost, un AdaBoost y por último un LightGBM. Además, obtendremos sus scores para un primer análisis

# Librerías

In [43]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import pickle

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split

from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import classification_report, accuracy_score, confusion_matrix, roc_curve, auc, \
                            silhouette_score, recall_score, precision_score, make_scorer, \
                            roc_auc_score, f1_score, precision_recall_curve, fbeta_score,mean_squared_error
import warnings
warnings.filterwarnings('ignore')

from xgboost import XGBClassifier
from sklearn.ensemble import AdaBoostClassifier
import lightgbm as lgb

import aux_func as fx

# Lectura de datos

In [44]:
#Read the traing data
pd_punctuation = pd.read_parquet('./data/training_data.parquet')

In [45]:
pd_punctuation.head()

Unnamed: 0,review_count,is_open,Health & Medical,Shipping Centers,Shopping,Restaurants,Automotive,Active Life,Arts & Entertainment,Event Planning & Services,Hotels & Travel,Beauty & Spas,useful,funny,cool,punctuation
51735,31,1,0,0,0,0,0,0,1,0,0,0,1.0,0.32,0.52,1
1649,112,1,0,0,0,1,0,0,0,0,0,0,0.53,0.29,0.33,0
67144,20,1,0,0,0,0,0,0,0,1,0,0,0.73,0.05,0.32,0
56745,22,0,0,0,0,1,0,0,0,0,0,0,1.87,0.78,1.26,1
9045,324,1,0,0,0,1,0,0,0,0,0,0,0.79,0.16,0.33,1


# Procesamiento de datos

In [46]:
#Defining the steps in the numerical pipeline 
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

#Defining the steps in the categorical pipeline 
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

#Numerical features to pass down the numerical pipeline 
numeric_features = pd_punctuation.select_dtypes(include=['int64', 'float64']).drop(['punctuation'], axis=1).columns
#Categrical features to pass down the categorical pipeline 
categorical_features = pd_punctuation.select_dtypes(include=['object']).columns

In [47]:
#Create the preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

In [48]:
#Save the preprocessor
with open ('./models/preprocessor.pickle','wb') as f:
    pickle.dump(preprocessor,f)

In [49]:
# Load model
with open('./models/preprocessor.pickle', 'rb') as f:
    preprocessor = pickle.load(f)

In [50]:
#Separate the training data in training and validation
X_train, X_validation, y_train, y_validation = train_test_split(pd_punctuation, pd_punctuation['punctuation'], 
                                                                test_size=0.15, 
                                                                random_state=12345)

In [51]:
# Drop target from X_train and X_validation
X_train = X_train.drop(['punctuation'], axis=1)
X_validation = X_validation.drop(['punctuation'], axis=1)

## Modelos

- Base Model
- Random Forest
- XGBoost
- ADA Boost
- LightGBM

# Base model

In [52]:
model_base = Pipeline(steps=[
    ('preprocesador', preprocessor), 
    ('clasificador', DummyClassifier(strategy='most_frequent',random_state=1))])

In [53]:
#Train the model
model_base.fit(X_train, y_train)

In [54]:
#Save the model
with open('./models/base.pickle', 'wb') as f:
    pickle.dump(model_base, f)

In [55]:
#Load the model
with open('./models/base.pickle', 'rb') as f:
    model_base = pickle.load(f)

In [56]:
# Predictions
y_pred_base = model_base.predict(X_validation)
y_pred_proba_base = model_base.predict_proba(X_validation)
fx.evaluate_model(y_validation, y_pred_base, y_pred_proba_base)

ROC-AUC score of the model: 0.5
Accuracy of the model: 0.501411735821262

Classification report: 
              precision    recall  f1-score   support

           0       0.00      0.00      0.00      8123
           1       0.50      1.00      0.67      8169

    accuracy                           0.50     16292
   macro avg       0.25      0.50      0.33     16292
weighted avg       0.25      0.50      0.33     16292


Confusion matrix: 
[[   0 8123]
 [   0 8169]]

F2 Score: 
0.41705807874530304



Las predicciones que nos da el modelo base imputado a la mayoría tiene una accuracy del 50%

# Random Forest

In [57]:
model_rf = Pipeline(steps=[
    ('preprocesador', preprocessor), 
    
    ('clasificador', RandomForestClassifier(n_jobs=-1, random_state=0))])

In [58]:
#Train the model
model_rf.fit(X_train, y_train)

In [59]:
#Save the model
with open ('./models/random_forest.pickle','wb') as f:
    pickle.dump(model_rf,f)

In [60]:
#Load the model
with open('./models/random_forest.pickle', 'rb') as f:
    model_rf = pickle.load(f)

In [61]:
# Predictions
y_pred_rf = model_rf.predict(X_validation)
y_pred_proba_rf = model_rf.predict_proba(X_validation)

In [62]:
fx.evaluate_model(y_validation, y_pred_rf,y_pred_proba_rf)

ROC-AUC score of the model: 0.8022571074756829
Accuracy of the model: 0.7284556837711761

Classification report: 
              precision    recall  f1-score   support

           0       0.73      0.73      0.73      8123
           1       0.73      0.73      0.73      8169

    accuracy                           0.73     16292
   macro avg       0.73      0.73      0.73     16292
weighted avg       0.73      0.73      0.73     16292


Confusion matrix: 
[[5945 2178]
 [2246 5923]]

F2 Score: 
0.7284582120013355



En el modelo random forest podemos observar un incremento del accuracy con respecto al modelo base. Se obtiene un 73% de accuracy. Además, vemos que otras métricas como el F2 Score también se incrementan por lo que parece que hemos conseguido un modelo mejor.

## XGBoost

In [63]:
model_XGB = Pipeline(steps=[
    ('preprocesador', preprocessor),
    ('clasificador', XGBClassifier(n_jobs=-1, random_state=0))])

In [64]:
#Training
model_XGB.fit(X_train, y_train)

In [65]:
#Save model
with open('./models/XGBoost.pickle', 'wb') as f:
    pickle.dump(model_XGB, f)

In [66]:
#Load the model
with open('./models/XGBoost.pickle', 'rb') as f:
    model_XGB = pickle.load(f)

In [67]:
# Predictions
y_pred_xgb = model_XGB.predict(X_validation)
y_pred_proba_xgb = model_XGB.predict_proba(X_validation)

In [68]:
fx.evaluate_model(y_validation, y_pred_xgb,y_pred_proba_xgb)

ROC-AUC score of the model: 0.7740890769771598
Accuracy of the model: 0.7016327031671986

Classification report: 
              precision    recall  f1-score   support

           0       0.70      0.71      0.70      8123
           1       0.70      0.70      0.70      8169

    accuracy                           0.70     16292
   macro avg       0.70      0.70      0.70     16292
weighted avg       0.70      0.70      0.70     16292


Confusion matrix: 
[[5728 2395]
 [2466 5703]]

F2 Score: 
0.7016351796782787



El XGBoost obtiene un accuracy del 70%. Al ser un valor muy parecido al random forest para ver cual es mejor tendremos que fijarnos en el tiempo cumputacional.

## ADA Boost

In [69]:
model_ADA = Pipeline(steps=[
    ('preprocesador', preprocessor),
    ('clasificador', AdaBoostClassifier(n_estimators=100, random_state=0))])

In [70]:
#Training
model_ADA.fit(X_train, y_train)

In [71]:
#Save model
with open('./models/ADA.pickle', 'wb') as f:
    pickle.dump(model_ADA, f)

In [72]:
#Load the model
with open('./models/ADA.pickle', 'rb') as f:
    model_ADA = pickle.load(f)

In [73]:
# Predictions
y_pred_ada = model_ADA.predict(X_validation)
y_pred_proba_ada = model_ADA.predict_proba(X_validation)

In [74]:
fx.evaluate_model(y_validation, y_pred_ada,y_pred_proba_ada)

ROC-AUC score of the model: 0.7424719795429516
Accuracy of the model: 0.6818070218512153

Classification report: 
              precision    recall  f1-score   support

           0       0.68      0.67      0.68      8123
           1       0.68      0.69      0.69      8169

    accuracy                           0.68     16292
   macro avg       0.68      0.68      0.68     16292
weighted avg       0.68      0.68      0.68     16292


Confusion matrix: 
[[5449 2674]
 [2510 5659]]

F2 Score: 
0.681750716474139



El ADA Boost consigue un accuracy del 68% se mantiene en la misma línea que los anteriores aunque un valor un poco peor. Como comentamos en el anterior modelo tendremos que fijarnos en otras métricas y en el tiempo computacional del modelo.

## LightGBM

In [75]:
model_LightGBM = Pipeline(steps=[
    ('preprocesador', preprocessor),
    ('clasificador', lgb.LGBMClassifier(n_jobs=-1, random_state=0))])

In [76]:
#Training
model_LightGBM.fit(X_train, y_train)

In [77]:
#Save model
with open('./models/GBM.pickle', 'wb') as f:
    pickle.dump(model_LightGBM, f)

In [78]:
#Load the model
with open('./models/GBM.pickle', 'rb') as f:
    model_LightGBM = pickle.load(f)

In [79]:
# Predictions
y_pred_gbm = model_LightGBM.predict(X_validation)
y_pred_proba_gbm = model_LightGBM.predict_proba(X_validation)

In [80]:
fx.evaluate_model(y_validation, y_pred_gbm,y_pred_proba_gbm)

ROC-AUC score of the model: 0.7693555973407814
Accuracy of the model: 0.6975202553400442

Classification report: 
              precision    recall  f1-score   support

           0       0.70      0.70      0.70      8123
           1       0.70      0.70      0.70      8169

    accuracy                           0.70     16292
   macro avg       0.70      0.70      0.70     16292
weighted avg       0.70      0.70      0.70     16292


Confusion matrix: 
[[5650 2473]
 [2455 5714]]

F2 Score: 
0.6975148653332831



Por último, el LightGBM nos ofrece una accuracy del 70% muy parecido a los anteriores.

Para concluir, comentar que en general todos los modelos tienen un accuracy parecido. No obstante, nos inclinamos por elegir como mejor modelo el random forest porque su tiempo computacional está por encima de los otros (aunque en general todos tienen un tiempo de carga bajo) y sus métricas están por encima del resto, destacando su accuracy del 73%, el F2 Score del 72% y la ROC-AUC score of the model del 80%.