# Base model

We are going to create a base model that establishes 0 for the entire target variable, that it is not fraud.

In [99]:
#Libraries
import pandas as pd
import pickle
from statistics import mode
from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score, accuracy_score, classification_report
from sklearn.pipeline import Pipeline
import scikitplot as skplt
import matplotlib.pyplot as plt

In [100]:
#We read the data
X_train = pd.read_parquet("../data/processed/X_train.parquet")
y_train = pd.read_parquet("../data/processed/y_train.parquet")
X_test = pd.read_parquet("../data/processed/X_test.parquet")
y_test = pd.read_parquet("../data/processed/y_test.parquet")
X_train_scaled = pd.read_parquet("../data/processed/X_train_scaled.parquet")

In [101]:
#We load the preprocessor
preprocessor = pickle.load(open('../models/preprocessor.pickle', 'rb'))

In [102]:
#We create a building that stablish 0 for the entire prediction
class ModeloBase():
    def __init__(self):
        self.prediccion = 0
        
    def fit(self, y_train):
     # This obtain the mode (number most frequently) that is 0, there is no fraud
        self.prediccion = mode(y_train)
        
    def predict(self, X):
        return [self.prediccion for _ in range(len(X))]

In [103]:
#We apply the model
modelo_base = ModeloBase()

#We train the model
modelo_base.fit(y_train['isfraud'])

#We create the prediction
y_pred_base = modelo_base.predict(X_test)

In [104]:
#Function to evaluate the model
def evaluate_model(ytest, ypred, ypred_proba = None):
    if ypred_proba is not None:
        print('ROC-AUC score of the model: {}'.format(roc_auc_score(ytest, ypred_proba[:, 1])))
    print('Accuracy of the model: {}\n'.format(accuracy_score(ytest, ypred)))
    print('Classification report: \n{}\n'.format(classification_report(ytest, ypred)))
    print('Confusion matrix: \n{}\n'.format(confusion_matrix(ytest, ypred)))

In [105]:
evaluate_model(y_test, y_pred_base)

Accuracy of the model: 0.9989128102424719



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Classification report: 
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    209487
           1       0.00      0.00      0.00       228

    accuracy                           1.00    209715
   macro avg       0.50      0.50      0.50    209715
weighted avg       1.00      1.00      1.00    209715


Confusion matrix: 
[[209487      0]
 [   228      0]]



Support tells us the real data we have, 209487 which are 0 and 228 which are 1. The quality of the model is very poor, as it does not identify fraud.

The precision is the evaluation of the second column of the confusion matrix and its precision is 1 (very high) for the 0 cases, but it is 0 (null) for the 1 cases, it fails to identify any fraud.

The recall refers to the second row of the confusion matrix, the prediction is bad because it fails to identify the frauds, it identifies them as 0, not fraud.

The f1 is the combination of these two measures and gives the same result.

The accuracy is high but because the target variable is unbalanced and most are 0, no matter how much it fails to predict fraud, it still hits non-fraud.

In [129]:
#We save the model
with open('../models/modelo_base.pickle', 'wb') as f:
    pickle.dump(modelo_base, f)