# Detección de fraude - Experimentación Decision Tree

En el presente notebook, se pretende experimentar con un modelo de árbol de decición de mayor profundidad al de la etapa de experimentación, debido a que este modelo es el que mejores resultados arrojó.

In [None]:
# Cargamos las librerías necesarias
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from pprint import pprint
import sklearn.metrics

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import accuracy_score,plot_confusion_matrix,roc_auc_score, classification_report, confusion_matrix, precision_recall_curve, auc
from sklearn.tree import DecisionTreeClassifier

from imblearn.over_sampling import RandomOverSampler

import pickle

from google.colab import drive

In [None]:
# Nos conectamos con la unidad de Google Drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [None]:
# Levantamos el dataset
df = pd.read_csv('/content/drive/MyDrive/Aprendizaje de máquina 1/TP 1/data/PS_20174392719_1491204439457_log.csv')

In [None]:
# Visualizamos que los datos se hayan cargado correctamente
df.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


### Pre-procesamiento de los datos

#### 1. Renombrar columnas

In [None]:
# Renombrar columnas
columns = {
    'step': 'step',
    'type': 'type',
    'amount': 'amount',
    'nameOrig': 'name_orig',
    'oldbalanceOrg': 'old_balance_org',
    'newbalanceOrig': 'new_balance_orig',
    'nameDest': 'name_dest',
    'oldbalanceDest': 'old_balance_dest',
    'newbalanceDest': 'new_balance_dest',
    'isFraud': 'is_fraud',
    'isFlaggedFraud': 'is_flagged_fraud',
}

df.rename(columns=columns, inplace=True)

#### 2. Eliminar columnas innecesarias

In [None]:
# Eliminar columnas que no resultan útiles
df.drop(columns=['name_orig', 'name_dest', 'is_flagged_fraud'], inplace=True)

#### 3. División de las variables predictoras (X) y variable objetivo (y)

In [None]:
# Definición de X e Y
X = df[['type', 'step', 'amount', 'old_balance_org', 'new_balance_orig', 'old_balance_dest', 'new_balance_dest']]
y = df[['is_fraud']]

#### 4. Transformación de datos y pipeline de pre-procesamiento

In [None]:
numeric_features = ['step', 'amount', 'old_balance_org', 'new_balance_orig', 'old_balance_dest', 'new_balance_dest']
categorical_features = ['type']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_transformer = OneHotEncoder(handle_unknown='ignore')

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

#### 5. Separación en datos de entrenamiento y testeo

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

### Preparación de la experimentación

#### 1. Definición del modelo

In [None]:
decision_tree_model = Pipeline(steps=[('preprocessor', preprocessor),
                       ('classifier', DecisionTreeClassifier(criterion='gini', max_depth=30))])

#### 2. Oversampling de los datos

In [None]:
oversampler=RandomOverSampler(sampling_strategy='minority');

X_train_os,y_train_os=oversampler.fit_resample(X_train, y_train);

print('Composición del training set:')
print(y_train_os.value_counts())

print('\nComposición del test set:')
print(y_test.value_counts())

Composición del training set:
is_fraud
0           5083526
1           5083526
dtype: int64

Composición del test set:
is_fraud
0           1270881
1              1643
dtype: int64


#### 3. Definición de las métricas

In [None]:
def metric_report(y_test, y_pred, y_proba):  
    print(classification_report(y_test, y_pred))  
    print('Area bajo la curva ROC:',np.round(roc_auc_score(y_test, y_proba[:,1]), 4)) 
    precision, recall,threshold=precision_recall_curve(y_test, y_proba[:,1]);
    print('Area bajo la curva Precision-Recall:',np.round(auc(recall, precision), 4))

## Entrenamiento del modelo

In [None]:
decision_tree_model.fit(X_train_os, y_train_os)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['step', 'amount',
                                                   'old_balance_org',
                                                   'new_balance_orig',
                                                   'old_balance_dest',
                                                   'new_balance_dest']),
                                                 ('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['type'])])),


### Análisis del resultado obtenido

In [None]:
y_pred = decision_tree_model.predict(X_test)

In [None]:
y_proba = decision_tree_model.predict_proba(X_test)

In [None]:
metric_report(y_test, y_pred, y_proba) 

              precision    recall  f1-score   support

           0       1.00      1.00      1.00   1270881
           1       0.89      0.87      0.88      1643

    accuracy                           1.00   1272524
   macro avg       0.95      0.93      0.94   1272524
weighted avg       1.00      1.00      1.00   1272524

Area bajo la curva ROC: 0.9327
Area bajo la curva Precision-Recall: 0.8788


In [None]:
# save the model
filename = '/content/drive/MyDrive/Aprendizaje de máquina 1/TP 1/models/decision_tree_30.sav'
pickle.dump(decision_tree_model, open(filename, 'wb'))

### Conclusión:

Hemos obtenido un modelo capaz de desempeñarse de manera correcta para la tarea de detección de fraude planteada.

**Dicho modelo es un árbol de decisión con una profundidad de 30 niveles.**

Es posible observar que el modelo performa de manera aceptable tanto para la clase mayoritaria como minoritaria en las 3 métricas principales: precision, recall y f1-score. A su vez, el AUC ROC es cercano a 1, lo que nos indica que las clases se están separando adecuadamente.