**Notebook de modelos de machine learning para la clasificacion de reclamaciones fraudulentas**

El objetivo de este notebook es procesar los datos y desarollar todo el proceso necesario para entrenar y testear distintos modelos de machine learning que ayuden a identificar aquellas reclamaciones de seguros que son fraudulentas, ayudando a los tomadores de desciciones de la empresa a mejorar las politicas de venta de seguros,  asi como tambien identificar de manera mas precisa a los impostores y evitar hacer desembolsos de reclamaciones que no son autenticas basandose en tecnicas de descubrimiento automatico de conocimiento que permiten identificar caracteristicas y combinaciones de caracteristicas que no son evidentes a simple vista, pero que mediante tecnicas de minería de datos permiten tener una nocion mas clara de cuales son las señales de alerta y los indicios mas relevantes que  se relacionan con casos de fraude del pasado, para prevenirlos en el futuro.

La variable que intentaremos clasificar o predecir mediante los modelos será fraudfound_p que indica si la reclamación de seguro fue fraudulenta o no.

Importando las librerias necesarias

In [4]:
#Notebook de modelamiento.
#Importando las librerias necesarias
#! pip install scikit-learn
#! pip install numpy
#! pip install imblearn
! pip install plotly
import numpy as np
from sklearn.preprocessing import LabelEncoder
import scipy.stats as stats
from scipy.stats import chi2_contingency

#Importando las librerias necesarias
from matplotlib import cm
import pandas as pd
from sklearn.preprocessing import LabelEncoder,OneHotEncoder,OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn import model_selection
import scipy.stats as stats
from scipy.stats import chi2_contingency
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from collections import Counter
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
from imblearn.over_sampling import RandomOverSampler


# algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from joblib import dump
import plotly.express as px
import matplotlib as plt
from sklearn.inspection import permutation_importance
from imblearn.over_sampling import SMOTE
import re


Collecting plotly
  Downloading plotly-5.15.0-py2.py3-none-any.whl (15.5 MB)
     --------------------------------------- 15.5/15.5 MB 10.7 MB/s eta 0:00:00
Collecting tenacity>=6.2.0
  Downloading tenacity-8.2.2-py3-none-any.whl (24 kB)
Installing collected packages: tenacity, plotly
Successfully installed plotly-5.15.0 tenacity-8.2.2


Leyendo datos

In [7]:
df=pd.read_csv("C:/Users/USER/OneDrive/Escritorio/DSJRR5/r5-ds-challenge/data/datoseda.csv")

Obteniendo nombre de las columnas del dataset

In [8]:
df.columns

Index(['monthh', 'weekofmonth', 'dayofweek', 'make', 'accidentarea',
       'dayofweekclaimed', 'monthclaimed', 'weekofmonthclaimed', 'sex',
       'maritalstatus', 'age', 'fault', 'policytype', 'vehiclecategory',
       'vehicleprice', 'fraudfound_p', 'policynumber', 'repnumber',
       'deductible', 'driverrating', 'days_policy_accident',
       'days_policy_claim', 'pastnumberofclaims', 'ageofvehicle',
       'ageofpolicyholder', 'policereportfiled', 'witnesspresent', 'agenttype',
       'numberofsuppliments', 'addresschange_claim', 'numberofcars', 'yearr',
       'basepolicy'],
      dtype='object')

La columna policy number contiene valores consecutivos que no nos sirven para la prediccion, por este motivo la eliminamos.

In [9]:
df = df.drop("policynumber", axis =1 )

Convirtiendo las columnas  que deben ser categoricas a tipo object

In [10]:
df['weekofmonth'] = df['weekofmonth'].astype(object)
df['weekofmonthclaimed'] = df['weekofmonthclaimed'].astype(object)
df['repnumber'] = df['repnumber'].astype(object)
df['deductible'] = df['deductible'].astype(object)
df['driverrating'] = df['driverrating'].astype(object)
df['policereportfiled'] = df['policereportfiled'].astype(object)
df['witnesspresent'] = df['witnesspresent'].astype(object)
df['yearr'] = df['yearr'].astype(object)

In [11]:
df.dtypes

monthh                   object
weekofmonth              object
dayofweek                object
make                     object
accidentarea             object
dayofweekclaimed         object
monthclaimed             object
weekofmonthclaimed       object
sex                      object
maritalstatus            object
age                     float64
fault                    object
policytype               object
vehiclecategory          object
vehicleprice             object
fraudfound_p              int64
repnumber                object
deductible               object
driverrating             object
days_policy_accident     object
days_policy_claim        object
pastnumberofclaims       object
ageofvehicle             object
ageofpolicyholder        object
policereportfiled        object
witnesspresent           object
agenttype                object
numberofsuppliments      object
addresschange_claim      object
numberofcars             object
yearr                    object
basepoli

In [12]:
categoricas = df.select_dtypes(include='object').columns.tolist()
categoricas


['monthh',
 'weekofmonth',
 'dayofweek',
 'make',
 'accidentarea',
 'dayofweekclaimed',
 'monthclaimed',
 'weekofmonthclaimed',
 'sex',
 'maritalstatus',
 'fault',
 'policytype',
 'vehiclecategory',
 'vehicleprice',
 'repnumber',
 'deductible',
 'driverrating',
 'days_policy_accident',
 'days_policy_claim',
 'pastnumberofclaims',
 'ageofvehicle',
 'ageofpolicyholder',
 'policereportfiled',
 'witnesspresent',
 'agenttype',
 'numberofsuppliments',
 'addresschange_claim',
 'numberofcars',
 'yearr',
 'basepolicy']

In [13]:
numericas = df["age"]

Creando  X y Y para almacenar los features y el target

In [14]:
X = df.drop("fraudfound_p",axis= 1)
y = df["fraudfound_p"]

In [15]:
X

Unnamed: 0,monthh,weekofmonth,dayofweek,make,accidentarea,dayofweekclaimed,monthclaimed,weekofmonthclaimed,sex,maritalstatus,...,ageofvehicle,ageofpolicyholder,policereportfiled,witnesspresent,agenttype,numberofsuppliments,addresschange_claim,numberofcars,yearr,basepolicy
0,Jul,2,Friday,Toyota,Urban,Thursday,Jul,3,Male,Married,...,7 years,36 to 40,0,0,External,0,no change,1 vehicle,1994,All Perils
1,May,4,Tuesday,Pontiac,Urban,Wednesday,May,5,Female,Single,...,7 years,36 to 40,0,0,External,0,no change,1 vehicle,1994,All Perils
2,Jan,2,Thursday,Toyota,Urban,Thursday,Jan,2,Male,Married,...,more than 7,51 to 65,0,0,External,3 to 5,no change,1 vehicle,1994,Liability
3,Oct,3,Thursday,Pontiac,Urban,Monday,Oct,4,Male,Married,...,7 years,31 to 35,0,0,External,0,no change,1 vehicle,1996,All Perils
4,Aug,2,Monday,Pontiac,Urban,Tuesday,Aug,2,Male,Married,...,6 years,31 to 35,0,0,External,3 to 5,no change,1 vehicle,1996,All Perils
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15415,Aug,4,Tuesday,Toyota,Urban,Tuesday,Aug,4,Male,Married,...,more than 7,51 to 65,0,0,External,more than 5,no change,1 vehicle,1996,Liability
15416,Nov,2,Saturday,Ford,Urban,Thursday,Nov,3,Female,Single,...,4 years,26 to 30,0,0,External,3 to 5,no change,1 vehicle,1996,Liability
15417,Sep,1,Thursday,Pontiac,Urban,Thursday,Sep,1,Male,Married,...,7 years,31 to 35,0,0,External,0,no change,1 vehicle,1996,Liability
15418,Sep,1,Saturday,Mazda,Urban,Thursday,Sep,2,Male,Married,...,7 years,36 to 40,0,0,External,1 to 2,no change,1 vehicle,1996,Collision


In [16]:
y

0        0
1        0
2        0
3        0
4        0
        ..
15415    0
15416    0
15417    0
15418    1
15419    0
Name: fraudfound_p, Length: 15420, dtype: int64

comportamiento de las clases sin subsampling o oversampling

In [17]:
print('Distribución original de clases:   ', y.value_counts())

Distribución original de clases:    fraudfound_p
0    14497
1      923
Name: count, dtype: int64


In [18]:
categorical_features = X.select_dtypes(include="object").columns.tolist()
numeric_features = X.select_dtypes(include='float64').columns.tolist()

Es necesario usar alguna tecnica para corregir el  desbalance de la clase objetivo. Si entrenamos un modelo dejando las clases de la variable objetivo tal y como estan, el modelo aprendera mucho mas de los casos no fraudulentos y no estara en buena capacidad de identificar las instancias de la clase minoritaria que es la que nos interesa encontrar ya que es la que nos permite identificar los casos de fraude. Por este motivo se usa el random over sampler para crear de manera sintetica un balance de clases para que a la hora de entrenar el modelo este mejore su capacidad predictiva de la clase minoritaria.

In [19]:
ros = RandomOverSampler(sampling_strategy='minority', random_state=42)
X_resampled, y_resampled = ros.fit_resample(X, y)

In [20]:
print('Distribución nueva de clases:   ', y_resampled.value_counts())

Distribución nueva de clases:    fraudfound_p
0    14497
1    14497
Name: count, dtype: int64


In [21]:
#random_state generates random seed
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.3, random_state=1,shuffle=True, stratify =y_resampled) # 70% training and 30% test

In [22]:
X_test.shape

(8699, 31)

In [23]:
y_test.shape

(8699,)

Imputando valores faltantes, por motivos de simplicidad y para un primer experimento se imputa los valores faltantes de las variables categoricas con la moda del set de datos y enc caso de que la variable sea numerica se imputa con la media. Esta aproximacion tiene sus limitaciones , sin embargo por motivos de simplicidad optaremos por esta alternativa la cual puede ser mejorada en un futuro.

In [24]:
def imputacion(X_train):
    for col in X_train.columns:
        if X_train[col].dtype == 'object':
            # Si la columna es categórica, imputar con la moda
            X_train[col] = X_train[col].fillna(df[col].mode()[0])
        else:
            # Si la columna es numérica, imputar con el promedio
            X_train[col] = X_train[col].fillna(X_train[col].mean())
    return X_train

In [25]:
X_train.dtypes

monthh                   object
weekofmonth              object
dayofweek                object
make                     object
accidentarea             object
dayofweekclaimed         object
monthclaimed             object
weekofmonthclaimed       object
sex                      object
maritalstatus            object
age                     float64
fault                    object
policytype               object
vehiclecategory          object
vehicleprice             object
repnumber                object
deductible               object
driverrating             object
days_policy_accident     object
days_policy_claim        object
pastnumberofclaims       object
ageofvehicle             object
ageofpolicyholder        object
policereportfiled        object
witnesspresent           object
agenttype                object
numberofsuppliments      object
addresschange_claim      object
numberofcars             object
yearr                    object
basepolicy               object
dtype: o

In [26]:
X_train_imputado = imputacion(X_train)
X_train_imputado

Unnamed: 0,monthh,weekofmonth,dayofweek,make,accidentarea,dayofweekclaimed,monthclaimed,weekofmonthclaimed,sex,maritalstatus,...,ageofvehicle,ageofpolicyholder,policereportfiled,witnesspresent,agenttype,numberofsuppliments,addresschange_claim,numberofcars,yearr,basepolicy
18488,Aug,1,Monday,Accura,Urban,Wednesday,Aug,5,Male,Married,...,7 years,31 to 35,0,0,External,0,no change,1 vehicle,1994,All Perils
9309,Nov,1,Thursday,Toyota,Urban,Monday,Dec,2,Male,Married,...,7 years,41 to 50,0,0,External,1 to 2,no change,1 vehicle,1994,Collision
25377,Mar,2,Thursday,Honda,Urban,Monday,Jul,3,Male,Single,...,5 years,31 to 35,0,0,External,0,no change,1 vehicle,1995,Collision
23687,Jan,4,Monday,Chevrolet,Urban,Wednesday,Feb,1,Female,Single,...,7 years,31 to 35,0,0,External,1 to 2,4 to 8 years,2 vehicles,1994,All Perils
18228,Apr,2,Monday,Pontiac,Urban,Friday,Apr,3,Male,Married,...,7 years,36 to 40,0,0,External,0,2 to 3 years,2 vehicles,1994,All Perils
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16268,Aug,5,Wednesday,Toyota,Urban,Wednesday,Sep,4,Male,Married,...,6 years,31 to 35,0,0,External,more than 5,no change,1 vehicle,1994,All Perils
20571,Dec,3,Monday,Accura,Urban,Wednesday,Dec,5,Male,Single,...,7 years,36 to 40,0,0,External,0,no change,1 vehicle,1995,All Perils
16170,Feb,2,Friday,Toyota,Urban,Saturday,Feb,2,Male,Single,...,6 years,31 to 35,0,0,External,0,no change,1 vehicle,1995,All Perils
15343,Jul,5,Saturday,Chevrolet,Urban,Monday,Aug,1,Male,Married,...,more than 7,51 to 65,0,0,External,more than 5,no change,1 vehicle,1996,Liability


In [27]:
X_train_imputado.nunique()


monthh                  12
weekofmonth              5
dayofweek                7
make                    18
accidentarea             2
dayofweekclaimed         7
monthclaimed            12
weekofmonthclaimed       5
sex                      2
maritalstatus            4
age                     66
fault                    2
policytype               9
vehiclecategory          3
vehicleprice             6
repnumber               16
deductible               4
driverrating             4
days_policy_accident     5
days_policy_claim        4
pastnumberofclaims       4
ageofvehicle             8
ageofpolicyholder        9
policereportfiled        2
witnesspresent           2
agenttype                2
numberofsuppliments      4
addresschange_claim      5
numberofcars             5
yearr                    3
basepolicy               3
dtype: int64

In [28]:
X_train_imputado.isnull().any()

monthh                  False
weekofmonth             False
dayofweek               False
make                    False
accidentarea            False
dayofweekclaimed        False
monthclaimed            False
weekofmonthclaimed      False
sex                     False
maritalstatus           False
age                     False
fault                   False
policytype              False
vehiclecategory         False
vehicleprice            False
repnumber               False
deductible              False
driverrating            False
days_policy_accident    False
days_policy_claim       False
pastnumberofclaims      False
ageofvehicle            False
ageofpolicyholder       False
policereportfiled       False
witnesspresent          False
agenttype               False
numberofsuppliments     False
addresschange_claim     False
numberofcars            False
yearr                   False
basepolicy              False
dtype: bool

In [29]:
y_train.isnull().any()

False

In [30]:
X_train_imputado['weekofmonth'] = X_train_imputado['weekofmonth'].astype(object)
X_train_imputado['weekofmonthclaimed'] = X_train_imputado['weekofmonthclaimed'].astype(object)
X_train_imputado['repnumber'] = X_train_imputado['repnumber'].astype(object)
X_train_imputado['deductible'] = X_train_imputado['deductible'].astype(object)
X_train_imputado['driverrating'] = X_train_imputado['driverrating'].astype(object)
X_train_imputado['policereportfiled'] = X_train_imputado['policereportfiled'].astype(object)
X_train_imputado['witnesspresent'] = X_train_imputado['witnesspresent'].astype(object)
X_train_imputado['yearr'] = X_train_imputado['yearr'].astype(object)


In [31]:
X_train_imputado.dtypes

monthh                   object
weekofmonth              object
dayofweek                object
make                     object
accidentarea             object
dayofweekclaimed         object
monthclaimed             object
weekofmonthclaimed       object
sex                      object
maritalstatus            object
age                     float64
fault                    object
policytype               object
vehiclecategory          object
vehicleprice             object
repnumber                object
deductible               object
driverrating             object
days_policy_accident     object
days_policy_claim        object
pastnumberofclaims       object
ageofvehicle             object
ageofpolicyholder        object
policereportfiled        object
witnesspresent           object
agenttype                object
numberofsuppliments      object
addresschange_claim      object
numberofcars             object
yearr                    object
basepolicy               object
dtype: o

In [32]:
categorical_features

['monthh',
 'weekofmonth',
 'dayofweek',
 'make',
 'accidentarea',
 'dayofweekclaimed',
 'monthclaimed',
 'weekofmonthclaimed',
 'sex',
 'maritalstatus',
 'fault',
 'policytype',
 'vehiclecategory',
 'vehicleprice',
 'repnumber',
 'deductible',
 'driverrating',
 'days_policy_accident',
 'days_policy_claim',
 'pastnumberofclaims',
 'ageofvehicle',
 'ageofpolicyholder',
 'policereportfiled',
 'witnesspresent',
 'agenttype',
 'numberofsuppliments',
 'addresschange_claim',
 'numberofcars',
 'yearr',
 'basepolicy']

In [33]:
numeric_features

['age']

In [34]:
print('Distribución de clases después de aplicar undersampling:', Counter(y_train))

Distribución de clases después de aplicar undersampling: Counter({0: 10148, 1: 10147})


Escalar la columna age para que quede en un rango suceptible de ser procesado por el modelo, para este objetivo se usa el standar escaler que emplea una estandarizacion al rededor de la media y la desviacion standard para comprimir los valores en un rango menor sin perder la explicabilidad de la varianza.

In [35]:
scaler = StandardScaler()

X_train_imputado['age'] = scaler.fit_transform(X_train_imputado[['age']])

X_train_imputado['age']


18488   -1.005944
9309     0.563581
25377   -0.592911
23687   -1.088551
18228    0.067942
           ...   
16268   -1.088551
20571   -0.345091
16170   -1.088551
15343    1.554861
12372    1.472254
Name: age, Length: 20295, dtype: float64

In [36]:
# Convertir las columnas categóricas a tipo de datos categóricos y asignarles los valores originales
for col in categorical_features:
    X_train_imputado[col] = pd.Categorical(X_train_imputado[col])

# Codificar las columnas categóricas utilizando pd.get_dummies
encoded_cols_df = pd.get_dummies(X_train_imputado[categorical_features], columns=categorical_features)

# Reemplazar las columnas originales con las nuevas columnas codificadas en el DataFrame
X_train_imputado = pd.concat([X_train_imputado.drop(categorical_features, axis=1), encoded_cols_df], axis=1)

In [37]:
X_train_imputado

Unnamed: 0,age,monthh_Apr,monthh_Aug,monthh_Dec,monthh_Feb,monthh_Jan,monthh_Jul,monthh_Jun,monthh_Mar,monthh_May,...,numberofcars_2 vehicles,numberofcars_3 to 4,numberofcars_5 to 8,numberofcars_more than 8,yearr_1994,yearr_1995,yearr_1996,basepolicy_All Perils,basepolicy_Collision,basepolicy_Liability
18488,-1.005944,False,True,False,False,False,False,False,False,False,...,False,False,False,False,True,False,False,True,False,False
9309,0.563581,False,False,False,False,False,False,False,False,False,...,False,False,False,False,True,False,False,False,True,False
25377,-0.592911,False,False,False,False,False,False,False,True,False,...,False,False,False,False,False,True,False,False,True,False
23687,-1.088551,False,False,False,False,True,False,False,False,False,...,True,False,False,False,True,False,False,True,False,False
18228,0.067942,True,False,False,False,False,False,False,False,False,...,True,False,False,False,True,False,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16268,-1.088551,False,True,False,False,False,False,False,False,False,...,False,False,False,False,True,False,False,True,False,False
20571,-0.345091,False,False,True,False,False,False,False,False,False,...,False,False,False,False,False,True,False,True,False,False
16170,-1.088551,False,False,False,True,False,False,False,False,False,...,False,False,False,False,False,True,False,True,False,False
15343,1.554861,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,True,False,False,True


Seleccionando el mejor modelo usando cross validation para testear los modelos en distintas submuestras de los datos y observar el tradeoff sesgo varianza.

In [38]:
seed =2
models = []

#logistic Regression
models.append(('LR', LogisticRegression(solver='liblinear')))

# Decision Tree classifier
models.append(('CART', DecisionTreeClassifier()))

# Naïve Bayes
models.append(('NB', GaussianNB()))

#Decision tree
models.append(('dt_hyper', DecisionTreeClassifier(criterion= "gini")))
# SVM
models.append(('SVM', SVC(C=1.0, kernel='rbf', max_iter=1000, tol=1e-3)))
# evaluate each model in turn
results = []
names = []
scoring = 'recall'
for name, model in models:
	# Kfol cross validation for model selection
	kfold = model_selection.KFold(n_splits=10, random_state=seed,shuffle=True)
	#X train , y train
	cv_results = model_selection.cross_val_score(model, X_train_imputado, y_train, cv=kfold, scoring=scoring)
	results.append(cv_results)
	names.append(name)
	msg = f"({name}, {cv_results.mean()}, {cv_results.std()}"
	print(msg)

(LR, 0.9047951679849096, 0.010186105626521214
(CART, 1.0, 0.0
(NB, 0.9820709245736948, 0.0040174899139510485
(dt_hyper, 1.0, 0.0




(SVM, 0.8315141969004154, 0.050132592262142735


In [39]:
results_df = pd.DataFrame(results, index=names).T
px.box(results_df,title = 'Algorithm Comparison')

Usando la funcion select k best para extraer las k caracteristicas mas importantes en terminos de explicabilidad y usarlas para entrenar el modelo de machine learning evitando sobreajuste por exceso de caracteristicas.

In [40]:
k_best = SelectKBest(f_classif, k=50)

In [41]:
# Aplicar SelectKBest a los datos de entrenamiento
X_train_imputado_kbest = k_best.fit_transform(X_train_imputado, y_train)

In [42]:
selected_bestfeature_names = X_train_imputado.columns[k_best.get_support()]

In [43]:
selected_bestfeature_names

Index(['age', 'monthh_Mar', 'monthh_Nov', 'make_Accura', 'accidentarea_Rural',
       'accidentarea_Urban', 'monthclaimed_Aug', 'monthclaimed_Dec',
       'monthclaimed_Jul', 'monthclaimed_Nov', 'sex_Female', 'sex_Male',
       'fault_Policy Holder', 'fault_Third Party',
       'policytype_Sedan - All Perils', 'policytype_Sedan - Collision',
       'policytype_Sedan - Liability', 'policytype_Sport - Collision',
       'policytype_Utility - All Perils', 'vehiclecategory_Sedan',
       'vehiclecategory_Sport', 'vehiclecategory_Utility',
       'vehicleprice_20000 to 29000', 'vehicleprice_30000 to 39000',
       'vehicleprice_less than 20000', 'vehicleprice_more than 69000',
       'deductible_400', 'deductible_500', 'days_policy_accident_0',
       'days_policy_accident_more than 30', 'pastnumberofclaims_0',
       'pastnumberofclaims_2 to 4', 'pastnumberofclaims_more than 4',
       'ageofvehicle_more than 7', 'ageofpolicyholder_16 to 17',
       'ageofpolicyholder_21 to 25', 'policerep

In [44]:
X_train_best = X_train_imputado[['monthh_Dec', 'monthh_Jul', 'monthh_Nov', 'weekofmonth_1',
       'weekofmonth_2', 'weekofmonth_3', 'weekofmonth_4', 'weekofmonth_5',
       'dayofweek_Friday', 'dayofweek_Monday', 'dayofweek_Saturday',
       'dayofweek_Thursday', 'dayofweek_Tuesday', 'dayofweek_Wednesday',
       'make_Mazda', 'make_Pontiac', 'make_Toyota', 'dayofweekclaimed_Monday',
       'dayofweekclaimed_Thursday', 'dayofweekclaimed_Tuesday',
       'dayofweekclaimed_Wednesday', 'monthclaimed_Dec', 'monthclaimed_Jul',
       'monthclaimed_Nov', 'weekofmonthclaimed_1', 'weekofmonthclaimed_2',
       'weekofmonthclaimed_5', 'sex_Female', 'fault_Policy Holder',
       'fault_Third Party', 'policytype_Sedan - Liability',
       'vehiclecategory_Sedan', 'vehiclecategory_Sport',
       'vehicleprice_20000 to 29000', 'vehicleprice_30000 to 39000',
       'repnumber_11', 'driverrating_1', 'driverrating_2', 'driverrating_3',
       'driverrating_4', 'pastnumberofclaims_1', 'pastnumberofclaims_2 to 4',
       'pastnumberofclaims_more than 4', 'ageofvehicle_7 years',
       'numberofsuppliments_1 to 2', 'numberofsuppliments_3 to 5',
       'numberofsuppliments_more than 5', 'yearr_1995', 'yearr_1996',
       'basepolicy_Liability']]

Usando cross validation de nuevo para observar el comportamiento de los modelos entrenados con una submuestra de las caracteristicsa y evaluar si mejora su desempeño.

In [45]:
seed =2
models = []

#logistic Regression
models.append(('LR', LogisticRegression(solver='liblinear')))

# Decision Tree classifier
models.append(('CART', DecisionTreeClassifier()))

# Naïve Bayes
models.append(('NB', GaussianNB()))

#Decision tree
models.append(('dt_hyper', DecisionTreeClassifier(criterion= "gini")))
# SVM
models.append(('SVM', SVC(C=1.0, kernel='rbf', max_iter=1000, tol=1e-3)))
# evaluate each model in turn
results = []
names = []
scoring = 'recall'
for name, model in models:
	# Kfol cross validation for model selection
	kfold = model_selection.KFold(n_splits=10, random_state=seed,shuffle=True)
	#X train , y train
	cv_results = model_selection.cross_val_score(model, X_train_best, y_train, cv=kfold, scoring=scoring)
	results.append(cv_results)
	names.append(name)
	msg = f"({name}, {cv_results.mean()}, {cv_results.std()}"
	print(msg)

(LR, 0.9125279896761673, 0.011756061142552758
(CART, 1.0, 0.0
(NB, 0.8772087504893692, 0.014259544080449537
(dt_hyper, 1.0, 0.0



Solver terminated early (max_iter=1000).  Consider pre-processing your data with StandardScaler or MinMaxScaler.


Solver terminated early (max_iter=1000).  Consider pre-processing your data with StandardScaler or MinMaxScaler.


Solver terminated early (max_iter=1000).  Consider pre-processing your data with StandardScaler or MinMaxScaler.


Solver terminated early (max_iter=1000).  Consider pre-processing your data with StandardScaler or MinMaxScaler.


Solver terminated early (max_iter=1000).  Consider pre-processing your data with StandardScaler or MinMaxScaler.


Solver terminated early (max_iter=1000).  Consider pre-processing your data with StandardScaler or MinMaxScaler.


Solver terminated early (max_iter=1000).  Consider pre-processing your data with StandardScaler or MinMaxScaler.


Solver terminated early (max_iter=1000).  Consider pre-processing your data with StandardScaler or MinMaxScaler.


Solver terminated early (max_iter=1000).  Consider pre-processing your data wit

(SVM, 0.7542785511676207, 0.06835522454279293


In [46]:
results_df = pd.DataFrame(results, index=names).T
px.box(results_df,title = 'Algorithm Comparison')

##Prueba con XTEST y yTEST

In [47]:
y_test

27578    1
16132    1
25827    1
4859     0
17302    1
        ..
24554    1
20416    1
5626     0
23928    1
14831    0
Name: fraudfound_p, Length: 8699, dtype: int64

In [48]:
def imputacion(X_test):
    for col in X_test.columns:
        if X_test[col].dtype == 'object':
            # Si la columna es categórica, imputar con la moda
            X_test[col] = X_test[col].fillna(df[col].mode()[0])
        else:
            # Si la columna es numérica, imputar con el promedio
            X_test[col] = X_test[col].fillna(X_test[col].mean())
    return X_test

In [49]:
X_test = imputacion(X_test)
X_test

Unnamed: 0,monthh,weekofmonth,dayofweek,make,accidentarea,dayofweekclaimed,monthclaimed,weekofmonthclaimed,sex,maritalstatus,...,ageofvehicle,ageofpolicyholder,policereportfiled,witnesspresent,agenttype,numberofsuppliments,addresschange_claim,numberofcars,yearr,basepolicy
27578,Sep,1,Sunday,Pontiac,Urban,Thursday,Sep,2,Male,Single,...,6 years,31 to 35,0,0,External,0,no change,1 vehicle,1996,Collision
16132,Oct,2,Friday,Pontiac,Urban,Saturday,Oct,3,Male,Married,...,7 years,36 to 40,0,0,External,0,no change,1 vehicle,1996,Collision
25827,Aug,1,Friday,Chevrolet,Rural,Friday,Aug,1,Male,Single,...,more than 7,41 to 50,0,0,External,1 to 2,no change,1 vehicle,1994,Collision
4859,Sep,1,Monday,Honda,Rural,Tuesday,Sep,1,Male,Married,...,7 years,36 to 40,0,0,External,0,no change,1 vehicle,1994,Liability
17302,Feb,4,Monday,Honda,Urban,Tuesday,Feb,4,Male,Single,...,new,16 to 17,0,0,External,0,1 year,1 vehicle,1994,All Perils
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24554,Apr,2,Wednesday,Mazda,Urban,Thursday,Apr,3,Male,Married,...,7 years,36 to 40,0,0,External,0,no change,1 vehicle,1996,All Perils
20416,Dec,4,Thursday,Chevrolet,Urban,Monday,Jan,1,Male,Married,...,7 years,31 to 35,0,0,External,0,no change,1 vehicle,1994,Collision
5626,Mar,4,Thursday,Toyota,Urban,Thursday,Apr,5,Male,Married,...,7 years,31 to 35,0,0,External,0,no change,1 vehicle,1996,All Perils
23928,Feb,2,Sunday,Mazda,Urban,Thursday,Feb,3,Male,Married,...,7 years,41 to 50,0,0,External,0,no change,1 vehicle,1994,All Perils


In [50]:
X_test.nunique()


monthh                  12
weekofmonth              5
dayofweek                7
make                    18
accidentarea             2
dayofweekclaimed         7
monthclaimed            12
weekofmonthclaimed       5
sex                      2
maritalstatus            4
age                     66
fault                    2
policytype               8
vehiclecategory          3
vehicleprice             6
repnumber               16
deductible               4
driverrating             4
days_policy_accident     5
days_policy_claim        3
pastnumberofclaims       4
ageofvehicle             8
ageofpolicyholder        9
policereportfiled        2
witnesspresent           2
agenttype                2
numberofsuppliments      4
addresschange_claim      5
numberofcars             4
yearr                    3
basepolicy               3
dtype: int64

In [51]:
X_test.isnull().any()

monthh                  False
weekofmonth             False
dayofweek               False
make                    False
accidentarea            False
dayofweekclaimed        False
monthclaimed            False
weekofmonthclaimed      False
sex                     False
maritalstatus           False
age                     False
fault                   False
policytype              False
vehiclecategory         False
vehicleprice            False
repnumber               False
deductible              False
driverrating            False
days_policy_accident    False
days_policy_claim       False
pastnumberofclaims      False
ageofvehicle            False
ageofpolicyholder       False
policereportfiled       False
witnesspresent          False
agenttype               False
numberofsuppliments     False
addresschange_claim     False
numberofcars            False
yearr                   False
basepolicy              False
dtype: bool

In [52]:
X_test['weekofmonth'] = X_test['weekofmonth'].astype(object)
X_test['weekofmonthclaimed'] = X_test['weekofmonthclaimed'].astype(object)
X_test['repnumber'] = X_test['repnumber'].astype(object)
X_test['deductible'] = X_test['deductible'].astype(object)
X_test['driverrating'] = X_test['driverrating'].astype(object)
X_test['policereportfiled'] = X_test['policereportfiled'].astype(object)
X_test['witnesspresent'] = X_test['witnesspresent'].astype(object)
X_test['yearr'] = X_test['yearr'].astype(object)

In [53]:
X_test.dtypes

monthh                   object
weekofmonth              object
dayofweek                object
make                     object
accidentarea             object
dayofweekclaimed         object
monthclaimed             object
weekofmonthclaimed       object
sex                      object
maritalstatus            object
age                     float64
fault                    object
policytype               object
vehiclecategory          object
vehicleprice             object
repnumber                object
deductible               object
driverrating             object
days_policy_accident     object
days_policy_claim        object
pastnumberofclaims       object
ageofvehicle             object
ageofpolicyholder        object
policereportfiled        object
witnesspresent           object
agenttype                object
numberofsuppliments      object
addresschange_claim      object
numberofcars             object
yearr                    object
basepolicy               object
dtype: o

In [54]:
categorical_features = X_test.select_dtypes(include="object").columns.tolist()
numeric_features = X_test.select_dtypes(include='float64').columns.tolist()

In [55]:
categorical_features

['monthh',
 'weekofmonth',
 'dayofweek',
 'make',
 'accidentarea',
 'dayofweekclaimed',
 'monthclaimed',
 'weekofmonthclaimed',
 'sex',
 'maritalstatus',
 'fault',
 'policytype',
 'vehiclecategory',
 'vehicleprice',
 'repnumber',
 'deductible',
 'driverrating',
 'days_policy_accident',
 'days_policy_claim',
 'pastnumberofclaims',
 'ageofvehicle',
 'ageofpolicyholder',
 'policereportfiled',
 'witnesspresent',
 'agenttype',
 'numberofsuppliments',
 'addresschange_claim',
 'numberofcars',
 'yearr',
 'basepolicy']

In [56]:
numeric_features

['age']

Aplicando una codificacion a las columna para el entrenamiento de los modelos. En ete caso se usa la funcion getdummies de pandas para codificar en numeros las variables categoricas.

In [57]:
# Convertir las columnas categóricas a tipo de datos categóricos y asignarles los valores originales
for col in categorical_features:
    X_test[col] = pd.Categorical(X_test[col])

# Codificar las columnas categóricas utilizando pd.get_dummies
encoded_cols_df = pd.get_dummies(X_test[categorical_features], columns= categorical_features)

# Reemplazar las columnas originales con las nuevas columnas codificadas en el DataFrame
X_test_encoded = pd.concat([X_test.drop(categorical_features, axis=1), encoded_cols_df], axis=1)

In [58]:
X_test_encoded

Unnamed: 0,age,monthh_Apr,monthh_Aug,monthh_Dec,monthh_Feb,monthh_Jan,monthh_Jul,monthh_Jun,monthh_Mar,monthh_May,...,numberofcars_1 vehicle,numberofcars_2 vehicles,numberofcars_3 to 4,numberofcars_5 to 8,yearr_1994,yearr_1995,yearr_1996,basepolicy_All Perils,basepolicy_Collision,basepolicy_Liability
27578,28.000000,False,False,False,False,False,False,False,False,False,...,True,False,False,False,False,False,True,False,True,False
16132,44.000000,False,False,False,False,False,False,False,False,False,...,True,False,False,False,False,False,True,False,True,False
25827,47.000000,False,True,False,False,False,False,False,False,False,...,True,False,False,False,True,False,False,False,True,False
4859,41.000000,False,False,False,False,False,False,False,False,False,...,True,False,False,False,True,False,False,False,False,True
17302,40.149049,False,False,False,True,False,False,False,False,False,...,True,False,False,False,True,False,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24554,43.000000,True,False,False,False,False,False,False,False,False,...,True,False,False,False,False,False,True,True,False,False
20416,26.000000,False,False,True,False,False,False,False,False,False,...,True,False,False,False,True,False,False,False,True,False
5626,33.000000,False,False,False,False,False,False,False,True,False,...,True,False,False,False,False,False,True,True,False,False
23928,52.000000,False,False,False,True,False,False,False,False,False,...,True,False,False,False,True,False,False,True,False,False


Mostrando el dataset despues de codificar las variables necesarias y obtener las dummies.

In [59]:
# Crear un objeto de la clase StandardScaler
scaler = StandardScaler()

# Estandarizar la columna 'col'
X_test_encoded['age'] = scaler.fit_transform(X_test_encoded[['age']])

# Imprimir el dataset original con la columna estandarizada
X_test_encoded


Unnamed: 0,age,monthh_Apr,monthh_Aug,monthh_Dec,monthh_Feb,monthh_Jan,monthh_Jul,monthh_Jun,monthh_Mar,monthh_May,...,numberofcars_1 vehicle,numberofcars_2 vehicles,numberofcars_3 to 4,numberofcars_5 to 8,yearr_1994,yearr_1995,yearr_1996,basepolicy_All Perils,basepolicy_Collision,basepolicy_Liability
27578,-1.011662e+00,False,False,False,False,False,False,False,False,False,...,True,False,False,False,False,False,True,False,True,False
16132,3.206720e-01,False,False,False,False,False,False,False,False,False,...,True,False,False,False,False,False,True,False,True,False
25827,5.704847e-01,False,True,False,False,False,False,False,False,False,...,True,False,False,False,True,False,False,False,True,False
4859,7.085941e-02,False,False,False,False,False,False,False,False,False,...,True,False,False,False,True,False,False,False,False,True
17302,5.916751e-16,False,False,False,True,False,False,False,False,False,...,True,False,False,False,True,False,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24554,2.374012e-01,True,False,False,False,False,False,False,False,False,...,True,False,False,False,False,False,True,True,False,False
20416,-1.178204e+00,False,False,True,False,False,False,False,False,False,...,True,False,False,False,True,False,False,False,True,False
5626,-5.953076e-01,False,False,False,False,False,False,False,True,False,...,True,False,False,False,False,False,True,True,False,False
23928,9.868390e-01,False,False,False,True,False,False,False,False,False,...,True,False,False,False,True,False,False,True,False,False


In [60]:
X_test_best = X_test_encoded[['monthh_Dec', 'monthh_Jul', 'monthh_Nov', 'weekofmonth_1',
       'weekofmonth_2', 'weekofmonth_3', 'weekofmonth_4', 'weekofmonth_5',
       'dayofweek_Friday', 'dayofweek_Monday', 'dayofweek_Saturday',
       'dayofweek_Thursday', 'dayofweek_Tuesday', 'dayofweek_Wednesday',
       'make_Mazda', 'make_Pontiac', 'make_Toyota', 'dayofweekclaimed_Monday',
       'dayofweekclaimed_Thursday', 'dayofweekclaimed_Tuesday',
       'dayofweekclaimed_Wednesday', 'monthclaimed_Dec', 'monthclaimed_Jul',
       'monthclaimed_Nov', 'weekofmonthclaimed_1', 'weekofmonthclaimed_2',
       'weekofmonthclaimed_5', 'sex_Female', 'fault_Policy Holder',
       'fault_Third Party', 'policytype_Sedan - Liability',
       'vehiclecategory_Sedan', 'vehiclecategory_Sport',
       'vehicleprice_20000 to 29000', 'vehicleprice_30000 to 39000',
       'repnumber_11', 'driverrating_1', 'driverrating_2', 'driverrating_3',
       'driverrating_4', 'pastnumberofclaims_1', 'pastnumberofclaims_2 to 4',
       'pastnumberofclaims_more than 4', 'ageofvehicle_7 years',
       'numberofsuppliments_1 to 2', 'numberofsuppliments_3 to 5',
       'numberofsuppliments_more than 5', 'yearr_1995', 'yearr_1996',
       'basepolicy_Liability']]

# CART EXPERIMENTO 
Probaremos ejecutar la clasificacion usando un arbol de clasificacion y regresión para ver que resultados obtenemos y como podemos mejorarlos.

In [61]:
CART = DecisionTreeClassifier()

In [62]:
CART_trained = CART.fit(X_train_best, y_train)

In [63]:

y__test_pred = CART_trained.predict(X_test_best)

In [64]:
print(classification_report(y_test, y__test_pred))

              precision    recall  f1-score   support

           0       1.00      0.92      0.96      4349
           1       0.93      1.00      0.96      4350

    accuracy                           0.96      8699
   macro avg       0.96      0.96      0.96      8699
weighted avg       0.96      0.96      0.96      8699



# NB EXPERIMENTO
En segundo lugar usaremos el modelo de Naive Bayes para probar otra tecnica de clasificación y ver si los resultados mejoran con respecto al modelo anterior.

In [65]:
NB = GaussianNB()

In [66]:
NB_trained = NB.fit(X_train_best, y_train)

In [67]:
y__test_pred = NB_trained.predict(X_test_best)

In [68]:
print(classification_report(y_test, y__test_pred))

              precision    recall  f1-score   support

           0       0.84      0.61      0.71      4349
           1       0.69      0.88      0.78      4350

    accuracy                           0.75      8699
   macro avg       0.77      0.75      0.74      8699
weighted avg       0.77      0.75      0.74      8699



# LR EXPERIMENTO
Por ultimo probaremos con la regresión logistica.

In [69]:
LR = LogisticRegression(solver='liblinear')

In [70]:
LR_trained = LR.fit(X_train_best, y_train)

In [71]:
y__test_pred = LR_trained.predict(X_test_best)

In [72]:
print(classification_report(y_test, y__test_pred))

              precision    recall  f1-score   support

           0       0.87      0.59      0.70      4349
           1       0.69      0.91      0.79      4350

    accuracy                           0.75      8699
   macro avg       0.78      0.75      0.75      8699
weighted avg       0.78      0.75      0.75      8699



Hyperparametrizacion del mejor modelo CART


In [73]:
parameters = {'max_depth': [10,12,13,14,15,16,17,18], 'criterion': ["gini", "entropy", "log_loss"]}

grid_search = GridSearchCV(DecisionTreeClassifier(), parameters, cv=10, return_train_score=True)
grid_search.fit(X_train_best, y_train)

grid_search.best_params_

{'criterion': 'gini', 'max_depth': 18}

In [74]:
CART_best = grid_search.best_estimator_

Predicción con el modelo hyperparametrizado

In [75]:
CART_best.fit(X_train_best, y_train)
y__test_pred = CART_best.predict(X_test_best)

In [76]:
print(classification_report(y_test, y__test_pred))

              precision    recall  f1-score   support

           0       0.99      0.82      0.90      4349
           1       0.85      0.99      0.91      4350

    accuracy                           0.91      8699
   macro avg       0.92      0.91      0.91      8699
weighted avg       0.92      0.91      0.91      8699



Guardando los modelos en formato joblib



In [79]:
import joblib
joblib.dump(CART_trained, 'CART1.pkl')
joblib.dump(CART_best, 'CARTbest.pkl')
joblib.dump(LR_trained, 'LogisticRegression.pkl')
joblib.dump(NB_trained, 'NiveBayes.pkl')




['NiveBayes.pkl']

**Conclusiones**



Con el objetivo de poder desarollar un ejercicio de machine learning satisfactorio, se preprocesaron los datos de las reclamaciones de seguros contra siniestros eliminando columnas innecesarias, se imputaron los valores faltantes por la media o la moda en el caso de las variables categoricas, se aplicó la tecnica de remuestreo conocida como random oversamplig para afrontar el problema de desbalance de datos a la hora de entrenar los modelos y  se codificaron las variables categoricas para obtener dummies con las que se entrena el modelo y se hace una seleccion mediante el metodo selec k best para acotar las mejores caracteristicas que aportan explicacion al modelo y asi evitar  el sobreajuste.

Posteriormente se particiona el dataset en un set de entrenamiento y uno de pruebas para testear el modelo con datos desconocidos, se aplica la tecnica de cross validation estratificado para que en cada iteracion de la validacion cruzada se tome la misma cantidad de muestras de cada una de las clases objetivo y se observa el comportamiento del tradeoff sesgo varianza con el objetivo de ver como se comporta el modelo en muestras repetidas de datos para evitar el problema de sobreajuste conocido como 'overfitting' que puede ser problematico a la hora de poner el modelo en producción. Posteriormente se entrenan 3 modelos:

La metrica relevante a estudiar en este caso sera la metrica de recall debido a que esta metrica es la apropiada cuando el costo a nivel de negocio de los falsos negativos es alto, en este caso, el costo de clasificar como no fraudulentos a reclamaciones que en realidad si lo son es la principal fuente de perdidas y el principal proble a abordar en este ejercicio.

1. CART : Classification and regression tree- Se entrena un modelo CART sobre el set de datos para clasificar la variable fraudfound_p obteniendo las siguientes metricas de desempeño.

                precision    recall  f1-score   support

           0       1.00      0.92      0.96      4349
           1       0.93      1.00      0.96      4350

    accuracy                           0.96      8699
   macro avg       0.96      0.96      0.96      8699
weighted avg       0.96      0.96      0.96      8699

Para este caso, es importante que el modelo nos ayude a capturar correctamente los casos que efectivamente fueron fraudulentos dentro del total de casos, teniendo en cuenta que esta es la clase minoritaria, y que lo importante es que desde el punto de vista de negocio, no se clasifiquen como casos no fraudulentos aquellos que en realidad si lo son . Este modelo presenta un recall de 1 para la clase minoritaria por lo cual podria considerarse que es un modelo que se sobreajusta a los datos, sin embargo este test se aplico sobre los datos de prueba que son desconocidos por el modelo. De esta manera podemos concluir que este es un potencial modelo a ser empleado por la empresa.

2. Naive Bayes :  Para el modelo Naive Bayes se obtienen los siguientes resultados:

            precision    recall  f1-score   support

           0       0.84      0.61      0.71      4349
           1       0.69      0.88      0.78      4350
    accuracy                           0.75      8699
   macro avg       0.77      0.75      0.74      8699
weighted avg       0.77      0.75      0.74      8699



Lo cual indica que con un 0.88 de recall para la clase minoritaria, este modelo tiene una buena capacidad para predecir casos de reclamaciones fraudulentas. Sin embargo tiene el costo asociado de que en muchas ocaciones puede clasificar como fraudulentas a reclamaciones que no lo son en realidad.

3. Regresión logistica: para el modelo de regresión logistica obtenemos los siguiente resultados:


              precision    recall  f1-score   support

           0       0.87      0.59      0.70      4349
           1       0.69      0.91      0.79      4350
    accuracy                           0.75      8699
   macro avg       0.78      0.75      0.75      8699
weighted avg       0.78      0.75      0.75      8699



Este modelo tiene una mejor capacidad de predecir la clase minoritaria que el modelo de naive bayes , sin embaro comete bastantes errores a la hora de clasificar las reclamaciones no fraudulentas.

4. CART con ajuste de hiperparametros:

Por ultimo observamos que las metricas del CART con ajuste de hiperparametros son:

              precision    recall  f1-score   support

           0       0.99      0.82      0.90      4349
           1       0.85      0.99      0.92      4350
    accuracy                           0.91      8699
   macro avg       0.92      0.91      0.91      8699
weighted avg       0.92      0.91      0.91      8699




 Como conclusion, se puede proponer el uso de los modelos CART con ajuste de hiperparametros y Naive Bayes con el objetivo de apoyar al departamento de ventas e investigaciones a la hora de vender seguros contra accidentes y desembolsar los mismos cuando se presenten reclamaciones. Estos modelos pueden apoyar en la identificación de casos fraudulentos y disminuir las perdidas que la empresa sufre como consecuencia de tener que hacer pagos a clientes que han mentido y declarado de manera falsa los daños sufridos en los accidentes, asi como tambien las condiciones bajo las cuales sucedieron los mismos.