## **Venta de seguros de auto con Machine Learning**

**Naren Castellon**

**4 de Mayo 2021**

<img src="imagen/seguro_auto.jpg" width="500" height="250">

<font size=3.5 > <p style="color:purple">
    Usaremos los siguientes modelos Machine Learning para clasificación 

1. Regresión Logistica
2. KNN
3. Support Vector Machine
4. Árbol de decisión

Se requiere instalar

`pip install imbalanced-learn`

<font size=5 > <p style="color:purple"> 1. **Importamos las librerias**

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns


from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.utils import resample
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn import neighbors
from sklearn.metrics import confusion_matrix, precision_score, recall_score

# Balanceo de Datos
# ==============================================================================
from imblearn.over_sampling import SMOTE


# Ocultar advertencia - Warning
# ==============================================================================
import warnings
warnings.filterwarnings("ignore")

<font size=5 > <p style="color:purple">**2. Importamos los Datos**

Este es un conjunto de datos IBM Watson Analytics. Este conjunto de datos nos brinda información sobre sus clientes. Puede predecir su comportamiento para retener a sus clientes. Además podemos analizar todos los datos relevantes de los clientes y desarrollar programas en la que podamos buscar retener la mayor cantidad de clientes, de manera que también podamos ir comprendiento la demografía de los clientes y su comportamiento de compra.

Vamos a utilizar análisis predictivo para analizar los clientes más importante y rentables y como estos interactúan. Realizaremos acciones específicas para aumentar la respuesta, la retención y el crecimiento rentable de los clientes

In [None]:
data = pd.read_csv("https://raw.githubusercontent.com/narencastellon/Python/refs/heads/main/data/WA_Fn-UseC_-Marketing-Customer-Value-Analysis.csv")

In [None]:
data.head()

<font size=5 > <p style="color:purple"> **3. Exploratory Data Analysis (EDA)**

In [None]:
data.columns

In [None]:
data.shape

In [None]:
data.dtypes

In [None]:
data.describe()

Nuestro conjunto de datos tiene 9134 clientes con información sobres sus ingresos, educación, sexo, residencia, etc (income, education, gender,residence). Cada cliente posee un autómovil y se le ofrece 4 tipos difentes de seguros de automovil. El target de esa base de datos es la variable **Response**. La respuesta (Response) puede ser "Si"- el cliente acepta la oferta, y "No" el cliente no acepta la oferta.

Podemos verificar si tenemos valores pérdidos

In [None]:
data.isnull().sum()

<font size='3' font>Tenemos 0 valores perdidos, lo cual es muy bueno.
Ahora hagamos EDA con algunos gráficos geniales :) Primero veremos cómo se distribuyen los cargos de acuerdo con factores dados

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
ax = data['Response'].value_counts().plot(kind='bar', figsize=(18, 6), fontsize=13, color='#087E8B')
ax.set_title('Acepta oferta (Yes = acepta oferta, No = No acepta oferta)', size=20, pad=30)
ax.set_ylabel('Number of transactions', fontsize=14)

for i in ax.patches:
    ax.text(i.get_x() + 0.19, i.get_height() + 700, str(round(i.get_height(), 2)), fontsize=15)

Vamos a visualizar que tipo de relación se tiene con los estado o la region al momento de poder 

In [None]:
data.columns

In [None]:
f, ax = plt.subplots(1, 1, figsize=(18, 6))
sns.countplot(x = "Response", hue="State", data = data)

In [None]:
f, ax = plt.subplots(1, 1, figsize=(18, 6))
sns.countplot(x = "Response", hue="Gender", data = data);

In [None]:
f, ax = plt.subplots(1, 1, figsize=(18, 5))
ax = sns.barplot(x='Response', y='Income', hue='Gender', data=data, palette='cool')

In [None]:
f, ax = plt.subplots(1, 1, figsize=(18, 5))
ax = sns.barplot(x='Response', y='Monthly Premium Auto', hue='Gender', data=data, palette='Reds_r')

In [None]:
f, ax = plt.subplots(1, 1, figsize=(18, 6))
sns.countplot(x = "Response", hue="Vehicle Class", data = data);


In [None]:
f, ax = plt.subplots(1, 1, figsize=(18, 6))
sns.countplot(x = "Response", hue="Sales Channel", data = data,palette='Set1');


In [None]:
data.Response.value_counts()

sólo  1308 de los clientes han aceptado la oferta.

In [None]:
print("Solo",round((len(data[(data.Response == "Yes")])/len(data.Response)*100),2),"%","de nuestros clientes aceptan una oferta realizada por su Equipo de Ventas.")

In [None]:
data.groupby("Sales Channel").agg({"Response":"count"})

La mayoría de las ofertas fueron realizadas por agentes (3477 ofertas) y la que tuvo menos oferta fue a través del sitio web.

In [None]:
channel = list(data["Sales Channel"].unique())
for i in channel:
    output = len(data[(data["Sales Channel"] == i) & 
                      (data["Response"] == "Yes")]) /len(data[(data["Sales Channel"] == i)])
    print(round((output * 100),2), "% de ofertas a través del Canal de Ventas", i, "fueron aceptados.")

In [None]:
objects = ["State","Response","Coverage","Education","EmploymentStatus",
           "Gender","Location Code","Marital Status","Policy Type","Policy","Renew Offer Type","Sales Channel",
           "Vehicle Class","Vehicle Size"]

for obj in objects:
    print(data[obj].value_counts())

**Results**

Todas las características categoricas están bien distribuidas, así que las guardare  y las codificaremos en datos numéricos

Algunas variables que tenemos en nuestro conjunto de datos no son tan importantes, por ejemplo el customer, policy es la misma que policy type, la fecha tampoco es importante, así que procederemos a eliminarlas


Los datos no están equilibrados con respecto a la variable respuesta.

<font size=5 > <p style="color:purple"> **4. Data Analysis**

In [None]:
data = data.drop(columns={"Customer","Policy", "Effective To Date"})

In [None]:
# Creamos una lista de las variables categoricas

data_categorial = data.select_dtypes(include=["object"])
categories = list(data_categorial.columns)
categories

In [None]:
# Convertimos las variables categoricas en numéricas empleando LabelEncoder
lb = LabelEncoder()

for i in categories:
    data[i] = lb.fit_transform(data[i])


In [None]:
data.head()

Creamos la matriz de correlación

In [None]:
f, ax = plt.subplots(1, 1, figsize=(20, 10))
cmap = sns.diverging_palette(10, 240, n=9)
ax = sns.heatmap(data.corr(), annot=True, cmap=cmap)

<font size=5 > <p style="color:purple"> **5. Supervised Machine Learning para datos no balanceados**

Empezaremos con la predicción de la respuesta de los futuros clientes. Para esto debemos encontrar un modelo adecuado. Dados que nuestros datos tiene un objetivo que se separa en **Si** y **No**, podemos usar la clasificación de los modelos aprendizaje automático (Machine Learning), iniciaremos usando los siguientes modelos:

* Logistic Regression
* KNeighbours Classifier
* Support Vector Machine
* Decision Tree

In [None]:
y = data["Response"]

In [None]:
X = data.drop(["Response"], axis=1)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=29)

## **1. Regresión Logistica**

In [None]:
lr = LogisticRegression()
# initialize the model (=lr)

model_logistica=lr.fit(X_train,y_train)
#fit the model to the train set

#prediccion del modelo
y_pred = model_logistica.predict(X_test)

acc = lr.score(X_test,y_test)*100
# comapring the test with the data

print("Logistic Regression Test Accuracy", round(acc, 2),"%")

## **Resultado**

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print (confusion_matrix(y_test,y_pred))
print (classification_report(y_test,y_pred))
print (accuracy_score(y_test, y_pred))

## **2. Modelo K Neighbors**

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 2)  # n_neighbors means k
knn.fit(X_train, y_train)
# prediction = knn.predict(x_test)

y_pred_knn = knn.predict(X_test)

acc = knn.score(X_test, y_test)*100
print("2 neighbors KNN Score: ",round(acc,2),"%")

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print (confusion_matrix(y_test,y_pred_knn))
print (classification_report(y_test,y_pred_knn))
print (accuracy_score(y_test, y_pred_knn))

## **3 Support Vector Machine**

In [None]:
from sklearn.svm import SVC
svm = SVC()
svm.fit(X_train, y_train)

y_pred_svc=svm.predict(X_test)
acc = svm.score(X_test,y_test)*100
print("SVM Algorithm Test Accuracy", round(acc, 2),"%")

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print (confusion_matrix(y_test,y_pred_svc))
print (classification_report(y_test,y_pred_svc))
print (accuracy_score(y_test, y_pred_svc))

## **4. Árbol de decisión**

In [None]:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)

#predicion
y_pred_dtc = dtc.predict(X_test)

acc = dtc.score(X_test, y_test)*100
print("Decision Tree Test Accuracy", round(acc, 2),"%")

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print (confusion_matrix(y_test,y_pred_dtc))
print (classification_report(y_test,y_pred_dtc))
print (accuracy_score(y_test, y_pred_dtc))

**Resultados**

Los modelos tienen una precisión realmente alta, el modelo de Support Vector Machine parece ser la mejor decisión para usar con más del 99% de precisión 

Pero esto se debe a que los datos que hemos trabajado hasta el momento no estan balanceados. La variable response tiene como respuesta con "No"  un porcentaje del 86%, por lo que el modelo no es de mucha utilidad y no da una visión muy precisa de los datos.

<font size=5 > <p style="color:purple">**6. Supervised Machine Learning para datos balanceados**

En esta ocasión para tener un mejor resultado de nuestro datos, podemos reducir la muestra de nuestro Target. En manera particular esto podría ser mejor que el sobremuestreo, por lo que no le damos demasiado peso a un objetivo determinado.

In [None]:
ax = data['Response'].value_counts().plot(kind='bar', figsize=(18, 6), fontsize=13, color='#087E8B')
ax.set_title('Acepta oferta (1 = acepta oferta, 0 = No acepta oferta)', size=20, pad=30)
ax.set_ylabel('Number of transactions', fontsize=14)

for i in ax.patches:
    ax.text(i.get_x() + 0.19, i.get_height() + 700, str(round(i.get_height(), 2)), fontsize=15)

In [None]:
#Downsampling:

#1. Test-Train Split!!
# concatenate our training data back together

X_down = pd.concat([X_train, y_train], axis=1)

# separate minority and majority classes

no_effect = X_down[X_down.Response==0]
effect = X_down[X_down.Response==1]

# downsample majority

no_effect_downsampled = resample(no_effect,
                               replace = False, # sample without replacement
                               n_samples = len(effect), # match minority n
                               random_state = 27) # reproducible results

# combine minority and downsampled majority

downsampled = pd.concat([no_effect_downsampled, effect])

# checking counts

downsampled.Response.value_counts()

In [None]:
sns.countplot(x = downsampled['Response'], data = downsampled)

In [None]:
downsampled.shape

In [None]:
y_train_down = downsampled.Response

In [None]:
X_train_down = downsampled.drop(["Response"], axis = 1)

## ** REGRESIÓN  LOGISTICA**

In [None]:
lr = LogisticRegression()
# initialize the model (=lr)

lr.fit(X_train_down,y_train_down)
#fit the model to the train set

y_pred1 = lr.predict(X_test)

acc = lr.score(X_test,y_test)*100
# comapring the test with the data

print("Prediction",y_pred[:5])
print("Logistic Regression Test Accuracy", round(acc, 2),"%")

La precisión es muy mala, intentemos con otro modelo.

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print (confusion_matrix(y_test,y_pred1))
print (classification_report(y_test,y_pred1))
print (accuracy_score(y_test, y_pred1))

## **K-NEAREST NEIGHBOUR** 

In [None]:
n_neighbors = 2
knn = KNeighborsClassifier(n_neighbors = n_neighbors)  # n_neighbors means k
knn.fit(X_train_down, y_train_down)

y_pred2 = knn.predict(X_test)

acc = knn.score(X_test, y_test)*100

print("Prediction:", y_pred[:5])
print(n_neighbors,"neighbors KNN Score: ",round(acc,2),"%")

In [None]:
acc_train = knn.score(X_train, y_train)*100
print("The accuracy score for the training data is: ",round(acc_train,2),"%")
acc_test = knn.score(X_test,y_test)*100
print("The accuracy score for the test data is: ",round(acc_test,2),"%")



In [None]:
cv_results = cross_val_score(knn, X_train_down,y_train_down, cv = 5)
cv_results

Accuracy is better and also the data is continuous.

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print (confusion_matrix(y_test,y_pred2))
print (classification_report(y_test,y_pred2))
print (accuracy_score(y_test, y_pred2))

6.3. **DECISION TREE**

In [None]:
dtc = DecisionTreeClassifier()
dtc.fit(X_train_down, y_train_down)

y_pred_dtc3 = dtc.predict(X_test)

acc_dtc = dtc.score(X_test, y_test)*100

print("Prediction", y_pred_dtc[:5])
print("Decision Tree Test Accuracy", round(acc_dtc, 2),"%")

In [None]:
acc_train = dtc.score(X_train, y_train)*100
print("The accuracy score for the training data is: ",round(acc_train,2),"%")
acc_test = dtc.score(X_test,y_test)*100
print("The accuracy score for the test data is: ",round(acc_test,2),"%")

In [None]:
cv_results = cross_val_score(dtc, X_train_down,y_train_down, cv = 5)
cv_results

In [None]:
cnf_matrix = confusion_matrix(y_test, y_pred_dtc3)
cnf_matrix

In [None]:
dtc_recall = recall_score(y_test, y_pred_dtc3)
dtc_recall

In [None]:
271/(271+4)

In [None]:
dtc_precision = precision_score(y_test,y_pred_dtc3)
dtc_precision

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print (confusion_matrix(y_test,y_pred_dtc3))
print (classification_report(y_test,y_pred_dtc3))
print (accuracy_score(y_test, y_pred_dtc3))

El Árbol de decisión la mejor precisión. El Recall es muy alto-Eso es bueno. Por tanto el modelo que puede predecir bastante bien que un cliente no aceptase la oferta es el modelo de Árbol de Decisión. En este caso como ya conocemos los clientes que no debes invertir, así que nos concentrariamos en los clientes que aceptan una oferta.

In [None]:
# Importar RandomOverSampler
from imblearn.over_sampling import RandomOverSampler


In [None]:
# In this case we use RandomOverSampler technique to transfrom data
ros = RandomOverSampler(random_state=0)

X_resampled, y_resampled = ros.fit_resample(X_train_down, y_train_down)

In [None]:
from time import time

from sklearn.model_selection import RandomizedSearchCV, cross_val_score, cross_val_predict, train_test_split, StratifiedKFold
#from sklearn.metrics import confusion_matrix, plot_confusion_matrix,  roc_curve, auc, accuracy_score, precision_score, classification_report, roc_auc_score

#from sklearn.metrics import precision_recall_curve
#from sklearn.metrics import plot_precision_recall_curve

from sklearn.metrics import average_precision_score

from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier

**Stratified K-Folds cross-validator** (Validador cruzado de K-Folds estratificado).

Proporciona índices de entrenamiento / prueba para dividir datos en conjuntos de entrenamiento / prueba.

Este objeto de validación cruzada es una variación de KFold que devuelve pliegues estratificados. Los pliegues se realizan conservando el porcentaje de muestras de cada clase.

In [66]:
skf = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)

models = {'LogisticRegression': LogisticRegression(random_state=0),
        #'RidgeClassifier' : RidgeClassifier(random_state=0),
        #'LGBMClassifier' : LGBMClassifier(random_state=0),
        #'KNeighborsClassifier' : KNeighborsClassifier(),
        #'XGBClassifier' : XGBClassifier(random_state=0,eval_metric = 'auc'),
        #'RandomForestClassifier': RandomForestClassifier(random_state=0),
         #"Arbol de Decisión":DecisionTreeClassifier(random_state=0)
          }
        

accuracy = []
precision = []
recall = []
f1 = []
roc_auc = []
times = []

for model_name in models:
    
    start = time()

    models[model_name].fit(X_resampled, y_resampled)
    
    end = time()
    
    accuracy_ = cross_val_score(models[model_name], X_test, y_test, scoring = 'accuracy', cv = skf, n_jobs = -1)
    precision_ = cross_val_score(models[model_name], X_test, y_test, scoring = 'precision', cv = skf, n_jobs = -1)
    recall_ = cross_val_score(models[model_name], X_test, y_test, scoring = 'recall', cv = skf, n_jobs = -1)
    f1_ = cross_val_score(models[model_name], X_test, y_test, scoring = 'f1', cv = skf, n_jobs = -1)
    roc_auc_ = cross_val_score(models[model_name], X_test, y_test, scoring = 'roc_auc', cv = skf, n_jobs = -1)

    accuracy.append(np.mean(accuracy_))
    precision.append(np.mean(precision_))
    recall.append(np.mean(recall_))
    f1.append(np.mean(f1_))
    roc_auc.append(np.mean(roc_auc_))
    times.append(end-start)
    
pd.concat([pd.DataFrame([models.keys()]).T.rename(columns = {0:'models'}),
           pd.DataFrame({'accuracy':accuracy, 'precision':precision, 'recall':recall, 'f1':f1, 'roc_auc':roc_auc, 'times':times})],
          axis=1)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Unnamed: 0,models,accuracy,precision,recall,f1,roc_auc,times
0,LogisticRegression,0.84948,0.0,0.0,0.0,0.507348,0.272772


In [None]:
import pandas as pd
import numpy as np
from time import time
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from lightgbm import LGBMClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from imblearn.over_sampling import RandomOverSampler

# Datos de ejemplo (supón que ya tienes tus datos cargados en X e y)
# X, y = tus datos

# Definir StratifiedKFold
skf = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)

# Definir modelos
models = {
    'LogisticRegression': LogisticRegression(random_state=0),
    'RidgeClassifier': RidgeClassifier(random_state=0),
    'LGBMClassifier': LGBMClassifier(random_state=0),
    'KNeighborsClassifier': KNeighborsClassifier(),
    'XGBClassifier': XGBClassifier(random_state=0, eval_metric='auc'),
    'RandomForestClassifier': RandomForestClassifier(random_state=0),
    'DecisionTreeClassifier': DecisionTreeClassifier(random_state=0)
}

# Sobremuestrear los datos
ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X, y)

# Definir listas para almacenar los resultados
accuracy = []
precision = []
recall = []
f1 = []
roc_auc = []
times = []

# Entrenar modelos y evaluar métricas
for model_name in models:
    start = time()

    # Entrenar modelo
    models[model_name].fit(X_resampled, y_resampled)
    
    end = time()
    
    # Evaluar modelo usando cross_val_score
    accuracy_ = cross_val_score(models[model_name], X_test, y_test, scoring='accuracy', cv=skf, n_jobs=-1)
    precision_ = cross_val_score(models[model_name], X_test, y_test, scoring='precision', cv=skf, n_jobs=-1)
    recall_ = cross_val_score(models[model_name], X_test, y_test, scoring='recall', cv=skf, n_jobs=-1)
    f1_ = cross_val_score(models[model_name], X_test, y_test, scoring='f1', cv=skf, n_jobs=-1)
    roc_auc_ = cross_val_score(models[model_name], X_test, y_test, scoring='roc_auc', cv=skf, n_jobs=-1)

    # Guardar resultados promedio
    accuracy.append(np.mean(accuracy_))
    precision.append(np.mean(precision_))
    recall.append(np.mean(recall_))
    f1.append(np.mean(f1_))
    roc_auc.append(np.mean(roc_auc_))
    times.append(end - start)

# Crear un DataFrame con los resultados
results_df = pd.DataFrame({
    'models': list(models.keys()),
    'accuracy': accuracy,
    'precision': precision,
    'recall': recall,
    'f1': f1,
    'roc_auc': roc_auc,
    'times': times
})

print(results_df)
