# Predicción de Churn
## M.Sc. Favio Vázquez (XXX Congreso nacional de actuarios)

![](https://debmedia.com/blog/wp-content/uploads/2021/07/21-07-08-Customer-Churn-pre.jpg)

El churn de clientes es cuando los clientes o suscriptores dejan de hacer negocios con una empresa o servicio.

Cada fila representa un cliente, cada columna contiene los atributos del cliente.

Este conjunto de datos contiene detalles de los clientes de un banco y la variable objetivo es una variable binaria que refleja el hecho de si el cliente abandonó el banco (cerró su cuenta) o sigue siendo un cliente.

Significado de las columnas

- CreditScore: Puntaje del cliente en contexto financiero;
- Geography: Representa el país del cliente;
- Gender: Sexo del cliente;
- Age: Edad;
- Tenure: Cuánto tiempo como cliente;
- Balance: Cuanto dinero hay en el banco;
- NumOfProducts: Cuántos productos tiene el cliente;
- HasCrCard: ¿Tiene el cliente una tarjeta de crédito?
- IsActiveMember: ¿El cliente es un miembro activo?
- EstimetedSalary: ¿Cuánto es el salario del cliente?
- Exit: Idicador de abandono del cliente

In [None]:
!pip install plotly

In [None]:
pip install --upgrade bamboolib --user

In [None]:
!python -m bamboolib install_nbextensions

In [None]:
!pip install shap

## Importar datos

In [None]:
import pandas as pd
import bamboolib as bam
import plotly.express as px
import plotly.graph_objects as go

In [None]:
df = pd.read_csv("../data/churn.csv")

## Exploración y preparación de datos

In [None]:
df

In [None]:
# Cantidad de filas y columnas
rows = df.shape[0]
columns = df.shape[1]
print(f'Número de filas : {rows}')
print(f'Número de columnas : {columns}')

In [None]:
# Datos faltantes
df.isnull().sum()

In [None]:
# Tipos de datos
df.dtypes

In [None]:
# Descripción estadística de las variables numéricas
df.describe()

In [None]:
df = df.drop(columns=['RowNumber', 'CustomerId', 'Surname'])
df

In [None]:
import plotly.express as px
fig = px.histogram(df, x='CreditScore', nbins=50, hover_name='CreditScore')
fig

In [None]:
go.Figure(
    data=[go.Histogram(x=df["Tenure"], xbins={"start": 0, "end": 10.0, "size": 0.5})],
    layout=go.Layout(title="Histogram of Tenure", yaxis={"title": "Count"}, bargap=0.05),
    )

In [None]:
import plotly.express as px
fig = px.histogram(df, x='Exited', color_discrete_sequence=px.colors.qualitative.Bold)
fig

In [None]:
bam.correlations(df)

In [None]:
test1 = df.groupby(['NumOfProducts']).agg(Exited_size=('Exited', 'size')).reset_index()
test1

In [None]:
df = pd.get_dummies(df, columns=['Gender', 'Geography'], drop_first=True, dummy_na=False)
df = df.rename(columns={'Gender_Male': 'Gender'})
df = df.rename(columns={'Exited': 'Churn'})
df

In [None]:
df

In [None]:
import plotly.express as px
fig = px.violin(df, x='CreditScore', color='Churn', box=True)
fig

In [None]:
df

In [None]:
import plotly.express as px
fig = px.histogram(df, x='NumOfProducts', color='Churn', barmode='group')
fig

In [None]:
df

## Modelado

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV

In [None]:
X = df.drop(['Churn'], axis=1)
y = df.Churn

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

In [None]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
Y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, y_train) * 100, 2)
acc_log

In [None]:
model_params = {
    'random-forest' : {
        'model' : RandomForestClassifier(),
        'params' : {
             "n_estimators": [5, 10, 15, 20, 25], 
             "max_depth": [3, 5, 7, 9, 11, 13],
         }
    },
    'logisticregression': {
        'model' : LogisticRegression(),
        'params' : {
            'C' : [0.001, 0.01,0.1,1,10,100,1000],
           
        }
    },
    'decision_tree' :{
        'model' :  DecisionTreeClassifier(),
        'params' : {
             'max_depth': [3, 5, 7, 9, 11, 13],
        }
    }
    
}

In [None]:
model_scores = []

for model_name ,mp in model_params.items():
    clf = RandomizedSearchCV(mp['model'],mp['params'], cv = 5,return_train_score = False,n_iter = 2)#for the computational purpose set n_ter = 2
    clf.fit(X_train,y_train)
    
    model_scores.append({
        'model' : model_name,
        'score_train' : clf.score(X_train,y_train),
        'score_test': clf.score(X_test,y_test),
        'best_params' : clf.best_params_
        
    })

In [None]:
results = pd.DataFrame(model_scores)
results

## Explicaciones

In [None]:
from sklearn.ensemble import RandomForestClassifier


model_rf = RandomForestClassifier(n_estimators= 5, max_depth= 9)
model_rf.fit(X_train, y_train)

In [None]:
model_rf.score(X_test,y_test)

In [None]:
sample = pd.DataFrame(X_test.iloc[0]).T

In [None]:
sample

In [None]:
y_test.iloc[0]

In [None]:
import shap
shap.initjs()
explainer = shap.TreeExplainer(model_rf)

In [None]:
prediction = model_rf.predict_proba(sample)
print("Direct print:", prediction)
print(
    "Probability to be class 0:",
    prediction[0][0],
    "\nProbability to be class 1:",
    prediction[0][1],
)

`expected_value` devuelve nuestras líneas de base y desde allí vemos el impacto de nuestras funciones.

In [None]:
print(explainer.expected_value)

In [None]:
shap_values = explainer.shap_values(sample)
shap_values

In [None]:
print("Direct prediction:", prediction)
aux = shap_values[0].sum() + explainer.expected_value[0]
print("Sum of Baseline + Feature Contribuitions:", aux)

In [None]:
shap.force_plot(explainer.expected_value[0], shap_values[0], sample)

In [None]:
shap_values = explainer.shap_values(X_test)

In [None]:
shap.summary_plot(shap_values[1], X_test)