S10 DESCRIPCIÓN DEL PROYECTO

Los clientes de Beta Bank se están yendo, cada mes, poco a poco. Los banqueros descubrieron que es más barato salvar a los clientes existentes que atraer nuevos.

Necesitamos predecir si un cliente dejará el banco pronto. Tú tienes los datos sobre el comportamiento pasado de los clientes y la terminación de contratos con el banco.

Crea un modelo con el máximo valor F1 posible. Para aprobar la revisión, necesitas un valor F1 de al menos 0.59. Verifica F1 para el conjunto de prueba. 

Además, debes medir la métrica AUC-ROC y compararla con el valor F1.

1.CARGA Y PREPARACION DE DATOS

In [151]:
import numpy as np, pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

1.1 Carga de datos

In [152]:
df = pd.read_csv(r"C:\Users\jonat\Desktop\DATA_SCIENTIST\SPRINT_10_Aprendizaje_Supervisado\Churn.csv")
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


1.2 Estandarizar los nombre de las columnas

In [153]:
df.columns = df.columns.str.lower()

1.3 Limpieza minima segura

In [154]:
if 'tenure' in df.columns:
    df['tenure'] = df['tenure'].fillna(0).astype(int)

In [155]:
for c in ['surname','geography','gender']:
    if c in df.columns and df[c].dtype == 'object':
        df[c] = df[c].str.lower()

Lo que hice fue asignar un 0 a todos los valores ausentes de la columna tenure, y convertir el tipo de dato a entero, y normalice el texto de las categorias por lo que los detalles de la data deben estar cubiertos por el momento

1.4 Establecer X/y y quitar los id que no aporten para poder minimizar el ruido

In [156]:
drop_ids = ['rownumber','customerid','surname']
y = df['exited'].astype(int)
X = df.drop(columns=drop_ids + ['exited'])

1.5 Dividir en 60/20/20 estratificado

In [157]:
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.20, random_state=RANDOM_STATE, stratify=y)
X_train, X_valid, y_train, y_valid = train_test_split(X_temp, y_temp, test_size=0.25, random_state=RANDOM_STATE, stratify=y_temp)

print("Shapes -> train:", X_train.shape, "| valid:", X_valid.shape, "| test:", X_test.shape)
print("Proporción Exited=1 (train/valid/test):",
      y_train.mean().round(4), y_valid.mean().round(4), y_test.mean().round(4))

Shapes -> train: (6000, 10) | valid: (2000, 10) | test: (2000, 10)
Proporción Exited=1 (train/valid/test): 0.2038 0.2035 0.2035


1.6 Preprocesadores

In [158]:
cat_cols = X_train.select_dtypes(include=['object']).columns.tolist()
num_cols = X_train.select_dtypes(include=[np.number]).columns.tolist()
print("Categóricas:", cat_cols)
print("Numéricas:", num_cols)

Categóricas: ['geography', 'gender']
Numéricas: ['creditscore', 'age', 'tenure', 'balance', 'numofproducts', 'hascrcard', 'isactivemember', 'estimatedsalary']


a)RANDOM FOREST ( sin escalar OHE)

In [159]:
ohe_full = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
pre_ohe_only = ColumnTransformer(transformers=[('cat', ohe_full, cat_cols),('num', 'passthrough', num_cols)])

b) LOGISTIC REGRESSION (OHE con drop='first' y le sumo el escalado para evitar dummy)

In [160]:
ohe_drop = OneHotEncoder(handle_unknown='ignore', sparse_output=False, drop='first')
pre_ohe_scale = ColumnTransformer(transformers=[('cat', ohe_drop, cat_cols),('num', StandardScaler(), num_cols)])

2. EQUILIBRIO DE CLASES + BASELINE (UMBRAL DE 0.5)

2.1 Cargo la smetricas que usare

In [161]:
from sklearn.metrics import (accuracy_score, confusion_matrix, precision_score, recall_score,f1_score, roc_auc_score, classification_report, precision_recall_curve)
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

In [162]:
def resumen_clases(y, titulo):
    vc = y.value_counts().rename("count")
    prop = y.value_counts(normalize=True).rename("proportion")
    tab = pd.concat([vc, prop], axis=1)
    print(f"\nResumen ({titulo}):\n", tab)
    print(f"Tasa de churn (media=1) en {titulo}: {y.mean():.4f}")
    return tab

In [163]:
def evaluar_binario(y_true, y_proba, threshold=0.5, nombre="modelo"):
    y_pred = (y_proba >= threshold).astype(int)
    acc = accuracy_score(y_true, y_pred)
    prec = precision_score(y_true, y_pred, zero_division=0)
    rec  = recall_score(y_true, y_pred, zero_division=0)
    f1   = f1_score(y_true, y_pred, zero_division=0)
    auc  = roc_auc_score(y_true, y_proba)
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    print(f"\n[{nombre}] thr={threshold:.2f} | ACC={acc:.4f}  PREC={prec:.4f}  REC={rec:.4f}  F1={f1:.4f}  AUC={auc:.4f}")
    print(f"TP={tp}  TN={tn}  FP={fp}  FN={fn}")
    print(classification_report(y_true, y_pred, digits=4))
    return {"acc":acc,"precision":prec,"recall":rec,"f1":f1,"auc":auc,
            "tp":tp,"tn":tn,"fp":fp,"fn":fn}

2.2 Examino el desbalance

In [164]:
resumen_clases(y_train, "y_train")
resumen_clases(y_valid, "y_valid")


Resumen (y_train):
         count  proportion
exited                   
0        4777    0.796167
1        1223    0.203833
Tasa de churn (media=1) en y_train: 0.2038

Resumen (y_valid):
         count  proportion
exited                   
0        1593      0.7965
1         407      0.2035
Tasa de churn (media=1) en y_valid: 0.2035


Unnamed: 0_level_0,count,proportion
exited,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1593,0.7965
1,407,0.2035


2.3 Pipelines baseline (sin balancear) y con umbral de 0.5

In [165]:
logit_base = Pipeline([
    ('prep', pre_ohe_scale),  # OHE drop + escala
    ('clf', LogisticRegression(max_iter=1000, random_state=RANDOM_STATE))])

rf_base = Pipeline([
    ('prep', pre_ohe_only),   # OHE sin escalar
    ('clf', RandomForestClassifier(n_estimators=400, random_state=RANDOM_STATE, n_jobs=-1))])

2.4 Entrenamiento

In [166]:
logit_base.fit(X_train, y_train)
rf_base.fit(X_train, y_train)

2.4 Evaluo en VALID

In [167]:
proba_logit_valid = logit_base.predict_proba(X_valid)[:,1]
proba_rf_valid    = rf_base.predict_proba(X_valid)[:,1]
m_logit_base = evaluar_binario(y_valid, proba_logit_valid, 0.5, "Logistic (baseline)")
m_rf_base    = evaluar_binario(y_valid, proba_rf_valid,    0.5, "RandomForest (baseline)")


[Logistic (baseline)] thr=0.50 | ACC=0.8090  PREC=0.5839  REC=0.2138  F1=0.3129  AUC=0.7568
TP=87  TN=1531  FP=62  FN=320
              precision    recall  f1-score   support

           0     0.8271    0.9611    0.8891      1593
           1     0.5839    0.2138    0.3129       407

    accuracy                         0.8090      2000
   macro avg     0.7055    0.5874    0.6010      2000
weighted avg     0.7776    0.8090    0.7718      2000


[RandomForest (baseline)] thr=0.50 | ACC=0.8610  PREC=0.7722  REC=0.4496  F1=0.5683  AUC=0.8524
TP=183  TN=1539  FP=54  FN=224
              precision    recall  f1-score   support

           0     0.8729    0.9661    0.9172      1593
           1     0.7722    0.4496    0.5683       407

    accuracy                         0.8610      2000
   macro avg     0.8225    0.7079    0.7427      2000
weighted avg     0.8524    0.8610    0.8462      2000



In [168]:
pd.DataFrame({
    'model': ['Logistic (baseline)','RandomForest (baseline)'],
    'ACC': [m_logit_base['acc'], m_rf_base['acc']],
    'Precision': [m_logit_base['precision'], m_rf_base['precision']],
    'Recall': [m_logit_base['recall'], m_rf_base['recall']],
    'F1@0.5': [m_logit_base['f1'], m_rf_base['f1']],
    'AUC-ROC': [m_logit_base['auc'], m_rf_base['auc']]
}).sort_values('F1@0.5', ascending=False).reset_index(drop=True)

Unnamed: 0,model,ACC,Precision,Recall,F1@0.5,AUC-ROC
0,RandomForest (baseline),0.861,0.772152,0.449631,0.568323,0.852366
1,Logistic (baseline),0.809,0.583893,0.213759,0.31295,0.756772


3. CORREGIR DESBALANCE + OPTIMIZAR UMBRAL (MAX F1 EN VALID)

3.1 Mejor umbral para F1 en el set de validación (máximiza F1)

In [169]:
def mejor_umbral_f1(y_true, y_proba):
    p, r, thr = precision_recall_curve(y_true, y_proba)
    f1s = 2*(p[:-1]*r[:-1])/(p[:-1]+r[:-1]+1e-12)
    idx = np.nanargmax(f1s)
    return float(thr[idx]), float(f1s[idx])

3.2 Evaluar un pipeline, busca el umbral optimo en valid y devuelve las metricas

In [170]:
def eval_pipe_f1opt(nombre, pipe, X_tr, y_tr, X_va, y_va):
    pipe.fit(X_tr, y_tr)
    proba = pipe.predict_proba(X_va)[:,1]
    thr_opt, f1_opt = mejor_umbral_f1(y_va, proba)
    auc = roc_auc_score(y_va, proba)
    return {
        "modelo": nombre,
        "thr*": thr_opt,
        "F1* (valid)": f1_opt,
        "AUC (valid)": auc,
        "pipe": pipe
    }


3.4 ENFOQUE: class_weight

In [171]:
logit_cw = Pipeline([
    ('prep', pre_ohe_scale),
    ('clf', LogisticRegression(max_iter=1000, class_weight='balanced', random_state=RANDOM_STATE))])


In [172]:
rf_cw = Pipeline([
    ('prep', pre_ohe_only),
    ('clf', RandomForestClassifier(
        n_estimators=600,
        class_weight='balanced_subsample',
        random_state=RANDOM_STATE, n_jobs=-1))])

3.5 Evaluar ambos con un umbral optimo

In [173]:
resultados = []
resultados.append(eval_pipe_f1opt("Logistic + class_weight", logit_cw, X_train, y_train, X_valid, y_valid))
resultados.append(eval_pipe_f1opt("RandomForest + class_weight", rf_cw, X_train, y_train, X_valid, y_valid))

In [174]:
tabla_valid = pd.DataFrame([{k:v for k,v in r.items() if k!="pipe"} for r in resultados])\
    .sort_values("F1* (valid)", ascending=False).reset_index(drop=True)

print("\nRanking por F1* (valid) con umbral optimizado:")
display(tabla_valid)


Ranking por F1* (valid) con umbral optimizado:


Unnamed: 0,modelo,thr*,F1* (valid),AUC (valid)
0,RandomForest + class_weight,0.343333,0.639697,0.852605
1,Logistic + class_weight,0.472191,0.493355,0.761358


3.6 Identificar el mejor modelo

In [175]:
best_idx  = tabla_valid.index[0]
best_name = tabla_valid.loc[best_idx, "modelo"]
best_thr  = float(tabla_valid.loc[best_idx, "thr*"])
best_pipe = [r["pipe"] for r in resultados if r["modelo"]==best_name][0]

print(f"Mejor en validación: {best_name}")
print(f"Umbral óptimo (thr*): {best_thr:.3f}")
print(f"F1* (valid): {tabla_valid.loc[best_idx, 'F1* (valid)']:.4f}")
print(f"AUC (valid): {tabla_valid.loc[best_idx, 'AUC (valid)']:.4f}")

Mejor en validación: RandomForest + class_weight
Umbral óptimo (thr*): 0.343
F1* (valid): 0.6397
AUC (valid): 0.8526


4. PRUEBA FINAL TEST CON EL MEJOR PIPELINE Y SU UMBRAL OPTIMO

4.1 Reentreno el modelo en TRAIN + VALID

In [176]:
X_trfin = pd.concat([X_train, X_valid], axis=0)
y_trfin = pd.concat([y_train, y_valid], axis=0)
best_pipe.fit(X_trfin, y_trfin)

4.2 Evalue en TEST con el umbral optimo y su validacion

In [177]:
proba_test = best_pipe.predict_proba(X_test)[:, 1]
final_metrics_opt = evaluar_binario(y_test, proba_test, threshold=best_thr,nombre=f"FINAL TEST (RF+class_weight, thr*={best_thr:.3f})")


[FINAL TEST (RF+class_weight, thr*=0.343)] thr=0.34 | ACC=0.8395  PREC=0.6138  REC=0.5700  F1=0.5911  AUC=0.8550
TP=232  TN=1447  FP=146  FN=175
              precision    recall  f1-score   support

           0     0.8921    0.9083    0.9002      1593
           1     0.6138    0.5700    0.5911       407

    accuracy                         0.8395      2000
   macro avg     0.7529    0.7392    0.7456      2000
weighted avg     0.8355    0.8395    0.8373      2000



4.3 Comparacion contral el umbral clasico 0.5 en test

In [178]:
final_metrics_05 = evaluar_binario(y_test, proba_test, threshold=0.50,nombre="FINAL TEST (RF+class_weight, thr=0.50)")


[FINAL TEST (RF+class_weight, thr=0.50)] thr=0.50 | ACC=0.8555  PREC=0.7521  REC=0.4324  F1=0.5491  AUC=0.8550
TP=176  TN=1535  FP=58  FN=231
              precision    recall  f1-score   support

           0     0.8692    0.9636    0.9140      1593
           1     0.7521    0.4324    0.5491       407

    accuracy                         0.8555      2000
   macro avg     0.8107    0.6980    0.7316      2000
weighted avg     0.8454    0.8555    0.8397      2000



In [179]:
print("RESUMEN FINAL")
print(f"Mejor modelo: RandomForest + class_weight")
print(f"Umbral óptimo (de valid): {best_thr:.3f}")
print(f"F1 (test, thr*): {final_metrics_opt['f1']:.4f} | AUC-ROC (test): {final_metrics_opt['auc']:.4f}")
print(f"(Comparación) F1 (test, thr=0.50): {final_metrics_05['f1']:.4f}")

RESUMEN FINAL
Mejor modelo: RandomForest + class_weight
Umbral óptimo (de valid): 0.343
F1 (test, thr*): 0.5911 | AUC-ROC (test): 0.8550
(Comparación) F1 (test, thr=0.50): 0.5491
