# Modelos



## Introducción a los Modelos de Machine Learning
En el análisis de datos, los modelos de Machine Learning nos permiten identificar patrones y hacer predicciones basadas en datos históricos. Uno de los modelos más interpretables y eficientes para problemas de clasificación es el **Árbol de Decisión** (*Decision Tree*). Este modelo divide los datos en diferentes ramas según criterios de segmentación, facilitando la interpretación de los factores que influyen en la toma de decisiones.

## Objetivo del Análisis
Este estudio busca entender los factores que influyen en la compra de **múltiples coches** en comparación con la compra de un solo vehículo. La meta es desarrollar un modelo que permita predecir si un cliente comprará **uno o más de un coche**, y con ello, generar estrategias que incentiven la compra de múltiples unidades.

## Variable Objetivo (*Target*)
La variable de interés en este modelo es **"Mas_1_coche"**, la cual toma dos posibles valores:
- `0`: El cliente compró **un solo coche**.
- `1`: El cliente compró **más de un coche**.

## Variables Predictoras (*Features*)
Para predecir la variable objetivo, utilizaremos un conjunto de variables predictoras que incluyen características del cliente, del vehículo y del historial de compra. Estas variables serán seleccionadas en base a su relevancia para mejorar el rendimiento del modelo.




## Importación de librerías

In [18]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
from itertools import product
from sklearn.metrics import (
    accuracy_score, classification_report, confusion_matrix, roc_auc_score,
    roc_curve, precision_recall_curve, precision_score, recall_score, f1_score
)
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import (
    train_test_split, cross_val_score, learning_curve, validation_curve
)
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from tabulate import tabulate
from sklearn.model_selection import GridSearchCV

# Importar métricas y validación cruzada
from sklearn.metrics import accuracy_score, f1_score, recall_score, roc_auc_score, confusion_matrix, classification_report, roc_curve, precision_recall_curve
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier


In [19]:
df = pd.read_csv('../data/Propensity_Processed.csv')
df.head()

Unnamed: 0,PRODUCTO,TIPO_CARROCERIA,COMBUSTIBLE,Potencia,TRANS,FORMA_PAGO,ESTADO_CIVIL,GENERO,OcupaciOn,PROVINCIA,...,Zona_Renta,REV_Garantia,Averia_grave,QUEJA_CAC,COSTE_VENTA,km_anno,Mas_1_coche,Revisiones,Edad_Cliente,Tiempo
0,0,0,0,0.0,1,0,0,1,1,4,...,1.0,0,3.0,1,2892,0,0,2,18,0
1,0,0,0,0.0,1,0,0,0,1,47,...,1.0,1,0.0,0,1376,7187,0,2,53,0
2,0,0,0,0.0,1,3,0,1,1,30,...,2.0,0,0.0,0,1376,0,1,4,21,3
3,0,0,0,0.0,1,2,0,0,1,32,...,2.0,1,3.0,1,2015,7256,1,4,48,5
4,0,0,0,0.0,1,2,0,0,2,41,...,3.0,0,0.0,0,1818,0,1,3,21,3


## Entrenamiendo del modelo

In [None]:
X = df.drop(columns=['Mas_1_coche', 'Tiempo'])  # Eliminamos también la columna 'Tiempo' por su alta correlación
y = df['Mas_1_coche']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
y_train.value_counts()

In [None]:
y_test.value_counts()

In [None]:
rf_model = RandomForestClassifier(
                                    random_state=42,
                                    n_estimators=100,
                                    min_samples_split=5)
rf_model.fit(X_train, y_train)

In [None]:
cv_scores = cross_val_score(rf_model, X_train, y_train, cv=5, scoring='accuracy')
cv_scores

In [None]:
print("Train set score (Accuracy) =", rf_model.score(X_train, y_train))
print("Test set score (Accuracy) =", rf_model.score(X_test, y_test))

conf_mat = confusion_matrix(y_test, rf_model.predict(X_test))

num_classes = conf_mat.shape[0]

print(tabulate(
    conf_mat,
    headers=[f'Pred Class {i}' for i in range(num_classes)],
    showindex=[f'Real Class {i}' for i in range(num_classes)],
    tablefmt='fancy_grid'
))

print("\nClassification Report:")
print(classification_report(y_test, rf_model.predict(X_test)))

## RANDOMFOREST

In [28]:
param_grid = {
    'criterion': ['entropy'],  
    'n_estimators': [100, 300, 500],  
    'max_depth': [10, 50, 100, None],  
    'min_samples_split': [2, 5, 10],  
    'min_samples_leaf': [3, 5, 10],  
    'max_features': ['sqrt'],  
    'bootstrap': [True],  
    'class_weight': ['balanced', 'balanced_subsample'],  
    'min_weight_fraction_leaf': [0.0],  
    'max_leaf_nodes': [None],  
    'warm_start': [False]
}

results = []

# Iterar sobre todas las combinaciones de hiperparámetros
for params in product(*param_grid.values()):
    criterion, n_estimators, max_depth, min_samples_split, min_samples_leaf, max_features, bootstrap, class_weight, min_weight_fraction_leaf, max_leaf_nodes, warm_start = params
    
    # Construir el modelo RandomForest con los nuevos parámetros
    model = RandomForestClassifier(
        criterion=criterion, 
        n_estimators=n_estimators,
        max_depth=max_depth, 
        min_samples_split=min_samples_split, 
        min_samples_leaf=min_samples_leaf,
        max_features=max_features,
        bootstrap=bootstrap,
        class_weight=class_weight,
        min_weight_fraction_leaf=min_weight_fraction_leaf,
        max_leaf_nodes=max_leaf_nodes,
        warm_start=warm_start,
        random_state=42
    )
    
    # Entrenar el modelo
    model.fit(X_train, y_train)
    
    # Predecir probabilidades en el conjunto de prueba
    y_pred_proba = model.predict_proba(X_test)[:, 1]  
    y_pred = (y_pred_proba > 0.45).astype(int)  # Se ajusta el umbral

    # Calcular métricas
    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')  
    roc_auc = roc_auc_score(y_test, y_pred_proba)
    
    # Validación cruzada
    cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='recall')  
    mean_cv_score = np.mean(cv_scores)
    
    # Guardar resultados
    results.append({
        'criterion': criterion,
        'n_estimators': n_estimators,
        'max_depth': max_depth,
        'min_samples_split': min_samples_split,
        'min_samples_leaf': min_samples_leaf,
        'accuracy': accuracy,
        'f1_score': f1,
        'recall': recall,  
        'roc_auc': roc_auc,
        'cv_recall': mean_cv_score  
    })

# Convertir a DataFrame y ordenar por recall
results_df = pd.DataFrame(results).sort_values(by=['recall', 'f1_score'], ascending=False)

# Mostrar los mejores modelos optimizados para recall
display(results_df.head(5))

KeyboardInterrupt: 

## Boost

### AdaBoost

In [23]:
import numpy as np
import pandas as pd
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, f1_score, recall_score

# Definir el modelo base
model = AdaBoostClassifier()

# Definir los hiperparámetros para la búsqueda
param_grid = {
    'n_estimators': [50, 100, 250],  # Reducimos para acelerar
    'learning_rate': [0.05, 0.2],  
    'algorithm': ['SAMME.R'],
    'random_state': [42]  # Se añade el random_state
}

# Configurar GridSearchCV
grid_search = GridSearchCV(
    model,
    param_grid,
    cv=5,  # Validación cruzada con 5 folds
    scoring='recall_weighted',  # Optimizar para recall
    n_jobs=-1,  # Paralelizar en todos los núcleos
    verbose=2  # Mostrar progreso
)

# Entrenar el modelo con la búsqueda de hiperparámetros
grid_search.fit(X_train, y_train)

# Obtener los mejores parámetros
best_model = grid_search.best_estimator_
best_params = grid_search.best_params_

# Hacer predicciones en el conjunto de prueba
y_pred = best_model.predict(X_test)

# Calcular métricas de evaluación
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')

# Guardar resultados en un DataFrame
results_ada_df = pd.DataFrame([{
    'model': "AdaBoost",
    'best_params': best_params,
    'accuracy': accuracy,
    'f1_score': f1,
    'recall': recall,
    'cv_score': grid_search.best_score_,
    'train_score': best_model.score(X_train, y_train)
}])

# Mostrar los resultados
display(results_ada_df)


Fitting 5 folds for each of 6 candidates, totalling 30 fits




Unnamed: 0,model,best_params,accuracy,f1_score,recall,cv_score,train_score
0,AdaBoost,"{'algorithm': 'SAMME.R', 'learning_rate': 0.2,...",0.770568,0.748965,0.770568,0.767476,0.768081


### Gradient bosting

In [24]:
# Parámetros específicos para Gradient Boosting
param_grid = {
    'loss': ['log_loss'],
    'learning_rate': [0.05, 0.2],
    'n_estimators': [500],
    'subsample': [1.0],
    'criterion': ['friedman_mse'],
    'min_samples_split': [2],
    'min_samples_leaf': [1],
    'max_depth': [3],
    'random_state': [42],
    'max_features': ['sqrt']
}

# Almacenar resultados
results_gb = []

for params in product(*param_grid.values()):
    param_dict = dict(zip(param_grid.keys(), params))
    
    # Instanciar y entrenar el modelo
    model = GradientBoostingClassifier(**param_dict)
    model.fit(X_train, y_train)
    
    # Predecir
    y_pred = model.predict(X_test)
    
    # Calcular métricas
    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    
    # Validación cruzada
    cv_scores = cross_val_score(model, X_train, y_train, cv=5)
    mean_cv_score = np.mean(cv_scores)
    
    # Calcular Train Score
    train_score = model.score(X_train, y_train)
    
    # Guardar resultados
    param_dict.update({
        'model': "GradientBoosting",
        'accuracy': accuracy,
        'f1_score': f1,
        'recall': recall,
        'cv_score': mean_cv_score,
        'train_score': train_score
    })
    results_gb.append(param_dict)

# Convertir a DataFrame y mostrar resultados
results_gb_df = pd.DataFrame(results_gb)
display(results_gb_df.sort_values(by=['f1_score', 'accuracy'], ascending=False).head(3))


Unnamed: 0,loss,learning_rate,n_estimators,subsample,criterion,min_samples_split,min_samples_leaf,max_depth,random_state,max_features,model,accuracy,f1_score,recall,cv_score,train_score
1,log_loss,0.2,500,1.0,friedman_mse,2,1,3,42,sqrt,GradientBoosting,0.832252,0.826966,0.832252,0.827861,0.84153
0,log_loss,0.05,500,1.0,friedman_mse,2,1,3,42,sqrt,GradientBoosting,0.814171,0.804409,0.814171,0.811035,0.813825


### XGBoost

In [25]:
# Parámetros específicos para XGBoost
param_grid = {
    'learning_rate': [0.05, 0.2],
    'n_estimators': [500],
    'max_depth': [3],
    'min_child_weight': [1],
    'subsample': [1.0],
    'colsample_bytree': [0.8],
    'reg_alpha': [0],
    'reg_lambda': [1],
    'gamma': [0],
    'scale_pos_weight': [1],
    'base_score': [0.5],
    'random_state': [42],
    'verbosity': [1]
}

# Almacenar resultados
results_xgb = []

for params in product(*param_grid.values()):
    param_dict = dict(zip(param_grid.keys(), params))
    
    # Instanciar y entrenar el modelo
    model = XGBClassifier(**param_dict)
    model.fit(X_train, y_train)
    
    # Predecir
    y_pred = model.predict(X_test)
    
    # Calcular métricas
    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    
    # Validación cruzada
    cv_scores = cross_val_score(model, X_train, y_train, cv=5)
    mean_cv_score = np.mean(cv_scores)
    
    # Calcular Train Score
    train_score = model.score(X_train, y_train)
    
    # Guardar resultados
    param_dict.update({
        'model': "XGBoost",
        'accuracy': accuracy,
        'f1_score': f1,
        'recall': recall,
        'cv_score': mean_cv_score,
        'train_score': train_score
    })
    results_xgb.append(param_dict)

# Convertir a DataFrame y mostrar resultados
results_xgb_df = pd.DataFrame(results_xgb)
display(results_xgb_df.sort_values(by=['f1_score', 'accuracy'], ascending=False).head(3))


Unnamed: 0,learning_rate,n_estimators,max_depth,min_child_weight,subsample,colsample_bytree,reg_alpha,reg_lambda,gamma,scale_pos_weight,base_score,random_state,verbosity,model,accuracy,f1_score,recall,cv_score,train_score
1,0.2,500,3,1,1.0,0.8,0,1,0,1,0.5,42,1,XGBoost,0.835193,0.830699,0.835193,0.830262,0.848495
0,0.05,500,3,1,1.0,0.8,0,1,0,1,0.5,42,1,XGBoost,0.822649,0.815069,0.822649,0.817869,0.822736
