## Librerías

# Título  Variables con RF
## Autor: Jose Chelquer
## Fecha de última modificación: 25/10/2024
## Descripción:

Agrega features corriendo RF.
Optimiza RF y agrega el resultado de las probabilidades arrojadas como variables extra en el dataset.


## Parámetros

< Descripción de cada uno de los parámetros que utiliza el job >


In [None]:
usar_gdrive=True    # Poner en true si se va a ejecutar con archivos de google drive en máquinas locales
#usar_gdrive=False

In [None]:
semillas = [17,19,23,29,31]     # Se va a crear una variable por cada semilla
ganancia_acierto=273000
costo_estimulo=7000

In [None]:
study_name="optimizacion_RF"
reinicializar_study=False           #La primera vez o si se quiere recomenzar
#reinicializar_study=True           #si se quiere empezar de cero, sin trabajar con lo ya creado


In [None]:
# parámetros del rf a optimizar
params=[
    ['max_depth', 'int', [2, 32]],
    ['min_samples_split', 'int', [2, 2000]],
    ['min_samples_leaf', 'int', [1, 200]],
    ['max_features', 'float', [0.05, 0.7]]
 ]






n_trials=100

grabar_importancias=True          # Se puede pedir que grabe las importancias de variables como resultado secundario
#grabar_importancias=False
importancias_file='importancias_rf.csv.gz'


## Input

< Archivos de datos (csv.gz) con sus paths que van a consumirse por el job>

In [None]:
# EL script se adapta a archivos .csv o .csv.gz
dataset_path='/content/drive/MyDrive/Data Science y similares/Maestría Data Mining Exactas/dmeyf/dmeyf2024/datasets/'
#dataset_path = '/home/jose/buckets/b1/datasets'
dataset_file='competencia_02_aumentada.csv.gz'

## Output

< Archivos, bases de datos, modelos que va a generar el job>

In [None]:
# el script se adapta a datasets .csv o .gz
output_file='competencia_02_conRF.csv.gz'
db_path='/content/drive/MyDrive/Data Science y similares/Maestría Data Mining Exactas/dmeyf/dmeyf2024/db/'
#db_path='/home/jose/buckets/b1/db'
db_file='optimizacion.db'

modelos='/content/drive/MyDrive/Data Science y similares/Maestría Data Mining Exactas/dmeyf/dmeyf2024/modelos/'
modelos_path='/home/jose/buckets/b1/modelos'
modelos_file='modelos_con_RF.pkl'


## Procesos

### Paquetes necesarios

In [None]:
%pip install optuna==3.6.1

Collecting optuna==3.6.1
  Downloading optuna-3.6.1-py3-none-any.whl.metadata (17 kB)
Collecting alembic>=1.5.0 (from optuna==3.6.1)
  Downloading alembic-1.13.3-py3-none-any.whl.metadata (7.4 kB)
Collecting colorlog (from optuna==3.6.1)
  Downloading colorlog-6.8.2-py3-none-any.whl.metadata (10 kB)
Collecting Mako (from alembic>=1.5.0->optuna==3.6.1)
  Downloading Mako-1.3.6-py3-none-any.whl.metadata (2.9 kB)
Downloading optuna-3.6.1-py3-none-any.whl (380 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m380.1/380.1 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading alembic-1.13.3-py3-none-any.whl (233 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.2/233.2 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading colorlog-6.8.2-py3-none-any.whl (11 kB)
Downloading Mako-1.3.6-py3-none-any.whl (78 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.6/78.6 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hInsta

## Código del proceso

< Todo el código a partir de aquí debe poder ejecutarse sin necesidad de parametrizar nada>

Instalamos, cargamos y seteamos el entorno

In [None]:
#%pip install scikit-learn==1.3.2
#%pip install seaborn==0.13.1
#%pip install numpy==1.26.4
#%pip install matplotlib==3.7.1



## Gdrive?

In [None]:
if usar_gdrive:
  from google.colab import drive
  drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os

from sklearn.model_selection import train_test_split
from sklearn.model_selection import ShuffleSplit, StratifiedShuffleSplit
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer

from joblib import Parallel, delayed

import optuna
from optuna.visualization import plot_optimization_history, plot_param_importances, plot_slice, plot_contour

from time import time



## Leer dataset

In [None]:

data = pd.read_csv(os.path.join(dataset_path, dataset_file))

In [None]:
X = data
y = X['clase_ternaria']
X = X.drop(columns=['clase_ternaria'])

## Función ganancia

In [None]:
def rf_gan_eval(y_pred, data):
    clase_ternaria = data.get_clase_ternaria()
    # Diferencia si eran BAJA+1 o BAJA+2
    ganancia = np.where(clase_ternaria == 'BAJA+2', ganancia_acierto, 0) - np.where(clase_ternaria !='BAJA+2', costo_estimulo, 0)
    #Ordena ganancia según los índices ordenados de y_pred de mayor a menor
    ganancia = ganancia[np.argsort(y_pred)[::-1]] #: desde todo : hasta todo :-1 step hacia atrás
    # Ganancias acumuladas so far
    ganancia = np.cumsum(ganancia)

    return 'gan_eval', np.max(ganancia) , True

def ganancia_prob(y_hat, y, prop=1, class_index=1, threshold=0.025):
  @np.vectorize
  def ganancia_row(predicted, actual, threshold=0.025):
    return  (predicted >= threshold) * (ganancia_acierto if actual == "BAJA+2" else -costo_estimulo)

  return ganancia_row(y_hat[:,class_index], y).sum() / prop



## Imputar NANs

In [None]:
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
Xi = imp_mean.fit_transform(X)


Los parámetros que se pueden ajustar en el **rf** son

1. **n_estimators**: Número de árboles en el bosque.
2. **max_depth**: Profundidad máxima de los árboles.
3. **min_samples_split**: Número mínimo de muestras requeridas para dividir un nodo interno.
4. **min_samples_leaf**: Número mínimo de muestras requeridas para estar en un nodo hoja.
5. **max_features**: Número de features a usar en cada árbol. **sqrt** es una elección histórica.
6. **max_leaf_nodes**: Número máximo de nodos hoja en cada árbol.
7. **oob_score**: Indica si se usa la muestra fuera de bolsa (out-of-bag) para estimar la calidad del modelo. Para evitar hacer un **montecarlo-cross-validation** que se toma su tiempo, usaremos esta opción para buscar el mejor modelo. No es la mejor opción. Pero no es tan mala.
8. **n_jobs**: Siempre -1, para que use todos los cores presentes en 9. **max_samples**: Fracción de los samples.

Finalmente nuestra función de optimización queda la siguiente forma:

## Optimizar hiperparámetros

In [None]:
def suggest_params(trial, params):
    suggested_params = {}
    for param in params:
        name, param_type, range_values = param
        if param_type == 'int':
            suggested_params[name] = trial.suggest_int(name, *range_values)
        elif param_type == 'float':
            suggested_params[name] = trial.suggest_float(name, *range_values)
    return suggested_params


def objective(trial):
    suggested_params = suggest_params(trial, params)

    # Definir otros hiperparámetros fijos

    random_state = semillas[0]

    # Inicializar el modelo usando los parámetros sugeridos
    model = RandomForestClassifier(
       random_state=random_state,
        n_jobs=-1,
        oob_score=True,
        **suggested_params  # Expande los parámetros sugeridos aquí
    )
    model.fit(Xi, y)
    return ganancia_prob(model.oob_decision_function_, y)





[I 2024-10-27 13:27:08,795] Using an existing study with name 'optimizacion_RF' instead of creating a new one.


In [None]:
storage_name = "sqlite:///" + os.path.join(db_path, db_file)
study_name = study_name
if reinicializar_study:
  optuna.delete_study(study_name=study_name, storage=storage_name)
study = optuna.create_study(
    direction="maximize",
    study_name=study_name,
    storage=storage_name,
    load_if_exists=True,
)

In [None]:

study.optimize(objective, n_trials=n_trials)

[I 2024-10-27 13:28:30,306] Trial 37 finished with value: 263550000.0 and parameters: {'max_depth': 20, 'min_samples_split': 502, 'min_samples_leaf': 124, 'max_features': 0.3940242002447821, 'n_estimators': 990}. Best is trial 34 with value: 265839000.0.
[I 2024-10-27 13:29:09,075] Trial 38 finished with value: 264488000.0 and parameters: {'max_depth': 15, 'min_samples_split': 234, 'min_samples_leaf': 137, 'max_features': 0.3290891928702739, 'n_estimators': 573}. Best is trial 34 with value: 265839000.0.
[I 2024-10-27 13:29:28,012] Trial 39 finished with value: 264327000.0 and parameters: {'max_depth': 13, 'min_samples_split': 388, 'min_samples_leaf': 172, 'max_features': 0.5696687799602278, 'n_estimators': 200}. Best is trial 34 with value: 265839000.0.
[I 2024-10-27 13:29:50,734] Trial 40 finished with value: 261828000.0 and parameters: {'max_depth': 23, 'min_samples_split': 549, 'min_samples_leaf': 108, 'max_features': 0.16000036503197776, 'n_estimators': 750}. Best is trial 34 with

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py", line 208, in _run_trial
    study._storage.set_trial_state_values(frozen_trial._trial_id, state, values)
  File "/usr/local/lib/python3.10/dist-packages/optuna/storages/_cached_storage.py", line 187, in set_trial_state_values
    return self._backend.set_trial_state_values(trial_id, state=state, values=values)
  File "/usr/local/lib/python3.10/dist-packages/optuna/storages/_rdb/storage.py", line 640, in set_trial_state_values
    with _create_scoped_session(self.scoped_session) as session:
  File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/usr/local/lib/python3.10/dist-packages/optuna/storages/_rdb/storage.py", line 73, in _create_scoped_session
    session.commit()
  File "/usr/local/lib/python3.10/dist-packages/sqlalchemy/orm/session.py", line 2028, in commit
    trans.commit(_to_root=True)
  File "<string>", line 2, in commit
  Fil

TypeError: object of type 'NoneType' has no len()

Exploramos como fue la búsqueda de parámetros

In [None]:
optuna.visualization.plot_optimization_history(study)

In [None]:
plot_param_importances(study)

In [None]:
plot_slice(study)

In [None]:
plot_contour(study)

In [None]:
plot_contour(study, params=["max_depth", "min_samples_split"])

## Ajustar con el mejor modelo
Una vez con cada semilla
y crear una variable con la probabilidad obtenida en cada caso.

Ajustamos el mejor modelo con cada semilla y creamos las variables.

In [None]:
modelos=[]
i=0
for semilla in semillas:
  i+=1
  print(f'Modelo {i}')
  model_rf = RandomForestClassifier(
          **study.best_params,
          max_samples=0.7,
          random_state=semillas[semilla],
          n_jobs=-1,
          oob_score=True
      )
  model_rf.fit(Xi, y)
  modelos.append(model_rf)


  y_pred_rf = model_rf.predict_proba(Xi)

  # Agregar variable
  data[f'rf_prob_{semilla}'] = y_pred_rf[:,1]



Modelo 1
Modelo 2
Modelo 3
Modelo 4
Modelo 5


In [None]:
data


Unnamed: 0,numero_de_cliente,foto_mes,active_quarter,cliente_vip,internet,cliente_edad,cliente_antiguedad,mrentabilidad,mrentabilidad_annual,mcomisiones,...,Visa_mconsumototal,Visa_cconsumos,Visa_cadelantosefectivo,Visa_mpagominimo,clase_ternaria,rf_prob_17,rf_prob_19,rf_prob_23,rf_prob_29,rf_prob_31
0,596043405,202106,1,0,0,49,81,13293.74,218131.73,0.00,...,,,,,BAJA+1,0.383388,0.383388,0.383388,0.383388,0.383388
1,324054194,202106,1,0,0,83,325,757.61,15384.05,0.00,...,,,,,BAJA+1,0.188294,0.188294,0.188294,0.188294,0.188294
2,1200152394,202106,1,0,0,32,59,2523.11,10321.95,2835.66,...,15242.36,11.0,0.0,0.00,BAJA+1,0.153771,0.153771,0.153771,0.153771,0.153771
3,1302835593,202106,1,0,0,40,23,2379.72,89994.61,83.96,...,,,,,BAJA+1,0.143019,0.143019,0.143019,0.143019,0.143019
4,858495587,202106,1,0,0,49,141,884.92,2583.72,306.86,...,628.72,1.0,0.0,0.00,BAJA+1,0.161522,0.161522,0.161522,0.161522,0.161522
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10192,1448948106,202104,1,0,0,23,23,1248.35,3991.75,1697.07,...,3188.22,2.0,0.0,1126.08,CONTINUA,0.201756,0.201756,0.201756,0.201756,0.201756
10193,848154891,202106,1,0,0,40,146,680.18,5576.40,0.00,...,,,,0.00,CONTINUA,0.051549,0.051549,0.051549,0.051549,0.051549
10194,271864454,202106,1,0,0,59,25,1722.09,17776.26,232.92,...,,,,0.00,CONTINUA,0.132796,0.132796,0.132796,0.132796,0.132796
10195,915729108,202104,1,0,0,32,124,1576.08,-6006.76,425.04,...,13675.13,16.0,0.0,1958.91,CONTINUA,0.043279,0.043279,0.043279,0.043279,0.043279


In [None]:
all_feat_importances = pd.DataFrame()

i=0
for modelo in modelos:
  i+=1
  print(f'Modelo {i}')
  print(modelo.oob_score_)
  features = X.columns
  importances = modelo.feature_importances_
  feat_importances = pd.DataFrame({'modelo':i, 'feature': features, 'importance': importances})
  feat_importances = feat_importances.sort_values('importance', ascending=False)
  print(feat_importances.head(10))

  # Agregar las filas al DataFrame acumulado
  all_feat_importances = pd.concat([all_feat_importances, feat_importances], ignore_index=True)

  if grabar_importancias:
    if importancias_file.endswith('.gz'):
      all_feat_importances.to_csv(dataset_path + importancias_file, index=False, compression='gzip')
    else:
      all_feat_importances.to_csv(dataset_path + importancias_file, index=False)



Modelo 1
0.8164165931156222
     modelo                      feature  importance
107       1                 ctrx_quarter    0.376500
18        1                 mcaja_ahorro    0.164988
11        1              mpasivos_margen    0.114315
28        1        mtarjeta_visa_consumo    0.067648
51        1                 cpayroll_trx    0.050944
52        1                     mpayroll    0.050673
27        1  ctarjeta_visa_transacciones    0.026425
22        1               mcuentas_saldo    0.023469
25        1                mautoservicio    0.014697
33        1        mprestamos_personales    0.011631
Modelo 2
0.8164165931156222
     modelo                      feature  importance
107       2                 ctrx_quarter    0.376500
18        2                 mcaja_ahorro    0.164988
11        2              mpasivos_margen    0.114315
28        2        mtarjeta_visa_consumo    0.067648
51        2                 cpayroll_trx    0.050944
52        2                     mpayroll   

## Grabar salida

In [None]:
# Grabar el archivo
if output_file.endswith('.csv.gz'):
    data.to_csv(dataset_path + output_file, index=False, compression='gzip')
else:
    data.to_csv(dataset_path + output_file, index=False)




In [None]:
# grabar modelos
import pickle
# Guardar los modelos en un archivo
modelos_file_path=os.path.join(modelos.path, modelos.file'
with open(modelos.file_path, 'wb') as f:
    pickle.dump(modelos, f)

print("Modelos guardados exitosamente.")
