# (03) modelado para clasificación de ingresos: Random Forest

en este punto del proyecto, se ha hecho:
* Un EDA a fondo en el dataset de clasificación de ingresos *(notebook 00)*
* Se han imputado columnas  con `NaN` *(notebook 01)*
* Se han analizado realciones no lineales *(notebook 02)*
* Se ha balanceado la columna target `income` *(notebook 02)*
* Se han analizado los outliers *(notebook 02)*

ahora, comenzaremos el modelado de un `Random Forest`

**objetivos**:
* **(1)** hacer un modelado inicial comenzando con dos dataset: uno con outliers y otro sin outliers
    * evalueremos el desempeño de ambos modelos, teniendo encuenta métricas tales como: *F1-score, recall, accuracy* aplicaremos un cross validation
    * evaluar el  posible **overfitting** y disminuirlo
    
* **(2)** entrenamiento final del modelo con el dataset con mejor desempeño 

entorno: eda (anaconda)

## importación de librerías, módulos propios y datasets

In [1]:
#libraries
import joblib
import matplotlib.pyplot as plt
import numpy   as np
import pandas  as pd
import pprint

from sklearn.compose          import   ColumnTransformer
from sklearn.ensemble         import   RandomForestClassifier
from sklearn.model_selection  import  (train_test_split as tts,
                                       cross_val_score as cvs,
                                       GridSearchCV)
from sklearn.metrics          import   classification_report, f1_score
from sklearn.pipeline         import   Pipeline
from sklearn.preprocessing    import  (OneHotEncoder as OHE,
                                       StandardScaler as SS,
                                       FunctionTransformer)
from typing import List, Dict

import warnings


In [2]:
# own modules
from import_modules import import_to_nb

# plotting & eda functions
import_to_nb(directory= 'scripts', show_content= False)

# lists, dicts & auxiliar functions (for this notebook)
import_to_nb(directory= 'modules', show_content= False) 

#-# directory: scripts
from load_data import Loader
from utils     import Utils
from utils_initial_exploration     import InitialExploration
from utils_categorical_plots       import CategoricalPlots
from utils_classif_models_plots    import ClassifModelsPlots
from utils_numerical_plots         import NumericalPlots

#-# directory: modules
from module_modeling               import ModelingMethods

In [3]:
load          = Loader()
utils         = Utils()
initial_exp   = InitialExploration()
cat_plots     = CategoricalPlots()
classif_plots = ClassifModelsPlots()
num_plots     = NumericalPlots()
model_methods = ModelingMethods()

In [4]:
warnings.simplefilter('ignore',  category= FutureWarning)

In [5]:
# load appereance
utils.load_appereance()

In [6]:
df_with_outliers    = load.load_data(file_name= 'adult_with_outliers'   , dir= 'clean')
df_without_outliers = load.load_data(file_name= 'adult_without_outliers', dir= 'clean')

In [7]:
df_with_outliers.columns

Index(['age', 'hours_per_week', 'marital_status', 'relationship', 'education',
       'occupation', 'capital_net', 'income'],
      dtype='object')

In [8]:
text_000 = f'DF shapes:\nwith outliers:    {df_with_outliers.shape}\nwithout outliers: {df_without_outliers.shape}'
print(text_000)

DF shapes:
with outliers:    (74256, 8)
without outliers: (51832, 8)


## comienzo del análisis

In [9]:
df_with_outliers.columns

Index(['age', 'hours_per_week', 'marital_status', 'relationship', 'education',
       'occupation', 'capital_net', 'income'],
      dtype='object')

*ganancia neta = capital_gain - capital_loss*

## (3.1) elección de dataset para entrenamiento final

In [10]:
# evaluación de df's con y sin outliers (cv, f1_score, classification_report)
# path original: ./modules/module_modeling.py
model_methods.evaluate_datasets(df_with_outliers, df_without_outliers, target_col= 'income')


Evaluando dataset: with_outliers
F1-score en test: 0.8547
F1-score (CV): 0.8469
                      0             1  accuracy     macro avg  weighted avg
precision      0.864807      0.840479  0.852224      0.852643      0.852644
recall         0.834994      0.869456  0.852224      0.852225      0.852224
f1-score       0.849639      0.854722  0.852224      0.852181      0.852180
support    11139.000000  11138.000000  0.852224  22277.000000  22277.000000
-*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*- 


Evaluando dataset: without_outliers
F1-score en test: 0.8091
F1-score (CV): 0.8148
                     0            1  accuracy     macro avg  weighted avg
precision     0.848175     0.791857  0.821222      0.820016      0.822374
recall        0.816164     0.827204  0.821222      0.821684      0.821222
f1-score      0.831862     0.809145  0.821222      0.820503      0.821454
support    8426.000000  7124.000000  0.821222  15550.000000  15550.000000
-*--*--*--*--*--*--*--

<div class="alert alert-info">
    <b style="font-size: 1.5em;">🔍 revisión de resultados obtenidos</i></b>
    <p>Los resultados son muy similares (<b>~0.854</b> vs <b>~0.809</b> en entrenamiento, con diferencias similares en test)
    <p>El dataframe <b>con outliers</b> tiene un desempeño ligeramente mejor que el que no tiene tanto en el conjunto de entrenamiento como el de prueba. En general, esto ocurre en todas las métricas obtenidas.</p>
    <p>Esta situación sugiere lo siguiente:</p>
    <ul>
        <li>los outliers contienen información relevante para la generalización</li>
        <li>No es estricamente necesario eliminar los outliers para este modelo, en este contexto de datos.</li>
        <li><b>support</b> es mucho mayor con outliers que con sin outliers, lo cuál tiene sentido porque el dataset en cuestión tiene más ocurrencias. Esto juega un factor a favor, dado que se tienen más datos para llegar a una mejor generalización.</li>
    </ul>
    <p>Un factor a favor es que se usará un Random Forest <i>(poco sensible a los outliers)</i>.</p>
</div>

In [11]:
df_with_outliers.columns

Index(['age', 'hours_per_week', 'marital_status', 'relationship', 'education',
       'occupation', 'capital_net', 'income'],
      dtype='object')

## (3.2) Estrategia de modelado final

Columnas seleccionadas para el entrenamiento *(resultado de feature importances + mutual information)*:
* `'capital_net', 'age', 'hours_per_week', 'marital_status', 'relationship','education', 'occupation', 'income'`

se usará un **GridSearch** para encontrar los mejores parámetros para usar en el modelo, haciendo incapié en:
* número de árboles (`n_estimators`)
* profundidad (`max_depth`)
* mínimo de muestras por hoja (`min_samples_leaf`)
* se evaluará el desempeño de cada iteración con los ya conocidos: *f1-score, recall, accuracy*


In [13]:
model_path = '../models/randomforest_probe_.pkl'
params_grid = {'n_estimators'      : [50, 250, 300],      # número de árboles
                'max_depth'        : [None, 10, 20, 30],  # qué tan profundo será del árbol
                'min_samples_split': [2, 5, 10, 20],      # mínimo (dividir nodos)
                'min_samples_leaf' : [1, 2, 4, 5]}        # mínimo (muestras por hoja)

##path: ./modules/module_modeling.py (func 4)
# results_pipeline = (model_methods
#                     .training_rf_bagging(df= df_without_outliers,
#                                             target_col= 'income',
#                                             param_grid= params_grid,
#                                             output_model_path= model_path)
#                   )

#---- RESULTADOS: ----
#Mejores hiperparámetros: {'classifier__max_depth': 20, 'classifier__min_samples_leaf': 1,
#                          'classifier__min_samples_split': 2, 'classifier__n_estimators': 300}
# F1-score en test: 0.8347
# Classification report:
#               precision    recall  f1-score   support

#           no       0.88      0.80      0.84      8426
#          yes       0.79      0.87      0.83      7124

#     accuracy                           0.83     15550
#    macro avg       0.84      0.84      0.83     15550
# weighted avg       0.84      0.83      0.84     15550

los resultados podrían mejorarse, implementaremos una nueva estrategia con stacking de:
* RandomForestClassifier + XGBoostClassifier *(optimizando cada uno por grid search)*

In [None]:
#model_path = '../models/stacking_optimized_0.pkl'
## el param grid está dentro de la función
#results = model_methods.training_stacking(df=df_with_outliers,
#                                          target_col= 'income',
#                                          output_model_path= model_path)

#pprint.pprint(results['classification_report'])

#---- RESULTADOS: ----
#'f1-score': 0.8609150087100027,'precision': 0.8796252927400469,
#'accuracy': 0.8638057189029044,

#Classification Report:
#               precision    recall  f1-score   support

#            0       0.88      0.84      0.86     11139
#            1       0.85      0.88      0.87     11138

#     accuracy                           0.86     22277
#    macro avg       0.86      0.86      0.86     22277
# weighted avg       0.86      0.86      0.86     22277

Optimizando Random Forest...
Fitting 3 folds for each of 36 candidates, totalling 108 fits
[CV] END classifier__max_depth=10, classifier__min_samples_leaf=1, classifier__min_samples_split=2, classifier__n_estimators=100; total time=   6.7s
[CV] END classifier__max_depth=10, classifier__min_samples_leaf=1, classifier__min_samples_split=2, classifier__n_estimators=100; total time=   6.7s
[CV] END classifier__max_depth=10, classifier__min_samples_leaf=1, classifier__min_samples_split=2, classifier__n_estimators=100; total time=   6.7s
[CV] END classifier__max_depth=10, classifier__min_samples_leaf=1, classifier__min_samples_split=2, classifier__n_estimators=150; total time=  10.1s
[CV] END classifier__max_depth=10, classifier__min_samples_leaf=1, classifier__min_samples_split=2, classifier__n_estimators=150; total time=  10.2s
[CV] END classifier__max_depth=10, classifier__min_samples_leaf=1, classifier__min_samples_split=2, classifier__n_estimators=150; total time=  10.3s
[CV] END classi

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.



[CV] END classifier__learning_rate=0.01, classifier__max_depth=3, classifier__n_estimators=100; total time=   0.6s
[CV] END classifier__learning_rate=0.01, classifier__max_depth=3, classifier__n_estimators=100; total time=   0.6s
[CV] END classifier__learning_rate=0.01, classifier__max_depth=3, classifier__n_estimators=100; total time=   0.6s
[CV] END classifier__learning_rate=0.01, classifier__max_depth=3, classifier__n_estimators=150; total time=   0.7s
[CV] END classifier__learning_rate=0.01, classifier__max_depth=3, classifier__n_estimators=150; total time=   0.7s
[CV] END classifier__learning_rate=0.01, classifier__max_depth=3, classifier__n_estimators=150; total time=   0.7s


Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.



[CV] END classifier__learning_rate=0.01, classifier__max_depth=5, classifier__n_estimators=100; total time=   0.7s
[CV] END classifier__learning_rate=0.01, classifier__max_depth=5, classifier__n_estimators=100; total time=   0.7s


Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.



[CV] END classifier__learning_rate=0.1, classifier__max_depth=3, classifier__n_estimators=100; total time=   0.5s
[CV] END classifier__learning_rate=0.1, classifier__max_depth=3, classifier__n_estimators=100; total time=   0.6s
[CV] END classifier__learning_rate=0.01, classifier__max_depth=5, classifier__n_estimators=100; total time=   0.7s
[CV] END classifier__learning_rate=0.1, classifier__max_depth=3, classifier__n_estimators=100; total time=   0.5s


Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.



[CV] END classifier__learning_rate=0.1, classifier__max_depth=3, classifier__n_estimators=150; total time=   0.7s
[CV] END classifier__learning_rate=0.01, classifier__max_depth=5, classifier__n_estimators=150; total time=   1.0s
[CV] END classifier__learning_rate=0.01, classifier__max_depth=5, classifier__n_estimators=150; total time=   1.0s


Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.



[CV] END classifier__learning_rate=0.01, classifier__max_depth=5, classifier__n_estimators=150; total time=   1.0s


Parameters: { "use_label_encoder" } are not used.



[CV] END classifier__learning_rate=0.1, classifier__max_depth=3, classifier__n_estimators=150; total time=   0.7s
[CV] END classifier__learning_rate=0.1, classifier__max_depth=3, classifier__n_estimators=150; total time=   0.7s
[CV] END classifier__learning_rate=0.1, classifier__max_depth=5, classifier__n_estimators=100; total time=   0.7s
[CV] END classifier__learning_rate=0.1, classifier__max_depth=5, classifier__n_estimators=100; total time=   0.7s
[CV] END classifier__learning_rate=0.1, classifier__max_depth=5, classifier__n_estimators=100; total time=   0.6s
[CV] END classifier__learning_rate=0.1, classifier__max_depth=5, classifier__n_estimators=150; total time=   0.9s
[CV] END classifier__learning_rate=0.1, classifier__max_depth=5, classifier__n_estimators=150; total time=   0.9s
[CV] END classifier__learning_rate=0.1, classifier__max_depth=5, classifier__n_estimators=150; total time=   0.8s


Parameters: { "use_label_encoder" } are not used.



Mejor XGBoost: {'classifier__learning_rate': 0.1, 'classifier__max_depth': 5, 'classifier__n_estimators': 150}
Entrenando modelo Stacking...


Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.




Classification Report:
              precision    recall  f1-score   support

           0       0.88      0.84      0.86     11139
           1       0.85      0.88      0.87     11138

    accuracy                           0.86     22277
   macro avg       0.86      0.86      0.86     22277
weighted avg       0.86      0.86      0.86     22277

Guardando modelo en: ../models/stacking_optimized_0.pkl
{'0': {'f1-score': 0.8609150087100027,
       'precision': 0.8796252927400469,
       'recall': 0.8429841098841907,
       'support': 11139.0},
 '1': {'f1-score': 0.8665787159190853,
       'precision': 0.8492501292880538,
       'recall': 0.8846291973424313,
       'support': 11138.0},
 'accuracy': 0.8638057189029044,
 'macro avg': {'f1-score': 0.863746862314544,
               'precision': 0.8644377110140503,
               'recall': 0.863806653613311,
               'support': 22277.0},
 'weighted avg': {'f1-score': 0.8637467351944828,
                  'precision': 0.864438392774688

El modelo parece ser lo suficientemente bueno como para continuar con el despliegue

Predicción

In [32]:
# cargamos el pipeline completo
pipeline = joblib.load('../models/stacking_optimized_0.pkl')

# nuevos datos para predicción
new_data = pd.DataFrame({'age'   :[26],
                         'hours_per_week':[56],
                         'marital_status':['married-civ-spouse'],
                         'relationship'  :['husband'],
                         'education'     :['bachelors'],
                         'occupation'    :['exec-managerial'],
                         'capital_net'   : [10000],})

def translate_pred (pipeline, new_data: pd.DataFrame, return_prob: bool= False) -> str|tuple:
    pred = pipeline.predict(new_data)
    prob = pipeline.predict_proba(new_data)
    
    #traducción de la predicción a str
    pred_class = 'yes' if int(pred[0]) == 1 else 'no'
    
    prob_no  = round(prob[0][0] , 3) # probabilidad de que sea 'no'
    prob_yes = round(prob[0][1] , 3) # probabilidad de que sea 'yes'
    
    print(f'predition: {pred_class}\nprobabilities\n- no: {prob_no:.2%}\n- yes: {prob_yes:.2%}')
    return pred_class, prob_no, prob_yes if return_prob else pred_class     
    
    
translate_pred(pipeline, new_data, return_prob= False)


predition: yes
probabilities
- no: 5.70%
- yes: 94.30%


('yes', 0.057, 'yes')

<div class="alert alert-success">
    <b style="font-size: 1.5em;">🎉 Entrenamiento terminado</b>
</div>