# Models Evaluation

Now as last step, previous to the deployment of the best model is the evaluation of all the trained models using the test dataset. Based on that the model with the best performance will be selected for the deployment.

## Required Imports

In [1]:
import sys
sys.path.append('../../')

In [2]:
import os
import pickle
import pandas as pd
import numpy as np
import warnings

from src.metrics.models_metrics import print_sklearn_model_metrics
from src.processing.features_building import build_features
from src.utils.obesity_encoder import ENCODER_NOBESITY, get_class_encoder

In [3]:
warnings.filterwarnings("ignore")

## Constants Defintion

In [4]:
DATA_PATH = '../../data/'
RAW_PATH = DATA_PATH + 'raw/'
PROCESSED_PATH = DATA_PATH + 'processed/'

RESULTS_PATH = '../../results/'
MODELS_PATH = RESULTS_PATH + 'models/'
PREDICTIONS_PATH = RESULTS_PATH + 'predictions/'

## Data Loading

In [5]:
test_df = pd.read_csv(RAW_PATH + 'test.csv')

## Data Processing

In [6]:
id_col = test_df.pop('id')
test_df = build_features(test_df)

## Model Loading

In [7]:
with open(MODELS_PATH + 'model.logreg.pkl', 'rb') as file:
    logreg = pickle.load(file)

with open(MODELS_PATH + 'model.xgboost.pkl', 'rb') as file:
    xgboost = pickle.load(file)

with open(MODELS_PATH + 'model.rf.pkl', 'rb') as file:
    rf = pickle.load(file)

## Test Metrics

As the test dataset does not have the target column we will evaluate the models using the Kaggle competition of this [link](https://www.kaggle.com/competitions/playground-series-s4e2/). For it we will format the predictions in the expected format

In [8]:
y_pred_logreg = logreg.predict(test_df)
logreg_submission_df = pd.DataFrame(
    data = id_col,
    columns = ['id'] 
)

logreg_submission_df['NObesydad_encoded'] = y_pred_logreg
logreg_submission_df['NObesydad'] = logreg_submission_df.apply(
    lambda row: get_class_encoder(ENCODER_NOBESITY, row.NObesydad_encoded), axis= 1
)
logreg_submission_df.drop(['NObesydad_encoded'], axis=1, inplace=True)

In [9]:
logreg_submission_df.to_csv(PREDICTIONS_PATH + 'logreg_submission.csv', index=False)

In [10]:
y_pred_rf = rf.predict(test_df)
rf_submission_df = pd.DataFrame(
    data = id_col,
    columns = ['id'] 
)

rf_submission_df['NObesydad_encoded'] = y_pred_rf
rf_submission_df['NObesydad'] = rf_submission_df.apply(
    lambda row: get_class_encoder(ENCODER_NOBESITY, row.NObesydad_encoded), axis= 1
)
rf_submission_df.drop(['NObesydad_encoded'], axis=1, inplace=True)

In [11]:
rf_submission_df.to_csv(PREDICTIONS_PATH + 'rf_submission.csv', index=False)

In [12]:
y_pred_xgboost = xgboost.predict(test_df)
xgboost_submission_df = pd.DataFrame(
    data = id_col,
    columns = ['id'] 
)

xgboost_submission_df['NObesydad_encoded'] = y_pred_xgboost
xgboost_submission_df['NObesydad'] = xgboost_submission_df.apply(
    lambda row: get_class_encoder(ENCODER_NOBESITY, row.NObesydad_encoded), axis= 1
)
xgboost_submission_df.drop(['NObesydad_encoded'], axis=1, inplace=True)

In [13]:
xgboost_submission_df.to_csv(PREDICTIONS_PATH + 'xgboost_submission.csv', index=False)

## Kaggle Results

![Results](../../public/imgs/Kaggle_Submission_Classic_Models.png)

All the models performed proficiently at the kaggle evaluation, but the best one was the Random Forest, so it will be the model to deploy.

## Feature Importance

As the models performed well in training and test dataset, taking a look at the feature importance that each model is defining is a really good insight to understand better the behaviour of those.