<a href="https://colab.research.google.com/github/lucasfelipecdm/tech-challenge-fase-1/blob/develop/modeling/model_building_evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Treinamento e validação do modelo

Nessa fase faremos o treinamento do nosso modelo utilizando a [tabela pré-processada](https://raw.githubusercontent.com/lucasfelipecdm/tech-challenge-fase-1/develop/data/dataset_preprocessed.csv) durante nossa última [fase](https://github.com/lucasfelipecdm/tech-challenge-fase-1/blob/develop/preprocessing/data_preprocessing.ipynb) começaremos pela separação da base pré-processada em base de treino e teste.

In [93]:
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
import pandas as pd
import numpy as np
import plotly.express as px
import os

import warnings
warnings.filterwarnings("ignore")

np.random.seed(42)
mpl.rc('axes', labelsize=12)
mpl.rc('xtick', labelsize=10)
mpl.rc('ytick', labelsize=10)
plt.figure(figsize=(10,5))

preprocessed_dataset_url = "https://raw.githubusercontent.com/lucasfelipecdm/tech-challenge-fase-1/develop/data/preprocessed.csv"
dataset_preprocessed = pd.read_csv(preprocessed_dataset_url)
dataset_preprocessed.drop(['lines'], axis = 1, inplace = True)
dataset_preprocessed.head(5)

Unnamed: 0,age,sex,bmi,children,smoker,charges
0,19,0,27.9,0,1,16884.924
1,18,1,33.77,1,0,1725.5523
2,28,1,33.0,3,0,4449.462
3,33,1,22.705,0,0,21984.47061
4,32,1,28.88,0,0,3866.8552


<Figure size 1000x500 with 0 Axes>

In [94]:
dataset_preprocessed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   int64  
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   int64  
 5   charges   1338 non-null   float64
dtypes: float64(2), int64(4)
memory usage: 62.8 KB


### Separação das tabelas e remoção da coluna target

Nesse passo vamos separar em tabelas X e y nossa variavel target e nossa tabela com as outras variáveis

In [95]:
X = dataset_preprocessed.drop("charges", axis = 1)
y = dataset_preprocessed.charges

#### Separação entre treino e teste

Aqui usamos o metódo `train_test_split` do sklearn para separarmos nossas base de dados em tabelas de treino e teste.

In [96]:
from sklearn.model_selection import train_test_split, RandomizedSearchCV
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

#### Feature Scaling ( Normalização e Padronização dos dados )

Aqui utilizaremos um metódo também do sklearn para padronização dos dados deixando eles em uma mesma escala para que valores com maior peso não tenha impacto e crie viés em nosso modelo.

In [97]:
from sklearn.preprocessing import StandardScaler
standard_scaler = StandardScaler()
X_train = standard_scaler.fit_transform(X_train)
X_train

array([[ 1.54446486, -1.02597835,  0.10596012, -0.91501097, -0.51298918],
       [ 0.48187425,  0.97467943, -0.49198238, -0.91501097, -0.51298918],
       [ 1.04858924, -1.02597835,  0.23025154,  1.56027883, -0.51298918],
       ...,
       [ 1.33194673,  0.97467943, -0.89928872, -0.91501097, -0.51298918],
       [-0.15568012, -1.02597835,  2.81517714,  0.73518223,  1.94935887],
       [ 1.11942861,  0.97467943, -0.10567121, -0.91501097, -0.51298918]])

Apenas aplicaremos o `.fit_transform` na base de treinamento e não na base de teste.

In [98]:
X_test = standard_scaler.transform(X_test)
X_test

array([[ 0.41103487, -1.02597835, -0.89928872,  0.73518223, -0.51298918],
       [-0.22651949, -1.02597835, -0.08551585, -0.91501097, -0.51298918],
       [ 1.75698298, -1.02597835, -0.61207477, -0.91501097,  1.94935887],
       ...,
       [-1.50162823, -1.02597835, -0.38868613, -0.91501097, -0.51298918],
       [ 1.33194673,  0.97467943,  0.9323301 , -0.91501097, -0.51298918],
       [-1.35994948,  0.97467943, -1.4325661 , -0.08991437, -0.51298918]])

### Criação do modelo

Agora criaremos o modelo e testaremos alguns tipos de regressão, para ser mais exato utilizaremos 5 tipos de regressão, sendo elas: Multi Linear, Floresta Aleatória, Árvore de decisão, XG Boost e Aumento de gradiente:

In [99]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
import xgboost as xgb


Também importaremos alguns metódos que nos ajudaram na avaliação dos resultados do modelo.

In [100]:
from sklearn import metrics
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_squared_error, r2_score
from math import sqrt

Vamos criar um dicionário para salvar os resultados de cada tipo de regressão

In [101]:
from collections import OrderedDict
model_rmse = OrderedDict()
model_r2 = OrderedDict()

#### Regressão Multi Linear

Vamos começar com a regressão multi linear

In [102]:
linear_regressor = LinearRegression()
linear_regressor.fit(X_train, y_train)

In [103]:
y_pred = linear_regressor.predict(X_test)

In [104]:
### Calculating RMSE and R-squared for the model

mse = round(mean_squared_error(y_test, y_pred), 3)
rmse = round(sqrt(mse), 3)

r2_value = round(r2_score(y_test, y_pred), 3)

model_rmse['Multi Linear Regression'] = rmse
model_r2['Multi Linear Regression'] = r2_value

print('Root Mean Squared Error of the model is : {}'.format(rmse))
print('R-squared value of the model is : {}'.format(r2_value))

Root Mean Squared Error of the model is : 4955.23
R-squared value of the model is : 0.762


In [105]:
decision_tree_regressor = DecisionTreeRegressor()
decision_tree_regressor.fit(X_train, y_train)

In [106]:
Y_pred = decision_tree_regressor.predict(X_test)

In [107]:
mse = round(mean_squared_error(y_test, y_pred), 3)
rmse = round(sqrt(mse), 3)

r2_value = round(r2_score(y_test, y_pred), 3)

model_rmse['Decision Tree Regression'] = rmse
model_r2['Decision Tree Regression'] = r2_value

print('Root Mean Squared Error of the model is : {}'.format(rmse))
print('R-squared value of the model is : {}'.format(r2_value))

Root Mean Squared Error of the model is : 4955.23
R-squared value of the model is : 0.762


In [108]:
random_forest_regressor = RandomForestRegressor(n_estimators = 1000, random_state = 27)
random_forest_regressor.fit(X_train, y_train)

In [109]:
y_pred = random_forest_regressor.predict(X_test)

In [110]:
### Calculating RMSE and R-squared for the model

mse = round(mean_squared_error(y_test, y_pred), 3)
rmse = round(sqrt(mse), 3)

r2_value = round(r2_score(y_test, y_pred), 3)

model_rmse['Random Forest Regression (1000 trees)'] = rmse
model_r2['Random Forest Regression (1000 trees)'] = r2_value

print('Root Mean Squared Error of the model is : {}'.format(rmse))
print('R-squared value of the model is : {}'.format(r2_value))

Root Mean Squared Error of the model is : 4453.517
R-squared value of the model is : 0.808


In [111]:
xgb_regressor = xgb.XGBRegressor()
xgb_regressor.fit(X_train, y_train)

In [112]:
y_pred = xgb_regressor.predict(X_test)

In [113]:
### Calculating RMSE and R-squared for the model

mse = round(mean_squared_error(y_test, y_pred), 3)
rmse = round(sqrt(mse), 3)

r2_value = round(r2_score(y_test, y_pred), 3)

model_rmse['XGBoost Regression'] = rmse
model_r2['XGBoost Regression'] = r2_value

print('Root Mean Squared Error of the model is : {}'.format(rmse))
print('R-squared value of the model is : {}'.format(r2_value))

Root Mean Squared Error of the model is : 4925.127
R-squared value of the model is : 0.765


In [114]:
gradient_boosting_regressor = GradientBoostingRegressor(n_estimators=100, random_state=23)
gradient_boosting_regressor.fit(X_train, y_train)

In [115]:
y_pred = gradient_boosting_regressor.predict(X_test)

In [116]:
### Calculating RMSE and R-squared for the model

mse = round(mean_squared_error(y_test, y_pred), 3)
rmse = round(sqrt(mse), 3)

r2_value = round(r2_score(y_test, y_pred), 3)

model_rmse['Gradient Boosting Regression'] = rmse
model_r2['Gradient Boosting Regression'] = r2_value

print('Root Mean Squared Error of the model is : {}'.format(rmse))
print('R-squared value of the model is : {}'.format(r2_value))

Root Mean Squared Error of the model is : 4269.244
R-squared value of the model is : 0.823


In [117]:
model_rmse

OrderedDict([('Multi Linear Regression', 4955.23),
             ('Decision Tree Regression', 4955.23),
             ('Random Forest Regression (1000 trees)', 4453.517),
             ('XGBoost Regression', 4925.127),
             ('Gradient Boosting Regression', 4269.244)])

In [118]:
model_r2

OrderedDict([('Multi Linear Regression', 0.762),
             ('Decision Tree Regression', 0.762),
             ('Random Forest Regression (1000 trees)', 0.808),
             ('XGBoost Regression', 0.765),
             ('Gradient Boosting Regression', 0.823)])

In [119]:
### Tabulating the results
from tabulate import tabulate
table = []
table.append(['S.No.', 'Classification Model', 'Root Mean Squared Error', 'R-squared'])
count = 1

for model in model_rmse:
    row = [count, model, model_rmse[model], model_r2[model]]
    table.append(row)
    count += 1

print(tabulate(table, headers = 'firstrow', tablefmt = 'fancy_grid'))

╒═════════╤═══════════════════════════════════════╤═══════════════════════════╤═════════════╕
│   S.No. │ Classification Model                  │   Root Mean Squared Error │   R-squared │
╞═════════╪═══════════════════════════════════════╪═══════════════════════════╪═════════════╡
│       1 │ Multi Linear Regression               │                   4955.23 │       0.762 │
├─────────┼───────────────────────────────────────┼───────────────────────────┼─────────────┤
│       2 │ Decision Tree Regression              │                   4955.23 │       0.762 │
├─────────┼───────────────────────────────────────┼───────────────────────────┼─────────────┤
│       3 │ Random Forest Regression (1000 trees) │                   4453.52 │       0.808 │
├─────────┼───────────────────────────────────────┼───────────────────────────┼─────────────┤
│       4 │ XGBoost Regression                    │                   4925.13 │       0.765 │
├─────────┼───────────────────────────────────────┼─────────