<a href="https://colab.research.google.com/github/pchernic/regression/blob/main/%5BLinear_Regression%5D_Insurance_case.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Insurance Case

- **Goal:** Predict insurance prices
- Supervised Learning

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns


In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


##Data Understanding

In [None]:
df = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/Regression/insurance.xlsx')
df.head()

Unnamed: 0,idade,sexo,imc,quantidade_filhos,fumante,regiao,custos_seguro
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.56,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [None]:
df.shape

(1341, 7)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1341 entries, 0 to 1340
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   idade              1341 non-null   int64  
 1   sexo               1338 non-null   object 
 2   imc                1341 non-null   float64
 3   quantidade_filhos  1341 non-null   int64  
 4   fumante            1341 non-null   object 
 5   regiao             1341 non-null   object 
 6   custos_seguro      1341 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.5+ KB


In [None]:
df.isnull().sum()

idade                0
sexo                 3
imc                  0
quantidade_filhos    0
fumante              0
regiao               0
custos_seguro        0
dtype: int64

In [None]:
df.dropna(inplace=True)

## Label Encoder
Treating categorical  features


In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [None]:
df = pd.DataFrame(df)

In [None]:
le.fit(df.sexo)
df.sexo = le.transform(df.sexo)

le.fit(df.fumante)
df.fumante = le.transform(df.fumante)

le.fit(df.regiao)
df.regiao = le.transform(df.regiao)

Normalizing: MinMaxScaler

  - Normalizing -→ to scale numerical features within a specific range, typically between 0 and 1, regardless of the original data distribution

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

df_norm = pd.DataFrame(scaler.fit_transform(df), index=df.index, columns=df.columns)
df_norm.head()

Unnamed: 0,idade,sexo,imc,quantidade_filhos,fumante,regiao,custos_seguro
0,0.021739,0.0,0.321227,0.0,1.0,1.0,0.251611
1,0.0,1.0,0.47915,0.2,0.0,0.666667,0.009636
2,0.217391,1.0,0.4735,0.6,0.0,0.666667,0.053115
3,0.326087,1.0,0.181464,0.0,0.0,0.333333,0.33301
4,0.304348,1.0,0.347592,0.0,0.0,0.333333,0.043816


### Statsmodels formula

In [None]:
import statsmodels.formula.api as smf


- we are looking forward to predict: 'custos_seguro' with the following variables:
idade+imc+quantidade_filhos+fumante+regiao

In [None]:
function = "custos_seguro~idade+imc+quantidade_filhos+fumante+regiao"


### OLS Metrics

##### fit

In [None]:
model = smf.ols(formula=function, data=df_norm).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:          custos_seguro   R-squared:                       0.751
Model:                            OLS   Adj. R-squared:                  0.750
Method:                 Least Squares   F-statistic:                     802.2
Date:                Mon, 17 Jul 2023   Prob (F-statistic):               0.00
Time:                        19:49:50   Log-Likelihood:                 1230.3
No. Observations:                1338   AIC:                            -2449.
Df Residuals:                    1332   BIC:                            -2417.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept            -0.0488      0.00

As our p-values are very low, r-squared and Adj. r-squared seem to be ok, plus we don't have too many variables, we are going to keep them all, without applying backwards method.

#### Train Test split settings:

In [None]:
x = df_norm[["idade", "imc", "quantidade_filhos", "fumante", "regiao"]]

y = df_norm[["custos_seguro"]] # response var.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

### Linear Regression

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2, random_state=0)

lr = LinearRegression()

lr.fit(x_train, y_train)

#### LinearRegression Assessment Metric:

r-squared

In [None]:
from sklearn import metrics
r_sq = lr.score(x,y)
print(r_sq)

0.7505629738411864


MAE

In [None]:
y_pred_train = lr.predict(x_train)
print("MAE:", metrics.mean_absolute_error(y_train, y_pred_train))

MAE: 0.06760649849946623


# Machine Learning Models

### Random Forest.
Parallel fit.

In [None]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor()
rf.fit(x_train, y_train)

  rf.fit(x_train, y_train)


##### Random Forest Metrics

In [None]:
from sklearn import metrics
r_sq = rf.score(x,y)
print("R-Squared", r_sq)
y_pred_train = rf.predict(x_train)
print("MAE:", metrics.mean_absolute_error(y_train, y_pred_train))

y_pred_test = rf.predict(x_test)
print("MAE:", metrics.mean_absolute_error(y_test, y_pred_test))

R-Squared 0.9512793932618829
MAE: 0.016336230546173686
MAE: 0.04405332013161706


With Random Forest we had a higher r-squared and lower MAE for train and test.

### AdaBoost
 - Sequential learning
 -  Adaptive Boosting

adaboost fit

In [None]:
from sklearn.ensemble import AdaBoostRegressor
ada = AdaBoostRegressor()
ada.fit(x_train, y_train)

  y = column_or_1d(y, warn=True)


#### Adaptative Boost Metrics:

In [None]:

r_sq = ada.score(x,y)
print(r_sq)
y_pred_train = ada.predict(x_train)
print("MAE:", metrics.mean_absolute_error(y_train, y_pred_train))

y_pred_test = ada.predict(x_test)
print("MAE:", metrics.mean_absolute_error(y_test, y_pred_test))

0.8411259436428901
MAE: 0.06032687030699958
MAE: 0.05970884451912648


##GradientBoost


In [None]:
from sklearn.ensemble import GradientBoostingRegressor
grb = GradientBoostingRegressor()
grb.fit(x_train, y_train)


  y = column_or_1d(y, warn=True)


#### GradientBoost Metrics

In [None]:
r_sq = grb.score(x,y)
print(r_sq)
y_pred_train = grb.predict(x_train)
print("MAE:", metrics.mean_absolute_error(y_train, y_pred_train))

y_pred_test = grb.predict(x_test)
print("MAE:", metrics.mean_absolute_error(y_test, y_pred_test))

0.8972803573144392
MAE: 0.03346236453828395
MAE: 0.03927539439232185


## GridSearch



In [None]:
from sklearn.model_selection import GridSearchCV

parameters = { "max_depth": [5],
              "min_samples_leaf": [4],
              "min_samples_split": [2],
              "n_estimators": [200]}

grid_search = GridSearchCV(grb, parameters, scoring="r2", cv=2, n_jobs=-1)

GridSearch Fit

In [None]:
grid_search.fit(x_train, y_train)

  y = column_or_1d(y, warn=True)


In [None]:
print(grid_search.best_estimator_)
print(grid_search.best_params_)

GradientBoostingRegressor(max_depth=5, min_samples_leaf=4, n_estimators=200)
{'max_depth': 5, 'min_samples_leaf': 4, 'min_samples_split': 2, 'n_estimators': 200}


In [None]:
best_model = grid_search.best_estimator_

In [None]:
best_model.get_params()

{'alpha': 0.9,
 'ccp_alpha': 0.0,
 'criterion': 'friedman_mse',
 'init': None,
 'learning_rate': 0.1,
 'loss': 'squared_error',
 'max_depth': 5,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 4,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 200,
 'n_iter_no_change': None,
 'random_state': None,
 'subsample': 1.0,
 'tol': 0.0001,
 'validation_fraction': 0.1,
 'verbose': 0,
 'warm_start': False}

In [None]:
grb_tunned = GradientBoostingRegressor(alpha = 0.09,
 ccp_alpha = 0.0,
 criterion = 'friedman_mse',
 init = None,
 learning_rate = 0.1,
 loss = 'squared_error',
 max_depth = 5,
 max_features = None,
 max_leaf_nodes = None,
 min_impurity_decrease = 0.0,
 min_samples_leaf = 4,
 min_samples_split = 2,
 min_weight_fraction_leaf = 0.0,
 n_estimators = 200,
 n_iter_no_change = None,
 random_state = None,
 subsample = 1.0,
 tol = 0.0001,
 validation_fraction = 0.1,
 verbose = 0,
 warm_start = False)

In [None]:
grb_tunned.fit(x_train, y_train)

  y = column_or_1d(y, warn=True)


#### Gridsearch Metrics

In [None]:
r_sq = grb_tunned.score(x,y)
r_sq

0.9325910627080722

In [None]:
y_pred_train = grb_tunned.predict(x_train)
print("MAE:", metrics.mean_absolute_error(y_train, y_pred_train))

MAE: 0.023830866069655104


In [None]:
y_pred_test = grb_tunned.predict(x_test)
print("MAE:", metrics.mean_absolute_error(y_test, y_pred_test))

MAE: 0.045687335566153855
