### Supervised Machine Learning Regression Project

#### Main objective
Main objective is to predict the medical insurance cost based on the insured portfolios. <br>
Regression model will be trained to generates the target column charges more accurately. <br>
Being a regression model problem, metrics such as the coefficient of determination and the mean squared error are used to evaluate the model. <br>
I chose this dataset because I am working as an actuarial analyst in a reinsurance company now :)

#### Brief description of the data set
This dataset is obtained from Kaggle.

Kaggle reference: https://www.kaggle.com/datasets/mirichoi0218/insurance?datasetId=13720&language=Python

##### Predictor
 - charges: Individual medical costs billed by health insurance
 
##### Features
 - age: age of primary beneficiary
 - sex: insurance contractor gender, female, male
 - bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight(kg/m  2 ) using the ratio of height to weight, ideally 18.5 to 24.9
 - children: Number of children covered by health insurance / Number of dependents
 - smoker: Smoking
 - region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

#### Data Exploration and Data Cleaning

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import LabelEncoder, StandardScaler, PolynomialFeatures

In [2]:
df = pd.read_csv('./insurance.csv')
df

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [4]:
df.isnull().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

In [5]:
df.describe(include='all')

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
count,1338.0,1338,1338.0,1338.0,1338,1338,1338.0
unique,,2,,,2,4,
top,,male,,,no,southeast,
freq,,676,,,1064,364,
mean,39.207025,,30.663397,1.094918,,,13270.422265
std,14.04996,,6.098187,1.205493,,,12110.011237
min,18.0,,15.96,0.0,,,1121.8739
25%,27.0,,26.29625,0.0,,,4740.28715
50%,39.0,,30.4,1.0,,,9382.033
75%,51.0,,34.69375,2.0,,,16639.912515


In [6]:
le = LabelEncoder()

In [7]:
df.sex = le.fit_transform(df.sex)
df.smoker = le.fit_transform(df.smoker)

In [8]:
df = pd.concat([df.drop(['region', 'charges'], axis=1), pd.get_dummies(df.region, drop_first=True), df.charges], axis=1)

In [9]:
df

Unnamed: 0,age,sex,bmi,children,smoker,northwest,southeast,southwest,charges
0,19,0,27.900,0,1,0,0,1,16884.92400
1,18,1,33.770,1,0,0,1,0,1725.55230
2,28,1,33.000,3,0,0,1,0,4449.46200
3,33,1,22.705,0,0,1,0,0,21984.47061
4,32,1,28.880,0,0,1,0,0,3866.85520
...,...,...,...,...,...,...,...,...,...
1333,50,1,30.970,3,0,1,0,0,10600.54830
1334,18,0,31.920,0,0,0,0,0,2205.98080
1335,18,0,36.850,0,0,0,1,0,1629.83350
1336,21,0,25.800,0,0,0,0,1,2007.94500


In [75]:
df_ss = df.copy()
X = df_ss.drop('charges', axis=1)
y = df_ss.charges
X[['age', 'bmi', 'children']] = StandardScaler().fit_transform(X[['age', 'bmi', 'children']])

In [76]:
X

Unnamed: 0,age,sex,bmi,children,smoker,northwest,southeast,southwest
0,-1.438764,0,-0.453320,-0.908614,1,0,0,1
1,-1.509965,1,0.509621,-0.078767,0,0,1,0
2,-0.797954,1,0.383307,1.580926,0,0,1,0
3,-0.441948,1,-1.305531,-0.908614,0,1,0,0
4,-0.513149,1,-0.292556,-0.908614,0,1,0,0
...,...,...,...,...,...,...,...,...
1333,0.768473,1,0.050297,1.580926,0,1,0,0
1334,-1.509965,0,0.206139,-0.908614,0,0,0,0
1335,-1.509965,0,1.014878,-0.908614,0,0,1,0
1336,-1.296362,0,-0.797813,-0.908614,0,0,0,1


There are no missing value in the dataset. <br>
Catagorical variables sex, smoker and region are converted into numerical values. <br>
Numerical variables age, bmi and children are standardized for later regression modelling. <br>
Discrete / catagorical variables (sex, smoker and one-hot encoded region) are not standardized. <br>
As most of the variables here are dummy (either 1 or 0), there will be no additional information from making interactions this way, so polynomial features are not used here.

#### Regression Models

In [77]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import GridSearchCV

In [78]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=8964)

In [79]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(1070, 8)
(268, 8)
(1070,)
(268,)


In [80]:
lr = LinearRegression()
lr.fit(X_train, y_train)
lr_y_pred = lr.predict(X_test)
lr_score = r2_score(y_test, lr_y_pred)
lr_mse = mean_squared_error(y_test, lr_y_pred)
print(f'r2 score: {lr_score}')
print(f'mean squared error: {lr_mse}')

r2 score: 0.6805655402124758
mean squared error: 47444871.45231035


In [81]:
lr_coef = pd.DataFrame(zip(X_train.columns, lr.coef_), columns=['Feature', 'Coefficient']).sort_values('Coefficient')
lr_coef

Unnamed: 0,Feature,Coefficient
7,southwest,-1006.831618
6,southeast,-905.581869
5,northwest,-166.992521
1,sex,93.837597
3,children,643.537767
2,bmi,2113.066862
0,age,3792.066609
4,smoker,24050.46258


In [82]:
las = Lasso()
alphas = [0.005, 0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 80]
las_param_grid = {'alpha' : alphas}
gslas = GridSearchCV(las, param_grid=las_param_grid, cv=3, scoring='neg_mean_squared_error', n_jobs=-1)
gslas.fit(X_train, y_train)
las_best = gslas.best_estimator_
las_best

Lasso(alpha=30)

In [83]:
las_best_y_pred = las_best.predict(X_test)
las_best_score = r2_score(y_test, las_best_y_pred)
las_best_mse = mean_squared_error(y_test, las_best_y_pred)
print(f'r2 score: {las_best_score}')
print(f'mean squared error: {las_best_mse}')

r2 score: 0.6817069019932714
mean squared error: 47275347.591276556


In [84]:
las_best_coef = pd.DataFrame(zip(X_train.columns, las_best.coef_), columns=['Feature', 'Coefficient']).sort_values('Coefficient')
las_best_coef

Unnamed: 0,Feature,Coefficient
7,southwest,-659.769876
6,southeast,-531.233428
1,sex,0.0
5,northwest,0.0
3,children,617.668329
2,bmi,2054.243515
0,age,3767.540588
4,smoker,23862.784978


In [85]:
rid = Ridge()
alphas = [0.005, 0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 80]
rid_param_grid = {'alpha' : alphas}
gsrid = GridSearchCV(rid, param_grid=rid_param_grid, cv=3, scoring='neg_mean_squared_error', n_jobs=-1)
gsrid.fit(X_train, y_train)
rid_best = gsrid.best_estimator_
rid_best

Ridge(alpha=0.3)

In [86]:
rid_best_y_pred = rid_best.predict(X_test)
rid_best_score = r2_score(y_test, rid_best_y_pred)
rid_best_mse = mean_squared_error(y_test, rid_best_y_pred)
print(f'r2 score: {rid_best_score}')
print(f'mean squared error: {rid_best_mse}')

r2 score: 0.6806742077891318
mean squared error: 47428731.30509861


In [87]:
rid_best_coef = pd.DataFrame(zip(X_train.columns, rid_best.coef_), columns=['Feature', 'Coefficient']).sort_values('Coefficient')
rid_best_coef

Unnamed: 0,Feature,Coefficient
7,southwest,-1003.633129
6,southeast,-899.756965
5,northwest,-165.746666
1,sex,96.422258
3,children,643.584054
2,bmi,2112.449996
0,age,3790.29927
4,smoker,24008.048315


In terms of accuracy, lasso regression with alpha=30 is recommended as a final model that best fits my needs, with highest r2 score. <br>
In terms of explainability, ridge regression with alpha=0.3 is recommended as a final model that best fits my needs, because it is a common practice to separate cost by gender while lasso regression zero out the gender effect.

#### Summary Key Findings and Insights
There are siginificant effects of living style on medical cost, including smoking habbit and bmi. <br>
Cost of male is slightly higher than that of female. <br>
It seems that medical cost in south region is lower than that in north region. <br>
From our regression models, around 68% of variation in medical cost can be predictable from our independent variables.

#### Next steps in analyzing
To verify our assumption that medical cost in south region is lower than that in north region, we need to do research on medical cost in different region, such information may be available online. <br>
There are other features that may help improve our models, cause of claims, public/private hospital indicator, inpatient/outpatient indicator, length of stay in hospital, to name but a few. <br>
Also the treatment year is important as there is high medical inflation, if the data is collect across many years, there may be distortion. <br>
We may group individual age into age band like 0-19, 20-39, 40-59, etc, if we want to improve the credibility of our data in each age band.