# Medical Insurance Cost Prediction

Many factors that affect how much you pay for health insurance are not within your control. Nonetheless, it's good to have an understanding of what they are. Here are some factors that affect how much health insurance premiums cost

* age: age of primary beneficiary

* sex: insurance contractor gender, female, male

* bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,
objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

* children: Number of children covered by health insurance / Number of dependents

* smoker: Smoking

* region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

* charges: Individual medical costs billed by health insurance

# The Code

In [172]:
#import Library

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, PolynomialFeatures
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.metrics import r2_score, mean_squared_error



In [173]:
#insert Dataset

from google.colab import files
uploaded = files.upload()

import io
dataset = pd.read_csv(io.BytesIO(uploaded['insurance.csv']))

Saving insurance.csv to insurance (1).csv


In [174]:
#show the first 5 rows of dataset

print(dataset.head())

   age     sex     bmi  children smoker     region      charges
0   19  female  27.900         0    yes  southwest  16884.92400
1   18    male  33.770         1     no  southeast   1725.55230
2   28    male  33.000         3     no  southeast   4449.46200
3   33    male  22.705         0     no  northwest  21984.47061
4   32    male  28.880         0     no  northwest   3866.85520


In [175]:
#number of columns and rows
print("number of rows and columns:\n",dataset.shape)

#Type of Attributes
print("\nType of attributes:\n",dataset.dtypes)

number of rows and columns:
 (1338, 7)

Type of attributes:
 age           int64
sex          object
bmi         float64
children      int64
smoker       object
region       object
charges     float64
dtype: object


## Data Cleaning


In [176]:
#checking for null values
dataset.isnull().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

#### Surprise! the dataset already clean and has same standard for naming each of the columns.

### One-Hot-Encoded

performed one-hot-encoded for unordered categoricals.

In [177]:
# Select the object (string) columns
obj = dataset.dtypes == np.object
categorical_cols = dataset.columns[obj]

In [178]:
obj

age         False
sex          True
bmi         False
children    False
smoker       True
region       True
charges     False
dtype: bool

In [179]:
dataset[categorical_cols]

Unnamed: 0,sex,smoker,region
0,female,yes,southwest
1,male,no,southeast
2,male,no,southeast
3,male,no,northwest
4,male,no,northwest
...,...,...,...
1333,male,no,northwest
1334,female,no,northeast
1335,female,no,southeast
1336,female,no,southwest


In [182]:
# Determine how many extra columns would be created
num_ohc_cols = (dataset[categorical_cols]
                .apply(lambda x: x.nunique())
                .sort_values(ascending=False))
print("number of columns for each of the object type : \n", num_ohc_cols)

number of columns for each of the object type : 
 region    4
smoker    2
sex       2
dtype: int64


In [183]:
# No need to encode if there is only one value
small_num_ohc_cols = num_ohc_cols.loc[num_ohc_cols>1]

# Number of one-hot columns is one less than the number of categories
small_num_ohc_cols -= 1

# This is 5 columns, assuming the original ones are dropped. 
print("total of new columns : ", small_num_ohc_cols.sum())

total of new columns :  5


#### create a new data set where all of the above categorical features will be one-hot encoded

In [184]:
# Copy of the data
dataset_ohc = dataset.copy()

# The encoders
le = LabelEncoder()
ohc = OneHotEncoder()

for col in num_ohc_cols.index:
    
    # Integer encode the string categories
    data = le.fit_transform(dataset_ohc[col]).astype(np.int)
    
    # Remove the original column from the dataframe
    dataset_ohc = dataset_ohc.drop(col, axis=1)

    # One hot encode the data--this returns a sparse array
    new_data = ohc.fit_transform(data.reshape(-1,1))

    # Create unique column names
    n_cols = new_data.shape[1]
    col_names = ['_'.join([col, str(x)]) for x in range(n_cols)]

    # Create the new dataframe
    new_df = pd.DataFrame(new_data.toarray(), 
                          index=dataset_ohc.index, 
                          columns=col_names)

    # Append the new data to the dataframe
    dataset_ohc = pd.concat([dataset_ohc, new_df], axis=1)

In [185]:
# Column difference is as calculated above
dataset_ohc.shape[1] - dataset.shape[1]

5

In [34]:
dataset_ohc

Unnamed: 0,age,bmi,children,charges,region_0,region_1,region_2,region_3,smoker_0,smoker_1,sex_0,sex_1
0,19,27.900,0,16884.92400,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0
1,18,33.770,1,1725.55230,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
2,28,33.000,3,4449.46200,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
3,33,22.705,0,21984.47061,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
4,32,28.880,0,3866.85520,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...
1333,50,30.970,3,10600.54830,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
1334,18,31.920,0,2205.98080,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1335,18,36.850,0,1629.83350,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
1336,21,25.800,0,2007.94500,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0


In [35]:
print(dataset.shape[1])

# Remove the string columns from the dataframe
dataset = dataset.drop(num_ohc_cols.index, axis=1)

print(dataset.shape[1])

7
4


# Linear Regression

In [187]:
y_col = 'charges'

# Split the data that is not one-hot encoded
X_data = dataset.drop(y_col, axis=1)
y_data = dataset[y_col]

X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, 
                                                    test_size=0.3, random_state=42)
# Split the data that is one-hot encoded
X_data_ohc = data_ohc.drop(y_col, axis=1)
y_data_ohc = data_ohc[y_col]

X_train_ohc, X_test_ohc, y_train_ohc, y_test_ohc = train_test_split(X_data_ohc, y_data_ohc, 
                                                    test_size=0.3, random_state=42)


In [41]:
# Compare the indices to ensure they are identical
(X_train_ohc.index == X_train.index).all()

True

In [145]:
#LinearRegression
lr = LinearRegression()

# Storage for error values
error_df = list()

# Data that have not been one-hot encoded
lr = lr.fit(X_train, y_train)
y_train_pred = lr.predict(X_train)
y_test_pred = lr.predict(X_test)

error_df.append(pd.Series({'train': mean_squared_error(y_train, y_train_pred),
                           'test' : mean_squared_error(y_test,  y_test_pred)},
                           name='no enc'))

# Data that have been one-hot encoded
lr = lr.fit(X_train_ohc, y_train_ohc)
y_train_ohc_pred = lr.predict(X_train_ohc)
y_test_ohc_pred = lr.predict(X_test_ohc)

error_df.append(pd.Series({'train': mean_squared_error(y_train_ohc, y_train_ohc_pred),
                           'test' : mean_squared_error(y_test_ohc,  y_test_ohc_pred)},
                          name='one-hot enc'))

# Assemble the results
error_df = pd.concat(error_df, axis=1)
error_df

Unnamed: 0,no enc,one-hot enc
train,123633800.0,37730550.0
test,141243300.0,33780510.0


#### As we can see on the result, error on the one-hot-encoded data is very different with original data. especially if we are looking at train error and test error for both one-hot-encoded data and original data. 
#### we could see that train and test error is unlike, test error is higher than train error for original data. while for one-hot-encoded, test error is lower than train error. It means that test set is perfonmed better than train set. There are many reason on why train set have higher than test set, one of that reasons is test set is too small. but, to make it clear lets do another Linear Regression with additional parameter!

In [161]:
print(lr.intercept_)
print(lr.coef_)
print(lr.score(X_test_ohc, y_test_ohc))

-1103.4309695991797
[   261.29692414    348.90691516    424.11912829    596.05658924
    109.12197876   -374.91224933   -330.26631867 -11814.18361118
  11814.18361118    -52.40591149     52.40591149]
0.7696118054369011


# Lasso Regression

#### Lasso regression for non encoded dataset

In [132]:
Ls = Lasso(alpha=0.01, max_iter=10000)
Ls = Ls.fit(X_train, y_train)
y_train_Ls_pred = Ls.predict(X_train)
y_test_Ls_pred = Ls.predict(X_test)

print(Ls.intercept_)
print(Ls.coef_)
print(Ls.score(X_test, y_test))

-4234.123698499838
[ 0.00000000e+00 -4.39110322e+01  4.14771064e+02  1.71683738e+03
  4.41368618e+00 -5.81596636e-01 -2.93483459e+01 -1.77188586e+00
  4.40648882e+01 -4.73024350e+02]
0.05212872055932638


In [133]:
#Mean Squared Error
Lstrain = mean_squared_error(y_train, y_train_Ls_pred)
Lstest = mean_squared_error(y_test, y_test_Ls_pred)

In [138]:
dfLs = pd.DataFrame(np.array([Lstrain, Lstest]),
                   columns=['Lasso Regression'],
                   index = ['train', 'test'])
print('Mean Squared error of Lasso Regression : \n', df)

Mean Squared error of Lasso Regression : 
        Lasso Regression
train      1.236338e+08
test       1.412433e+08


#### just like simple Linear Regression with original data, Lasso regression has higher test set error than train error

# Polynomial Features


In [73]:
y_col = 'charges'

# Split the data that is not one-hot encoded
X_data = dataset.drop(y_col, axis=1)
y_data = dataset[y_col]

pol = PolynomialFeatures (degree = 2)
X_pol = pol.fit_transform(X_data)
X_pol_train, X_pol_test, y_pol_train, y_pol_test = train_test_split(X_pol, y_data, test_size=0.3, random_state=73)

pol_reg = LinearRegression()
pol_reg = pol_reg.fit(X_pol_train, y_pol_train)
y_pol_train_pred = pol_reg.predict(X_pol_train)
y_pol_test_pred = pol_reg.predict(X_pol_test)
print(pol_reg.intercept_)
print(pol_reg.coef_)
print(pol_reg.score(X_pol_test, y_pol_test))

-4234.428461542173
[ 0.00000000e+00 -4.39172375e+01  4.14789133e+02  1.71711706e+03
  4.41376821e+00 -5.81578280e-01 -2.93497981e+01 -1.77208891e+00
  4.40598864e+01 -4.73041547e+02]
0.052128768816979365


In [134]:
#Mean Squared Error
poltrain = mean_squared_error(y_pol_train, y_pol_train_pred)
poltest = mean_squared_error(y_pol_test, y_pol_test_pred)


In [141]:
poldf = pd.DataFrame(np.array([poltrain, poltest]),
                   columns=['PolynomialFeatures'],
                   index = ['train', 'test'])
print('Mean Squared error of Polynomial Features : \n', df)

Mean Squared error of Polynomial Features : 
        Lasso Regression
train      1.236338e+08
test       1.412433e+08


#### same as Lasso Regression, Polynomial Features has higher test set error than train error

## Mean Squared Error Comparison

#### Comparison of Mean Squared Error between simple linear Regression that did and didn't use One-hot-encode, Lasso Regression and Polynomial Features,

In [147]:
df = pd.concat([error_df, dfLs, poldf], axis=1)

In [148]:
df

Unnamed: 0,no enc,one-hot enc,Lasso Regression,PolynomialFeatures
train,123633800.0,37730550.0,123633800.0,123633800.0
test,141243300.0,33780510.0,141243300.0,141243300.0


### according to dataframe above, we can conclude that:
* Linear Regression with One-Hot-encoded has highest error for both of train set and test set compared with all,
* also unlike the others, One-hot-encoded Linear regression has higher error on train set than test set.
* train error and test error for Linear regression, Lasso regression, and PolynomialFeatures almost same but actually error value for Lasso regresion are slightly higher than the rest.

# Summary
this dataset is really great to be used for  modelling Linear Regression especially for beginner as the dataset already clean.
for further work, it will be better if scalling the dataset can be performed such as MinMaxScaller or StandardScaller. and then use unclean/raw dataset so we can clean the dataset all by ourself.