#IS 470 Lab 6: Numeric Prediction--Regression Methods

---

In order for a health insurance company to make money, it needs to collect more
in yearly premiums than it spends on medical care to its beneficiaries. As a result, insurers invest a great deal of time and money in developing models that accurately forecast medical expenses for the insured population.<br>
<br>
Medical expenses are difficult to estimate because the most costly conditions are rare and seemingly random. Still, some conditions are more prevalent for certain segments of the population. For instance, lung cancer is more likely among smokers than non-smokers, and heart disease may be more likely among the obese.<br>
<br>
The goal of this analysis is to use patient data to estimate the average medical
care expenses for such population segments. These estimates can be used to create actuarial tables that set the price of yearly premiums higher or lower, 
depending on the expected treatment costs.<br>
<br>
The insurance data set has 1338 observations of 7 variables.
<br>
We will use this file to predict the medical expenses.
<br>
<br>
VARIABLE DESCRIPTIONS:<br>
age:	      age in years<br>
sex:	      gender<br>
bmi:	      body mass index<br>
children:	how many children do they have?<br>
smoker:	  do they smoke?<br>
region:	  geographic region<br>
expenses:	yearly medical expenses<br>
<br>
Target variable: **expenses**

### 1.Upload and clean data

In [None]:
# Mounting Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Import libraries
! pip install regressors
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.tree import DecisionTreeRegressor
from regressors import stats
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

In [None]:
# Read data
insurance = pd.read_csv("/content/drive/MyDrive/IS470_data/insurance.csv")
insurance

In [None]:
# Show the head rows of a data frame
insurance.head()

In [None]:
# Examine variable type
insurance.dtypes

In [None]:
# Change categorical variables to "category"
insurance['sex'] = insurance['sex'].astype('category')
insurance['smoker'] = insurance['smoker'].astype('category')
insurance['region'] = insurance['region'].astype('category')

In [None]:
# Examine variable type
insurance.dtypes

In [None]:
# Data exploration: some examples
# Histogram of insurance expenses
snsplot = sns.histplot(x='expenses', data = insurance)
snsplot.set_title("Histogram of expenses in the insurance data set")

In [None]:
# exploring relationships among all numeric variables: correlation matrix
insurance.corr()

### 2.Partition the data set for regression model

In [None]:
# Create dummy variables
insurance = pd.get_dummies(insurance, columns=['sex','smoker','region'], drop_first=True)
insurance

In [None]:
# Partition the data
target = insurance['expenses']
predictors = insurance.drop(['expenses'],axis=1)
predictors_train, predictors_test, target_train, target_test = train_test_split(predictors, target, test_size=0.3, random_state=0)
print(predictors_train.shape, predictors_test.shape, target_train.shape, target_test.shape)

In [None]:
# Examine the distribution of target variable for training data set
snsplot = sns.histplot(data = target_train)
snsplot.set_title("Histogram of expenses in the training data set")

In [None]:
# Examine the distribution of target variable for testing data set
snsplot = sns.histplot(data = target_test)
snsplot.set_title("Histogram of expenses in the testing data set")

## 3.Simple linear regression

In [None]:
# Build a simple linear regression model with only bmi as predictor
model1 = linear_model.LinearRegression()
model1.fit(predictors_train[['bmi']], target_train)

In [None]:
# Show model summary
predictor_names = predictors_train[['bmi']].columns.values
stats.summary(model1, predictors_train[['bmi']], target_train, predictor_names)

Q1. How the expenses change when we increase the bmi by 1? <br>


In [None]:
# Make predictions on testing data
prediction_on_test = model1.predict(predictors_test[['bmi']])

In [None]:
# Examine the evaluation results on testing data: MAE and RMSE
MAE = mean_absolute_error(target_test, prediction_on_test)
RMSE = mean_squared_error(target_test, prediction_on_test, squared=False)
print("MAE:", MAE)
print("RMSE:", RMSE)

## 4.Multiple linear regression

In [None]:
# Build a multiple linear regression model with all predictors
model2 = linear_model.LinearRegression()
model2.fit(predictors_train, target_train)

In [None]:
# Show model summary
predictor_names = predictors_train.columns.values
stats.summary(model2, predictors_train, target_train, predictor_names)

Q2. How the expenses change when we increase the bmi by 1? <br>


Q3. Do you think bmi is important in predicting expenses? why? <br>


In [None]:
# Make predictions on testing data
prediction_on_test = model2.predict(predictors_test)

In [None]:
# Examine the evaluation results on testing data: MAE and RMSE
MAE = mean_absolute_error(target_test, prediction_on_test)
RMSE = mean_squared_error(target_test, prediction_on_test, squared=False)
print("MAE:", MAE)
print("RMSE:", RMSE)

## 5.Improving Model Performance:  Adding non-linear relationships

Add a higher-order "age" term

In [None]:
# add a higher-order "age" term


In [None]:
# Partition the data


In [None]:
# Build a linear regression model with non-linear relationships


In [None]:
# Show model summary


In [None]:
# Make predictions on testing data


In [None]:
# Examine the evaluation results on testing data: MAE and RMSE


Add an interaction effect:

In [None]:
# add an indicator for BMI >= 30
insurance.loc[insurance['bmi'] >= 30, 'bmi30'] = 1
insurance.loc[insurance['bmi'] < 30, 'bmi30'] = 0
insurance['bmi30'] = insurance['bmi30'].astype('category')

In [None]:
# add an interaction effects bmi30*smoker
insurance['bmi30_smoker'] = insurance['bmi30'].astype(float) * insurance['smoker_yes'].astype(float)

In [None]:
insurance

In [None]:
# Partition the data
target = insurance['expenses']
predictors = insurance.drop(['expenses'],axis=1)
predictors_train, predictors_test, target_train, target_test = train_test_split(predictors, target, test_size=0.3, random_state=0)
print(predictors_train.shape, predictors_test.shape, target_train.shape, target_test.shape)

In [None]:
# Build a linear regression model with non-linear relationships


In [None]:
# Show model summary


In [None]:
# Make predictions on testing data


In [None]:
# Examine the evaluation results on testing data: MAE and RMSE


Q4. Compared to the previous model (model2), does this model (model4) has better performance? why?<br>


## 6.Regression Tree

In [None]:
# Partition the data


In [None]:
# Build a regression tree model with max_depth=3


In [None]:
# plot the tree


In [None]:
# Make predictions on testing data


In [None]:
# Examine the evaluation results on testing data: MAE and RMSE


***Download the html file and submit to BeachBoard***<br>
<br>
1.   ***Download the lab6.ipynb file***
2.   ***Upload the lab6.ipynb file***
3.   ***Run the code below to generate a html file***
4.   ***Download the html file and submit to BeachBoard***

In [None]:
!jupyter nbconvert --to html "/content/drive/MyDrive/IS470_lab/IS470_lab6.ipynb"