# **Week 3: Modeling**

Our goal for this project is to create a model that can predict insurance costs based on our dataset. One choice of model would be to ignore any relationships between variables, and predict the same number for each individual – i.e., predicting a constant. We call this constant a summary statistic because it summarizes the data in our sample, and the model is known as a constant model. 

Let's see how it works!

In [1]:
import numpy as np 
import pandas as pd 

import matplotlib.pyplot as plt
import seaborn as sns

url = 'https://github.com/millenopan/DGMI-Project/blob/master/insurance.csv?raw=true'
data = pd.read_csv(url)

In [2]:
data

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


------------------------------

Let's begin by setting some rough guess for what the cost might be.

In [3]:
cost_guess = 6000
cost_guess

6000

We now want to see how accurate our guess is. Let us first see how accurately our guess predicts the insurance cost for the first row of our dataset. In order to be able to do so, we need some metric of how “good” or “bad” our predictions are. This is what loss functions provide us with. For this project, we will look at the squared and absolute loss functions.

In [4]:
first_cost = data["charges"][0]
first_cost

16884.924

In [5]:
squared_error = (first_cost-cost_guess)**2
squared_error

118481570.48577598

In [6]:
absolute_error = abs(first_cost-cost_guess)
absolute_error

10884.923999999999

We can see that the error values are quite high for both loss functions. This means that our guess probably isn't great at capturing the first insurance cost. However, we want to find out how "good" or "bad" our model is on average for all values. Hence, we use the root mean squared error and mean absolute error. The reason for using the root mean squared error instead of just the mean squared error is to ensure that the final error value is in the same units as the variable we're predicting.

In [7]:
root_mean_squared_error = np.mean((data["charges"]-cost_guess)**2)**0.5
root_mean_squared_error

14120.970448485827

In [8]:
mean_absolute_error = np.mean(abs(data["charges"]-cost_guess))
mean_absolute_error

8942.175742719723

We can see that the mean error values are pretty high as well, which means that we need to tweak our model. One approach could be to use our mean and median values as our prediction. Let's see how that goes.

In [9]:
def root_mean_squared_error(actual, predicted):
  return np.mean((actual-predicted)**2)**0.5

In [10]:
def mean_absolute_error(actual, predicted):
  return np.mean(abs(actual-predicted))

In [11]:
mean_cost = np.mean(data["charges"])
actual_costs = data["charges"]
rmse_mean_cost = root_mean_squared_error(actual_costs, mean_cost)
mae_mean_cost = mean_absolute_error(actual_costs, mean_cost)
mean_cost, rmse_mean_cost, mae_mean_cost

(13270.422265141257, 12105.484975561605, 9091.126581137027)

In [12]:
median_cost = np.median(data["charges"])
actual_costs = data["charges"]
rmse_median_cost = root_mean_squared_error(actual_costs, median_cost)
mae_median_cost = mean_absolute_error(actual_costs, median_cost)
median_cost, rmse_median_cost, mae_median_cost

(9382.033, 12714.650509188747, 8351.03963065844)

In general, we decide which loss function we want to use to evaluate our model before we start and then tweak our model such that this value is minimized. We can see that the root mean squared error is lower for the mean cost, and the mean absolute error for the median cost. Hence, if we used the root mean squared error as our loss function, we would select the mean cost as our constant, and if we used the mean absolute error as our loss function, we would select the median cost as our constant.

We can see that our error values are lower than before, which means our model has improved! The values are, however, still quite high, and hence, we need to use more complex modeling techniques to predict the insurance cost, and we will be going over one of these models next week!