#Week 4: Linear Regression

For this week's content, we will walk through how to create a basic linear regression model that uses the BMI variable to predict charges. A linear model is of the form: $y = a_0 + a_1x_1 + a_2x_2 + .. + a_nx_n$, where $x_i$ represents our independent variables and the $a_i$ are our coefficients. For this example, we have only one independent variable that we're using: BMI, and hence, this model will be of the form $y = mx + c$ where $x$ represents the BMI.

In [1]:
import numpy as np 
import pandas as pd 

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

url = 'https://github.com/millenopan/DGMI-Project/blob/master/insurance.csv?raw=true'
data = pd.read_csv(url)

In [2]:
data

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


First, we want to divide our data into a training data and test data. We will use the training data to train our model, and the test data to evaluate how well the model does on unseen data. We use the train_test_split function for this, and by keeping test_size as 0.2, we 80% of the data is used as training data, and 20% is used as test data.

In [3]:
train, test = train_test_split(data, test_size=0.2, random_state=83)

In [5]:
X_train = train.loc[:, ["bmi"]]
y_train = train["charges"]
X_test = test.loc[:, ["bmi"]]
y_test = test["charges"]

In [6]:
display(X_train, y_train, X_test, y_test)

Unnamed: 0,bmi
408,21.120
549,46.200
947,34.200
1124,42.750
1227,37.180
...,...
1320,31.065
1335,36.850
1280,33.330
1000,22.990


408      6652.52880
549     45863.20500
947     39047.28500
1124    40904.19950
1227     7162.01220
           ...     
1320     5425.02335
1335     1629.83350
1280     8283.68070
1000    17361.76610
82      37165.16380
Name: charges, Length: 1070, dtype: float64

Unnamed: 0,bmi
665,38.060
579,23.465
84,34.800
244,27.740
1307,28.120
...,...
399,38.170
1070,37.070
802,22.300
439,29.450


665     42560.43040
579      3206.49135
84      39836.51900
244     29523.16560
1307    21472.47880
           ...     
399      1631.66830
1070    39871.70430
802      2103.08000
439      2897.32350
912     14382.70905
Name: charges, Length: 268, dtype: float64

In [7]:
model = LinearRegression(fit_intercept=False)
model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=False, n_jobs=None, normalize=False)

We can now use our model to predict the insurance charges based on a BMI input. Let's try using our model to predict the insurance charges for the training data and the test data. We can then use the rmse function that we created in previous week's code to evaluate our model.

In [8]:
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

In [9]:
def root_mean_squared_error(actual, predicted):
  return np.mean((actual-predicted)**2)**0.5

In [10]:
training_error = root_mean_squared_error(y_train, y_pred_train)
test_error = root_mean_squared_error(y_test, y_pred_test)
print("The training error is " + str(training_error) + " and the test error is " + str(test_error))

The training error is 11690.6480282237 and the test error is 12547.670463648361


In this week's document, the error shown above will be described a bit more in detail for you to understand what is going on. Let's try testing our model on a sample!



In [11]:
test

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
665,43,male,38.060,2,yes,southeast,42560.43040
579,25,female,23.465,0,no,northeast,3206.49135
84,37,female,34.800,2,yes,southwest,39836.51900
244,63,female,27.740,0,yes,northeast,29523.16560
1307,32,male,28.120,4,yes,northwest,21472.47880
...,...,...,...,...,...,...,...
399,18,female,38.170,0,no,southeast,1631.66830
1070,37,male,37.070,1,yes,southeast,39871.70430
802,21,male,22.300,1,no,southwest,2103.08000
439,26,male,29.450,0,no,northeast,2897.32350


In [12]:
model.predict(X_test)[0]

16419.56793266756

Hmm... that actually seems quite a bit off. Maybe we can do something to make our model better!