<font color='#43aba2'><h3><b>Implementing Linear Regression model using different methods and testing the RMSE</font>

In [1]:
#importing the required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
#loading and reading the data
df = pd.read_csv("/content/sample_data/insurance.csv")
df.head()

Unnamed: 0,age,gender,bmi,bloodpressure,diabetic,children,smoker,region,claim
0,19,female,27.9,107,No,0,Yes,southwest,16884.92
1,18,male,33.8,133,No,1,No,southeast,1725.55
2,28,male,33.0,88,Yes,3,No,southeast,4449.46
3,33,male,22.7,119,Yes,0,No,northwest,21984.47
4,32,male,28.9,91,No,0,No,northwest,3866.86


In [3]:
#exploring and analysing the data
df.shape
df.describe()

Unnamed: 0,age,bmi,bloodpressure,children,claim
count,1338.0,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.665471,109.397608,1.094918,13270.422414
std,14.04996,6.098382,17.519398,1.205493,12110.01124
min,18.0,16.0,80.0,0.0,1121.87
25%,27.0,26.3,94.0,0.0,4740.2875
50%,39.0,30.4,109.0,1.0,9382.03
75%,51.0,34.7,124.0,2.0,16639.915
max,64.0,53.1,140.0,5.0,63770.43


In [4]:
#correlation study of the data
df.corr()

Unnamed: 0,age,bmi,bloodpressure,children,claim
age,1.0,0.109341,-0.080593,0.042469,0.299008
bmi,0.109341,1.0,-0.015544,0.012645,0.198576
bloodpressure,-0.080593,-0.015544,1.0,-0.043967,-0.028208
children,0.042469,0.012645,-0.043967,1.0,0.067998
claim,0.299008,0.198576,-0.028208,0.067998,1.0




---



<h2><b>n-1 dummy encoding</h2>

In [6]:
#performing n-1 dummy encoding for categorical variable
dummies = df[['age', 'bmi','bloodpressure','children','diabetic','smoker','gender','region','claim']]
dummies = pd.get_dummies(dummies, drop_first=True)
dummies

Unnamed: 0,age,bmi,bloodpressure,children,claim,diabetic_Yes,smoker_Yes,gender_male,region_northwest,region_southeast,region_southwest
0,19,27.9,107,0,16884.92,0,1,0,0,0,1
1,18,33.8,133,1,1725.55,0,0,1,0,1,0
2,28,33.0,88,3,4449.46,1,0,1,0,1,0
3,33,22.7,119,0,21984.47,1,0,1,1,0,0
4,32,28.9,91,0,3866.86,0,0,1,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...
1333,50,31.0,98,3,10600.55,0,0,1,1,0,0
1334,18,31.9,138,0,2205.98,0,0,0,0,0,0
1335,18,36.9,114,0,1629.83,0,0,0,0,1,0
1336,21,25.8,121,0,2007.95,1,0,0,0,0,1


If we have a categorical variable that has n different categories then dummy encoding will convert it into n-1 variables



---



<h2><b>Scaling or Standardization</h2>

**Scaling**<br> The process of increasing or decreasing the magnitude according to a fixed ratio. i.e; changing the size but not the shape <br>
**StandardScaler** standardizes a feature by subtracting the mean and then scaling to unit variance <br>
StandardScaler results in a distribution with a standard deviation equal to 1. <br>
***NEED***: It helps improve the performance of the model and reducing the values/models from varying widely .


In [11]:
#standadize the numerical variables
from sklearn.preprocessing import StandardScaler

#listing the numeric and categorical data
numeric = dummies[['age','bmi','bloodpressure','children','claim']]
categorical = dummies[['diabetic_Yes', 'smoker_Yes' ,	'gender_male' ,	'region_northwest', 'region_southeast',	'region_southwest']]

#initialize standard scaler instance
scaler = StandardScaler()

#fit and transform the scaler on numerical column
scaled = scaler.fit_transform(numeric)
scaled

array([[-1.43876426, -0.4536457 , -0.13690567, -0.90861367,  0.29858346],
       [-1.50996545,  0.51418574,  1.34771851, -0.07876719, -0.95368938],
       [-0.79795355,  0.38295436, -1.22182333,  1.58092576, -0.72867485],
       ...,
       [-1.50996545,  1.02270734,  0.26280084, -0.90861367, -0.96159654],
       [-1.29636188, -0.79812808,  0.66250735, -0.90861367, -0.93036111],
       [ 1.55168573, -0.25679863,  1.69032408, -0.90861367,  1.31105343]])

In [12]:
#converting array into dataframe
scaled_df = pd.DataFrame(scaled)
scaled_df

Unnamed: 0,0,1,2,3,4
0,-1.438764,-0.453646,-0.136906,-0.908614,0.298583
1,-1.509965,0.514186,1.347719,-0.078767,-0.953689
2,-0.797954,0.382954,-1.221823,1.580926,-0.728675
3,-0.441948,-1.306650,0.548305,-0.908614,0.719843
4,-0.513149,-0.289606,-1.050521,-0.908614,-0.776802
...,...,...,...,...,...
1333,0.768473,0.054876,-0.650814,1.580926,-0.220551
1334,-1.509965,0.202511,1.633223,-0.908614,-0.914002
1335,-1.509965,1.022707,0.262801,-0.908614,-0.961597
1336,-1.296362,-0.798128,0.662507,-0.908614,-0.930361


In [13]:
#renaming the dataframe columns
scaled_df = scaled_df.rename({0:'age', 1:'bmi', 2:'bloodpressure',3:'children', 4:'claim'}, axis = 1)
scaled_df

Unnamed: 0,age,bmi,bloodpressure,children,claim
0,-1.438764,-0.453646,-0.136906,-0.908614,0.298583
1,-1.509965,0.514186,1.347719,-0.078767,-0.953689
2,-0.797954,0.382954,-1.221823,1.580926,-0.728675
3,-0.441948,-1.306650,0.548305,-0.908614,0.719843
4,-0.513149,-0.289606,-1.050521,-0.908614,-0.776802
...,...,...,...,...,...
1333,0.768473,0.054876,-0.650814,1.580926,-0.220551
1334,-1.509965,0.202511,1.633223,-0.908614,-0.914002
1335,-1.509965,1.022707,0.262801,-0.908614,-0.961597
1336,-1.296362,-0.798128,0.662507,-0.908614,-0.930361


In [15]:
#build a standard dataframe 
X_std = pd.merge(scaled_df, categorical, left_index=True, right_index=True)
X_std

Unnamed: 0,age,bmi,bloodpressure,children,claim,diabetic_Yes,smoker_Yes,gender_male,region_northwest,region_southeast,region_southwest
0,-1.438764,-0.453646,-0.136906,-0.908614,0.298583,0,1,0,0,0,1
1,-1.509965,0.514186,1.347719,-0.078767,-0.953689,0,0,1,0,1,0
2,-0.797954,0.382954,-1.221823,1.580926,-0.728675,1,0,1,0,1,0
3,-0.441948,-1.306650,0.548305,-0.908614,0.719843,1,0,1,1,0,0
4,-0.513149,-0.289606,-1.050521,-0.908614,-0.776802,0,0,1,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...
1333,0.768473,0.054876,-0.650814,1.580926,-0.220551,0,0,1,1,0,0
1334,-1.509965,0.202511,1.633223,-0.908614,-0.914002,0,0,0,0,0,0
1335,-1.509965,1.022707,0.262801,-0.908614,-0.961597,0,0,0,0,1,0
1336,-1.296362,-0.798128,0.662507,-0.908614,-0.930361,1,0,0,0,0,1


In [16]:
#separating the Feature and Label
X = X_std.drop('claim', axis = True)
y = X_std.loc[:,'claim']

In [18]:
#splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X , y, test_size = 0.2, random_state = 0)



---



<h2><b>Model - 1</h2>

**statsmodel.OLS** -  <br>statsmodels module is used to implement *Ordinary Least Squares(OLS)* method of linear regression. <br>
The class estimates a multi-variate regression model and provides a variety of fit-statistics.

In [19]:
#using statsmodels.OLS - model 1
import statsmodels.api as sm
X_train_Sm= sm.add_constant(X_train)
model1 = sm.OLS(y_train,X_train_Sm).fit()
print(model1.summary())

  import pandas.util.testing as tm


                            OLS Regression Results                            
Dep. Variable:                  claim   R-squared:                       0.737
Model:                            OLS   Adj. R-squared:                  0.735
Method:                 Least Squares   F-statistic:                     297.3
Date:                Sun, 20 Jun 2021   Prob (F-statistic):          3.76e-299
Time:                        16:53:41   Log-Likelihood:                -791.33
No. Observations:                1070   AIC:                             1605.
Df Residuals:                    1059   BIC:                             1659.
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const               -0.3452      0.040  

In [20]:
#pridicting for model 1
y_pred_1 = model1.predict(X_train_Sm)
y_pred_1

#evaluating the performance of model 1
from sklearn.metrics import mean_squared_error
rms1 = mean_squared_error(y_train, y_pred_1)
rms1

0.2569784131202851



---



<h2><b>Model - 2</h2>

**Linear Regression** - <br>
*LinearRegression()* fits a linear model to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation.

In [21]:
#using sklearn.linear_model(LinearRegression) - model - 2
from sklearn.linear_model import LinearRegression
model2 = LinearRegression().fit(X_train, y_train)
model2.score(X_train, y_train)

0.7373276222331174

In [22]:
#prediction for model 2
y_pred_2 = model2.predict(X_test)
y_pred_2

#evaluating the perormance for model 2
from sklearn.metrics import mean_squared_error
rms2 = mean_squared_error(y_test, y_pred_2)
rms2

0.21667094530326148



---



<h2><b>Model - 3</h2>

**SGDRegressor** - <br>SGD stands for Stochastic Gradient Descent. The SGDregressor applies regularized linear model with SGD learning to build an estimator. <br>The SGD regressor works well with large-scale datasets. 

**Pipeline** <br>
scikit-learn provides a Pipeline utility to help automate machine learning workflows. Pipelines work by allowing for a linear sequence of data transforms to be chained together culminating in a modeling process that can be evaluated. <br>
It helps to create a convienent workflow which make sure of reproducibility of the work.


In [23]:
#using SGDRegressor - model 3
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import make_pipeline
model3 = make_pipeline(StandardScaler(),SGDRegressor())
model3.fit(X_train, y_train)
model3.score(X_train,y_train)

0.7366701131694342

In [24]:
#predicting for model 3
y_pred_3 = model3.predict(X_test)

#evaluting the performance
rsm3 = mean_squared_error(y_test, y_pred_3)
rsm3

0.21670712725577101