Create a linear regression model in python using any dataset of your choice. For this model you can
also create your own data. Find the best fit line in the data and calculate SSE (sum of square error)
or MSE (Mean square error) , Y intercept, and Slope for the relationship in data. Explain your
findings and understanding of these terms in detail in the report.



---


Successfully executing the code with linear regression model and calculating following:
a. SSE or MSE
b. Y intercept
c. Slope 



---


Mean Absolute Error:  0.34193698828461927


---


Mean Squared Error:  0.25108181197832785


---


Root Mean Squared Error:  0.5010806441864701


---


Y intercept:  0.005664780213938073


---


Slope:  [ 0.29254799 -0.0116126   0.15938599  0.03498638  0.80153266 -0.0290586]

Importing the required packages/libraries


In [1]:
#imporat packages/libraris
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.linear_model import LinearRegression

Reading the insurance data

In [2]:
data = pd.read_csv('/content/insurance.csv') #reading the csv file


In [3]:
data.head(3) #looking at the first 5 data points

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462


In [4]:
data.describe() 

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


Checking for null values

In [5]:
data.isnull().sum() #checking for null values

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

Will need to convert to number format for sex, region and smoker. 
0 is smoker and 1 is not a smoker.
0 is female and 1 is male.
value for region is based on the location.

In [6]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(data.sex.drop_duplicates()) 
data.sex = le.transform(data.sex) #sex male or female 

#print(data.sex)

le.fit(data.smoker.drop_duplicates()) 
data.smoker = le.transform(data.smoker) #smoker yes or no

#print(data.smoker)

le.fit(data.region.drop_duplicates()) #region 
data.region = le.transform(data.region)

#print(data.region)

In [7]:
data.head(5)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,0,27.9,0,1,3,16884.924
1,18,1,33.77,1,0,2,1725.5523
2,28,1,33.0,3,0,2,4449.462
3,33,1,22.705,0,0,1,21984.47061
4,32,1,28.88,0,0,1,3866.8552


In [8]:
from sklearn.decomposition import PCA

pca = PCA(whiten=True)
pca.fit(data)
variance = pd.DataFrame(pca.explained_variance_ratio_)
np.cumsum(pca.explained_variance_ratio_)

array([0.99999851, 0.99999974, 0.99999998, 0.99999999, 1.        ,
       1.        , 1.        ])

Creating a features and X and y variables

In [9]:
feature_cols = ['age', 'sex', 'bmi', 'children', 'smoker', 'region']

X = data[feature_cols]
y = data.charges


Using preprocessing for standarditization 

In [10]:
from sklearn import preprocessing

In [11]:
#standardized X abd y
X = preprocessing.scale(X) 
y = preprocessing.scale(y)

Training the X and y datasets 

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1) 

Using the linear regression model for analysis 

In [13]:
linear_model = LinearRegression()

In [14]:
model_fit = linear_model.fit(X_train,y_train) #fitting the training and test data

In [15]:
linear_model.score(X_train,y_train) #looking at the score

0.7544083642384213

In [16]:
pred = linear_model.predict(X_test) #predction value

In [17]:
from sklearn.metrics import r2_score 
r2_score(y_test, pred)#R squared

0.740367716897532

For retrieving the slope (coefficient of x)

In [18]:
linear_model.coef_ #gives you an array of weights estimated by linear regression. It is of shape (ntargets, nfeatures)

array([ 0.29254799, -0.0116126 ,  0.15938599,  0.03498638,  0.80153266,
       -0.0290586 ])

To retrieve the intercept

In [19]:
linear_model.intercept_

0.005664780213938073

In [20]:
from sklearn.metrics import mean_squared_error 
from sklearn import metrics

In [22]:
from sklearn import metrics
print('Mean Absolute Error: ', metrics.mean_absolute_error(y_test, pred))
print('Mean Squared Error: ', metrics.mean_squared_error(y_test, pred))
print('Root Mean Squared Error: ', np.sqrt(metrics.mean_squared_error(y_test, pred)))
print('Y intercept: ', linear_model.intercept_)
print('Slope: ', linear_model.coef_)

Mean Absolute Error:  0.34193698828461927
Mean Squared Error:  0.25108181197832785
Root Mean Squared Error:  0.5010806441864701
Y intercept:  0.005664780213938073
Slope:  [ 0.29254799 -0.0116126   0.15938599  0.03498638  0.80153266 -0.0290586 ]
