## To predict insurance costs by using Linear Regression implemented in Sklearn

To make their own profits, the insurance company(insurer) must collect more premiums than the amount paid to the insured person.
For this, the insurance company invests a lot of time and money in creating a model that accurately predicts health care costs.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error,r2_score

### Read the data from csv file

In [2]:
data = pd.read_csv("insurance.csv")

In [3]:
data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


### To convert columns which have str to int, LabelEncoder is used

In [4]:
le=preprocessing.LabelEncoder()

In [5]:
le.fit(data['smoker'])
data['smoker']=le.transform(data['smoker'])

In [6]:
le.fit(data['sex'])
data['sex']=le.transform(data['sex'])

In [7]:
le.fit(data['region'])
data['region']=le.transform(data['region'])

In [8]:
data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,0,27.9,0,1,3,16884.924
1,18,1,33.77,1,0,2,1725.5523
2,28,1,33.0,3,0,2,4449.462
3,33,1,22.705,0,0,1,21984.47061
4,32,1,28.88,0,0,1,3866.8552


### Attribute Information:

Body Mass Index (BMI): People with high BMI have a significantly higher rate of premium than people with normal BMI. The reason again being this can lead to various ailments such as heart problems, joint problems, diabetes, to name a few. People with higher BMI may even need specialized treatment, for normal procedures like pregnancy. Thereby making even simple process a little tedious and affects the premium rates.

Gender:Women are more likely to visit doctors, take prescriptions, and be subject to chronic diseases.

Age: Most young individuals have premiums at much lower rates since they have fewer identified and unidentified diseases than older individuals. Young policyholders are less likely to have health problems and are more likely not to visit a doctor.

Smoking: Smoking is a heavyweight negative when it comes to life insurance premiums. Smoking is closely linked to cardio vascular disease and to numerous forms of cancer.

Region:that shared climate, lack of healthy food options, cultural aversion to exercise etc. 

In [9]:
# x=np.array(data['region'],dtype='float')
# y=np.array(data['charges'])
# x=x.reshape(-1,1)
# y=y.reshape(-1,1)

By considering bmi alone, error is 0.184

By considering age alone, error is 0.178

By considering smoker alone, error is 0.12

By considering sex alone, error is 0.19

By considering children alone, error is 0.19

By considering region alone, error is 0.178

As the error is more, by considering single feature,hence multiple features are considered

### Using 3 features--> bmi,age,smoker as numpy array

In [10]:
x = np.array(data[['bmi','age','smoker']])
y = np.array(data['charges'])

### Backward Elimination for feature selection
Considering all the features,the Root mean squared error(rmse) is 0.099%

By removing region,rmse is 0.099,hence region doesnt impact

By remoing sex,rmse is 0.101,hence sex impacts

By removing smoker,rmse is 0.18,hence smoker impacts

By removing age,rmse was increased to 0.116, hence age impacts

By removing children,error was 0.095,it also doesnt impact as rmse is decreased

by removing bmi,rmse was 0.104, hence it impacts

When we consider bmi,age,smoker; the error obtained is 0.09 ,if we consider sex,the error obtained is 0.099%, hence sex is not considered

### To convert y into 2-D array

In [11]:
y=y.reshape(-1,1)

### Normalize the array

In [12]:
scaler=MinMaxScaler()

In [13]:
X=scaler.fit_transform(x)
Y=scaler.fit_transform(y)

### Spliting the data as train and test

In [22]:
xtrain,xtest,ytrain,ytest=train_test_split(X,Y,test_size=0.2)

### Using LinearRegression model of SKlearn

In [23]:
regressor = LinearRegression()

### Train the model

In [24]:
regressor.fit(xtrain,ytrain)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

### Predict the output for the test values

In [25]:
y_pred =regressor.predict(xtest)

In [26]:
y_pred

array([[ 0.1712021 ],
       [ 0.19431612],
       [ 0.16737912],
       [ 0.13083191],
       [ 0.50784548],
       [ 0.04948993],
       [ 0.11478121],
       [ 0.02530271],
       [ 0.48421319],
       [ 0.13027298],
       [ 0.18460086],
       [ 0.17725259],
       [ 0.16175682],
       [ 0.0838029 ],
       [ 0.50929722],
       [-0.00644888],
       [ 0.42541083],
       [ 0.13674541],
       [ 0.46803847],
       [ 0.41629991],
       [ 0.06860482],
       [ 0.15834257],
       [ 0.15205054],
       [ 0.18916293],
       [ 0.4957188 ],
       [ 0.20408678],
       [ 0.15761246],
       [ 0.12215543],
       [ 0.12427789],
       [ 0.02847891],
       [ 0.07653822],
       [ 0.1582682 ],
       [ 0.08056256],
       [ 0.44075014],
       [ 0.04410518],
       [ 0.10540928],
       [ 0.03482709],
       [ 0.03472004],
       [ 0.21339482],
       [ 0.42350046],
       [-0.02458737],
       [ 0.06904622],
       [ 0.49926755],
       [ 0.17185336],
       [ 0.07544131],
       [ 0

In [27]:
# for i in zip(y_pred,ytest):
#     print(i)

In [28]:
np.sqrt(mean_squared_error(ytest,y_pred))

0.08729296812698925

### RMSE is 0.087

In [29]:
r2_score(ytest,y_pred)

0.7563289698461266

### R2 score gives the best possible fit
(hw much dataset is fit by the model)