# **Practice Project: Insurance Cost Analysis**

Estimated time needed: **75** minutes

In this project, you have to perform analytics operations on an insurance database that uses the below mentioned parameters.

| Parameter |Description| Content type |
|---|----|---|
|age| Age in years| integer |
|gender| Male or Female|integer (1 or 2)|
| bmi | Body mass index | float |
|no_of_children| Number of children | integer|
|smoker| Whether smoker or not | integer (0 or 1)|
|region| Which US region - NW, NE, SW, SE | integer (1,2,3 or 4 respectively)|
|charges| Annual Insurance charges in USD | float|

## Objectives
In this project, you will:
 - Load the data as a `pandas` dataframe
 - Clean the data, taking care of the blank entries
 - Run exploratory data analysis (EDA) and identify the attributes that most affect the `charges`
 - Develop single variable and multi variable Linear Regression models for predicting the `charges`
 - Use Ridge regression to refine the performance of Linear regression models.

In [None]:
from cProfile import label

from flatbuffers.packer import int64

""" Import libraries and download dataset """
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sbn
import sklearn as sk

HEADERS = ['age', 'gender', 'bmi', 'no_of_children', 'smoker', 'region', 'charges']

filepath = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-Coursera/medical_insurance_dataset.csv'

# Task 1 : Import the dataset

Import the dataset into a `pandas` dataframe. Note that there are currently no headers in the CSV file.

Print the first 10 rows of the dataframe to confirm successful loading.

In [None]:
df = pd.read_csv(filepath, header=None)
df.head(10)

In [None]:
df.columns = HEADERS
df.head()

In [None]:
df.replace('?', np.nan, inplace=True)

In [None]:
df.info()

In [None]:
average_age = df['age'].astype('float64').mean(axis=0)
df['age'].replace(np.nan, average_age, inplace=True)

In [None]:
most_smoker = df['smoker'].value_counts().idxmax()
df['smoker'].replace(np.nan, most_smoker, inplace=True)

In [None]:
df['age'] = df['age'].astype('float64')
df['smoker'] = df['smoker'].astype('int64')

In [None]:
df['charges'] = np.round(df['charges'], 2)
df['charges'].head()

In [None]:
sbn.regplot(x='bmi', y='charges', data=df, line_kws=dict(color="r"))
plt.ylim(0,)
plt.show()

In [None]:
sbn.boxplot(x='smoker', y='charges', data=df)

In [None]:
df.corr()

In [105]:
lr = sk.linear_model.LinearRegression()
X = df[['smoker']]
Y = df[['charges']]
Z = df.drop(['charges'], axis=1)
lr.fit(X, Y)
lr.score(X, Y)
lr.fit(Z, Y)
lr.score(Z, Y)

0.7505867314418195

In [109]:
from sklearn.metrics import r2_score
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

Input = [('polynomial', PolynomialFeatures(degree=2)),('scale', StandardScaler()),('Model', LinearRegression())]
pipe = Pipeline(Input)
Z = Z.astype('float64')
pipe.fit(Z, Y)
yhat = pipe.predict(Z)
r2_score(Y, yhat)

0.8453700268104134

In [113]:
x_train, x_test, y_train, y_test = sk.model_selection.train_test_split(Z, Y, test_size=0.2, random_state=1)

In [114]:
from sklearn.linear_model import Ridge

RidgeModel=Ridge(alpha=0.1)
RidgeModel.fit(x_train, y_train)
yhat = RidgeModel.predict(x_test)
print(r2_score(y_test,yhat))

0.728592699726808


In [115]:
pr = PolynomialFeatures(degree=2)
x_train_pr = pr.fit_transform(x_train)
x_test_pr = pr.fit_transform(x_test)
RidgeModel.fit(x_train_pr, y_train)
y_hat = RidgeModel.predict(x_test_pr)
print(r2_score(y_test,y_hat))

0.8259215315586219
