# Project: Medical Insurance Cost Prediction

## Table of Contents

1. [Importing Packages](#section-one)   

2. [Loading the Raw Data](#section-two)

3. [Data Exploration](#section-three)

4. [Data Pre-Processing](#section-four)

    * [Encoding the Categorical Features](#section-four-two)
    * [Splitting the Features and Target](#section-four-three)
    * [Splitting the data into Training and Testing data](#section-four-four)


5. [Model Training](#section-five)

    * [Linear Regression](#section-five-one)
    * [Model Evaluation](#section-five-two)


6. [Building a Predictive System.](#section-six)

<a id="section-one"></a>
# 1. Importing Packages

In [611]:
# relevant libraries for this project

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import seaborn as sns
sns.set()
from sklearn.preprocessing import OneHotEncoder

<a id="section-two"></a>
# 2. Loading the raw data

We have taken the data of medical health insurance from Kaggle. The file is named 'insurance.csv' and we will be using module 'pandas' to load it.

In [612]:
# Panda dataframe

raw_data = pd.read_csv('../input/insurance/insurance.csv')

# Let's have a quick look at the dataset

raw_data.head()

Here, we have the first 5 rows of the given dataset to have a look at it and understand it. We are going to predict the 'charges' so it will become the target and rest columns are the features!

In [613]:
# let's check the number of rows and columns in the dataset

raw_data.shape

We have 1338 rows and 7 columns in the dataset.

In [614]:
# Some more information about the dataset

raw_data.info()

In [615]:
# let's check the descriptive statistics of the variables

raw_data.describe(include='all')

Here, we understand that categorical variables don't have some types of numerical descriptives and numerical variables don't have some types of categorical descriptives. Therefore, we have 3 categorical variables (features) in this dataset: Sex, Smoker, and Region.

In [616]:
# Checking the total number of missing values

raw_data.isnull().sum()

Great! We do not have any missing values and won't have to drop any variables.

In [617]:
# We have checked the raw data and there aren't any missing values!

data_no_mv = raw_data.copy()

<a id="section-three"></a>
# 3. Data Exploration

Displaying the probability distribution function (PDF) of a variable is a fantastic data exploration step. We will see how that variable is spread in the PDF. This makes it very simple to identify outliers and other irregularities. Frequently, the PDF also serves as the foundation for our decision over whether to alter a feature.

In [618]:
# distribution of age variable

sns.displot(data_no_mv['age'])
plt.title('Distribution of Age')
plt.show()

As the figure represents that the highest density of people is of age 20-23. From age 24 to 70, the distribution is almost equal.

In [619]:
# plot of sex variables

sns.countplot(x = 'sex', data = data_no_mv)
plt.title('Distribution of sex')
plt.show

Firstly, we are using 'sns.countplot' because 'sex' is a categorical variables and to represent in a better way.
The figure shows that the number of male and female is almost equal.

In [620]:
# distribution of bmi variable

sns.displot(data_no_mv['bmi'])
plt.title('Distribution of BMI')
plt.show()

This kind of distribution is normal distribution. The figure shows that we have an gradual increase from 15 to reach the peak values of 30. Then there is a gradual decrease. We may also notice very few outliers and we will take care of them later.

According to the research,
Normal BMI range is 18.5 to 24.9. A person exceeding the limit is overweight and the person below this limit is underweight. We could see that there are more number of people in this dataset that are overweight!

In [621]:
# plot of children variable

sns.countplot(x = 'children', data = data_no_mv)
plt.title('Children')
plt.show()

According to the figure, there are more number of people with no children. Then there are people having 1-3 children and there are very less people having 4-5 children.

In [622]:
# plot of smoker variable

sns.countplot(x = 'smoker', data = data_no_mv)
plt.title('Smoker')
plt.show()

In this dataset, there are more non-smokers than smokers.

In [623]:
# plot of region variable

sns.countplot(x = 'region', data = data_no_mv)
plt.title('Region')
plt.show()

We have four regions: Southwest, Southeast, Northwest, and Northeast. People are equally distributed in all the regions with southeast having slightly more number of people than other regions.

In [624]:
# distribution of charges variable

sns.displot(data_no_mv['charges'])
plt.title('Distribution of Charges')
plt.show()

Mostly, the charges are around 1000-10,000 dollars.

<a id="section-four"></a>
# 4. Data Pre-Processing

<a id="section-four-two"></a>
## Encoding the categorical features

In [625]:
# Categorical features: Sex, Smoker, and Region.

data = data_no_mv.copy()

# Assigning values for 'smoker' feature

data['smoker'] = data['smoker'].map({'yes':1, 'no':0})

data.head()

In [626]:
# As we know that 'sex' and 'region' are nominal categorical variables
# We will create dummy variable

dummies = pd.get_dummies(data['sex'])

# Let's have a look

dummies

In [627]:
# One-Hot Encoder
# We will use this method to feed this feature to the machine

ohe = OneHotEncoder()

feature_array = ohe.fit_transform(data[['region']]).toarray()

In [647]:
# Let's see the following categories in column region

feature_labels = ohe.categories_

print(feature_labels)

In [629]:
# Creating one array

feature_labels = np.array(feature_labels).ravel()

print(feature_labels)

In [630]:
# We are now making a data frame of these labels

features = pd.DataFrame(feature_array, columns = feature_labels)

features.head()

In [631]:
# We will now join the dummy variable and OHE columns to original dataset

data_new = pd.concat([data, dummies, features], axis=1)

data_new = data_new.drop(columns='region', axis=1)
data_new = data_new.drop(columns='sex', axis=1)

data_new.head()

<a id="section-four-three"></a>
## Splitting the Features and Targets

In [632]:
# declare the variables

y = data_new.charges
x = data_new.drop(columns='charges', axis=1)

In [633]:
# let's have a look at the target variables

print(y)

In [634]:
# let's have a look at the features

print(x)

<a id="section-four-four"></a>
## Splitting the data into training data & testing data

In [635]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 2)

In [636]:
# By looking at the shape, we could see the number of observations which are training and testing

print(x.shape, x_train.shape, x_test.shape)

<a id="section-five"></a>
# 5. Model Training

<a id="section-five-one"></a>
## Linear Regression

In [637]:
# Check the regression

reg = LinearRegression()
reg.fit(x_train.values, y_train.values)

<a id="section-five-two"></a>
## Model Evaluation

In [638]:
# prediction on training data

train_data_pred = reg.predict(x_train.values)

In [639]:
# R squared value

from sklearn import metrics

r2_train = metrics.r2_score(y_train, train_data_pred)

print('R squared value: ',r2_train)

In [640]:
# prediction on testing data

test_data_pred = reg.predict(x_test.values)

In [641]:
# R squared value

r2_test = metrics.r2_score(y_test, test_data_pred)

print('R squared value: ',r2_test)

In [642]:
# Residual = Differences between the targets and the predictions
# The residuals are the estimate of errors

sns.displot(y_train - train_data_pred)
plt.title('Residuals PDF', size = 18)

<a id="section-six"></a>
# 6. Building a Predictive System

In [643]:
# we select the random features in order to machine to predict respective charges 
# we will also put the values assigned to categorical varibales

# features used: age:31, sex:female, bmi:25.74, children:0, smoker:no, region:southeast

input_data = (31, 25.74, 0, 0, 1, 0, 0, 0, 1, 0)

In [644]:
# changing inupt_data into numpy array

array_data = np.asarray(input_data)

In [645]:
# reshape the array

array_data_reshaped = array_data.reshape(1,-1)

In [646]:
prediction = reg.predict(array_data_reshaped)

print('The insurance charge is $',prediction)

The original price is 3756.8552. Our model's predictions are close to the original price, therefore the model is complete. 

That's all and thank you! I hope you liked my linear regression model and prediction with ML project. Please do let me know your reviews and advices in the comments :))