
# Using a Kaggle NB and Dataset to explore

# **Cost of Treatment of Patient Prediction Based on Medical Cost Personal Datasets**

# **Part 1 - DEFINE**

---Step1.Define the problem----->
Accurately Predict the insurance costs, based on medical cost personal dataset

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from datetime import datetime
from collections import defaultdict  #provide default values for null keys
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score,mean_squared_error
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import PolynomialFeatures
from sklearn.ensemble import RandomForestRegressor
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline



# **Part 2 - DISCOVER**
----Step2.Load Dataset---->Check Head, info and describe ,  shape of dataset by query

In [None]:
#grab our insurance file...
import os
import requests

thePath = "./"
theLink = "https://dse200.dev/Day3/insurance.csv"
theFile = "insurance.csv"

if not os.path.exists(thePath + theFile):
    r = requests.get(theLink)
    with open(thePath + theFile, 'wb') as f:
        f.write(r.content)

# Load the data

df= pd.read_csv(thePath + theFile)
print(df.shape)

In [None]:
df.head(10)

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
print('Number of rows and columns in the data set: ',df.shape)

Now we have imported dataset. When we look at the shape of dataset it has return as (1338,7).So there are  m=1338  training exaple and  n=7  independent variable. The target variable here is charges and remaining six variables such as age, sex, bmi, children, smoker, region are independent variable.

----Step3.Clean Dataset---

In [None]:
# Check for null count column wise
df.isnull().sum(axis=0)

---Step4.Explore the Data (EDA)--

a.Visualizing the Charges data Target Variable by using distplot


In [None]:
f= plt.figure(figsize=(12,4))
ax=f.add_subplot(121)
sns.distplot(df['charges'],bins=50,color='r',ax=ax)
ax.set_title('Distribution of insurance charges')

ax=f.add_subplot(122)
sns.distplot(np.log10(df['charges']),bins=40,color='b',ax=ax)
ax.set_title('Distribution of insurance charges in $log$ sacle')
ax.set_xscale('log')
plt.show()


b.Visualizing categorical data by using bar plot

- birth-gender
- smoker
- region

In [None]:
plt.figure(figsize=(18,4))
plt.subplot(131)
sns.barplot(x='sex', y='charges', data=df)
plt.subplot(132)
sns.barplot(x='smoker', y='charges', data=df)
plt.subplot(133)
sns.barplot(x='region', y='charges', data=df)
plt.show()

c.Visualizing Numerical data by using pairplot
- age
- bmi
- children
- charges

In [None]:
sns.pairplot(df,kind="reg")

In [None]:

#Plot a heatmap and look at the corelation

# Select only numerical features for correlation analysis
numerical_features = df.select_dtypes(include=['number']).columns
numerical_df = df[numerical_features]

#Plot a heatmap and look at the corelation
sns.heatmap(numerical_df.corr(), cmap='coolwarm', annot=True)

--Step5.Label Encoding for Catogorical data---

**Label encoding** refers to transforming the word labels into numerical form so that the algorithms can understand how to operate on them.



In [None]:
# Let us map the variables with 2 levels to 0 and 1
df['sex']=df['sex'].map({'male':1, 'female':0})
df['smoker']=df['smoker'].map({'yes':1,'no':0})

In [None]:
# Assigning dummy variables to remaining categorical variable- region
df = pd.get_dummies(df, columns=['region'], drop_first=True)
df.head()

# Part 3 DEVELOP
# **Train Test split**

In [None]:
from sklearn.model_selection import train_test_split
X = df.drop('charges',axis=1) # Independet variable
y = df['charges'] # dependent variable

X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=0)

In [None]:
lr = LinearRegression()
lr.fit(X_train,y_train)

In [None]:
y_train_pred = lr.predict(X_train)
y_test_pred = lr.predict(X_test)
print(lr.score(X_test,y_test))

**Now lets add Polynmial Feature and look at the result**

In [None]:
X = df.drop(['charges','region_northwest','region_southeast','region_southwest'], axis = 1)
Y = df.charges



quad = PolynomialFeatures (degree = 2)
x_quad = quad.fit_transform(X)

X_train,X_test,Y_train,Y_test = train_test_split(x_quad,Y, random_state = 0)

plr = LinearRegression().fit(X_train,Y_train)

Y_train_pred = plr.predict(X_train)
Y_test_pred = plr.predict(X_test)

print(plr.score(X_test,Y_test))

# Now lets try out with Random Forest

In [None]:
forest = RandomForestRegressor(n_estimators = 100,
                              criterion = 'squared_error', # Change criterion to 'squared_error'
                              random_state = 1,
                              n_jobs = -1)
forest.fit(X_train,y_train)
forest_train_pred = forest.predict(X_train)
forest_test_pred = forest.predict(X_test)

print('MSE train data: %.3f, MSE test data: %.3f' % (
mean_squared_error(y_train,forest_train_pred),
mean_squared_error(y_test,forest_test_pred)))
print('R2 train data: %.3f, R2 test data: %.3f' % (
r2_score(y_train,forest_train_pred),
r2_score(y_test,forest_test_pred)))

In [None]:
plt.figure(figsize=(10,6))

plt.scatter(forest_train_pred,forest_train_pred - y_train,
          c = 'black', marker = 'o', s = 35, alpha = 0.5,
          label = 'Train data')
plt.scatter(forest_test_pred,forest_test_pred - y_test,
          c = 'c', marker = 'o', s = 35, alpha = 0.7,
          label = 'Test data')
plt.xlabel('Predicted values')
plt.ylabel('Tailings')
plt.legend(loc = 'upper left')
plt.hlines(y = 0, xmin = 0, xmax = 60000, lw = 2, color = 'red')
plt.show()