<a href="https://colab.research.google.com/github/mona-qanadilo/GaussianNaiveBayes/blob/main/Medical_Cost_Prediction_%5BBlanks%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Lifecycle
© 2023, Zaka AI, Inc. All Rights Reserved.

---
## Case Study: Insurance Medical Cost Prediction

**Objective:**

In this exercise, you will explore a dataset of insurance data detailed below. We will build a model to predict the cost of treatment for individuals based on their age, sex, bmi and other information.


## Dataset Description


*   **age**: age of primary beneficiary
*   **sex**: insurance contractor gender, female, male
*   **bmi**: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,
objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
*   **children**: Number of children covered by health insurance / Number of dependents
*   **smoker**: Smoking
*   **region**: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.
*   **charges**: Individual medical costs billed by health insurance

## 1. Data Loading

#### Import necessary python modules

We will need the following libraries:
 - Numpy — for scientific computing (e.g., linear algebra (vectors & matrices)).
 - Pandas — providing high-performance, easy-to-use data reading, manipulation, and analysis.
 - Matplotlib — plotting & visualization.
 - scikit-learn — a tool for data mining and machine learning models.


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

#### Load the data

You can find the dataset [here](https://github.com/zaka-ai/medical-cost-prediction).

This is a Github repo where you can store multiple types of files and load them into your virtual disk by cloning the git repo with `!git clone [link to repo]`. From there we can change the working directory to point to the path of the folder where the dataset we want to work on resides.

In [None]:
# clone git repo
!git clone https://github.com/zaka-ai/medical-cost-prediction

# change working directory
%cd medical-cost-prediction/data/

#### Read & visualize data
Data now is stored on disk in a csv (Comma Separated Values) file. To load the data to our code, we use **pandas** module, more specifically, the **read_csv** function.

In [None]:
# read CSV file in Pandas
data = pd._____

# display first 10 rows
data._____

## 2. Exploratory Data Analysis

Let's dig deeper & understand our data

**Task:** how many rows & columns in our dataset

In [None]:
# get the number of rows and columns
rows = data._____
columns = data._____

print('There are {} rows and {} columns.'.format(rows,columns))

Using the function **info()**, we can check:
 - data types (int, float, or object (e.g., string))
 - missing values
 - memory usage
 - number of rows and columns

In [None]:
data.info()

Using the function **describe()**, we can check the mean, standard deviation, maximum, and minimum of each numerical feature (column)

In [None]:
data.describe()

#### Distribution of charges

First, let's look at the distribution of charges. This will help us to know how much patients spend on treatment on average.

In [None]:
# plot the histogram of the charges
data["charges"]._______
plt.title("Distribution of charges")
plt.xlabel("Charges")
plt.ylabel("Frequency")
plt.show()

#### Correlation between smoking and cost of treatment

Let's see if smokers spend more or less on treatment than non-smokers!

First, let's see how many smokers vs non-smokers we have.

In [None]:
# select smokers


# select non smokers


# print the number of smokers and non-smokers


Now let's plot the charges for both.

In [None]:
# create the figure
fig = plt.figure(figsize=(12,5))

# add first sub plot for smokers
ax = fig.add_subplot(121)
# draw distribution of charges for smokers
ax.hist______
# set sub plot title
ax.set_title('Distribution of charges for smokers')

# add second sub plot for non smokers
ax = fig.add_subplot(122)
# draw distribution of charges for non-smokers
ax._______
# set sub plot title
ax.set_title('Distribution of charges for non-smokers')

Smoking patients spend more on treatment.

#### Correlation between age and cost of treatment

First, let's look at the distribution of age in our dataset, and then look at how age affects the cost of treatment.

In [None]:
# plot histogram for age distribution
plt._______
plt.title("Age distribution")
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.show()

Let's plot the correlation between age and cost of treatment

In [None]:
# draw a scatter plot to show correlation between age and charges, color = 'b'
plt._______
plt.title("Cost of treatment for different ages")
plt.xlabel("Age")
plt.ylabel("Charges")
plt.show()

Let's check if smoking also affects this curve.

**Task**: show a scatter plot with the correlation between age and cost of treatment while showing smokers in red and non-smokers in blue

In [None]:
#Exercise


#### Correlation between BMI and cost of treatment

# Body Mass Index

![alt text](https://4.bp.blogspot.com/-nBF9Z1tFGhI/W3MqbdD0j7I/AAAAAAAAAIs/UdyXTIxsBT8Pl8usABxEK_Fusj5S0SnBQCLcBGAs/s1600/HOW%2BTO%2BCALCULATE%2BBODY%2BMASS%2BINDEX%2BBMI.jpg)

# BMI Chart

![BMI char](https://images.squarespace-cdn.com/content/v1/56fae4be1d07c0c393d8faa5/1551103826935-HCXS8U78500C06GQ1PLJ/ke17ZwdGBToddI8pDm48kNMeyc_nGAbaGjp3EBJ2o08UqsxRUqqbr1mOJYKfIPR7LoDQ9mXPOjoJoqy81S2I8N_N4V1vUb5AoIIIbLZhVYxCRW4BPu10St3TBAUQYVKckzCNDuUMr1wTvf7-fqd2hrX5O2-_PoO3UJ2jNU1VzJbe6G9-F0r9BTATNUu-NBMy/BMI+Chart.jpg)

First, let's look at the distribution of BMI in our dataset, and then look at how it affects the cost of treatment.

In [None]:
# draw a histogram to show the distribution of BMI
data["bmi"]._____
plt.title("BMI distribution")
plt.xlabel("BMI")
plt.ylabel("Frequency")
plt.show()

According to the chart above, obesity starts at BMI = 30. Let's investigate the impact of BMI on cost of treatment.

In [None]:
# select obese people
obese = _____
# select overweight people
overweight = _____
# select healthy people
healthy = _____

print('There are {} obese, {} overweight and {} healthy individuals.'.format(obese.shape[0], overweight.shape[0], healthy.shape[0]))

To compare, let's plot the distribution of charges for all 3 groups (obese, overweight and healthy) in 1 plot, while showing `obese` data in red, `overweight` data in yellow and `healthy` data in green.

In [None]:
plt._____
plt._____
plt._____
plt.title("Charges distribution")
plt.xlabel("Charges")
plt.ylabel("Frequency")
plt.show()

Patients with BMI above 30 spend more on treatment!

## 3. Data Preprocessing
"Garbage in, garbage out".

Data should be preprocessed and cleaned to get rid of noisy data.
Preprocessing includes:
 - dealing with missing data
   - remove whole rows (if they are not a lot)
   - infer (e.g., date of birth & age)
   - fill with mean, median, or even 0
 - removing unsued column(s)
 - convert categorical (non numerical) data into numerical
 - normalization: standarize data ranges for all features (e.g., between 0 and 1)



---



 Let's start by removing missing data.

In [None]:
# print how many missing value in each column
data.isnull().sum()

In [None]:
# drop rows with missing values
data = data._____
data.isnull().sum()

In [None]:
data.info()

#### Remove unused columns

Let's remove the `region` column since we don't really care about it

In [None]:
# dropping the region feature
data._____
data.head()

#### Convert Categorical columns to numerical

*   We need to convert the sex column from male/female to 0/1.
*   We need to convert the smoker column from no/yes to 0/1.


Let's start with the sex column



In [None]:
# define dictionary
gender = {'male':0, 'female':1}
# replace sex column with 0/1
data['sex'] = ______
# print head to verify
data.head()

And now the smokers column

In [None]:
# define dictionary
smokers = {'no':0, 'yes':1}
# replace smokers column with 0/1
data['smoker'] = _____
# print head to verify
data.head()

#### Normalization

Let's scale all the columns by dividing by the maximum

In [None]:
# get the max of each column
data_max = _____
data_max

In [None]:
# divide each column by its maximum value
data = data.____
data.describe()

## 4. Model Training & Testing



#### Data split

Before training, we need to split data into training (80%) & testing (20%)

In [None]:
# store all columns excpet last one as inputs in X
X = data.iloc[:,0:-1].values
# store the last column as the output (label) in y
y = data.iloc[:,-1].values

# split dataset in a 80/20 split
x_train, x_test, y_train, y_test = ______

print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

#### Linear Regression Modeling


In [None]:
# define our regression model
model = ______

# train our model
model.______
print('Model trained!')

#### Evaluation

In [None]:
print('Model score {}'.format(model.______))

####Prediction
Let's have some fun with some predictions

In [None]:
# use the linear model to predict


#### BONUS: Features Importance (weights)


In [None]:
columns_names = data.columns[0:-1].values
features_importance = model.coef_
plt.barh(columns_names, features_importance)
plt.title('Features Importance')
plt.xlabel('importance')
plt.ylabel('feature')
plt.show()