# U.S. Medical Insurance Costs

## Dataset Description
### Context
Machine Learning with R by Brett Lantz is a book that provides an introduction to machine learning using R. As far as I can tell, Packt Publishing does not make its datasets available online unless you buy the book and create a user account which can be a problem if you are checking the book out from the library or borrowing the book from a friend. All of these datasets are in the public domain but simply needed some cleaning up and recoding to match the format in the book.

### Content
- age: age of primary beneficiary
- sex: insurance contractor gender, female, male
- bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
- children: Number of children covered by health insurance / Number of dependents
- smoker: Smoking
- region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.
- charges: Individual medical costs billed by health insurance

### Why do I want to do that project 
The goal of this project is to 
1. practice data analytics process and python
2. learn more about the insurance client infomation(eg. how different attributed related to each other) and turn them into further question/useful insight

In [1]:
import pandas as pd 
import numpy as np

In [2]:
df = pd.read_csv("insurance.csv")

In [3]:
df.head(5)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [4]:
# List the data type for all attributes and check missing value 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


##### After looking at the data, I have the following question:
- What is the average age of primary beneficary?
- What is the average age of primary beneficary who has at least one child covered?
- Where a majority of the individuals are from?
- How the medical cost charged differently for smoker?
- How the medical cost charged differently in different region?
- What is the most important factor to determine how much to charge?

In [5]:
np.mean(df['age'])

39.20702541106129

it seem that in US, the main demand of medical insurance is from middle age people 

In [6]:
# Mean of age for those who have no child 
np.mean(df[df['children']==0]["age"])

38.444250871080136

In [7]:
# Mean of age for those who have at least one child 
np.mean(df[df['children']>0]["age"])

39.78010471204188

- Just as we intuitively think, the clients who have no child is slightly younger that those who have child.
- But the difference is not huge

In [8]:
df["region"].value_counts()

southeast    364
southwest    325
northwest    325
northeast    324
Name: region, dtype: int64

- The majority of individual is from southeast
- but other region have similar # of clients 

In [9]:
# Let check how many smoker are there 
df['smoker'].value_counts()

no     1064
yes     274
Name: smoker, dtype: int64

In [10]:
# Mean of charge for those who are smoker 
np.mean(df[df['smoker']=="yes"]["charges"])

32050.23183153285

In [11]:
# Mean of charge for those who are non-smoker 
np.mean(df[df['smoker']=="no"]["charges"])

8434.268297856199

In [12]:
32050.23183153285/ 8434.268297856199

3.8000014582983206

- Just as expected, client who are smoker charger a lot more than those who are not(the charge is four time higer)

In [13]:
# Mean of charge for those who are in southeast 
np.mean(df[df['region']=="southeast"]["charges"])

14735.411437609895

In [14]:
# Mean of charge for those who are in southwest 
np.mean(df[df['region']=="southwest"]["charges"])

12346.93737729231

In [15]:
# Mean of charge for those who are in northwest 
np.mean(df[df['region']=="northwest"]["charges"])

12417.575373969228

In [16]:
# Mean of charge for those who are in northeast
np.mean(df[df['region']=="northeast"]["charges"])

13406.3845163858

- In southeast part of US, the charge is slightly more than in other region
- We might need to ask why is that the case for deeper exploration

For the last question, we will try to fit a random forest regression to find the feature importance score

In [17]:
from sklearn.ensemble import RandomForestRegressor 

In [18]:
# define the model
model = RandomForestRegressor()

In [19]:
# split the dependent variable and independent variable 
X = df.drop(columns="charges")
y = df["charges"]

In [20]:
# if we want to use categorial varible in the ML model
# first we need to change them into binary varible 
for col in df.columns:
    if df[col].dtypes == "O":
        newCol = pd.get_dummies(df[col], prefix = col, drop_first = True)
        X.drop(columns=col, inplace = True)
        X = pd.concat([X, newCol], axis = 1)

In [21]:
# fit the model 
model.fit(X, y)

RandomForestRegressor()

In [22]:
# get importance
importance = model.feature_importances_

In [23]:
# summarize feature importance
# sort the feature by importance score
sortedScore = sorted(zip(importance, X.columns), key = lambda t: t[0], reverse=True)
for score, col in sortedScore :
    print(f"{col}:{score:.2f}")

smoker_yes:0.62
bmi:0.21
age:0.13
children:0.02
region_northwest:0.01
sex_male:0.01
region_southeast:0.01
region_southwest:0.00


Observation from the feature importance analysis
- Whether the client is smoker contribute most of the charge
- bmi and age also matter but not the # of children covered/sex/region 

##### Potential bias in this dataset
- There may be outlier in the dataset(unexpected high charge), we might need to explore why it is the case so that we can have more accurate explanation 
- Our dataset may incur sampling bias since the sample size may not be huge enough