# U.S. Medical Insurance Costs

Goal: The goal of this project is to optimize the insurance cost in my future. This project will help me decide where to live if I want to have children and what my bmi should be (regarding bmi I will try to inspect if lower means alwas better). 

Data: For this project I am assuming that insurance is provided by one company. Also In the main part I won't focus on sex and smoking since I want change my sex and I will never smoke. So I will filter them out

## Preparing Data
Firstly I have imported pandas library and insurance.csv

In [80]:
import csv
import pandas as pd

starting_df = pd.read_csv("insurance.csv")
test = 1111



Now I want to check if there are any rows with Null or other "bad" values.

In [81]:

def check_for_nulls (df):
    df["nulls"] = df.apply(lambda row: row.isnull().any(), axis=1)
    return df
test_null = check_for_nulls(starting_df)

print(test_null["nulls"].unique())



[False]


In [82]:
starting_df.drop(columns=['nulls'], inplace=True)
print(starting_df.columns)

Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')


## Minimizing My Insurance Cost: 

In [83]:
#Jakub's Code

## Evaluating Cost-BMI relationship:
### Steps Brainstorm:
 - What is the average cost of insurance. the ave bmi?
### Approach Breakdown
 - Add column that divides bmi into medical catagories: obese, etc
 - Find Mean and Medium Insurance cost within each bmi catagory
 - Loop through each peron in each bmi catagory, and find the mean and median cost paid by people of same/similar age, sex, region, children, smoking status (everything but bmi) and compare that to the mean and median cost at this bmi catagory. record the difference to see if there are any trends where people in a certain bmi catagory typically pay more or less than other of their same age,sec etc.
 

### 1. Aggregate BMI catagories
Add column with bmi catagories (source: [cdc.gov](https://www.cdc.gov/bmi/adult-calculator/bmi-categories.html))
- Underweight: Less than 18.5
- Healthy Weight: 18.5 to less than 25
- Overweight: 25 to less than 30
- Obesity: 30 or greater
    - Class 1 Obesity: 30 to less than 35
    - Class 2 Obesity: 35 to less than 40
    - Class 3 (Severe) Obesity: 40 or greater

*These catagories have been prefixed with letters to easily sort them in the dataframes by lowest to highest bmi*

In [84]:
#Set pandas to display more columns per line
pd.set_option('display.width',2000)

#function to sort into bmi catagory
def bmi_catagory_sorter(bmi):
    if bmi >= 40:
        return 'f_class_3_obese'
    elif  bmi >= 35:
        return 'e_class_2_obese'
    elif  bmi >= 30:
        return 'd_class_1_obese'
    elif  bmi >= 25:
        return 'c_overweight'
    elif  bmi >= 18.5:
        return 'b_healthy_weight'
    elif  bmi < 18.5:
        return 'a_underweight'
    else:
        return 'err: input should be a number'

#Add columns with Adult BMI Category
bmi_analysis_df = starting_df
bmi_analysis_df['bmi_catagory'] = starting_df.bmi.apply(bmi_catagory_sorter)
print('Sample of df with new column added with bmi catagories:')
print(bmi_analysis_df.head())

aggregate_by_bmi_catagory = bmi_analysis_df.groupby('bmi_catagory').agg({'charges':['mean','count']})
print('\n Grouping data by bmi catagory here are the average insurance costs and counts of each catagory:')
print(aggregate_by_bmi_catagory)

Sample of df with new column added with bmi catagories:
   age     sex     bmi  children smoker     region      charges      bmi_catagory
0   19  female  27.900         0    yes  southwest  16884.92400      c_overweight
1   18    male  33.770         1     no  southeast   1725.55230   d_class_1_obese
2   28    male  33.000         3     no  southeast   4449.46200   d_class_1_obese
3   33    male  22.705         0     no  northwest  21984.47061  b_healthy_weight
4   32    male  28.880         0     no  northwest   3866.85520      c_overweight

 Grouping data by bmi catagory here are the average insurance costs and counts of each catagory:
                       charges      
                          mean count
bmi_catagory                        
a_underweight      8852.200585    20
b_healthy_weight  10409.337709   225
c_overweight      10987.509891   386
d_class_1_obese   14419.674970   391
e_class_2_obese   17022.258883   225
f_class_3_obese   16784.615546    91


### 2. Compare individuals insurance cost to the average cost of similar people
Defining **similar people**
 - age is +/- 2 years
 - sex, region, number of children, and smoking status are all the same
 - bmi is the the variable. We are checking similar people across all  bmi values

Using this definition of **similar people** we calculate the average cost of insurance and compare it to each individual's cost with this formula:
    
### 100*$\frac{*indidualCost* - *avgCost*}{*avgCost*}$
We also record the sample size of **similar people**. With these two new columns we now group by BMI Catagory again identify and coorilation between
BMI Catagroy and people paying more or less then the average cost of similar people across all BMI values.

In [85]:
#Define Functions
#This function gets the count and average of insurance cost of each patiene with the imput age +/- 2 years, sex, children, smoker, region
# - Essentially the average cost of a simlar people across all bmis and count to record the sample size
def get_avg_with_bmi_variable(i_age,i_sex,i_children,i_smoker,i_region):
    return bmi_analysis_df[((bmi_analysis_df.age <= i_age + 2) | (bmi_analysis_df.age >= i_age - 2) )
                            & (bmi_analysis_df.sex == i_sex)
                            & (bmi_analysis_df.children == i_children) 
                            & (bmi_analysis_df.smoker == i_smoker) 
                            & (bmi_analysis_df.region == i_region)].agg({'charges': ['mean', 'count']})

#This function will run on all rows of bmi_analysis_df grouped by bmi_catagory 
# finding the difference between that individual's charges and the average charge of similar people calculated by the previous function
# it returns: [percent diffence in cost from the avg, sample size of average]
def get_percent_dif_from_avg(row):
    #Pull data from row
    i_charges = row['charges']
    i_age = row['age']
    i_sex = row['sex']
    i_children = row['children']
    i_smoker = row['smoker']
    i_region = row['region']
    #Use data to find all people of similar age and same: sex, number of children, smoker status and region
    # Then get the count and average insurance cost of these people 
    avg_cost_of_similar_people = get_avg_with_bmi_variable(i_age,i_sex,i_children,i_smoker,i_region).iloc[0].charges
    similar_people_smpl_size = int(get_avg_with_bmi_variable(i_age,i_sex,i_children,i_smoker,i_region).iloc[1].charges)
    #Calculate the percent difference this person pays to insurce compared to the average of similar people
    percent_diff_from_avg = round(100*((i_charges - avg_cost_of_similar_people)/avg_cost_of_similar_people),2)
    return [percent_diff_from_avg,similar_people_smpl_size]

In [86]:
#Create two new columns: 
# - perenct difference in cost each person compared to the average cost of similar people
# - sample size of similar people average
bmi_analysis_df[['percent_cost_diff_from_similar_ppl_avg','similar_ppl_smpl_size']] = bmi_analysis_df.apply(get_percent_dif_from_avg,axis=1,result_type='expand')
# Print sample of df with new columns
print(bmi_analysis_df.head())


   age     sex     bmi  children smoker     region      charges      bmi_catagory  percent_cost_diff_from_similar_ppl_avg  similar_ppl_smpl_size
0   19  female  27.900         0    yes  southwest  16884.92400      c_overweight                                  -38.44                   10.0
1   18    male  33.770         1     no  southeast   1725.55230   d_class_1_obese                                  -78.89                   32.0
2   28    male  33.000         3     no  southeast   4449.46200   d_class_1_obese                                  -56.76                   12.0
3   33    male  22.705         0     no  northwest  21984.47061  b_healthy_weight                                  192.02                   53.0
4   32    male  28.880         0     no  northwest   3866.85520      c_overweight                                  -48.64                   53.0


In [90]:
#Group df by bmi_catagory and calculate average percent_cost_diff_from_similar_ppl and average similar people sample size
percent_cost_diff_by_bmi_catagory = bmi_analysis_df.groupby('bmi_catagory').agg({'percent_cost_diff_from_similar_ppl_avg':'mean', 'similar_ppl_smpl_size':'mean'})
#cost_diff_by_bmi_catagory = bmi_analysis_df.groupby('bmi_catagory').cost_diff_from_avg.mean()

print('Step 2. Results: Grouped by BMI Catagory \n - Percent difference in insurance cost from average cost of similar people\n - Similar people sample size\n')
print(percent_cost_diff_by_bmi_catagory)

print('\n\nStep 1. Results: Grouped by BMI Catagory\n - Average Insurance cost\n - Number of people\n')
print(aggregate_by_bmi_catagory)

Step 2. Results: Grouped by BMI Catagory 
 - Percent difference in insurance cost from average cost of similar people
 - Similar people sample size

                  percent_cost_diff_from_similar_ppl_avg  similar_ppl_smpl_size
bmi_catagory                                                                   
a_underweight                                 -30.542500              35.700000
b_healthy_weight                              -14.321733              32.368889
c_overweight                                   -5.991736              33.932642
d_class_1_obese                                 3.437801              32.989770
e_class_2_obese                                18.439511              33.888889
f_class_3_obese                                 7.177363              34.318681


Step 1. Results: Grouped by BMI Catagory
 - Average Insurance cost
 - Number of people

                       charges      
                          mean count
bmi_catagory                        
a_underwei

## Reflections on BMI Analysis:
### Coorilations
In both Step 1. and 2. of this analysis we find **positive coorilation between BMI and Insurance costs**. Across 5 of the 6 catagories with see the trend of increased insurance charges with increased BMI

The Class 2 Obesity to Class 3 Obesity Exception:
The one exeption is the trend from Class 2 Obesity to Class 3 Obesity. In this catagory we see a decrease in both average insurance cost (Step 1) and average percent cost difference of simlar people (Step 2). 

In Step 1, the sample size is considarably lower for Class 3 Obsity than all but the Underweight catagory. In addition the Average Cost is only slightly lower than teh Average Cost of Class 2 Obesity. 
These two facts combine would seem to accound for the lapse in positive corrilation.

In Step 2 however the average sample size is similar across all bmi catagories and the drop in Cost Percent Difference from Class 2 to Class 3 is quite dramatic. Therefore it could be warrented to further explore
this sample to identify any other trends across the Class 2 and Class 3 Catagories that might contribute to this outlier in the positive coorilation.

### Implications
BMI is a Weight/Height ratio that was created with a goal of assessing the fat to body mass propotions. This goal is made clear by the names of the catagories given to different level of BMI. Each name clearly implies
that fat levels and health of related to a person BMI. However BMI does not factor in, muscle mass, bone mass, or body proportion. Without accounting for these factors BMI is not a accura with to assess a person's fat levels
or health with any acceptable level of accuratcy.

Therefore the positive coorilation between BMI and 

### Potential Bias
### Future Analysis