# U.S. Medical Insurance Costs
### This project will attempt to gain some unique insights into the medical data set.
### The goal is to identify areas to focus resources to improve certain subgroup members' health, and ultimately lower their insurance costs. 

#### As a project for a Python and Data Science beginner student, it will only employ the basic modules and techniques learned to this point in the Codecademy course.   The ability to graph will significantly improve the readability of this project, particularly in the "In Depth Analysis" section.

# Import modules

In [2]:
import csv
import statistics

# Import data and store in a list of dictionaries

In [5]:
with open('insurance.csv') as insurance_csv:
    reader = csv.DictReader(insurance_csv)
    insuranceList = list(reader)
print(insuranceList[0:10])

[{'age': '19', 'sex': 'female', 'bmi': '27.9', 'children': '0', 'smoker': 'yes', 'region': 'southwest', 'charges': '16884.924'}, {'age': '18', 'sex': 'male', 'bmi': '33.77', 'children': '1', 'smoker': 'no', 'region': 'southeast', 'charges': '1725.5523'}, {'age': '28', 'sex': 'male', 'bmi': '33', 'children': '3', 'smoker': 'no', 'region': 'southeast', 'charges': '4449.462'}, {'age': '33', 'sex': 'male', 'bmi': '22.705', 'children': '0', 'smoker': 'no', 'region': 'northwest', 'charges': '21984.47061'}, {'age': '32', 'sex': 'male', 'bmi': '28.88', 'children': '0', 'smoker': 'no', 'region': 'northwest', 'charges': '3866.8552'}, {'age': '31', 'sex': 'female', 'bmi': '25.74', 'children': '0', 'smoker': 'no', 'region': 'southeast', 'charges': '3756.6216'}, {'age': '46', 'sex': 'female', 'bmi': '33.44', 'children': '1', 'smoker': 'no', 'region': 'southeast', 'charges': '8240.5896'}, {'age': '37', 'sex': 'female', 'bmi': '27.74', 'children': '3', 'smoker': 'no', 'region': 'northwest', 'charges'

# Organize Data into subsets
## Define resusable functions to create subsets

In [3]:
def sortBySex(group):
    males = []
    females = []
    for row in group:
        if row['sex'] == 'male':
            males.append(row)
        if row['sex'] == 'female':
            females.append(row)
    return {"M" : males, "F" : females}

In [4]:
#Male and Female subgroups
allMales = sortBySex(insuranceList)["M"]
allFemales = sortBySex(insuranceList)["F"]
print(len(allMales))
print(len(allFemales))

676
662


In [5]:
def sortByRegion(group):
    sw = []
    se = []
    nw = []
    ne = []
    for row in group:
        if row['region'] == 'southwest':
            sw.append(row)
        elif row['region'] == 'southeast':
            se.append(row)
        elif row['region'] == 'northwest':
            nw.append(row)
        elif row['region'] == 'northeast':
            ne.append(row)
    return {'SW' : sw, 'SE' : se, 'NW' : nw, 'NE' : ne}

In [6]:
# SW, SE, NW, NE subgroups
allSW = sortByRegion(insuranceList)['SW']
allSE = sortByRegion(insuranceList)['SE']
allNW = sortByRegion(insuranceList)['NW']
allNE = sortByRegion(insuranceList)['NE']
print(len(allSW))
print(len(allSE))
print(len(allNE))
print(len(allNW))

325
364
324
325


In [7]:
def sortBySmoker(group):
    smoker = []
    non = []
    for row in group:
        if row['smoker'] == 'yes':
            smoker.append(row)
        elif row['smoker'] == 'no':
            non.append(row)
    return {'Smoker' : smoker, 'Non-smoker' : non}

In [8]:
# smoker, non_smoker subgroups
smokers = sortBySmoker(insuranceList)['Smoker']
non_smokers = sortBySmoker(insuranceList)['Non-smoker']
print(len(smokers))
print(len(non_smokers))
percentage_smoker = 100 * len(smokers)/(len(smokers) + len(non_smokers))
print(str(round(percentage_smoker, 2)) + "% are smokers.")

274
1064
20.48% are smokers.


In [9]:
def sortByParent(group):
    parent = []
    notParent = []
    for row in group:
        if int(row['children']) == 0:
            notParent.append(row)
        elif int(row['children']) > 0:
            parent.append(row)
    return {'Yes': parent, 'No': notParent}

In [10]:
# parents, nonparents subgroups
parents = sortByParent(insuranceList)['Yes']
nonparents = sortByParent(insuranceList)['No']
print(len(parents))
print(len(nonparents))

764
574


# Create other useful functions for analysis

### For this project, I chose to create age and BMI groups rather than use linear progressions.  I think linear progressions might be more predictive, but this method will reveal non linear trends and possibly better identify specific groups to target for improvement.

In [34]:
# average insurance cost
def averageCost(group):
    charges = []
    for row in group:
        charges.append(float(row['charges']))
    return round(statistics.mean(charges), 2)
print(averageCost(insuranceList))

13270.42


In [12]:
# average BMI
def averageBMI(group):
    bmis = []
    for row in group:
        bmis.append(float(row['bmi']))
    return round(statistics.mean(bmis), 2)
print(averageBMI(parents))
print(averageBMI(nonparents))

30.75
30.55


In [13]:
def percentSmoker(group):
    dosmoke = 0
    for row in group:
        if row['smoker'] == 'yes':
            dosmoke += 1
    return round(100 * (dosmoke/len(group)), 2)
print(percentSmoker(insuranceList))

20.48


In [14]:
# sort by age
def byAge(group):
    twenties = []
    thirties = []
    forties = []
    fifties =[]
    sixties = []
    for row in group:
        age = int(row['age'])
        if age < 30:
            twenties.append(row)
        elif 30 <= age < 40:
            thirties.append(row)
        elif 40 <= age < 50:
            forties.append(row)
        elif 50 <= age < 60:
            fifties.append(row)
        elif 60 <= age < 70:
            sixties.append(row)
        
    return {'18-29': twenties, '30s': thirties, '40s': forties, '50s': fifties, \
           '60s': sixties}

for key, value in byAge(insuranceList).items():
    print(key, ':', len(value))


18-29 : 417
30s : 257
40s : 279
50s : 271
60s : 114


In [15]:
# Average BMI by age
def bmiByAge(group, groupname):
    bmiByAge = {}
    for key, value in byAge(group).items():
        bmiByAge[key] = averageBMI(value)
    print("Average BMI's by age in {}:". format(groupname))
    print(bmiByAge)
bmiByAge(insuranceList, "Total Dataset")

Average BMI's by age in Total Dataset:
{'18-29': 29.85, '30s': 30.44, '40s': 30.71, '50s': 31.51, '60s': 32.02}


In [16]:
# Sort by BMI
def byBmi(group):
    underweight = []
    normal = []
    overweight = []
    obese = []
    for row in group:
        bmi = float(row['bmi'])
        if bmi < 18.5:
            underweight.append(row)
        elif 18.5 <= bmi < 25:
            normal.append(row)
        elif 25 <= bmi < 30:
            overweight.append(row)
        elif bmi >= 30:
            obese.append(row)
    return {'Underweight': underweight, 'Normal': normal, 'Overweight': overweight, 'Obese': obese}

for key, value in byBmi(insuranceList).items():
    print(key, len(value))


Underweight 20
Normal 225
Overweight 386
Obese 707


# Basic Analysis

In [17]:
# Obvious factor in insurance cost... Age!  Sadly, this factor is not addressable.
for key, value in byAge(insuranceList).items():
        print('Age ({}): Average cost of {}'.format(key, str(averageCost(value))))

Age (18-29): Average cost of 9182.49
Age (30s): Average cost of 11738.78
Age (40s): Average cost of 14399.2
Age (50s): Average cost of 16495.23
Age (60s): Average cost of 21248.02


In [18]:
# The greatest factor in insurance cost: smokers vs. non smokers
print('Cost for Smokers:', averageCost(smokers))
print('Cost for Non Smokers:', averageCost(non_smokers))

Cost for Smokers: 32050.23
Cost for Non Smokers: 8434.27


In [19]:
# percent smokers per region
rs = {}
rs['SE smokers'] = percentSmoker(allSE)
rs['SW smokers'] = percentSmoker(allSW)
rs['NE smokers'] = percentSmoker(allNE)
rs['NW smokers'] = percentSmoker(allNW)
print(rs)

{'SE smokers': 25.0, 'SW smokers': 17.85, 'NE smokers': 20.68, 'NW smokers': 17.85}


In [20]:
# Average BMI per region
bmiPerRegion = {}
bmiPerRegion['SE bmi'] = averageBMI(allSE)
bmiPerRegion['SW bmi'] = averageBMI(allSW)
bmiPerRegion['NE bmi'] = averageBMI(allNE)
bmiPerRegion['NW bmi'] = averageBMI(allNW)
print(bmiPerRegion)

{'SE bmi': 33.36, 'SW bmi': 30.6, 'NE bmi': 29.17, 'NW bmi': 29.2}


In [21]:
# Average insurance cost per region
costPerRegion = {}
costPerRegion['SE cost'] = averageCost(allSE)
costPerRegion['SW cost'] = averageCost(allSW)
costPerRegion['NE cost'] = averageCost(allNE)
costPerRegion['NW cost'] = averageCost(allNW)
print(costPerRegion)
# Is cost of living a factor in insurance costs?
## E.g., does BCBS of CA charge more than MN in order to compensate employees?
### Smoking % equal, yet NW insurance more than SW despite lower BMI.  

{'SE cost': 14735.41, 'SW cost': 12346.94, 'NE cost': 13406.38, 'NW cost': 12417.58}


In [22]:
# percent smokers among parents
print('Parent smoke %:', percentSmoker(parents))
print('Nonparent smoke %:', percentSmoker(nonparents))
# Data indicates smokers are generally "formed" before they become parents

Parent smoke %: 20.81
Nonparent smoke %: 20.03


In [23]:
# percent smokers by sex
print('Males:', percentSmoker(allMales))
print('Females:', percentSmoker(allFemales))

Males: 23.52
Females: 17.37


In [24]:
# insurance cost by sex
print('Cost for Males:', averageCost(allMales))
print('Cost for Females:', averageCost(allFemales))

Cost for Males: 13956.75
Cost for Females: 12569.58


In [25]:
#insurance cost for parents
print('Cost for Parents:', averageCost(parents))
print('Cost for Non Parents:', averageCost(nonparents))

Cost for Parents: 13949.94
Cost for Non Parents: 12365.98


# More in depth analysis
## Variances in BMI by age between groups could identify areas to focus for exercise or diet programs

In [26]:
# how does BMI change for mothers vs. nonmothers as they age?
## -sort out women from group parents 
mothers = sortBySex(parents)['F']
notmothers = sortBySex(nonparents)['F']
### copy bmiByAge for these two groups
momBmiByAge = {}
for key, value in byAge(mothers).items():
    momBmiByAge[key] = averageBMI(value)
notmomBmiByAge = {}
for key, value in byAge(notmothers).items():
    notmomBmiByAge[key] = averageBMI(value)
print('Moms:', momBmiByAge)
print('Not Moms:', notmomBmiByAge)

Moms: {'18-29': 28.97, '30s': 29.79, '40s': 30.58, '50s': 31.58, '60s': 33.71}
Not Moms: {'18-29': 30.0, '30s': 29.63, '40s': 31.82, '50s': 30.65, '60s': 30.41}


In [27]:
# findings of mom BMI are surprising.  How about Dads?
## BmiByAge should be a function. Changed above...
fathers = sortBySex(parents)['M']
notfathers = sortBySex(nonparents)['M']
dadBmiByAge = {}
for key, value in byAge(fathers).items():
    dadBmiByAge[key] = averageBMI(value)
notdadBmiByAge = {}
for key, value in byAge(notfathers).items():
    notdadBmiByAge[key] = averageBMI(value)
print('Dads:', dadBmiByAge)
print('Not Dads:', notdadBmiByAge)
# "Dad Bod" appears to exist until the sixties, then dads improve.  "Granddad Bod"?

Dads: {'18-29': 30.12, '30s': 31.34, '40s': 30.68, '50s': 32.48, '60s': 31.92}
Not Dads: {'18-29': 30.06, '30s': 30.25, '40s': 29.99, '50s': 31.18, '60s': 33.12}


In [28]:
# smokers BMI by age:
bmiByAge(smokers, "Smokers")
bmiByAge(non_smokers, "Non-Smokers")
# 75% greater increase in BMI from 20's to 60's in non smokers

Average BMI's by age in Smokers:
{'18-29': 30.43, '30s': 30.54, '40s': 30.14, '50s': 31.66, '60s': 31.79}
Average BMI's by age in Non-Smokers:
{'18-29': 29.7, '30s': 30.42, '40s': 30.87, '50s': 31.48, '60s': 32.09}


In [29]:
# BMI by age per region:
bmiByAge(allNE, "NE")
bmiByAge(allSE, "SE")
bmiByAge(allNW, "NW")
bmiByAge(allSW, "SW")
# Wow, what conclusions does this lead to? 
# For one, hotter climates may lead to greater increase in BMI from age 50 on.
## Do higher rates of smoking in the east account for any of this?

Average BMI's by age in NE:
{'18-29': 28.0, '30s': 28.46, '40s': 29.59, '50s': 30.45, '60s': 30.98}
Average BMI's by age in SE:
{'18-29': 33.3, '30s': 33.61, '40s': 32.75, '50s': 33.7, '60s': 33.77}
Average BMI's by age in NW:
{'18-29': 28.54, '30s': 28.39, '40s': 30.26, '50s': 29.95, '60s': 29.18}
Average BMI's by age in SW:
{'18-29': 29.08, '30s': 30.95, '40s': 29.87, '50s': 31.82, '60s': 33.91}


## Do smoking habits change by age group?

In [30]:
for key, value in byAge(insuranceList).items():
    smokeByAge = {key: percentSmoker(value)}
    print(smokeByAge)

{'18-29': 20.62}
{'30s': 22.57}
{'40s': 22.22}
{'50s': 15.13}
{'60s': 23.68}


In [31]:
print("Parents:")
for key, value in byAge(parents).items():
    smokeByAge = {key: percentSmoker(value)}
    print(smokeByAge)
print("Nonparents:")
for key, value in byAge(nonparents).items():
    smokeByAge = {key: percentSmoker(value)}
    print(smokeByAge)
# This is probably a reflection demographics (smoker health education in the 1980's vs. the 1960's per se),
# combined with better education about smoke and children.
## It may be useful to target empty nest parents with anti smoking campaigns.  

Parents:
{'18-29': 19.54}
{'30s': 22.05}
{'40s': 21.13}
{'50s': 18.24}
{'60s': 29.41}
Nonparents:
{'18-29': 21.4}
{'30s': 24.19}
{'40s': 25.76}
{'50s': 11.38}
{'60s': 21.25}


## Average Insurance Cost per BMI range

In [32]:
for key, value in byBmi(insuranceList).items():
    print(key, averageCost(value))

Underweight 8852.2
Normal 10409.34
Overweight 10987.51
Obese 15552.34


In [33]:
print('Males:')
for key, value in byBmi(allMales).items():
    print(key,averageCost(value))
print('Females:')
for key, value in byBmi(allFemales).items():
    print(key, averageCost(value))
# BMI is a much greater determining factor of insurance cost for men vs. women

Males:
Underweight 5611.71
Normal 9868.02
Overweight 11381.95
Obese 16610.45
Females:
Underweight 11012.53
Normal 10909.02
Overweight 10616.85
Obese 14370.67


# Conclusions

## Smoking is the number one factor in higher health insurance costs.
### Not surprisingly, the data indicates that smokers choose the smoking habit at an early age.  So smoking education has to start early.  The data indicates that education about children and second hand smoke might be might be working, but the expense of smoking may be a factor in the lower rates of smoking for younger parents.  The older parents (now grandparents) rate of smoking is an identifiable area to target for improvement.  

## The unhealthiest region of the U.S. is the Southeast.  
### In the Southeast, smoking rates are more than 25% higher than the rest of the country, and BMI numbers are significantly worse.  This is an area to focus resources to make exercise programs more easily available, and improve diet and smoking education.  Exercise program accesibility improvements would likely pay greater dividends in the Southern half of the U.S., as BMI's appear to increase more quickly there.

## BMI differences account for much larger insurance costs in men than women
### This is likely a reflection of greater health consequences overall for obesity in men versus women.  (Weight gain in mothers may be more normal and healthy?)  Aging men (particularly nonsmokers, nonparents) appear to be a very useful demographic to target.  As a parent of a young child, I expected greater differences in BMI between parents and non parents.  While this doesn't seem to be a significant area of focus, there is a definitive leap in mother's  BMIs in their 60's.  This could simply be a result of a small sample size, but it also could reflect normal hormonal changes.  This demographic might be another group to target for improvement.  