# U.S. Medical Insurance Costs

The goal of this project is to investigate how different demographic data affects insurance costs. We will also compare our dataset to available US Census data to evaluate whether our sample data is representative of the US adult population.

## Part 1: Import and Preprocess Data

Sample insurance data has been provided by Codecademy. 

Census data was acquired from the [US Census website](https://data.census.gov/table?t=Age+and+Sex&tid=ACSST1Y2021.S0101) from the 2021 Annual Community Survey.

In [1]:
# Import csv library
import csv

In [2]:
# Read in insurance data
with open('insurance.csv') as insurance_csv:
    insurance_data = list(csv.DictReader(insurance_csv))

In [3]:
# Read in Census data
with open('census_data.csv') as census_csv:
    census_data = {}
    census_file = csv.DictReader(census_csv)
    for record in census_file:
        census_data.update({record['Age_Grouping']: record})

Looking at the data, we can see that our data isn't all in the format we need. Numeric and yes/no values are stored as strings.

In [4]:
print(insurance_data[0:3])
print(census_data['Total_population'])

[{'age': '19', 'sex': 'female', 'bmi': '27.9', 'children': '0', 'smoker': 'yes', 'region': 'southwest', 'charges': '16884.924'}, {'age': '18', 'sex': 'male', 'bmi': '33.77', 'children': '1', 'smoker': 'no', 'region': 'southeast', 'charges': '1725.5523'}, {'age': '28', 'sex': 'male', 'bmi': '33', 'children': '3', 'smoker': 'no', 'region': 'southeast', 'charges': '4449.462'}]
{'Age_Grouping': 'Total_population', 'Total_Estimate': '331893745', 'Male_Estimate': '164350703', 'Female_Estimate': '167543042'}


We'll iterate through the insurance and Census data to reformat the data.

In [5]:
# Reformat numeric values in insurance data, convert yes/no data to 1/0
for record in insurance_data:
    record['age'] = int(record['age'])
    record['bmi'] = float(record['bmi'])
    record['children'] = int(record['children'])
    record['charges'] = float(record['charges'])
    if record['smoker'] == 'yes':
        record['smoker'] = 1
    if record['smoker'] == 'no':
        record['smoker'] = 0

In [6]:
# Reformat numeric values in Census data
for group, record in census_data.items():
    record['Total_Estimate'] = int(record['Total_Estimate'])
    record['Male_Estimate'] = int(record['Male_Estimate'])
    record['Female_Estimate'] = int(record['Female_Estimate'])

Next we'll compare the ages present in the insurance data to the age groupings in the Census data to see if there are ages in one dataset that aren't present in the other. We find and print the maximum and minimum ages in the insurance dataset, and print the age groupings in the Census data.

In [11]:
# Initialize max and min ages
# Min is set to infinity so any value in insurance data will be lower
# Max is set to 0 so any value in insurance data will be higher
ins_min_age = float('inf')
ins_max_age = 0

for record in insurance_data:
    # Check if age in record is less than minimum age
    # If so, set minimum age equal to record's age
    if record['age'] < ins_min_age:
        ins_min_age = record['age']
    # Check if agen in record is greater than maximum age
    # If so, set maximum age equal to record's age
    if record['age'] > ins_max_age:
        ins_max_age = record['age']

# Print max and min insurance ages
print('Insurance Minimum Age: ' + str(ins_min_age) + '\n'\
     'Insurance Maximum Age: ' + str(ins_max_age) + '\n\n')

# Iterate through Census data and print all age groupings
for key in census_data.keys():
    print(key)

Insurance Minimum Age: 18
Insurance Maximum Age: 64


Total_population
Under 5 years
5 to 9 years
10 to 14 years
15 to 19 years
20 to 24 years
25 to 29 years
30 to 34 years
35 to 39 years
40 to 44 years
45 to 49 years
50 to 54 years
55 to 59 years
60 to 64 years
65 to 69 years
70 to 74 years
75 to 79 years
80 to 84 years
85 years and over


We can now see that the ages in the insurance data range from 18 to 64 years while the Census data has age groupings as young as under 5 and as old as 85 years and older. We'll remove the age groups from the Census data that aren't present in the insurance data. Because we have no insurance data for ages 15 through 17, our first age group to compare across datasets will be 20 to 24 years old.

When the unneeded age groups are removed, their dictionaries will be removed from the Census data and their population counts will be removed from the overall population total and the overall male and female population totals.

In [12]:
# Define unneeded age groups
groups_to_remove = ['Under 5 years', 
                    '5 to 9 years', 
                    '10 to 14 years', 
                    '15 to 19 years', 
                    '65 to 69 years', 
                    '70 to 74 years',
                    '75 to 79 years',
                    '80 to 84 years',
                    '85 years and over']

# Store the overall total population and total male and female populations
# These will be used to update the census data with new population totals after unneeded groups are removed
census_total_pop = census_data['Total_population']['Total_Estimate']
census_total_female = census_data['Total_population']['Female_Estimate']
census_total_male = census_data['Total_population']['Male_Estimate']

for group in groups_to_remove:
    # Subtract the total group population from the overall total
    census_total_pop = census_total_pop - census_data[group]['Total_Estimate']
    # Subtract the female group population from the overall female total
    census_total_female = census_total_female - census_data[group]['Female_Estimate']
    # Subtract the male group population from the overall male total
    census_total_male = census_total_male - census_data[group]['Male_Estimate']

# Update Census total populations with new values
census_data['Total_population']['Total_Estimate'] = census_total_pop
census_data['Total_population']['Female_Estimate'] = census_total_female
census_data['Total_population']['Male_Estimate'] = census_total_male

In [13]:
# Remove unneeded group dictionaries from Census data
for group in groups_to_remove:
    census_data.pop(group, None)

The keys currently used in the Census dictionary are long and leave lots of room for error when trying to access a specific key. We'll shorten them for ease of use. Since we're not able to change the keys in existing dictionaries, we'll create a new dictionary using the new keys with the existing Census data.

In [14]:
# Create list of existing keys
census_keys = list(census_data.keys())

# Define new keys
census_keys_new = ['total',
                  '20_24',
                  '25_29',
                  '30_34',
                  '35_39',
                  '40_44',
                  '45_49',
                  '50_54',
                  '55_59',
                  '60_64']

# Create empty dictionary for new keys
census_dat = {}

# Populate dictionary with Census data and new keys
for i in range(len(census_keys)):
    census_dat.update({census_keys_new[i]: census_data[census_keys[i]]})

## Part 2: Store Data in Variables

To make comparison between the Census and insurance data easier, we'll create a dictionary of dictionaries from the insurance data that mimics the form of the list of Census dictionaries.

We'll use the old census keys to populate the Age Grouping for each dictionary.

In [15]:
# Create empty dictionary
ins_data_compare = {}

# Add dictionaries with empty counts for each age group
for i in range(len(census_keys)):
    ins_data_compare.update({census_keys_new[i]: 
                             {'Age_Grouping': census_keys[i], 
                              'Total_Estimate': 0, 
                              'Male_Estimate': 0, 
                              'Female_Estimate': 0}})

In [19]:
# Iterate through insurance data to calculate counts for different groups
for record in insurance_data:
    # For each record in insurance data, add one to overall population count
    ins_data_compare['total']['Total_Estimate'] += 1
    # If record's sex is female, add one to overall female population
    if record['sex'] == 'female':
        ins_data_compare['total']['Female_Estimate'] += 1
    # If record's sex is male, add one to overall male population
    if record['sex'] == 'male':
        ins_data_compare['total']['Male_Estimate'] += 1
    # Find appropriate age and sex group for record, add one to age group population and age group sex population
    for key in ins_data_compare.keys():
        if key == 'total':
            continue
        if record['age'] >= int(key[0:1]) and record['age'] <= int(key[-2:]) and record['sex'] == 'female':
            ins_data_compare[key]['Total_Estimate'] += 1
            ins_data_compare[key]['Female_Estimate'] += 1
        if record['age'] >= int(key[0:1]) and record['age'] <= int(key[-2:]) and record['sex'] == 'male':
            ins_data_compare[key]['Total_Estimate'] += 1
            ins_data_compare[key]['Male_Estimate'] += 1

## Part 3a: Compare Insurance and Census Data

Ages and sexes from insurance and Census data are compared to evaluate if insurance data is representative of the US population in relevant age groups.

In [21]:
# Create new dictionaries to store Census and insurance data as proportions
ins_proportions = {}
census_proportions = {}

def make_proportions(dictionary_in, dictionary_out):
    for group, record in dictionary_in.items():
        # Store total population for all age groups and sexes
        overall_total = dictionary_in['total']['Total_Estimate']
        # Add dictionary for age group
        dictionary_out.update({record['Age_Grouping']: 
                               # Store proportion of age group in total population
                                {'Total': round(record['Total_Estimate'] / overall_total, 4),
                                 # Store proportion of female population in age group
                                'Female_Prop': round(record['Female_Estimate'] / record['Total_Estimate'], 4),
                                 # Store proportion of male population in age group
                                'Male_Prop': round(record['Male_Estimate'] / record['Total_Estimate'], 4)}})

In [22]:
# Populate dictionaries
make_proportions(ins_data_compare, ins_proportions)
make_proportions(census_dat, census_proportions)

In [24]:
# Print comparisons of the insurance and Census proportions for age and gender
# This loop requires the age_group_labels list to be defined above
for group in census_keys:
    # Print header comparing proportion of age group in overall populations
    print('Age Group: ' + group + '\n' + \
         'TOTAL PROPORTION' + '\n' + \
         'Census: ' + str(census_proportions[group]['Total']) + '\n' + \
         'Insurance: ' + str(ins_proportions[group]['Total']) + '\n' + \
         # Print comparison of female proportions of age group
         'FEMALE PROPORTION' + '\n' + \
         'Census: ' + str(census_proportions[group]['Female_Prop']) + '\n' + \
         'Insurance: ' + str(ins_proportions[group]['Female_Prop']) + '\n' + \
         # Print comparisons of male proportions of age group
         'MALE PROPORTION' + '\n' + \
         'Census: ' + str(census_proportions[group]['Male_Prop']) + '\n' + \
         'Insurance: ' + str(ins_proportions[group]['Male_Prop']) + '\n')

Age Group: Total_population
TOTAL PROPORTION
Census: 1.0
Insurance: 1.0
FEMALE PROPORTION
Census: 0.4992
Insurance: 0.4955
MALE PROPORTION
Census: 0.5008
Insurance: 0.5045

Age Group: 20 to 24 years
TOTAL PROPORTION
Census: 0.1104
Insurance: 0.2075
FEMALE PROPORTION
Census: 0.4891
Insurance: 0.482
MALE PROPORTION
Census: 0.5109
Insurance: 0.518

Age Group: 25 to 29 years
TOTAL PROPORTION
Census: 0.1141
Insurance: 0.3112
FEMALE PROPORTION
Census: 0.4927
Insurance: 0.482
MALE PROPORTION
Census: 0.5073
Insurance: 0.518

Age Group: 30 to 34 years
TOTAL PROPORTION
Census: 0.1186
Insurance: 0.4097
FEMALE PROPORTION
Census: 0.4954
Insurance: 0.4845
MALE PROPORTION
Census: 0.5046
Insurance: 0.5155

Age Group: 35 to 39 years
TOTAL PROPORTION
Census: 0.1155
Insurance: 0.503
FEMALE PROPORTION
Census: 0.4939
Insurance: 0.4866
MALE PROPORTION
Census: 0.5061
Insurance: 0.5134

Age Group: 40 to 44 years
TOTAL PROPORTION
Census: 0.1103
Insurance: 0.6037
FEMALE PROPORTION
Census: 0.499
Insurance: 0.488

Based on our available data, all groups in the insurance data appear to be fairly represented along age and gender lines. A few age groups (30 to 34, 35 to 39, and 60 to 64) are under-represented by a 2-3 percentage points, but that shouldn't have a significant impact on our analysis.

## Part 3b: Compare Insurance Costs Across Demographics

Now that we've compared the insurance data to Census data, we'll compare costs for three different demographic groups of interest: sex, smoker status, and region. First we'll create a list of the unique regions.

In [139]:
unique_regions = []

for record in insurance_data:
    # Check if record's region is in unique regions list
    # If not, add to regions list
    if record['region'] not in unique_regions:
        unique_regions.append(record['region'])

print(unique_regions)

['southwest', 'southeast', 'northwest', 'northeast']


In [142]:
def compare_sexes(dict_list):
    # Set counts and costs for each sex group to 0
    female_cost = 0
    female_count = 0
    male_cost = 0
    male_count = 0
    for record in dict_list:
        # Check if record's sex is female
        # If so, add charges to the total and add one to female count
        if record['sex'] == 'female':
            female_cost += record['charges']
            female_count += 1
        # Check if record's sex is male
        # If so, add charges to total and add one to male count
        if record['sex'] == 'male':
            male_cost += record['charges']
            male_count += 1
    # Divide costs by count to find average cost for each sex
    female_avg = female_cost / female_count
    male_avg = male_cost / male_count
    # Print results
    print('Average Female Charges: ' + str(round(female_avg, 4)))
    print("Average Male Charges: " + str(round(male_avg, 4)))
    
compare_sexes(insurance_data)

Average Female Charges: 12569.5788
Average Male Charges: 13956.7512


In [146]:
def compare_smoker(dict_list):
    smoker_cost = 0
    smoker_count = 0
    nonsmoker_cost = 0
    nonsmoker_count = 0
    for record in dict_list:
        if record['smoker'] == 1:
            smoker_cost += record['charges']
            smoker_count += 1
        if record['smoker'] == 0:
            nonsmoker_cost += record['charges']
            nonsmoker_count += 1
    smoker_avg = smoker_cost / smoker_count
    nonsmoker_avg = nonsmoker_cost / nonsmoker_count
    print('Average Smoker Charges: ' + str(round(smoker_avg, 4)))
    print('Average Nonsmoker Charges: ' + str(round(nonsmoker_avg, 4)))
    
compare_smoker(insurance_data)

Average Smoker Charges: 32050.2318
Average Nonsmoker Charges: 8434.2683


In [148]:
def compare_region(dict_list):
    # Set region costs and counts to 0
    southwest_cost = 0
    southwest_count = 0
    southeast_cost = 0
    southeast_count = 0
    northwest_cost = 0
    northwest_count = 0
    northeast_cost = 0
    northeast_count = 0
    # Iterate through dictionary list
    # For appropriate region, add costs to total and increase count by 1
    for record in dict_list:
        if record['region'] == 'southwest':
            southwest_cost += record['charges']
            southwest_count += 1
        if record['region'] == 'southeast':
            southeast_cost += record['charges']
            southeast_count += 1
        if record['region'] == 'northwest':
            northwest_cost += record['charges']
            northwest_count += 1
        if record['region'] == 'northeast':
            northeast_cost += record['charges']
            northeast_count += 1
    # Divide costs by count to find averages
    southwest_avg = southwest_cost / southwest_count
    southeast_avg = southeast_cost / southeast_count
    northwest_avg = northwest_cost / northwest_count
    northeast_avg = northeast_cost / northeast_count
    # Print results
    print('Southwest Average: ' + str(round(southwest_avg, 4)))
    print('Southeast Average: ' + str(round(southeast_avg, 4)))
    print('Northwest Average: ' + str(round(northwest_avg, 4)))
    print('Northeast Average: ' + str(round(northeast_avg, 4)))

compare_region(insurance_data)

Southwest Average: 12346.9374
Southeast Average: 14735.4114
Northwest Average: 12417.5754
Northeast Average: 13406.3845


From our analysis, we can see that males have slightly higher insurance costs on average than females, smokers have much higher insurance costs on average than non-smokers, and people in the southeast have higher insurance costs on average than other regions.