# U.S. Medical Insurance Costs

In this project, we'll want to explore insurance costs by individual. The metrics gathered are for age, sex, BMI, number of children, smoker status, region of the country, and the actual insurance cost.

Some preliminary project goals:
- Get range and average for the numeric metrics (age, BMI, etc).
- Get minimum, average, and maximum charge for each metric (or metric group).
- Compare stats across different groups.

Table of Contents:
- [Preliminary stats](#Preliminary-stats)
- [Stats by Metric](#Stats-by-Metric)
    - [Region](#Region)
    - [Sex](#Sex)
    - [Smoker status](#Smoker-status)
    - [Number of Children](#Number-of-Children)
    - [BMI](#BMI)
    - [Age](#Age)
- [Conclusions](#Conclusion)
- [TODO](#TODO)

In [166]:
# A bunch of functions and such to help along the way

# Format a number value as a string with the following formatting:
# - ":," adds a comma as a thousands separator and
# - ".2f" limits the string to two decimal places
def format_float(val):
    # Ensure the value provided is a float.
    return "{:,.2f}".format(float(val))

# Format a dollar (number) value as a string.
def format_dollar(val):
    return "${}".format(format_float(val))

def add_row_to_dict(row, dictionary, col_name):
    key = row[col_name]
    record = dictionary.get(key, [])
    record.append(row)
    dictionary[key] = record
    
def person_message(row, superlative):
    smoker_text = 'smoker' if row['smoker'] == 'yes' else 'non-smoker'
    children_text = 'child' if row['children'] == '1' else 'children'
    message = "Person with the {sup} ({charge}) is a {age} {sex} {smoker_text} from the {region} region with {children} {children_text} and a BMI of {bmi}.".format(
        sup=superlative,
        charge=format_dollar(row['charges']),
        age=row['age'],
        sex=row['sex'],
        smoker_text=smoker_text,
        region=row['region'],
        children=row['children'],
        children_text=children_text,
        bmi=row['bmi']
    )
    return message

# Parse through an array of rows and present charge stats.
def parse_array(arr):
    min_item_charge = float("inf")
    max_item_charge = 0
    total_item_charges = 0
    total_items = len(arr)
    for row in arr:
        charge = float(row['charges'])
        total_item_charges += charge
        if charge < min_item_charge:
            min_item_charge = charge
        if charge > max_item_charge:
            max_item_charge = charge
    average_item_charge = total_item_charges / total_items
    print("Total rows: {}".format(total_items))
    print("Total charges: {}".format(format_dollar(total_item_charges)))
    print("Minimum charge: {}".format(format_dollar(min_item_charge)))
    print("Average charge: {}".format(format_dollar(average_item_charge)))
    print("Maximum charge: {}".format(format_dollar(max_item_charge)))
    print("- - - - - - - - - - - - - -")

# Iterate through a dictionary to present some stats on it.
def parse_dict(dictionary):
    keys = sorted(dictionary.keys())
    for key in keys:
        print(key)
        value = dictionary[key]
        parse_array(value)

We start by importing the CSV and gathering our different group information.

In [165]:
import csv

with open('insurance.csv') as insurance_csv:
    # Convert the file to a Dict CSV var.
    insurance_reader = csv.DictReader(insurance_csv)
    
    # Initialize our empty dicts for the various groups.
    Sex_dict = dict()
    Smoker_dict = dict()
    Region_dict = dict()
    Children_dict = dict()
    
    # Initialize starting stats.
    total_rows = 0
    total_charges = 0
    total_age = 0
    total_bmi = 0
    total_children = 0
    min_age = float("inf")
    max_age = 0
    min_bmi = float("inf")
    max_bmi = 0
    max_children = 0
    
    min_charge_row = None
    max_charge_row = None

    # Iterate through our CSV to populate the group sets.
    for row in insurance_reader:
        # Get vars from the values and format them correctly.
        age = int(row['age'])
        bmi = float(row['bmi'])
        children = int(row['children'])
        charges = float(row['charges'])
        
        # Add the values to the appropriate sets.
        add_row_to_dict(row, Sex_dict, 'sex')
        add_row_to_dict(row, Smoker_dict, 'smoker')
        add_row_to_dict(row, Region_dict, 'region')
        add_row_to_dict(row, Children_dict, 'children')
        
        # Gather our wanted stats.
        total_charges += charges
        total_rows += 1
        total_age += age
        total_bmi += bmi
        total_children += children

        # Do comparisons to get our various min/max stats.
        if age < min_age:
            min_age = age
        if age > max_age:
            max_age = age
        if bmi < min_bmi:
            min_bmi = bmi
        if bmi > max_bmi:
            max_bmi = bmi
        
        # If min/max charge rows have not been set already, do so with current row.
        if min_charge_row is None:
            min_charge_row = row
            max_charge_row = row
        else:
            # Do comparisons against current row
            if charges < float(min_charge_row['charges']):
                min_charge_row = row
            if charges > float(max_charge_row['charges']):
                max_charge_row = row


## Preliminary stats

In [114]:
    # Gather our average stats.
    average_charge = total_charges / total_rows
    average_age = total_age / total_rows
    average_bmi = total_bmi / total_rows
    average_children = total_children / total_rows
    
    # Print messages
    print("Total rows: {}".format(total_rows))
    print("Total charges: {}".format(format_dollar(total_charges)))
    print("Average charge: {}".format(format_dollar(average_charge)))
    print("Age range: {} - {}".format(min_age, max_age))
    print("Average age: {}".format(format_float(average_age)))
    print("BMI range: {} - {}".format(format_float(min_bmi), format_float(max_bmi)))
    print("Average BMI: {}".format(format_float(bmi)))
    print("Average children count: {}".format(format_float(average_children)))
    print(person_message(min_charge_row, 'minimum charge'))
    print(person_message(max_charge_row, 'maximum charge'))

Total rows: 1338
Total charges: $17,755,824.99
Average charge: $13,270.42
Age range: 18 - 64
Average age: 39.21
BMI range: 15.96 - 53.13
Average BMI: 29.07
Average children count: 1.09
Person with the minimum charge ($1,121.87) is a 18 male non-smoker from the southeast region with 0 children and a BMI of 23.21.
Person with the maximum charge ($63,770.43) is a 54 female smoker from the southeast region with 0 children and a BMI of 47.41.


## Stats by Metric
### Region

In [168]:
print("=================")
print("Charges by Region")
print("=================")
parse_dict(Region_dict)

Charges by Region
northeast
Total rows: 324
Total charges: $4,343,668.58
Minimum charge: $1,694.80
Average charge: $13,406.38
Maximum charge: $58,571.07
- - - - - - - - - - - - - -
northwest
Total rows: 325
Total charges: $4,035,712.00
Minimum charge: $1,621.34
Average charge: $12,417.58
Maximum charge: $60,021.40
- - - - - - - - - - - - - -
southeast
Total rows: 364
Total charges: $5,363,689.76
Minimum charge: $1,121.87
Average charge: $14,735.41
Maximum charge: $63,770.43
- - - - - - - - - - - - - -
southwest
Total rows: 325
Total charges: $4,012,754.65
Minimum charge: $1,241.57
Average charge: $12,346.94
Maximum charge: $52,590.83
- - - - - - - - - - - - - -


### Sex

In [169]:
print("==============")
print("Charges by Sex")
print("==============")
parse_dict(Sex_dict)

Charges by Sex
female
Total rows: 662
Total charges: $8,321,061.19
Minimum charge: $1,607.51
Average charge: $12,569.58
Maximum charge: $63,770.43
- - - - - - - - - - - - - -
male
Total rows: 676
Total charges: $9,434,763.80
Minimum charge: $1,121.87
Average charge: $13,956.75
Maximum charge: $62,592.87
- - - - - - - - - - - - - -


### Smoker status

In [170]:
print("========================")
print("Charges by Smoker status")
print("========================")
parse_dict(Smoker_dict)

Charges by Smoker status
no
Total rows: 1064
Total charges: $8,974,061.47
Minimum charge: $1,121.87
Average charge: $8,434.27
Maximum charge: $36,910.61
- - - - - - - - - - - - - -
yes
Total rows: 274
Total charges: $8,781,763.52
Minimum charge: $12,829.46
Average charge: $32,050.23
Maximum charge: $63,770.43
- - - - - - - - - - - - - -


### Number of Children

In [171]:
print("=============================")
print("Charges by number of Children")
print("=============================")
parse_dict(Children_dict)

Charges by number of Children
0
Total rows: 574
Total charges: $7,098,070.00
Minimum charge: $1,121.87
Average charge: $12,365.98
Maximum charge: $63,770.43
- - - - - - - - - - - - - -
1
Total rows: 324
Total charges: $4,124,899.67
Minimum charge: $1,711.03
Average charge: $12,731.17
Maximum charge: $58,571.07
- - - - - - - - - - - - - -
2
Total rows: 240
Total charges: $3,617,655.30
Minimum charge: $2,304.00
Average charge: $15,073.56
Maximum charge: $49,577.66
- - - - - - - - - - - - - -
3
Total rows: 157
Total charges: $2,410,784.98
Minimum charge: $3,443.06
Average charge: $15,355.32
Maximum charge: $60,021.40
- - - - - - - - - - - - - -
4
Total rows: 25
Total charges: $346,266.41
Minimum charge: $4,504.66
Average charge: $13,850.66
Maximum charge: $40,182.25
- - - - - - - - - - - - - -
5
Total rows: 18
Total charges: $158,148.63
Minimum charge: $4,687.80
Average charge: $8,786.04
Maximum charge: $19,023.26
- - - - - - - - - - - - - -


### BMI
For BMI, we'll use [BMI ranges](https://www.cancer.org/cancer/cancer-causes/diet-physical-activity/body-weight-and-cancer-risk/adult-bmi.html) to create static buckets. These ranges are:
- Underweight: less than 18.5
- Normal weight: 18.5 to 24.9
- Overweight: 25 to 29.9
- Obese: 30 or more

In [172]:
import math

# Bucket properties:
# - 'start' - inclusive
# - 'end' - exclusive 
# - 'items' - array of data items in said bucket
bmi_buckets = [
    # Underweight
    {
        'label': 'Underweight',
        'start': 0.0,
        'end': 18.5,
        'items': [],
    },
    # Normal weight
    {
        'label': 'Normal weight',
        'start': 18.5,
        'end': 25.0,
        'items': [],
    },
    # Overweight
    {
        'label': 'Overweight',
        'start': 25.0,
        'end': 30.0,
        'items': [],
    },
    # Obese
    {
        'label': 'Obese',
        'start': 30.0,
        'end': float("inf"),
        'items': [],
    }
]
    
# Re-open the CSV file, re-iterate through the CSV reader rows, and place rows into appropriate buckets.
with open('insurance.csv') as insurance_csv:
    insurance_reader = csv.DictReader(insurance_csv)
    for row in insurance_reader:
        row_bmi = float(row['bmi'])
        # Check which bucket the row should be placed in.
        bucket_index = 0
        for i in range(len(bmi_buckets)):
            if row_bmi < bmi_buckets[i]['end']:
                bmi_buckets[i]['items'].append(row)
                break

# Print our results
print("====================")
print("Charges by BMI Range")
print("====================")
for bucket in bmi_buckets:
    print("{} ({} - {})".format(bucket['label'], bucket['start'], bucket['end']))
    parse_array(bucket['items'])
    
    

Charges by BMI Range
Underweight (0.0 - 18.5)
Total rows: 20
Total charges: $177,044.01
Minimum charge: $1,621.34
Average charge: $8,852.20
Maximum charge: $32,734.19
- - - - - - - - - - - - - -
Normal weight (18.5 - 25.0)
Total rows: 225
Total charges: $2,342,100.98
Minimum charge: $1,121.87
Average charge: $10,409.34
Maximum charge: $35,069.37
- - - - - - - - - - - - - -
Overweight (25.0 - 30.0)
Total rows: 386
Total charges: $4,241,178.82
Minimum charge: $1,252.41
Average charge: $10,987.51
Maximum charge: $38,245.59
- - - - - - - - - - - - - -
Obese (30.0 - inf)
Total rows: 707
Total charges: $10,995,501.18
Minimum charge: $1,131.51
Average charge: $15,552.34
Maximum charge: $63,770.43
- - - - - - - - - - - - - -


### Age
We do a similar thing for age, but for the sake of practice, let's create dynamic buckets. In reality, we would define discrete groups by age range (i.e. young adult, middle aged, elderly, etc) like we did for BMI.

In [173]:
# Let's do the same for Age, but let's use dynamic buckets for fun.
num_of_buckets = 5
age_range = max_age - min_age
age_interval = math.ceil(age_range / num_of_buckets)

# Initialize and construct the buckets.
age_buckets = []
for i in range(num_of_buckets):
    age_start = min_age + (i * age_interval)
    age_buckets.append({
        'start': age_start,
        'end': age_start + age_interval,
        'items': [],
    })
    
# Re-open the CSV file, re-iterate through the CSV reader rows, and place rows into appropriate buckets.
with open('insurance.csv') as insurance_csv:
    insurance_reader = csv.DictReader(insurance_csv)
    for row in insurance_reader:
        row_age = int(row['age'])
        # Check which bucket the row should be placed in.
        bucket_index = 0
        for i in range(len(age_buckets)):
            if row_age < age_buckets[i]['end']:
                age_buckets[i]['items'].append(row)
                break

# Print our results
print("=====================")
print("Charges by Age ranges")
print("=====================")
for bucket in age_buckets:
    print("Age {} - {}".format(bucket['start'], bucket['end'] - 1))
    parse_array(bucket['items'])

Charges by Age ranges
Age 18 - 27
Total rows: 362
Total charges: $3,293,545.59
Minimum charge: $1,121.87
Average charge: $9,098.19
Maximum charge: $44,501.40
- - - - - - - - - - - - - -
Age 28 - 37
Total rows: 262
Total charges: $3,055,394.64
Minimum charge: $2,689.50
Average charge: $11,661.81
Maximum charge: $58,571.07
- - - - - - - - - - - - - -
Age 38 - 47
Total rows: 272
Total charges: $3,734,571.52
Minimum charge: $5,383.54
Average charge: $13,730.04
Maximum charge: $62,592.87
- - - - - - - - - - - - - -
Age 48 - 57
Total rows: 278
Total charges: $4,430,668.80
Minimum charge: $7,789.64
Average charge: $15,937.66
Maximum charge: $63,770.43
- - - - - - - - - - - - - -
Age 58 - 67
Total rows: 164
Total charges: $3,241,644.44
Minimum charge: $11,345.52
Average charge: $19,766.12
Maximum charge: $52,590.83
- - - - - - - - - - - - - -


## Conclusion
Based on our findings, smoker status seems to have the greatest indication of a person having a large insurance cost (with an average insurance cost of \\$32,050.23 for smokers), followed by age. The number of children an individual had _seemed_ to indicate a higher insurance cost, but that was only true up to 3 children. After that, the cost started to go _down_.

Highest average insurance cost by metric:
- **Age range**: 58 - 67 ($19,766.12)
- **BMI range**: Obese (30.0+) (\\$15,552.34)
- **Number of children**: 3 (\\$15,355.32)
- **Region**: southeast (\\$14,735.41)
- **Sex**: male (\\$13,956.75)
- **Smoker status**: yes (\\$32,050.23)

## TODO
A number of improvements and extensions can be made on this project, such as:
- Use libraries (NumPy, pandas) to do the statistical calculations for us.
- Present the data in a more pleasing way, i.e. tables, graphs, etc.
- Compare _within_ groups, such as male smokers in the northeast region.