# U.S. Medical Insurance Costs

## Overview
The database provided contains information on individuals purchasing insurance through company X. The data gathered includes the following:
- Age
- Sex (Binary-normative)
- BMI
- Number of Children
- Smoking/Nonsmoking status
- Region
- Insurance Cost

I've come up with the following five questions to probe the data for:
1. How does the average bmi compare across the different regions? Where do the healthiest men and women live?
2. ~~Which region smokes the most? How does their average BMI relate to the number of smokers?~~ Which region pays the lowest cost on average? How does it compare to the national average?
3. ~~How many smokers have children?~~  Rank the attributes from those that affect costs the most, to those that affect costs the least.
4. Which is the youngest region?
5. What are some of the issues with this data set? What information could be gathered to improve analysis?


## Initialization

In [1]:
import csv


In [2]:
#test area


## Question One:
### How does the average bmi compare across the different regions? Where do the healthiest men and women live?

First I'll break up the data, compiling lists of each region's bmi:

In [3]:
sw_bmi_list = []
nw_bmi_list = []
se_bmi_list = []
ne_bmi_list = []

with open('insurance.csv') as ins_csv:
    ins_reader = csv.DictReader(ins_csv)
    for row in ins_reader:
        if row['region'] == 'southwest':
            sw_bmi_list.append(row['bmi'])
        elif row['region'] == 'northwest':
            nw_bmi_list.append(row['bmi'])
        elif row['region'] == 'southeast':
            se_bmi_list.append(row['bmi'])
        elif row['region'] == 'northeast':
            ne_bmi_list.append(row['bmi'])
        else:
            pass

Next, I'll create a function to get the average bmi:

In [4]:
def bmi_avg(list):
    total_bmi = 0.0
    for i in list:
        total_bmi += float(i)
    avg_bmi = round(total_bmi/len(list), 2)
    
    return avg_bmi

Finally, let's get our averages for each region

In [5]:
avg_bmi_dict = {
    'southwest': bmi_avg(sw_bmi_list),
    'northwest': bmi_avg(nw_bmi_list),
    'southeast': bmi_avg(se_bmi_list),
    'northeast': bmi_avg(ne_bmi_list),
    'entire us': (
        bmi_avg(sw_bmi_list)+bmi_avg(nw_bmi_list)+bmi_avg(se_bmi_list)+bmi_avg(ne_bmi_list))/4
}
                
for key in avg_bmi_dict:
    print("The average bmi for the {key} is: {value}".format(key=key, value=avg_bmi_dict[key]))

### What does this tell us?
First, let's acknowledge that bmi is inherently flawed. There are a litany of examples of why this is, but explaining the why is out of this project's scope. With that out of the way, we see that the Northeast region is the healthiest of the regions, while the Southeast is the least healthy. It would be interesting to see how GDP of each region stacks up with bmi, as the SE is likely the most economically depressed region of the US.

### What else can we do with this data?
Getting the average bmi across each region is a bit broad. Now that I've worked with the data a bit, I'd like to find data on more specific demographics.

We can create a class for each region that would break bmi data down by sex, age group, smoking status, and how many children they have:

In [1]:
class BmiBreakdown:
    def __init__(self, dict):
        self.under_30_bmi_lst = []
        self.over_29_bmi_lst = []
        self.female_bmi_lst = []
        self.male_bmi_lst = []
        #Parents will be segregated by parenthood status as well as number of children (1, 2, and >=3)
        self.nonparent_bmi_lst = []
        self.parent_bmi_lst = []
        self.parent_1_bmi_lst = []
        self.parent_2_bmi_lst = []
        self.parent_3up_bmi_lst = []
        self.smoker_bmi_lst = []
        self.nonsmoker_bmi_lst = []
        for key in dict:
            for value in dict[key]:
                if value == 'age':
                    if int(dict[key]['age']) < 30:
                        self.under_30_bmi_lst.append(dict[key]['bmi'])
                    elif int(dict[key]['age']) >= 30:
                        self.over_29_bmi_lst.append(dict[key]['bmi'])
                elif value == 'sex':
                    if dict[key]['sex'] == 'female':
                        self.female_bmi_lst.append(dict[key]['bmi'])
                    elif dict[key]['sex'] == 'male':
                        self.male_bmi_lst.append(dict[key]['bmi'])
                elif value == 'children':
                    if int(dict[key]['children']) > 0:
                        self.parent_bmi_lst.append(dict[key]['bmi'])
                        if int(dict[key]['children']) == 1:
                            
                    else:
                        self.nonparent_bmi_lst.append(dict[key]['bmi'])
                elif value == 'smoker':
                    if dict[key]['smoker'] == 'yes':
                        self.smoker_bmi_lst.append(dict[key]['bmi'])
                    else:
                        self.nonsmoker_bmi_lst.append(dict[key]['bmi'])
                else:
                    pass

#Methods based on age:
    def avg_under_30(self):
        total_bmi = 0.0
        for i in self.under_30_bmi_lst:
            total_bmi += float(i)
        avg_bmi = total_bmi / len(self.under_30_bmi_lst)
        return round(avg_bmi, 2)

    def avg_over_29(self):
        total_bmi = 0.0
        for i in self.over_29_bmi_lst:
            total_bmi += float(i)
        avg_bmi = total_bmi / len(self.over_29_bmi_lst)
        return round(avg_bmi, 2)
    
#Methods based on sex:
    def avg_female(self):
        total_bmi = 0.0
        for i in self.female_bmi_lst:
            total_bmi += float(i)
        avg_bmi = total_bmi / len(self.female_bmi_lst)
        return round(avg_bmi, 2)
    
    def avg_male(self):
        total_bmi = 0.0
        for i in self.male_bmi_lst:
            total_bmi += float(i)
        avg_bmi = total_bmi / len(self.male_bmi_lst)
        return round(avg_bmi, 2)
    
#Methods based on parental status:
    def avg_nonparent(self):
        total_bmi = 0.0
        for i in self.nonparent_bmi_lst:
            total_bmi += float(i)
        avg_bmi = total_bmi / len(self.nonparent_bmi_lst)
        return round(avg_bmi, 2)
    
    def avg_parent(self):
        total_bmi = 0.0
        for i in self.parent_bmi_lst:
            total_bmi += float(i)
        avg_bmi = total_bmi / len(self.parent_bmi_lst)
        return round(avg_bmi, 2)
    
#Methods based on Smoking status:
    def avg_smoker(self):
        total_bmi = 0.0
        for i in self.smoker_bmi_lst:
            total_bmi += float(i)
        avg_bmi = total_bmi / len(self.smoker_bmi_lst)
        return round(avg_bmi, 2)
    
    def avg_nonsmoker(self):
        total_bmi = 0.0
        for i in self.nonsmoker_bmi_lst:
            total_bmi += float(i)
        avg_bmi = total_bmi / len(self.nonsmoker_bmi_lst)
        return round(avg_bmi, 2)

IndentationError: expected an indented block (<ipython-input-1-853501a94541>, line 32)

### Dictionary Creation
The Class I've created's constructor requires a dictionary as an argument, so let's break the data down into four dictionaries, one for each region:

In [7]:
sw_dict = {}
nw_dict = {}
se_dict = {}
ne_dict = {}

with open('insurance.csv') as ins_csv:
    primary_key = 0
    ins_reader = csv.DictReader(ins_csv)
    for row in ins_reader:
        if row['region'] == 'southwest':
            sw_dict[primary_key] = row
            primary_key += 1
        elif row['region'] == 'northwest':
            nw_dict[primary_key] = row
            primary_key += 1
        elif row['region'] == 'southeast':
            se_dict[primary_key] = row
            primary_key += 1
        elif row['region'] == 'northeast':
            ne_dict[primary_key] = row
            primary_key += 1
        else:
            pass

## Using the BmiBreakdown class

Let's probe the data a bit, see what we come up with. First, create our objects:

In [8]:
sw_bmi_breakdown = BmiBreakdown(sw_dict)
nw_bmi_breakdown = BmiBreakdown(nw_dict)
se_bmi_breakdown = BmiBreakdown(se_dict)
ne_bmi_breakdown = BmiBreakdown(ne_dict)



### Question 1.a:
How fit is each regions under 30 population? How does it compare to those 30 and older? How does each demographic compare to the mean BMI of the region?

In [10]:
under_30_avg_bmi_dict = {
    'southwest': sw_bmi_breakdown.avg_under_30(),
    'northwest': nw_bmi_breakdown.avg_under_30(),
    'southeast': se_bmi_breakdown.avg_under_30(),
    'northeast': ne_bmi_breakdown.avg_under_30()
}

over_29_avg_bmi_dict = {
    'southwest': sw_bmi_breakdown.avg_over_29(),
    'northwest': nw_bmi_breakdown.avg_over_29(),
    'southeast': se_bmi_breakdown.avg_over_29(),
    'northeast': ne_bmi_breakdown.avg_over_29()
}

for key in under_30_avg_bmi_dict:
    print("The average bmi of people under 30 in the " + key + " is: " + str(under_30_avg_bmi_dict[key]))
    print("The average bmi of over the age of 29 in the " + key + " is: " + str(over_29_avg_bmi_dict[key]))
    print("The average bmi of the region is: " + str(avg_bmi_dict[key]))
    print("The average bmi of the entire US is: " + str(avg_bmi_dict['entire us']))
    print('')


The average bmi of people under 30 in the southwest is: 29.08
The average bmi of over the age of 29 in the southwest is: 31.26
The average bmi of the region is: 30.6
The average bmi of the entire US is: 30.5825

The average bmi of people under 30 in the northwest is: 28.54
The average bmi of over the age of 29 in the northwest is: 29.5
The average bmi of the region is: 29.2
The average bmi of the entire US is: 30.5825

The average bmi of people under 30 in the southeast is: 33.3
The average bmi of over the age of 29 in the southeast is: 33.38
The average bmi of the region is: 33.36
The average bmi of the entire US is: 30.5825

The average bmi of people under 30 in the northeast is: 28.0
The average bmi of over the age of 29 in the northeast is: 29.71
The average bmi of the region is: 29.17
The average bmi of the entire US is: 30.5825



#### A note: 
- After completing question 1.a, I think my outputs are a bit verbose and difficult to immediately grasp. Going forward I'm going to assign a *distance from local mean* and *distance to national mean* variable to each demographic.

### Question 1.b


### Take Aways:
do later

### Question 1.b:
idk rn, I'm kinda beat
we can reuse the bmiBreakdown class to find out how many smokers there are in each region.

In [15]:
print(len(sw_bmi_breakdown.smoker_bmi_lst))
print(len(sw_bmi_breakdown.nonsmoker_bmi_lst))

58
267
