# U.S. Medical Insurance Costs

## MEDICAL INSURANCE COSTS

**Goals during analysis**

* Find the factor that has the greatest affect on insurance cost.
* Find the difference in affect that each factor has in cost and then decide if it is more effective to focus on one factor in order to lower cost, or on several.

**How will I analyze the data in order to meet these goals?**

* I will get total cost of insurance, as well as the average cost of insurance.
* I will get the total cost for all the women vs the total cost for all the men and compare. I will also look at the total number of children between the 2 groups of sexes as well as the amount of smokers in both. Does one group have a much larger amount of children? Does one group smoke more than the other?
* What percentage of the total cost do the men make up? What percentage of the total cost do the women make up? Is this due to another factor such as children or smoking?
* Does the average cost differ greatly between regions?


In [125]:
import csv
bmis = []
ages = []
sexs = []
children = []
smoker = []
region = []
charges = []
with open("insurance.csv") as insurance_data:
    insurance_csv = csv.DictReader(insurance_data)
    for row in insurance_csv:
        data_row = row
        bmis.append(data_row["bmi"])
        ages.append(data_row["age"])
        sexs.append(data_row["sex"])
        children.append(data_row["children"])
        smoker.append(data_row["smoker"])
        region.append(data_row["region"])
        charges.append(float(data_row["charges"]))
        
def total_cost(costs):
    total = 0
    for i in costs:
        total+=i
    return total

ordered_charges = sorted(charges)

total_charges = round(total_cost(ordered_charges), 2)
average_cost = round(total_charges/len(ordered_charges), 2)
print("The total cost of insurance is " + str(total_charges))
print("The average cost of insurance is " + str(average_cost))
print("The highest cost of insurance is " + str(round(ordered_charges[-1], 2)))
print("The lowest cost of insurance is " + str(round(ordered_charges[0], 2)))
print()

def cost_by_sex(costs):
    men_cost = 0
    women_cost = 0
    counter = 0
    for i in costs:
        if sexs[counter] == "male":
            men_cost+=i
            counter+=1
        else:
            women_cost+=i
            counter+=1
    return men_cost, women_cost
total_costs_men, total_costs_women = cost_by_sex(charges)
print("The total insurance cost for men is " + str(round(total_costs_men, 2)))
print("The total insurance cost for women is " + str(round(total_costs_women, 2)))

men_amt = 0
women_amt = 0

for i in sexs:
    if i == "male":
        men_amt+=1
    else:
        women_amt+=1
print("The data surveys " + str(men_amt) + " men and " + str(women_amt) + " women.")

def children_by_sex():
    men_children_total = 0
    women_children_total = 0
    counter = 0
    for i in sexs:
        if i == "male":
            men_children_total+=int(children[counter])
            counter+=1
        else:
            women_children_total+=int(children[counter])
            counter+=1
    return men_children_total, women_children_total
male_total_kids, female_total_kids = children_by_sex()
print("The total amount of children for men is " + str(male_total_kids))
print("The total amount of children for women is " + str(female_total_kids))

def smokers_by_sex():
    women_smokers = 0
    male_smokers = 0
    for i in sexs:
        if i == "male":
            male_smokers+=1
        else:
            women_smokers+=1
    return women_smokers, male_smokers
wsmokers, msmokers = smokers_by_sex()
print("The total amount of male smokers is " + str(msmokers))
print("The total amount of female smokers is " + str(wsmokers))

def average_bmi():
    male_avg = 0
    female_avg = 0
    counter = 0
    mc = 0
    fc = 0
    for i in sexs:
        if i == "male":
            male_avg += float(bmis[counter])
            counter+=1
            mc+=1
        else:
            female_avg += float(bmis[counter])
            counter+=1
            fc+=1
    male_avg = male_avg/mc
    female_avg = female_avg/fc
    return male_avg, female_avg
m_avg, f_avg = average_bmi()
print("The average male BMI is " + str(round(m_avg, 2)))
print("The average female BMI is " + str(round(f_avg, 2)))


The total cost of insurance is 17755824.99
The average cost of insurance is 13270.42
The highest cost of insurance is 63770.43
The lowest cost of insurance is 1121.87

The total insurance cost for men is 9434763.8
The total insurance cost for women is 8321061.19
The data surveys 676 men and 662 women.
The total amount of children for men is 754
The total amount of children for women is 711
The total amount of male smokers is 676
The total amount of female smokers is 662
The average male BMI is 30.94
The average female BMI is 30.38


## Initial Data Analysis

It seems that the data collected has very similar averages between genders in all data points. Men do pay slightly more, however they are also the gender that was surveyed most by a small margin. Because of this men also lead in all data points including more children, more smokers, and an insignificantly higher average BMI. 

It is difficult to draw significant conclusions from this data so far, except that gender does not play a significant factor in cost of insurance. My next step is to identify the information of the 10 highest paying people and the 10 lowest paying people and extract what data may contribute to those costs.

In [151]:
def highest_lowest_cost(x,y):
    high_cost = {}
    low_cost = {}
    for i in range(x,y):
        index = charges.index(ordered_charges[-i])
        index2 = charges.index(ordered_charges[i-1])
        high_cost.update({i: {"Cost": round(charges[index], 2), "BMI": bmis[index], "Sex": sexs[index], "Children": children[index], "Smoker": smoker[index], "Region": region[index], "Age": ages[index]}})
        low_cost.update({i: {"Cost": round(charges[index2], 2), "BMI": bmis[index2], "Sex": sexs[index2], "Children": children[index2], "Smoker": smoker[index2], "Region": region[index2], "Age": ages[index2]}})
    return high_cost, low_cost
ten_high_cost, ten_low_cost = highest_lowest_cost(1,11)
for i in range(1,11):
    print("The person with the number " + str(i) + " highest cost of " + str(ten_high_cost[i].get("Cost")) + " is " + str(ten_high_cost[i].get("Age")) + " years old. They have " + str(ten_high_cost[i].get("Children")) + " children. They are " + str(ten_high_cost[i].get("Sex")) + " and have a BMI of " + str(ten_high_cost[i].get("BMI")) + ". Do they smoke? " + ten_high_cost[i].get("Smoker"))

print("")
for i in range(1,11):
    print("The person with the number " + str(i) + " lowest cost of " + str(ten_low_cost[i].get("Cost")) + " is " + str(ten_low_cost[i].get("Age")) + " years old. They have " + str(ten_low_cost[i].get("Children")) + " children. They are " + str(ten_low_cost[i].get("Sex")) + " and have a BMI of " + str(ten_low_cost[i].get("BMI")) + ". Do they smoke? " + ten_low_cost[i].get("Smoker"))


The person with the number 1 highest cost of 63770.43 is 54 years old. They have 0 children. They are female and have a BMI of 47.41. Do they smoke? yes
The person with the number 2 highest cost of 62592.87 is 45 years old. They have 0 children. They are male and have a BMI of 30.36. Do they smoke? yes
The person with the number 3 highest cost of 60021.4 is 52 years old. They have 3 children. They are male and have a BMI of 34.485. Do they smoke? yes
The person with the number 4 highest cost of 58571.07 is 31 years old. They have 1 children. They are female and have a BMI of 38.095. Do they smoke? yes
The person with the number 5 highest cost of 55135.4 is 33 years old. They have 0 children. They are female and have a BMI of 35.53. Do they smoke? yes
The person with the number 6 highest cost of 52590.83 is 60 years old. They have 0 children. They are male and have a BMI of 32.8. Do they smoke? yes
The person with the number 7 highest cost of 51194.56 is 28 years old. They have 1 childr

## Secondary Data Analysis

Understandably, 10 data points from the highest paying and lowest paying individuals can only allow me to form very broad conclusions, however it did give me some insight into the larger data. 

* In the top 10 highest paying individuals, all were smokers.
* In the top 10 lowest paying individuals, none were smokers
* In the top 10 lowest paying individuals, none had children. 
    * This could be insignificant, as some of the top 10 highest payers also had no children.

So far it seems that smoking is the factor that most influences cost, however, I haven't begun looking at BMI averages as 10 people on both ends of the data set wouldn't be enough to form a conclusion. My next goal is to calculate the average of the top 100 and the bottom 100.

In [182]:
tophundred, bothundred = highest_lowest_cost(1,101)
def average_bmi(dictionary):
    average = 0.0
    for i in dictionary.values():
       average+=float(i.get("BMI"))
    average = average/100
    return average
print("The average bmi for the top 100 paying insured is " + str(average_bmi(tophundred)))
print("The average bmi for the bottom 100 paying insured is " + str(average_bmi(bothundred)))

def total_smokers(dictionary):
    smoke = 0
    for i in dictionary.values():
        if i.get("Smoker") == "yes":
            smoke+=1
    return smoke
print("Out of the 100 people in the highest paying insured, " + str(total_smokers(tophundred)) + " of them smoke.")
print("Out of the 100 people in the lowest paying insured, " + str(total_smokers(bothundred)) + " of them smoke.")

The average bmi for the top 100 paying insured is 36.4673
The average bmi for the bottom 100 paying insured is 30.014950000000013
Out of the 100 people in the highest paying insured, 100 of them smoke.
Out of the 100 people in the lowest paying insured, 0 of them smoke.


## Final Conclusion

From the data gathered so far, it seems that the single factor that plays the largest role in cost of insurance is whether or not the individual smokes. This is shown by the fact that the top 100 highest paying individuals are all smokers, and the bottom 100 lowest paying individuals are not.

A higher BMI also seems to a factor for higher paying individuals. (This is only based on an average of the top and bottom 100.)