# U.S. Medical Insurance Costs

This project uses a dataset with medical insurance costs and demographic information to calculate summary statistics for the cost of medical insurance in the United States based on three critieria: location, age, and whether an individual smokes or not.  
The dataset defines location to be one of four regions: northeast, northwest, southeast, or southwest.  
    
All dataset columns are imported into the project, but not all are currently used in the analysis portion.  
The columns not currently being used are:  
gender  
bmi  
number_of_children

Prior to working with this dataset in Python, Excel was used to:  
Rename a few columns for clarity  
Remove one duplicate record  
Check for missing values  
Check for consistent column formatting  
  
The original dataset can be found [here](https://www.kaggle.com/datasets/mirichoi0218/insurance).  
  
Finally, the main purpose of this project is to demonstrate general Python fluency, not to conduct a detailed analysis. For this reason, no library imports are used.

In [1]:
# To begin, an empty list is initiated to represent each column contained in the dataset, insurance.csv. The dataset is then
# imported and each list is populated with its corresponding values.

age_column = []
gender_column = []
bmi_column = []
number_of_children_column = []
smokes_column = []
region_column = []
insurance_cost_column = []

with open('insurance.csv') as insurance:
    next(insurance)
    for row in insurance:
        row = row.split(',')
        age_column.append(int(row[0]))
        gender_column.append(row[1])
        bmi_column.append(float(row[2]))
        number_of_children_column.append(int(row[3]))
        smokes_column.append(row[4])
        region_column.append(row[5])
        insurance_cost_column.append(float(row[6]))

In [2]:
# Average age of individuals in the dataset.
print("Average age of individual with medical insurance:", round(sum(age_column) / len(age_column)))

Average age of individual with medical insurance: 39


In [3]:
# Average cost of medical insurance for individuals in the dataset.
print("Average cost of medical insurance:", "$" + str(round(sum(insurance_cost_column) / len(insurance_cost_column), 2)))

Average cost of medical insurance: $13279.12


In [4]:
# Now we will calculate the average cost of medical insurance for different age groups. First, five empty lists are initiated to
# represent five age groups. These lists are then populated with their corresponding values from insurance_cost_column. The
# actual calculations are done within a final print statement.

cost_age_under30 = []
cost_age30to39 = []
cost_age40to49 = []
cost_age50to59 = []
cost_age60_over = []

age_and_cost = zip(age_column, insurance_cost_column)
for age,cost in age_and_cost:
    if age < 30:
        cost_age_under30.append(cost)
    elif age <= 39:
        cost_age30to39.append(cost)
    elif age <= 49:
        cost_age40to49.append(cost)
    elif age <= 59:
        cost_age50to59.append(cost)
    else:
        cost_age60_over.append(cost)
               
print(f"""
AVERAGE COST OF INSURANCE BY AGE
Under 30: ${round(sum(cost_age_under30) / len(cost_age_under30), 2)}
30-39: ${round(sum(cost_age30to39) / len(cost_age30to39), 2)}
40-49: ${round(sum(cost_age40to49) / len(cost_age40to49), 2)}
50-59: ${round(sum(cost_age50to59) / len(cost_age50to59), 2)}
60+:   ${round(sum(cost_age60_over) / len(cost_age60_over), 2)}
""")


AVERAGE COST OF INSURANCE BY AGE
Under 30: $9200.62
30-39: $11738.78
40-49: $14399.2
50-59: $16495.23
60+:   $21248.02



Age is shown to be an important factor in the cost of medical insurance. As an individuals age increases, so does the average cost of insurance.

In [5]:
# Next we will calculate the average cost of medical insurance by region. This process will follow the same structure as before:
# empty lists are initiated and populated, and then a print statement with calculations is generated. 

cost_northeast = []
cost_northwest = []
cost_southeast = []
cost_southwest = []

region_and_cost = zip(region_column, insurance_cost_column)
for region,cost in region_and_cost:
    if region == "northeast":
        cost_northeast.append(cost)
    elif region == "northwest":
        cost_northwest.append(cost)
    elif region == "southeast":
        cost_southeast.append(cost)
    elif region == "southwest":
        cost_southwest.append(cost)
        
print(f"""
AVERAGE COST OF INSURANCE BY REGION
Northeast: ${round(sum(cost_northeast) / len(cost_northeast), 2)}
Northwest: ${round(sum(cost_northwest) / len(cost_northwest), 2)}
Southeast: ${round(sum(cost_southeast) / len(cost_southeast), 2)}
Southwest: ${round(sum(cost_southwest) / len(cost_southwest), 2)}
""")


AVERAGE COST OF INSURANCE BY REGION
Northeast: $13406.38
Northwest: $12450.84
Southeast: $14735.41
Southwest: $12346.94



Medical insurance is shown to be more costly in the East than it is in the West. The Southeast is shown to have the highest cost of insurance of all four regions.

In [6]:
# Lastly, we will determine the average cost of medical insurance based on whether an individual smokes or not. For this last
# process, a different approach will be used. Instead of initiating and populating lists, we will initiate and update four
# variables to be used for sums and counts. These variables will then be used for making calculations in the print statement.

smoking_total = 0
smoking_count = 0
non_smoking_total = 0
non_smoking_count = 0

smokes_cost = zip(smokes_column, insurance_cost_column)
for record in smokes_cost:
    if record[0] == "yes":
        smoking_total += record[1]
        smoking_count += 1
    elif record[0] == "no":
        non_smoking_total += record[1]
        non_smoking_count += 1
        
print("Average insurance cost for smoking individuals:", "$" + str(round(smoking_total / smoking_count, 2)))
print("Average insurance cost for non-smoking individuals:", "$" + str(round(non_smoking_total / non_smoking_count, 2)))

Average insurance cost for smoking individuals: $32050.23
Average insurance cost for non-smoking individuals: $8440.66


Whether an individual smokes or not is clearly a deciding factor in the cost of medical insurance.

## Conclusions

In the context of this dataset, it can be determined that age, location, and whether an individual smokes or not, are all contributing factors to the cost of medical insurance. Whether an individual smokes or not has the most impact on insurance cost. This is followed by age, and then location.  
According to these findings, an individual who is 60+ year old, living in the Southeast, who smokes, is likely to have medical insurance that is the most expensive.