# U.S. Medical Insurance Costs

In this project, a CSV with medical insurance costs will be analyzed using Python fundamentals. The goal of this project will be to analyze and compare medical costs based on region and the factors that influence their differences. The CSV file will be imported with the `csv` library.

In [8]:
#import csv
import csv

The next step is to look through the **insurance.csv** to get familiar with the data types, identify any missing values, and the names of the columns. This .csv file does not appear to have any missing values. 

SInce this analysis will be conducted by regions. There will be four dictionaries created. One for each region:
- Northeast
- Northwest
- Southeast
- Southwest

Each dictionary will contain the following lists which will be created in a helper function below:
- Age
- Sex
- BMI
- Number of Children
- Smoker Status
- Insurance Cost

In [10]:
northeast = {
    'age': [],
    'sex': [],
    'bmi': [],
    'num_children': [],
    'smoker_status': [],
    'insurance_cost': []
}
northwest = {
    'age': [],
    'sex': [],
    'bmi': [],
    'num_children': [],
    'smoker_status': [],
    'insurance_cost': []
}
southeast = {
    'age': [],
    'sex': [],
    'bmi': [],
    'num_children': [],
    'smoker_status': [],
    'insurance_cost': []
}
southwest = {
    'age': [],
    'sex': [],
    'bmi': [],
    'num_children': [],
    'smoker_status': [],
    'insurance_cost': []
}

The following helper function will help break all of csv data into the respective lists

In [18]:
def append_row_to_dict(region_dict, row):
    for item in row:
        if(item == 'age'):
            region_dict['age'].append(row[item])
        elif(item == 'sex'):
            region_dict['sex'].append(row[item])
        elif(item == 'bmi'):
            region_dict['bmi'].append(row[item])
        elif(item == 'smoker'):
            region_dict['smoker_status'].append(row[item])
        elif(item == 'children'):
            region_dict['num_children'].append(row[item])
        elif(item == 'charges'):
            region_dict['insurance_cost'].append(row[item])

def load_list_data_and_create_dictionaries(csv_file, region_dict, region_name):
    with open(csv_file) as file_info:
        csv_dict = csv.DictReader(file_info)
        for row in csv_dict:
            if(row['region'] == region_name):
                append_row_to_dict(region_dict, row)

In [21]:
region_dict_list = [
    (northeast, 'northeast'), 
    (northwest, 'northwest'), 
    (southeast, 'southeast'), 
    (southwest, 'southwest')
]

for region_dict, region_name in region_dict_list:
    load_list_data_and_create_dictionaries('insurance.csv', region_dict, region_name)

Now that the data is segmented into lists. It is ready for analysis. The goal of this analysis is determine which region has the most expensive medical costs. This will be completed through the following analyses:
- Average cost by region
- Average number of smokers (1 being all are smokers, 0 being all are non-smokers)
- Average number of kids
- Average age
- Female to male ratio

These analyses will be created as methods for the class `RegionalCostsAnalysis`. 

northeast
northwest
southeast
southwest


In [67]:
class RegionalCostsAnalysis:
    def __init__(self, region_dict_name_tuple):
        self.region_dict_list = []
        self.region_name_list = []
        for region_dict, region_name in region_dict_name_tuple:
            self.region_dict_list.append(region_dict)
            self.region_name_list.append(region_name)
    
    def compute_averages(self, metric):
        average_values_by_region = {}
        for index, region_dict in enumerate(self.region_dict_list):
            region_len = len(region_dict[metric])
            total_value = 0
            for metric_val in region_dict[metric]:
                total_value += float(metric_val)
            average_metric_val = round(total_value / region_len, 3)
            average_values_by_region[self.region_name_list[index]] = average_metric_val
        
        return average_values_by_region
    
    def calculate_ratios(self, metric, numerator):
        ratios = {}
        for index, region_dict in enumerate(self.region_dict_list):
            numerator_count = 0
            denominator_count = 0
            for metric_val in region_dict[metric]:
                if metric_val == numerator:
                    numerator_count += 1
                else:
                    denominator_count += 1
            ratio = round(numerator_count / denominator_count, 3)
            ratios[self.region_name_list[index]] = ratio
        return ratios

In [68]:
region_analysis = RegionalCostsAnalysis(region_dict_list)
region_analysis.compute_averages('age')

{'northeast': 39.269,
 'northwest': 39.197,
 'southeast': 38.94,
 'southwest': 39.455}

The first analysis conducted is based on the average age in each region. Normally as individuals get older, their healthcare costs start to rise due to more frequent doctor visits. The **southeast** region comes in with the lowest average age at **38.940**, and the **southwest** is the highest at **39.455**.

In [69]:
region_analysis.calculate_ratios('sex', 'female')

{'northeast': 0.988,
 'northwest': 1.019,
 'southeast': 0.926,
 'southwest': 0.994}

The next analysis conducted is on gender. Normally, women have more expensive health care costs simply due to pregnancy care. The **northwest** is the only region to have more women than mean with it's ratio returning over 1.000. The **southeast** contains the lowest ratio of women.

In [70]:
region_analysis.compute_averages('bmi')

{'northeast': 29.174,
 'northwest': 29.2,
 'southeast': 33.356,
 'southwest': 30.597}

The third analysis is the average body mass index (BMI). Individuals with a larger BMI normally have higher health care costs because of increased health risks with obesity. It should be noted that athletes have skewed BMI's as their increased muscle definition can signal obesity when that may not be exactly true. So while the **southeast** has the highest BMI at **33.356**, there could potentially be a larger number of athletes in this region.

In [71]:
region_analysis.compute_averages('num_children')

{'northeast': 1.046,
 'northwest': 1.148,
 'southeast': 1.049,
 'southwest': 1.142}

The fourth analysis is the average number of children. Normally, health costs will increase when families have children, but the increases are usually very minimal after the first child. This analysis shows that all four regions are within **0.102** of each other and all four regions average at least one child. This analysis likely has little effect on the determining the differences in cost by region.

In [72]:
region_analysis.calculate_ratios('smoker_status', 'yes')

{'northeast': 0.261,
 'northwest': 0.217,
 'southeast': 0.333,
 'southwest': 0.217}

The fifth analysis is ratio of smokers to non-smokers. Smokers have higher healthcare costs due to the well documented health risks that are associated with cigarette smoking. The **southeast** region has the highest ratio of smokers as there is a smoker for every three non-smokers in the region.

In [73]:
region_analysis.compute_averages('insurance_cost')

{'northeast': 13406.385,
 'northwest': 12417.575,
 'southeast': 14735.411,
 'southwest': 12346.937}

The final analysis here is the insurance cost. The **southeast** region is the most expensive compared to the rest. The **southeast** lead the four regions in smoker status and bmi, likely indicating that one or both of those measures play an influence in determining insurance cost.

The **northeast** region has the second highest insurance cost average. Focusing on the two categories that would be influencing the high costs for the **southeast**, the **northeast** has the lowest BMI out of the four regions and the second highest smoker status ratio. Therefore, the smoker status is likely the biggest influence on insurance cost. 

The **northwest** and **southwest** had identical ratios in smoker status; however, the northwest is slightly higher with its average insurance cost. The **northwest** has the highest ratio of women which can be the factor that makes the average healthcare costs slightly higher.

Based on this simple analysis, more research would be required to det