# U.S. Medical Insurance Costs

This project investigates a medical insurance cost dataset using Python. We will be examining stats like:

* average age of patients with children
* average cost of insurance for patients depending on sex, region, smoker status
* region where most patients reside

The first step is to import the CSV library.

In [1]:
#Import csv library
import csv

Next, the dataset **insurance.csv** must be imported using the csv library.

In [2]:
#Function to read data from csv file and add to lists
def read_csv(column_name, column_list):
    with open('insurance.csv', newline = "") as insurance:
        insurance_costs = csv.DictReader(insurance)
        for row in insurance_costs:
            column_list.append(row[column_name])

**insurance.csv** contains the following columns:
* Patient Age
* Patient Sex 
* Patient BMI
* Patient Number of Children
* Patient Smoking Status
* Patient U.S Geopraphical Region
* Patient Yearly Medical Insurance Cost

There are no signs of missing data. To store this information, seven empty lists will be created to hold each column of data.

In [3]:
#Create empty lists to store each column of data from the csv
ages = []
sexes = []
bmis = []
num_children = []
smoker_status = []
regions = []
insurance_charges = []

We can call the read_csv function to load each empty list with data from the dataset.

In [4]:
#Call function to fill lists
read_csv("age", ages)
read_csv("sex", sexes)
read_csv("bmi", bmis)
read_csv("children", num_children)
read_csv("smoker", smoker_status)
read_csv("region", regions)
read_csv("charges", insurance_charges)

A dictionary of all the data will need to be created in order to thoroughly investigate this dataset. Each patient's data will be set up as a dictionary value, under a unique numerical key, since the data is anonymous.

In [5]:
#Function to create dictionary of all data, using numbers as the keys since the data is anonymous
def create_dict():
    all_dict = {}
    i = 0
    while i < len(ages):
        single_dict = {}
        single_dict.update({"Age": ages[i], 
                            "Sex": sexes[i], 
                            "BMI": bmis[i], 
                            "Children": num_children[i],
                            "Smoker": smoker_status[i],
                            "Region": regions[i],
                            "Charges": insurance_charges[i]
                           })
        all_dict[i] = single_dict
        i += 1
    return all_dict

insurance_dict = create_dict()

Now that the dictionary and lists have been created, we can start investigating this dataset.

We may want to examine the average insurance costs for a group of patients. We can create a reusable function to find the average cost of any list of costs.

In [6]:
#Function to find average cost of a list
def find_average_cost(cost_list):
    total_cost = 0
    for cost in cost_list:
        total_cost += float(cost)
    average_cost = total_cost / len(cost_list)
    return round(average_cost,2)

We can also use a function to create that list of costs based on another categorical key/value.

In [7]:
#Function to create list of values based on another categorical key/value
def create_cost_list(key, value):
    cost_list = []
    for patient in insurance_dict.values():
        if patient[key] == value:
            cost_list.append(patient["Charges"])
    return cost_list

Here we use the above functions to create a list of insurance costs for females, and a list for males. Then we can find the average cost of each list.

The cost for males is only slightly higher than for females.

In [8]:
#Call functions to create list of female and male costs, then find the average cost of each
avg_female_cost = find_average_cost(create_cost_list("Sex", "female"))
avg_male_cost = find_average_cost(create_cost_list("Sex", "male"))
print("The average insurance cost for males is " + str(avg_male_cost) + " while the average cost for females is " + str(avg_female_cost))

The average insurance cost for males is 13956.75 while the average cost for females is 12569.58


Here we use the above functions to create a list of insurance costs for smokers, and a list for non-smokers. Then we can find the average cost of each list. 

You will notice that the cost for smokers is much higher than that for non-smokers.

In [9]:
#Call functions to create list of smoker and non-smoker costs, then find the average cost of each
avg_smoker_cost = find_average_cost(create_cost_list("Smoker", "yes"))
avg_nonsmoker_cost = find_average_cost(create_cost_list("Smoker", "no"))
print("The average insurance cost for smokers is " + str(avg_smoker_cost) + " while the average cost for non-smokers is " + str(avg_nonsmoker_cost))

The average insurance cost for smokers is 32050.23 while the average cost for non-smokers is 8434.27


The dataset divides patients by 4 regions. Here we use the above functions to find the average cost for patients in each region.

The highest cost is in the southeast region, and lowest is in the southwest.

In [10]:
#Call functions to create list of costs by region, then find the average cost of each region
region_list = ["northwest", "northeast", "southeast", "southwest"]
for region in region_list:
    avg_cost = find_average_cost(create_cost_list("Region", region))
    print("The average insurance cost for the " + region + " region is " + str(avg_cost))

The average insurance cost for the northwest region is 12417.58
The average insurance cost for the northeast region is 13406.38
The average insurance cost for the southeast region is 14735.41
The average insurance cost for the southwest region is 12346.94


We can create a function to find the average age of patients in a specific list of patients.

In [11]:
#Function to find average age of patients
def find_average_age(list):
    total_age = 0
    for age in list:
        total_age += int(age)
    average_age = total_age / len(list)
    return int(average_age)

We can use that function to examine the average age of all patients in the dataset.

In [12]:
avg_age_all_patients = find_average_age(ages)
print("The average age of all patients is " + str(avg_age_all_patients))

The average age of all patients is 39


It would be interesting to find out the average age of patients who have 1 or more children. Here we create a function to generate the list of patients with children.

In [13]:
#Function to create list of patient's ages who have one or more children
def create_age_list():
    age_list = []
    for patient in insurance_dict.values():
        if int(patient["Children"]) > 0:
            age_list.append(patient["Age"])
    return age_list

We can call the above functions to generate the list and then calculate the average age of patients with children.

You will notice that the average age of patients with children is the same as the average age of all patients. This could suggest that the patient data includes mostly patients of child-bearing age.

In [14]:
#Call functions to create list of ages with at least 1 child, then find average age
avg_age_with_children = find_average_age(create_age_list())
print("The average age of a patient with 1 or more children is " + str(avg_age_with_children))

The average age of a patient with 1 or more children is 39


We can find out where most of the patients live. First we must create a function to find the highest count of patients based on a particular column, in this case we will be using the Region column.

In [15]:
#Function to count number of patients by key
def find_highest_count(column):
    counts_dict = {}
    for patient in insurance_dict.values():
        patient_key = patient[column]
        if patient_key in counts_dict:
            counts_dict[patient_key] += 1
        else:
            counts_dict[patient_key] = 1
    highest = max(counts_dict, key=counts_dict.get)
    return highest

In [16]:
#Call function to analyze where the majority of patients are from
most_popular_region = find_highest_count("Region")
print("Most patients are from the " + most_popular_region)

Most patients are from the southeast
