# U.S. Medical Insurance Costs

SCOPE OF WORK: PATIENT DEMOGRAPHICS AND COST ANALYSIS 

Goals: 
- Provide insights into the dataset's demographics and cost related features.
- Key points: average age of patients, geographic trends, healthcare cost difference for smokers or non smokers.
- Identify limitations on dataset and assess potential impact on the results.

Data:
- Key variables: demographic, financial and lifestyle.
- Quality Assessment: examine representation balance of the dataset and explore for missing values or inconsistencies.
- Identify possible biases and document implications or possible impact in results.

Analytics:
- Compute average age of individuals and examine geographic distribution.
- Compare costs/age depending on lifestyle/healtcare variables (smokers, children).
- Exploratory insights: correlation between attributes (age/children/smokers vs. costs).





Save dataset via Python variables. 
We will create a list of dictionaries containing every row in our insurance csv file.

In [57]:
import csv

# Open csv file
with open("insurance.csv", "r") as insurance_file:
    # Use reader to read rows as dictionaries
    reader = csv.DictReader(insurance_file)

    # Create an empty list to store each row as a dictionary
    insurance_data = []
    
    for row in reader:
        insurance_data.append({
            "age": int(row["age"]),
            "sex": row["sex"],
            "bmi": float(row["bmi"]),
            "children": int(row["children"]),
            "smoker": row["smoker"],
            "region": row["region"],
            "charges": float(row["charges"])
        })

    

Now that our data is easily accesible and organised, we want to get some quick information about our sample. We will first get the average age of every person in our dataset.

In [35]:
# Calculate average age of every person in the dataset
average_age = sum(row["age"] for row in insurance_data) / len(insurance_data)
print(f"Average age: {average_age:.2f}")

Average age: 39.21


We would like now to organise people by group ages, so that we know how each group is represented in the sample.

In [36]:
# Compute amount of people for each age group

under_18 = 0
from_18_to_30 = 0
from_31_to_50 = 0
over_50 = 0

for person in insurance_data:
    if person["age"] < 18:
        under_18 += 1
    elif person["age"] < 31:
        from_18_to_30 += 1
    elif person["age"] < 51:
        from_31_to_50 += 1
    else:
        over_50 += 1

# Display results
print(f"Under 18 years old: {under_18}")
print(f"From 18 to 30 years old: {from_18_to_30}")
print(f"From 31 to 50 years old: {from_31_to_50}")
print(f"Over 50 years old: {over_50}")

Under 18 years old: 0
From 18 to 30 years old: 444
From 31 to 50 years old: 538
Over 50 years old: 356


We have established that our dataset is well balanced with our age ranges. We will try confirming now if the sample is also balanced in regards of gender.

In [37]:
# Check quantities of men and women included in the dataset
female_count = 0
male_count = 0

for person in insurance_data:
    if person["sex"] == "female":
        female_count += 1
    else:
        male_count += 1

print(f"Total females: {female_count}")
print(f"Total males: {male_count}")

Total females: 662
Total males: 676


We will continue now by making sure each age range is also well balanced in terms of gender.

In [38]:
male_18_to_30 = 0
female_18_to_30 = 0
male_31_to_50 = 0
female_31_to_50 = 0
male_over_50 = 0
female_over_50 = 0

# Iterate over every person in the dataset to allocate in age and gender group

for person in insurance_data:
    if person["age"] < 31:
        if person["sex"] == "female":
            female_18_to_30 += 1
        else:
            male_18_to_30 += 1
    elif person["age"] < 51:
        if person["sex"] == "female":
            female_31_to_50 += 1
        else: 
            male_31_to_50 += 1
    else:
        if person["sex"] == "female":
            female_over_50 += 1
        else:
            male_over_50 += 1


# Obtain results for each group
print(f"From 18 to 30: {male_18_to_30} males - {female_18_to_30} females.")
print(f"From 31 to 50: {male_31_to_50} males - {female_31_to_50} females.")
print(f"Over 50: {male_over_50} males - {female_over_50} females.")


From 18 to 30: 230 males - 214 females.
From 31 to 50: 271 males - 267 females.
Over 50: 175 males - 181 females.


Our dataset seems pretty well balanced in terms of age and gender. We will now focus on working with geographical variables to continue our analysis.

In [39]:
# Create empty dictionary to store locations (keys) and total amount of persons in dataset (values)
locations = {}

for record in insurance_data:
    # If region already in our dictionary increment count by 1
    if record["region"] in locations:
        locations[record["region"]] += 1
    else:
        # If not found in our dictionary, initialize count 
        locations[record["region"]] = 1

print(locations)

{'southwest': 325, 'southeast': 364, 'northwest': 325, 'northeast': 324}


Time to compare average cost depending on location.

In [40]:
# Initialize a dictionary to store the averages by region
average_costs_by_region = {}

# Loop through each unique region and calculate the average cost
for region in locations:
    total_charges = sum(row["charges"] for row in insurance_data if row["region"] == region)
    average_costs_by_region[region] = total_charges / locations[region]


# Print the results
for region in average_costs_by_region:
    print(f"{region}: {average_costs_by_region[region]:.2f} ")


southwest: 12346.94 
southeast: 14735.41 
northwest: 12417.58 
northeast: 13406.38 


We have found out that the Southeast region has the highest average cost in the country. We will try to get some insights as to what could be causing this by determining if it is the region with the most smokers or the highest average age.

In [41]:
# Group data by location 
region_stats = {}

# Iterate over each region and filter rows corresponding to current region
for region in locations:
    region_data = [row for row in insurance_data if row["region"] == region]

    # Calculate average age
    avg_age = sum(row["age"] for row in region_data) / len(region_data)

    # Calculate percentage of smokers
    total_smokers = sum(1 for row in region_data if row["smoker"] == "yes")
    percentage_smokers = (total_smokers/len(region_data)) * 100

    # Store obtained results in the dictionary
    region_stats[region] = {
        "average_age": avg_age,
        "percentage_smokers": percentage_smokers
    }


# Print results
for region, stats in region_stats.items():
    print(f"Region: {region}. Average age: {stats["average_age"]:.2f}. Percentage of smokers: {stats["percentage_smokers"]:.2f}")



Region: southwest. Average age: 39.46. Percentage of smokers: 17.85
Region: southeast. Average age: 38.94. Percentage of smokers: 25.00
Region: northwest. Average age: 39.20. Percentage of smokers: 17.85
Region: northeast. Average age: 39.27. Percentage of smokers: 20.68


We were able to determine the correlation between two variables. The Southeast region, with the highest percentage of smokers, has the highest average cost. The average age does not seem to be a big factor as all regions have a very similar average age.

Time to shift to lifestyle variables. We will compare insurance cost for people with or without children.

In [42]:
have_children_count = 0
no_children_count = 0

for person in insurance_data:
    if person["children"] > 0:
        have_children_count += 1
    else: 
        no_children_count += 1

avg_cost_with_children = sum(row["charges"] for row in insurance_data if row["children"] > 0) / have_children_count

print(f"Average insurance cost for people with children: {avg_cost_with_children:.2f}")

avg_cost_no_children = sum(row["charges"] for row in insurance_data if row["children"] == 0) / no_children_count

print(f"Average insurance cost for people that don't have children: {avg_cost_no_children:.2f}")



Average insurance cost for people with children: 13949.94
Average insurance cost for people that don't have children: 12365.98


As we can see, people with children tend to have higher insurance costs. We will group them by quantity of children and get their average insurance costs.

In [47]:
# Store new data in separate dictionaries
total_cost_per_children_qty = {}
count_per_children_qty = {}

# Iterate through the records and discard the persons that do not have children
for person in insurance_data:
    if person["children"] == 0:
        continue
    else:
        # Keep track of count of people and total charges for each children quantity
        if person["children"] in total_cost_per_children_qty:
            total_cost_per_children_qty[person["children"]] += person["charges"]
            count_per_children_qty[person["children"]] += 1
        else:
            total_cost_per_children_qty[person["children"]] = person["charges"]
            count_per_children_qty[person["children"]] = 1

# Iterate through total cost per quantity and divide by count to get the average
for children_qty in total_cost_per_children_qty:
    avg_cost_by_children_qty = total_cost_per_children_qty[children_qty] / count_per_children_qty[children_qty]
    print(f"Children quantity: {children_qty}. Average cost of insurance: {avg_cost_by_children_qty:.2f}")



Children quantity: 1. Average cost of insurance: 12731.17
Children quantity: 3. Average cost of insurance: 15355.32
Children quantity: 2. Average cost of insurance: 15073.56
Children quantity: 5. Average cost of insurance: 8786.04
Children quantity: 4. Average cost of insurance: 13850.66


The average cost of insurance for people with 5 children seems considerably low compared to other quantities. We will focus on trying to determine why. As per our previous queries, we were able to see the correlation between percentage of smokers and higher insurance cost. To confirm this, we will take a look at persons with 5 children and their smoker status.

In [56]:
# Initialize count for smokers and non-smokers
smokers_count = 0
non_smokers_count = 0

# Obtain total count of each group
for person in insurance_data:
    if person["children"] == 5:
        if person["smoker"] == "yes":
            smokers_count += 1
        else:
            non_smokers_count += 1

# Calculate percentages for each group
percentage_of_smokers = smokers_count * 100 / (smokers_count + non_smokers_count)

print("Percentage of smokers for people with 5 children: " + str(percentage_of_smokers))


Percentage of smokers for people with 5 children: 5.555555555555555
