# U.S. Medical Insurance Costs

Possible ways to analyse the dataset:
1. By gender
2. By region - which region has the highest/lowest total and average insurance cost
    * Which region are insurance costs the greatest for females? for males?
3. By smoking status
    * Percentage of smokers in females and males
4. By the number of children
    * Which region are insurance costs the greatest for people with children (1 to 5: find the max)
5. By the bmi
    * What is the average bmi in each region in total and by gender?
6. By age group 
    * Insurance costs for females in different age groups, versus insurance costs for males in different age groups

Documentation for reference:
* https://docs.python.org/3/library/csv.html
* https://docs.python.org/3/
* https://www.kaggle.com/mirichoi0218/insurance

In [12]:
# Read the csv file
import csv
# Create empty list to store contents of each row of csv file
all_records = []
with open("insurance.csv") as insurance_csv:
    insurance_read = csv.DictReader(insurance_csv)
    for row in insurance_read:
        all_records.append(row)
        print(row)
print(all_records)

{'age': '19', 'sex': 'female', 'bmi': '27.9', 'children': '0', 'smoker': 'yes', 'region': 'southwest', 'charges': '16884.924'}
{'age': '18', 'sex': 'male', 'bmi': '33.77', 'children': '1', 'smoker': 'no', 'region': 'southeast', 'charges': '1725.5523'}
{'age': '28', 'sex': 'male', 'bmi': '33', 'children': '3', 'smoker': 'no', 'region': 'southeast', 'charges': '4449.462'}
{'age': '33', 'sex': 'male', 'bmi': '22.705', 'children': '0', 'smoker': 'no', 'region': 'northwest', 'charges': '21984.47061'}
{'age': '32', 'sex': 'male', 'bmi': '28.88', 'children': '0', 'smoker': 'no', 'region': 'northwest', 'charges': '3866.8552'}
{'age': '31', 'sex': 'female', 'bmi': '25.74', 'children': '0', 'smoker': 'no', 'region': 'southeast', 'charges': '3756.6216'}
{'age': '46', 'sex': 'female', 'bmi': '33.44', 'children': '1', 'smoker': 'no', 'region': 'southeast', 'charges': '8240.5896'}
{'age': '37', 'sex': 'female', 'bmi': '27.74', 'children': '3', 'smoker': 'no', 'region': 'northwest', 'charges': '7281.

In [13]:
# Create empty lists to store individuals: age, sex, bmi, children, smoker, region, charges
age_list = []
sex_list = []
bmi_list = []
children_list = []
smoker_list = []
region_list = []
charges_list = []

In [14]:
for row in all_records:
    age_list.append(row["age"])
    sex_list.append(row["sex"])
    bmi_list.append(row["bmi"])
    children_list.append(row["children"])
    smoker_list.append(row["smoker"])
    region_list.append(row["region"])
    charges_list.append(row["charges"])

In [17]:
print(age_list)
print(sex_list)
print(bmi_list)
print(children_list)

['19', '18', '28', '33', '32', '31', '46', '37', '37', '60', '25', '62', '23', '56', '27', '19', '52', '23', '56', '30', '60', '30', '18', '34', '37', '59', '63', '55', '23', '31', '22', '18', '19', '63', '28', '19', '62', '26', '35', '60', '24', '31', '41', '37', '38', '55', '18', '28', '60', '36', '18', '21', '48', '36', '40', '58', '58', '18', '53', '34', '43', '25', '64', '28', '20', '19', '61', '40', '40', '28', '27', '31', '53', '58', '44', '57', '29', '21', '22', '41', '31', '45', '22', '48', '37', '45', '57', '56', '46', '55', '21', '53', '59', '35', '64', '28', '54', '55', '56', '38', '41', '30', '18', '61', '34', '20', '19', '26', '29', '63', '54', '55', '37', '21', '52', '60', '58', '29', '49', '37', '44', '18', '20', '44', '47', '26', '19', '52', '32', '38', '59', '61', '53', '19', '20', '22', '19', '22', '54', '22', '34', '26', '34', '29', '30', '29', '46', '51', '53', '19', '35', '48', '32', '42', '40', '44', '48', '18', '30', '50', '42', '18', '54', '32', '37', '47', '20

### 1. Insurance costs by gender

In [32]:
# Calculate total insurance cost for females and males
total_charges_female = 0
total_charges_male = 0
#iterate through list of dictionaries
for row in all_records:
    if row["sex"] == "female":
        total_charges_female += float(row["charges"])
    else:
        total_charges_male += float(row["charges"])

#round to 2 dp
total_charges_female = round(total_charges_female, 2)
total_charges_male = round(total_charges_male, 2)
        
# Count no. of females and males using sex_list
count_female = sex_list.count("female")
count_male = sex_list.count("male")

# Calculate average insurance cost for females and males
average_charges_female = round(total_charges_female/count_female, 2)
average_charges_male = round(total_charges_male/count_male, 2)

print("""Total charges for females: ${total_charges_female}\n
Total number of females: {count_female}\n
Average charges for females: ${average_charges_female}
""".format(total_charges_female=total_charges_female, count_female=count_female, average_charges_female=average_charges_female) )

print("""Total charges for males: ${total_charges_male}\n
Total number of males: {count_male}\n
Average charges for males: ${average_charges_male}
""".format(total_charges_male=total_charges_male, count_male=count_male, average_charges_male=average_charges_male) )


Total charges for females: $8321061.19

Total number of females: 662

Average charges for females: $12569.58

Total charges for males: $9434763.8

Total number of males: 676

Average charges for males: $13956.75



The data shows a roughly balanced dataset between females and males. On average, females have a lower insurance cost than males.

In [46]:
# Create dictionary with keys 'female' and 'male', and values being a list of dictionaries containing individual records
records_by_gender = {
    "female": [],
    "male": []
}

# iterate through dictionary
for record in all_records:
    if record["sex"] == "female":
        records_by_gender["female"].append(record)
    else:
        records_by_gender["male"].append(record)

#print(records_by_gender["male"])

#### Exploratory data analysis by gender ####
# Variables: Age, bmi, children, smoker, region, charges

# Define a function that takes input parameters "sex" and "attribute", and outputs the average value of that attribute.
# attributes are those with continuous values: age, bmi, children, charges
def get_average_attribute_values(sex, attribute):
    # get the list of individual records for that gender
    records_list = records_by_gender[sex]
    # create temp variable to sum values of that attributes
    sum_attribute_values = 0
    # loop through list of individual records
    for record in records_list:
        sum_attribute_values += float(record[attribute])
    # return the average value of the attribute
    return round(sum_attribute_values/len(records_list), 2)

# Test function: Get the average bmi of females and males
average_bmi_female = get_average_attribute_values("female", "bmi")
average_bmi_male = get_average_attribute_values("male", "bmi")
print("Females: Average BMI = {}".format(average_bmi_female))
print("Males: Average BMI = {}".format(average_bmi_male))

# Test function: Get the average no. of children of females and males
average_children_female = get_average_attribute_values("female", "children")
average_children_male = get_average_attribute_values("male", "children")
print("Females: Average number of children = {}".format(average_children_female))
print("Males: Average number of children = {}".format(average_children_male))



Females: Average BMI = 30.38
Males: Average BMI = 30.94
Females: Average number of children = 1.07
Males: Average number of children = 1.12


### 2. Insurance costs by region

In [50]:
# Create a list of unique regions
region_list_unique = []
for region in region_list:
    if region not in region_list_unique:
        region_list_unique.append(region)
region_list_unique.sort()
print(region_list_unique)

# Create a dictionary with keys as the regions, and values being a list of individual records in that region
records_by_region = {key: [] for key in region_list_unique}
print(records_by_region)

# Append individual records to values in records_by_region
for record in all_records:
    for region in region_list_unique:
        if record["region"] == region:
            records_by_region[region].append(record)
print(records_by_region)

['northeast', 'northwest', 'southeast', 'southwest']
{'northeast': [], 'northwest': [], 'southeast': [], 'southwest': []}
{'northeast': [{'age': '37', 'sex': 'male', 'bmi': '29.83', 'children': '2', 'smoker': 'no', 'region': 'northeast', 'charges': '6406.4107'}, {'age': '25', 'sex': 'male', 'bmi': '26.22', 'children': '0', 'smoker': 'no', 'region': 'northeast', 'charges': '2721.3208'}, {'age': '52', 'sex': 'female', 'bmi': '30.78', 'children': '1', 'smoker': 'no', 'region': 'northeast', 'charges': '10797.3362'}, {'age': '23', 'sex': 'male', 'bmi': '23.845', 'children': '0', 'smoker': 'no', 'region': 'northeast', 'charges': '2395.17155'}, {'age': '60', 'sex': 'female', 'bmi': '36.005', 'children': '0', 'smoker': 'no', 'region': 'northeast', 'charges': '13228.84695'}, {'age': '34', 'sex': 'female', 'bmi': '31.92', 'children': '1', 'smoker': 'yes', 'region': 'northeast', 'charges': '37701.8768'}, {'age': '63', 'sex': 'female', 'bmi': '23.085', 'children': '0', 'smoker': 'no', 'region': 'n

In [60]:
# Define a function that calculates the average insurance costs per region per gender (if gender parameter is an
# empty string, then it just outputs the average cost of all individuals per region)
def av_cost_by_region(region, gender):
    total_cost = 0;
    records_list = records_by_region[region] #gets list of records of people in this region
    count = len(records_list) #gets number of people in this region
    
    # if gender parameter is not specified, just average across all individuals
    if gender != "female" and gender != "male":
        for record in records_list:
            total_cost += float(record["charges"])
        return round(total_cost/count, 2)
    else:
        count_gender = 0
        total_cost_gender = 0;
        for record in records_list:
            if record["sex"] == gender:
                count_gender += 1
                total_cost_gender += float(record["charges"])
        return round(total_cost_gender/count_gender, 2)

# Test function
# Calculate the average cost of all individuals in northeast
average_cost_northeast = av_cost_by_region("northeast", "")
print("Average cost in Northeast region: ${}".format(average_cost_northeast))

# Calculate the average cost of all females in northeast
average_cost_northeast_f = av_cost_by_region("northeast", "female")
print("Average cost of females in Northeast region: ${}".format(average_cost_northeast_f))

# Calculate the average cost of all males in northeast
average_cost_northeast_m = av_cost_by_region("northeast", "male")
print("Average cost of males in Northeast region: ${}".format(average_cost_northeast_m))

Average cost in Northeast region: $13406.38
Average cost of females in Northeast region: $12953.2
Average cost of males in Northeast region: $13854.01


In [64]:
# Analyse insurance costs for females in every region
average_cost_northeast_f = av_cost_by_region("northeast", "female")
print("Average cost of females in Northeast region: ${}".format(average_cost_northeast_f))
average_cost_northwest_f = av_cost_by_region("northwest", "female")
print("Average cost of females in Northwest region: ${}".format(average_cost_northwest_f))
average_cost_southeast_f = av_cost_by_region("southeast", "female")
print("Average cost of females in Southeast region: ${}".format(average_cost_southeast_f))
average_cost_southwest_f = av_cost_by_region("southwest", "female")
print("Average cost of females in Southwest region: ${}".format(average_cost_southwest_f))

Average cost of females in Northeast region: $12953.2
Average cost of females in Northwest region: $12479.87
Average cost of females in Southeast region: $13499.67
Average cost of females in Southwest region: $11274.41


On average, females in the Southeast region incur the greatest insurance costs, while insurance is the cheapest for females in the Southwest region.

In [65]:
# Analyse insurance costs for males in every region
average_cost_northeast_m = av_cost_by_region("northeast", "male")
print("Average cost of males in Northeast region: ${}".format(average_cost_northeast_m))
average_cost_northwest_m = av_cost_by_region("northwest", "male")
print("Average cost of males in Northwest region: ${}".format(average_cost_northwest_m))
average_cost_southeast_m = av_cost_by_region("southeast", "male")
print("Average cost of males in Southeast region: ${}".format(average_cost_southeast_m))
average_cost_southwest_m = av_cost_by_region("southwest", "male")
print("Average cost of males in Southwest region: ${}".format(average_cost_southwest_m))

Average cost of males in Northeast region: $13854.01
Average cost of males in Northwest region: $12354.12
Average cost of males in Southeast region: $15879.62
Average cost of males in Southwest region: $13412.88


On average, males in the Southeast region incur the greatest insurance costs, while insurance is the cheapest for males in the Northwest region.

### 3. Insurance costs by smoking status

In [71]:
# Create dictionary with keys 'yes' and 'no', and values being a list of dictionaries containing individual records
records_by_smoker = {
    "yes": [],
    "no": []
}

# iterate through dictionary
for record in all_records:
    if record["smoker"] == "yes":
        records_by_smoker["yes"].append(record)
    else:
        records_by_smoker["no"].append(record)

print(records_by_smoker)

{'yes': [{'age': '19', 'sex': 'female', 'bmi': '27.9', 'children': '0', 'smoker': 'yes', 'region': 'southwest', 'charges': '16884.924'}, {'age': '62', 'sex': 'female', 'bmi': '26.29', 'children': '0', 'smoker': 'yes', 'region': 'southeast', 'charges': '27808.7251'}, {'age': '27', 'sex': 'male', 'bmi': '42.13', 'children': '0', 'smoker': 'yes', 'region': 'southeast', 'charges': '39611.7577'}, {'age': '30', 'sex': 'male', 'bmi': '35.3', 'children': '0', 'smoker': 'yes', 'region': 'southwest', 'charges': '36837.467'}, {'age': '34', 'sex': 'female', 'bmi': '31.92', 'children': '1', 'smoker': 'yes', 'region': 'northeast', 'charges': '37701.8768'}, {'age': '31', 'sex': 'male', 'bmi': '36.3', 'children': '2', 'smoker': 'yes', 'region': 'southwest', 'charges': '38711'}, {'age': '22', 'sex': 'male', 'bmi': '35.6', 'children': '0', 'smoker': 'yes', 'region': 'southwest', 'charges': '35585.576'}, {'age': '28', 'sex': 'male', 'bmi': '36.4', 'children': '1', 'smoker': 'yes', 'region': 'southwest', 

In [78]:
# Define a function that outputs the percentage of females and males among (i) smokers and (ii) non-smokers
# def gender_percent_by_smoker(smoker)

# Define a function that outputs the average insurance cost of smokers and nonsmokers, where smoker equals 'yes' or 'no'
def av_cost_by_smoker(smoker):
    #get records by smoking status
    total_cost = 0
    count = 0
    for record in records_by_smoker[smoker]:
        if record["smoker"] == smoker: 
            total_cost += float(record["charges"])
            count += 1
    return round(total_cost/count, 2)
        
# Calculate average insurance cost of smokers and non-smokers
av_cost_smoker = av_cost_by_smoker("yes")
av_cost_nonsmoker = av_cost_by_smoker("no")
percent_diff = round(100*(av_cost_smoker-av_cost_nonsmoker)/av_cost_nonsmoker, 2)
print("Average insurance cost of a smoker: ${}".format(av_cost_smoker))
print("Average insurance cost of a non-smoker: ${}".format(av_cost_nonsmoker))
print("Percentage cost difference between smoker and non-smoker: {}%".format(percent_diff))


Average insurance cost of a smoker: $32050.23
Average insurance cost of a non-smoker: $8434.27
Percentage cost difference between smoker and non-smoker: 280.0%


### 4. Insurance costs by no. of children

In [83]:
# Get unique number of children and store in a list
children_list = []
for record in all_records:
    if record["children"] not in children_list:
        children_list.append(record["children"])
children_list.sort()
print(children_list)

# # Create dictionary with keys as the number of children, and values being a list of dictionaries containing individual records
records_by_children = {key: [] for key in children_list}

print(records_by_children)

['0', '1', '2', '3', '4', '5']
{'0': [], '1': [], '2': [], '3': [], '4': [], '5': []}


In [84]:
# iterate through dictionary
for record in all_records:
    for num in children_list:
        if record["children"] == num:
            records_by_children[num].append(record)

print(records_by_children)

{'0': [{'age': '19', 'sex': 'female', 'bmi': '27.9', 'children': '0', 'smoker': 'yes', 'region': 'southwest', 'charges': '16884.924'}, {'age': '33', 'sex': 'male', 'bmi': '22.705', 'children': '0', 'smoker': 'no', 'region': 'northwest', 'charges': '21984.47061'}, {'age': '32', 'sex': 'male', 'bmi': '28.88', 'children': '0', 'smoker': 'no', 'region': 'northwest', 'charges': '3866.8552'}, {'age': '31', 'sex': 'female', 'bmi': '25.74', 'children': '0', 'smoker': 'no', 'region': 'southeast', 'charges': '3756.6216'}, {'age': '60', 'sex': 'female', 'bmi': '25.84', 'children': '0', 'smoker': 'no', 'region': 'northwest', 'charges': '28923.13692'}, {'age': '25', 'sex': 'male', 'bmi': '26.22', 'children': '0', 'smoker': 'no', 'region': 'northeast', 'charges': '2721.3208'}, {'age': '62', 'sex': 'female', 'bmi': '26.29', 'children': '0', 'smoker': 'yes', 'region': 'southeast', 'charges': '27808.7251'}, {'age': '23', 'sex': 'male', 'bmi': '34.4', 'children': '0', 'smoker': 'no', 'region': 'southwes

In [90]:
# Calculate average insurance cost by number of children
def av_cost_by_children(num_of_children):
    #get records by smoking status
    total_cost = 0
    count = 0
    for record in records_by_children[str(num_of_children)]:
        total_cost += float(record["charges"])
        count += 1
    average_cost = round(total_cost/count, 2)
    return "Number of children: {num}; Average insurance cost: {cost}".format(num=num_of_children, cost=average_cost)

print(av_cost_by_children(0))
print(av_cost_by_children(1))
print(av_cost_by_children(2))
print(av_cost_by_children(3))
print(av_cost_by_children(4))
print(av_cost_by_children(5))


# Calculate average insurance cost of individuals with 'n' or more children, where n is the lower bound

Number of children: 0; Average insurance cost: 12365.98
Number of children: 1; Average insurance cost: 12731.17
Number of children: 2; Average insurance cost: 15073.56
Number of children: 3; Average insurance cost: 15355.32
Number of children: 4; Average insurance cost: 13850.66
Number of children: 5; Average insurance cost: 8786.04


In [99]:
# Manually check the average cost for those with 5 children
lst = []
for record in all_records:
    if record["children"] == '5':
        lst.append(float(record["charges"]))
        print(record["charges"])
print(round(sum(lst)/len(lst), 2))

4687.797
6799.458
4830.63
5080.096
9788.8659
12592.5345
11552.904
6666.243
6653.7886
10096.97
8965.79575
8596.8278
4915.05985
19023.26
9222.4026
8582.3023
5615.369
14478.33015
8786.04


In [138]:
# In which region is insurance the most expensive for people with 'n' children?
# Use the dictionary of records by region
# Define a function that takes the number of childen as input parameter and outputs the cost per region
def regional_cost_by_children(num_children):
    # Create new dictionary with regions as keys and values being a list of individuals with n number of children
    region_dict = {key: [] for key in region_list_unique}
    for key in records_by_region.keys():
        for record in records_by_region[key]:
            if record["children"] == str(num_children):
                region_dict[key].append(record)
    # loop through completed region_dict to sum insurance costs per region 
    for region in region_list_unique:
        #print(region)
        lst = region_dict[region]
        #print(len(lst))
        total_charges = 0
        for record in lst:
            total_charges += float(record["charges"])
        av_cost = round(total_charges/len(lst), 2)
        print("""Average insurance cost for an individual with {num_children} children in the {region} region is ${av_cost}.""".format(num_children=num_children, region=region, av_cost=av_cost))
 

In [140]:
regional_cost_by_children(0)

Average insurance cost for an individual with 0 children in the northeast region is $11626.46.
Average insurance cost for an individual with 0 children in the northwest region is $11324.37.
Average insurance cost for an individual with 0 children in the southeast region is $14309.87.
Average insurance cost for an individual with 0 children in the southwest region is $11938.5.


In [141]:
regional_cost_by_children(1)

Average insurance cost for an individual with 1 children in the northeast region is $16310.21.
Average insurance cost for an individual with 1 children in the northwest region is $10230.26.
Average insurance cost for an individual with 1 children in the southeast region is $13687.04.
Average insurance cost for an individual with 1 children in the southwest region is $10406.48.


In [142]:
regional_cost_by_children(2)

Average insurance cost for an individual with 2 children in the northeast region is $13615.15.
Average insurance cost for an individual with 2 children in the northwest region is $13464.31.
Average insurance cost for an individual with 2 children in the southeast region is $15728.47.
Average insurance cost for an individual with 2 children in the southwest region is $17483.49.


In [143]:
regional_cost_by_children(3)

Average insurance cost for an individual with 3 children in the northeast region is $14409.91.
Average insurance cost for an individual with 3 children in the northwest region is $17786.16.
Average insurance cost for an individual with 3 children in the southeast region is $18449.85.
Average insurance cost for an individual with 3 children in the southwest region is $10402.44.


In [144]:
regional_cost_by_children(4)

Average insurance cost for an individual with 4 children in the northeast region is $14485.19.
Average insurance cost for an individual with 4 children in the northwest region is $11347.02.
Average insurance cost for an individual with 4 children in the southeast region is $14451.02.
Average insurance cost for an individual with 4 children in the southwest region is $14933.26.


In [145]:
regional_cost_by_children(5)

Average insurance cost for an individual with 5 children in the northeast region is $6978.97.
Average insurance cost for an individual with 5 children in the northwest region is $8965.8.
Average insurance cost for an individual with 5 children in the southeast region is $10115.44.
Average insurance cost for an individual with 5 children in the southwest region is $8444.16.


Except for those with 3 or 5 children, insurance is the cheapest for individuals in the northwest region. For those with 3 or 5 children, insurance is the cheapest in the northeast region. 