# U.S. Medical Insurance Costs

In this project, we will be using Python fundamentals to explore a CSV file containing medical insurance costs. The objective is to analyze different attributes in the "insurance.csv" file, extract information about the patients, and discover potential applications for the dataset.

By examining this data, we aim to gain insights and better understand the various aspects of insurance costs and patient characteristics.

In [46]:
import csv

In [47]:
ages = []
sexes = []
bmis = []
num_children = []
smoker = []
regions = []
costs = []

In [48]:
def populate_list(list, source, column):
    with open(source, newline='') as file:
        csvreader = csv.DictReader(file)
        for row in csvreader:
            list.append(row[column])
        return list

Using the helper function populate_list() we can load up our lists with data.

In [49]:
populate_list(ages, 'insurance.csv', 'age')
populate_list(sexes, 'insurance.csv', 'sex')
populate_list(bmis, 'insurance.csv', 'bmi')
populate_list(num_children, 'insurance.csv', 'children')
populate_list(smoker, 'insurance.csv', 'smoker')
populate_list(regions, 'insurance.csv', 'region')
populate_list(costs, 'insurance.csv', 'charges')

['16884.924',
 '1725.5523',
 '4449.462',
 '21984.47061',
 '3866.8552',
 '3756.6216',
 '8240.5896',
 '7281.5056',
 '6406.4107',
 '28923.13692',
 '2721.3208',
 '27808.7251',
 '1826.843',
 '11090.7178',
 '39611.7577',
 '1837.237',
 '10797.3362',
 '2395.17155',
 '10602.385',
 '36837.467',
 '13228.84695',
 '4149.736',
 '1137.011',
 '37701.8768',
 '6203.90175',
 '14001.1338',
 '14451.83515',
 '12268.63225',
 '2775.19215',
 '38711',
 '35585.576',
 '2198.18985',
 '4687.797',
 '13770.0979',
 '51194.55914',
 '1625.43375',
 '15612.19335',
 '2302.3',
 '39774.2763',
 '48173.361',
 '3046.062',
 '4949.7587',
 '6272.4772',
 '6313.759',
 '6079.6715',
 '20630.28351',
 '3393.35635',
 '3556.9223',
 '12629.8967',
 '38709.176',
 '2211.13075',
 '3579.8287',
 '23568.272',
 '37742.5757',
 '8059.6791',
 '47496.49445',
 '13607.36875',
 '34303.1672',
 '23244.7902',
 '5989.52365',
 '8606.2174',
 '4504.6624',
 '30166.61817',
 '4133.64165',
 '14711.7438',
 '1743.214',
 '14235.072',
 '6389.37785',
 '5920.1041',
 '176

Next we will want to consider if cleaning data and reformatting is necessary. It looks like rounding BMI to one decimal and rounding costs to two decimals makes sense.

In [56]:
# clean bmis list, converting strings to float and rounding nearest decimal
bmis = [round(float(num), 1) for num in bmis]
# clean costs list, converting strings to float and rounding nearest 2 decimals
costs= [round(float(num), 2) for num in costs]
print(costs)

[16884.92, 1725.55, 4449.46, 21984.47, 3866.86, 3756.62, 8240.59, 7281.51, 6406.41, 28923.14, 2721.32, 27808.73, 1826.84, 11090.72, 39611.76, 1837.24, 10797.34, 2395.17, 10602.39, 36837.47, 13228.85, 4149.74, 1137.01, 37701.88, 6203.9, 14001.13, 14451.84, 12268.63, 2775.19, 38711.0, 35585.58, 2198.19, 4687.8, 13770.1, 51194.56, 1625.43, 15612.19, 2302.3, 39774.28, 48173.36, 3046.06, 4949.76, 6272.48, 6313.76, 6079.67, 20630.28, 3393.36, 3556.92, 12629.9, 38709.18, 2211.13, 3579.83, 23568.27, 37742.58, 8059.68, 47496.49, 13607.37, 34303.17, 23244.79, 5989.52, 8606.22, 4504.66, 30166.62, 4133.64, 14711.74, 1743.21, 14235.07, 6389.38, 5920.1, 17663.14, 16577.78, 6799.46, 11741.73, 11946.63, 7726.85, 11356.66, 3947.41, 1532.47, 2755.02, 6571.02, 4441.21, 7935.29, 37165.16, 11033.66, 39836.52, 21098.55, 43578.94, 11073.18, 8026.67, 11082.58, 2026.97, 10942.13, 30184.94, 5729.01, 47291.06, 3766.88, 12105.32, 10226.28, 22412.65, 15820.7, 6186.13, 3645.09, 21344.85, 30942.19, 5003.85, 17560.38

Sorting costs by different variables, starting with cost by region:

In [7]:
costs_by_bmi=list(zip(costs,bmis))
costs_by_age=list(zip(costs,ages))
costs_by_num_children=list(zip(costs,num_children))
costs_by_region=list(zip(regions,costs)) 

#Insurance costs sorted by region, I started with this because I suspected it might be particularly tricky.
regions_list=[]
for region in regions:
    if region not in regions_list:
        regions_list.append(region)
regions_list.sort()
print(regions_list)

costs_by_region_dict={region:[] for region in regions_list}

for patient in costs_by_region:
    for region in regions_list:
        if patient[0]==region:
            costs_by_region_dict[region].append(patient[1])

print(costs_by_region_dict)

#Could we identify regions which are the 'most expensive'?

['northeast', 'northwest', 'southeast', 'southwest']
{'northeast': ['6406.4107', '2721.3208', '10797.3362', '2395.17155', '13228.84695', '37701.8768', '14451.83515', '2198.18985', '39774.2763', '3046.062', '6079.6715', '3393.35635', '2211.13075', '13607.36875', '8606.2174', '6799.458', '2755.02095', '4441.21315', '7935.29115', '30184.9367', '22412.6485', '3645.0894', '21344.8467', '11488.31695', '30259.99556', '1705.6245', '39556.4945', '3385.39915', '12815.44495', '13616.3586', '2457.21115', '27375.90478', '3490.5491', '6334.34355', '19964.7463', '7077.1894', '15518.18025', '10407.08585', '4827.90495', '1694.7964', '8538.28845', '4005.4225', '43753.33705', '14901.5167', '4337.7352', '20984.0936', '6610.1097', '10564.8845', '7358.17565', '9225.2564', '38511.6283', '5354.07465', '29523.1656', '4040.55825', '12829.4551', '41097.16175', '13047.33235', '24869.8368', '14590.63205', '9282.4806', '9617.66245', '9715.841', '22331.5668', '48549.17835', '4237.12655', '11879.10405', '9432.9253', 

At least now we can find the average cost per region

In [8]:
def regional_average_cost(costs_by_region_dict,region):
    total_cost=0
    for cost in costs_by_region_dict[region]:
        total_cost+=float(cost)
    return total_cost/len(costs_by_region_dict[region])
#Testing the function
for region in regions_list:
    print('The average insurance cost in the {} region is'.format(region),regional_average_cost(costs_by_region_dict,region),'dollars')

The average insurance cost in the northeast region is 13406.3845163858 dollars
The average insurance cost in the northwest region is 12417.575373969228 dollars
The average insurance cost in the southeast region is 14735.411437609895 dollars
The average insurance cost in the southwest region is 12346.93737729231 dollars


Can we find the median cost per region?

Can we find the standard deviation and the IQR? (Comparing these will at least suggest existence of outliers in the dataset)