# U.S. Medical Insurance Costs

## Project objectives

- Find out the average age of the patients in the dataset.
- Analyze where a majority of the individuals are from.
- Look at the different costs between smokers vs. non-smokers.
- Figure out what the average age is for someone who has at least one child in this dataset.

In [4]:
import csv
from collections import defaultdict

Usually this would be easier using Pandas library, but we are suposed to use csv instead. 
We read the data using the csv.DictReader, and each row of data with a dictionary of the corresponding insured person is saved in the list insured[].

In [5]:
insured = []
with open('insurance.csv', newline = '') as data_file:
    reader = csv.DictReader(data_file, delimiter = ',')
    for row in reader:
        row['age'] = int(row['age'])
        row['bmi'] = round(float(row['bmi']),1)
        row['children'] = int(row['children'])
        row['charges'] = round(float(row['charges']),2)
        insured.append(row)
for i in range(5):
    print(insured[i])

{'age': 19, 'sex': 'female', 'bmi': 27.9, 'children': 0, 'smoker': 'yes', 'region': 'southwest', 'charges': 16884.92}
{'age': 18, 'sex': 'male', 'bmi': 33.8, 'children': 1, 'smoker': 'no', 'region': 'southeast', 'charges': 1725.55}
{'age': 28, 'sex': 'male', 'bmi': 33.0, 'children': 3, 'smoker': 'no', 'region': 'southeast', 'charges': 4449.46}
{'age': 33, 'sex': 'male', 'bmi': 22.7, 'children': 0, 'smoker': 'no', 'region': 'northwest', 'charges': 21984.47}
{'age': 32, 'sex': 'male', 'bmi': 28.9, 'children': 0, 'smoker': 'no', 'region': 'northwest', 'charges': 3866.86}


## Calculate the average age

In [6]:
average_age = round(sum(person['age'] for person in insured) / len(insured), 1)
print(average_age)

39.2


## Calculate some statistics by region

First of all, we build a new dictionary with the function build_by_region. 
It takes a parameter, insured_people, that is a list of dictionaries with insured people data.
It returns a dictionary, in which the keys are the regions and each value is a list with the dictionaries of data corresponding to insured people. 

In [7]:
def build_by_region(insured_people):
    insured_by_region = defaultdict(list)
    for person in insured:
        insured_by_region[person['region']].append(person)
    return insured_by_region

And now we call the function.

In [10]:
insured_by_region = build_by_region(insured)

Now we perform the calculation of some estatistics for each region. We save the calculations in a dictionary stats_summary{}. 
In stats_summary the key is the region, and the corresponding value is a new dictionary with the stats.

In [15]:
def stats_calculation(insured_people):
    """Insured_people: a list of dictionaries with the data of insured people.
    The function returns a dictionary with the calculations performed."""
    num_of_people = len(insured_people)
    average_age = int(sum([person['age'] for person in insured_people]) / num_of_people)
    smokers = [person for person in insured_people if person['smoker'] == 'yes']
    non_smokers = [person for person in insured_people if person['smoker'] == 'no']
    num_smokers, num_non_smokers = len(smokers), len(non_smokers)
    smokers_ratio = round(num_smokers / num_of_people, 4)
    smoker_cost = round(sum([person['charges'] for person in smokers]) / num_smokers, 2)
    non_smoker_cost = round(sum([person['charges'] for person in non_smokers]) / num_non_smokers, 2)
    num_of_parents = len([person for person in insured_people if person['children'] > 0])
    age_of_parents = round(sum([person['age'] for person in insured_people if person['children'] > 0]) /\
                     num_of_parents,1)
    
    stats = {'num_of_people': num_of_people, 'average_age': average_age, 
                             'smokers_ratio': smokers_ratio, 'non_smoker_cost': non_smoker_cost, 
                             'smoker_cost': smoker_cost, 'age_of_parents': age_of_parents, 
                             'num_of_parents':num_of_parents}
    return stats

stats_by_region = {}

for region in insured_by_region:
    stats_by_region[region] = stats_calculation(insured_by_region[region])
stats_by_region['Total'] = stats_calculation(insured)


Now we have the calculations in the stats_by_region{} dictionary and we can print out the results

In [16]:
    
for region in stats_by_region:
    print(region)
    print('-'*len(region))
    for key in stats_by_region[region].keys():
        print(f'{key:15}', end = ' ')
    print()
    for val in stats_by_region[region].values():
            print(f'{val:^15}', end = ' ')
    print('\n')

southwest
---------
num_of_people   average_age     smokers_ratio   non_smoker_cost smoker_cost     age_of_parents  num_of_parents  
      325             39            0.1785          8019.28        32269.06          40.0             187       

southeast
---------
num_of_people   average_age     smokers_ratio   non_smoker_cost smoker_cost     age_of_parents  num_of_parents  
      364             38             0.25           8032.22         34845.0          39.8             207       

northwest
---------
num_of_people   average_age     smokers_ratio   non_smoker_cost smoker_cost     age_of_parents  num_of_parents  
      325             39            0.1785          8556.46         30192.0          39.5             193       

northeast
---------
num_of_people   average_age     smokers_ratio   non_smoker_cost smoker_cost     age_of_parents  num_of_parents  
      324             39            0.2068          9165.53        29673.54          39.7             177       

Total
-----
