
# Intro

This program will use Python 3 to analyze data about medical insurance costs contained in a csv file. The file can be obtained at https://www.kaggle.com/mirichoi0218/insurance. Its content contains data on:
+ age: age of primary beneficiary
+ sex: either female or male
+ bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
+ children: Number of children covered by health insurance / Number of dependents
+ smoker: Smoking or not
+ region: The beneficiary's residential area in the US sorted by northeast, southeast, southwest, northwest.
+ charges: Individual medical costs billed by health insurance

The goals of this project are to:

+ Read data from a csv file.
+ Organize the data into dictionaries. One dictonary will hold all of the data, and four more will hold data for each region.
+ Create a class with methods for analyzing the data in each of those regions.
+ Create objects for each region and analyze the data.


# Step 1: Import the csv file, populate lists, and create a national dictionary

In [18]:
import csv

age = []
sex = []
bmi = []
children = []
smoker = []
region = []
charges = []

# read from the csv file and populate the above lists
with open('insurance.csv') as insurance_csv:
    insurance_reader = csv.DictReader(insurance_csv)
    for row in insurance_reader:
        age.append(row['age'])
        sex.append(row['sex'])
        bmi.append(row['bmi'])
        children.append(row['children'])
        smoker.append(row['smoker'])
        region.append(row['region'])
        charges.append(row['charges'])
        
# create a national dictionary to store all of the lists, and an additional 'Geography' key to act as a name
national_dict = {'Geography': 'all regions',
                 'Age': age, 
                 'Sex': sex, 
                 'BMI': bmi, 
                 'Children': children, 
                 'Smoker': smoker, 
                 'Region': region,
                 'Charges': charges
                }


# Step 2: Create a dictionary for each region

In [19]:
# creates a regional dictionary with the same keys as the 'national_dict' but which holds values
# for an individual region
def create_regional_dict(region):
    regional_dict = {'Geography': region,
                     'Age': [],
                     'Sex': [],
                     'BMI': [],
                     'Children': [],
                     'Smoker': [],
                     'Charges': []
                    }
    for i in range(len(national_dict['Age'])):
        if national_dict['Region'][i] == region:
            regional_dict['Age'].append(national_dict['Age'][i])
            regional_dict['Sex'].append(national_dict['Sex'][i])
            regional_dict['BMI'].append(national_dict['BMI'][i])
            regional_dict['Children'].append(national_dict['Children'][i])
            regional_dict['Smoker'].append(national_dict['Smoker'][i])
            regional_dict['Charges'].append(national_dict['Charges'][i])
    return regional_dict
        
# use the above function to create regional dictionaries
northeast_dict = create_regional_dict('northeast')
southeast_dict = create_regional_dict('southeast')
southwest_dict = create_regional_dict('southwest')
northwest_dict = create_regional_dict('northwest')


# Step 3: Make a `PatientsData` class with some methods we can use to analyze the data

The method called `region_details` will return cumulative data on a given regional or national dictionary. Other methods will provide the necessary info for `region_details`, but they could also be used on their own for further analysis.

In [20]:
class PatientsData:
    def __init__(self, patient_geography, patient_age, patient_sex, patient_bmi, patient_children, 
                 patient_smoker, patient_charges):
        self.patient_geography = patient_geography
        self.patient_age = patient_age
        self.patient_sex = patient_sex
        self.patient_bmi = patient_bmi
        self.patient_children = patient_children
        self.patient_smoker = patient_smoker
        self.patient_charges = patient_charges
    
    # finds averages of data types with int or float values
    def find_avg(self, list_name):
        total = 0
        for i in range(len(list_name)):
            total += float(list_name[i])
        average = total / len(list_name)
        return average
        
    def find_avg_age(self):
        return int(self.find_avg(self.patient_age))
    
    def find_avg_bmi(self):
        return round(self.find_avg(self.patient_bmi), 2)
    
    def find_avg_children(self):
        return int(self.find_avg(self.patient_children))
    
    def find_avg_charges(self):
        return round(self.find_avg(self.patient_charges), 2)
    
    # returns a string detailing a region, including avg age, number of male vs female, avg bmi,
    # avg number of children, number of smokers vs nonsmokers, and avg patient charges
    def region_details(self):
        geography = self.patient_geography
        avg_age = self.find_avg_age()
        total_people = len(self.patient_sex)
        num_male = self.patient_sex.count('male')
        num_female = self.patient_sex.count('female')
        avg_bmi = self.find_avg_bmi()
        avg_children = self.find_avg_children()
        num_smoker = self.patient_smoker.count('yes')
        num_nonsmoker = self.patient_smoker.count('no')
        avg_charges = self.find_avg_charges()
        print('Geography: {}\nTotal people: {}\nMales: {}\nFemales: {}\nAverage age: {}\nAverage BMI: {}\nAverage number of children: {}\nSmokers: {}\nNonsmokers: {}\nAverage patient charge: {}\n'
              .format(geography, total_people, num_male, num_female, avg_age, avg_bmi, avg_children, num_smoker, num_nonsmoker, avg_charges))


# Step 4: Create an instance of the `PatientsData` class for each region


In [21]:
national_patients_data = PatientsData(national_dict['Geography'],
                                      national_dict['Age'], 
                                      national_dict['Sex'], 
                                      national_dict['BMI'], 
                                      national_dict['Children'], 
                                      national_dict['Smoker'], 
                                      national_dict['Charges']
                                     )
northeast_patients_data = PatientsData(northeast_dict['Geography'],
                                       northeast_dict['Age'],
                                       northeast_dict['Sex'],
                                       northeast_dict['BMI'],
                                       northeast_dict['Children'],
                                       northeast_dict['Smoker'],
                                       northeast_dict['Charges']
                                      )
southeast_patients_data = PatientsData(southeast_dict['Geography'],
                                       southeast_dict['Age'],
                                       southeast_dict['Sex'],
                                       southeast_dict['BMI'],
                                       southeast_dict['Children'],
                                       southeast_dict['Smoker'],
                                       southeast_dict['Charges']
                                      )
southwest_patients_data = PatientsData(southwest_dict['Geography'],
                                       southwest_dict['Age'],
                                       southwest_dict['Sex'],
                                       southwest_dict['BMI'],
                                       southwest_dict['Children'],
                                       southwest_dict['Smoker'],
                                       southwest_dict['Charges']
                                      )
northwest_patients_data = PatientsData(northwest_dict['Geography'],
                                       northwest_dict['Age'],
                                       northwest_dict['Sex'],
                                       northwest_dict['BMI'],
                                       northwest_dict['Children'],
                                       northwest_dict['Smoker'],
                                       northwest_dict['Charges']
                                      )


# Step 5: Analyze!

Finally, we can call `region_details` on each region to compare the data across regions.

The results show us that each US region is fairly evenly represented in the national dataset (though with a slightly larger showing from the southeast). Additionally, each region has a pretty even male:female ratio.

The southeast region has a notably higher average cost of insurance, which does correlate poitively with a higher average BMI compared to national results and other regions. Further analysis would be required to investigate if the link is causally connected.

In [22]:
national_patients_data.region_details()
northeast_patients_data.region_details()
southeast_patients_data.region_details()
southwest_patients_data.region_details()
northwest_patients_data.region_details()

Geography: all regions
Total people: 1338
Males: 676
Females: 662
Average age: 39
Average BMI: 30.66
Average number of children: 1
Smokers: 274
Nonsmokers: 1064
Average patient charge: 13270.42

Geography: northeast
Total people: 324
Males: 163
Females: 161
Average age: 39
Average BMI: 29.17
Average number of children: 1
Smokers: 67
Nonsmokers: 257
Average patient charge: 13406.38

Geography: southeast
Total people: 364
Males: 189
Females: 175
Average age: 38
Average BMI: 33.36
Average number of children: 1
Smokers: 91
Nonsmokers: 273
Average patient charge: 14735.41

Geography: southwest
Total people: 325
Males: 163
Females: 162
Average age: 39
Average BMI: 30.6
Average number of children: 1
Smokers: 58
Nonsmokers: 267
Average patient charge: 12346.94

Geography: northwest
Total people: 325
Males: 161
Females: 164
Average age: 39
Average BMI: 29.2
Average number of children: 1
Smokers: 58
Nonsmokers: 267
Average patient charge: 12417.58

