# U.S. Medical Insurance Costs

In this project, I will be investigating US medical insurance costs within **insurance.csv** to learn more about the patient 
information in the file and gain insights into potential use cases for the dataset.

In [169]:
# Import pandas modules
import pandas as pd

In [170]:
# Read and store csv into a variable uses pandas
df = pd.read_csv('insurance.csv')

print(df.head())

   age     sex     bmi  children smoker     region      charges
0   19  female  27.900         0    yes  southwest  16884.92400
1   18    male  33.770         1     no  southeast   1725.55230
2   28    male  33.000         3     no  southeast   4449.46200
3   33    male  22.705         0     no  northwest  21984.47061
4   32    male  28.880         0     no  northwest   3866.85520


In [171]:
# Store column values into corresponding lists
ages = df['age'].tolist()
sexes = df['sex'].tolist()
bmis = df['bmi'].tolist()
num_children = df['children'].tolist()
smoker_status = df['smoker'].tolist()
regions = df['region'].tolist()
charges = df['charges'].tolist()

In [179]:
# Create a class that will gain perform analysis on insurance datas
class PatientsInfo:
    def __init__(self, patients_ages, patients_sex, patients_bmis, patients_num_children, patients_smoker_status,
                 patients_regions, patients_charges):
        self.patients_ages = patients_ages
        self.patients_sex = patients_sex
        self.patients_bmis = patients_bmis
        self.patients_num_children = patients_num_children
        self.patients_smoker_status = patients_smoker_status
        self.patients_regions = patients_regions
        self.patients_charges = patients_charges
        
    # Method that calculates the average ages of the patients in insurance.csv
    def analyse_ages(self):
        # initialize total age at zero
        total_age = 0
        # iterate through all ages in the ages list
        for age in self.patients_ages:
            # sum of the total age
            total_age += int(age)
        # Return total age divided by the length of the patient list
        return ("Average Patient Age: " + str(round(total_age/len(self.patients_ages), 2)) + " years")
    
    # Method to find the number of patients in region and return the resultant data as a dictionary
    def analyse_regions(self):
        # Initialise dictionary
        region_count = {} # region -> total patients
        # Iterate through list and increment each region accordingly
        for region in self.patients_regions:
            region_count[region] = 1 + region_count.get(region , 0)
        # Return the dictionary
        return region_count
    
    # Method to calculate the average yearly charge
    def average_charges(self):
        # Initialise total charges
        total_charges = 0
        # Iterate through list
        for charge in self.patients_charges:
            total_charges += charge
        # Return the average 
        return ("Average Yearly Medical Insurance Charges: " +  
                str(round(total_charges/len(self.patients_charges), 2)) + " dollars.")
    
    # Method to calculate average insurance cost for each region
    def average_region_cost(self):
        # Initialise empty dictionary
        region_total_charges = {} # region -> total charge
        
        # Use analyse.regions() to return unique region dictionary with its corresponding patient population
        unique_regions = self.analyse_regions()
        
        # Iterate through patients_regions to increment the total charge for each unique region
        for index, region in enumerate(self.patients_regions):
            region_total_charges[region] = self.patients_charges[index] + region_total_charges.get(region, 0)
        
        # Iterate through region_total_charges to calculate average using the unique_region dictionary
        for region in region_total_charges:
            region_total_charges[region] = region_total_charges.get(region, 0) // unique_regions[region]
        return region_total_charges
    
    # Method to calculate the difference between the average non-smoker and smoker insurance cost
    def analyse_smoker(self):
        # Initalise total charges for smoker and non-smoker as well as the total count
        smoker_total_charge = 0
        non_smoker_total_charge = 0
        total_smokers = 0
        total_non_smokers = 0
        
        # Interate through the patient smoker status list
        for index, status in enumerate(self.patients_smoker_status):
            if status == 'yes':
                smoker_total_charge += self.patients_charges[index]
                total_smokers += 1
            non_smoker_total_charge += self.patients_charges[index]
            total_non_smokers += 1
        
        # Calculate the average
        average_smoker_charge = smoker_total_charge // total_smokers
        average_non_smoker_charge = non_smoker_total_charge // total_non_smokers
        
        # Return the average for both groups as well as the difference between them
        return ("The Average Yearly Medical Insurance Charges for Smokers is " + str(average_smoker_charge) + 
               " and for non-smoker is " + str(average_non_smoker_charge) + 
               " The difference between the groups is " + str(average_smoker_charge - average_non_smoker_charge))
    
    # Method to retrieve average charge per age group
    def cost_per_age_group(self):
        # Initialise age group varaibles and set to zero
        total_18_25 = 0
        total_26_35 = 0
        total_36_45 = 0
        total_46_55 = 0
        total_56_65 = 0
        total_over_65 = 0
        count_18_25 = 0
        count_26_35 = 0
        count_36_45 = 0
        count_46_55 = 0
        count_56_65 = 0
        count_over_65 = 0
        # Iterate through patients ages and increment the corresponding total and count variables
        for index, age in enumerate(self.patients_ages):
            if age >= 18 and age <= 25:
                total_18_25 += self.patients_charges[index]
                count_18_25 += 1
            elif age >= 26 and age <= 35:
                total_26_35 += self.patients_charges[index]
                count_26_35 += 1
            elif age >= 36 and age <= 45:
                total_36_45 += self.patients_charges[index]
                count_36_45 += 1
            elif age >= 46 and age <= 55:
                total_46_55 += self.patients_charges[index]
                count_46_55 += 1
            elif age >= 56 and age <= 65:
                total_56_65 += self.patients_charges[index]
                count_56_65 += 1
            else:
                total_over_65 += self.patients_charges[index]
                count_over_65 += 1
                
        # Calculate the average
        avg_18_25 = round((total_18_25 / count_18_25),2)
        avg_26_35 = round((total_26_35 / count_18_25),2)
        avg_36_45 = round((total_36_45 / count_18_25),2)
        avg_46_55 = round((total_46_55 / count_18_25),2)
        avg_56_65 = round((total_56_65 / count_18_25),2)
        avg_over_65 = round((total_over_65 / count_18_25),2)
    
        # Create a dictionary to hold the results
        price_per_age_group = {} # age group -> average cost
        price_per_age_group['Avg 18-25'] = avg_18_25
        price_per_age_group['Avg 26-35'] = avg_26_35
        price_per_age_group['Avg 36-45'] = avg_36_45
        price_per_age_group['Avg 45-55'] = avg_46_55
        price_per_age_group['Avg_56-65'] = avg_56_65
        price_per_age_group['Avg over 65'] = avg_over_65
        
        # Return dictionary 
        return price_per_age_group

In [180]:
# Create an instance of the PatientInfo class to analyse insurance data
patient_info = PatientsInfo(ages, sexes, bmis, num_children, smoker_status, regions, charges)

In [181]:
patient_info.analyse_ages()

'Average Patient Age: 39.21 years'

The average age of the patients in **insurance.csv** is about 39 years old. This is important to check in order to ensure the data in **insurance.csv** is representative for a broader population. If it is decided to use the dataset to make inferences about other populations, the data must abundant and broad enough for such use cases.

A further analysis would have to be done to make sure the range and standard deviation of the patient age group in **insurance.csv** is indicative of a random sampling of individuals. 

In [182]:
patient_info.average_charges()

'Average Yearly Medical Insurance Charges: 13270.42 dollars.'

The average yearly medical insurance charge per individual is 13270 US dollars. Some further analysis could be done to see what patient attributes contribute most strongly to low and/or high medical insurance charges. For example, one could check if patient age correlates with the amount of money they spend yearly.

Since the data of insurance is likely to be skewed by outliers with extremely high insurance cost, finding the IQR range of the charges will give us insight on the spread of insurance cost as well showing us the a better representation of the likely average. The mean calulcated by patient_info is likely to be skewed. Although the data might be biased towards southeast patients

In [183]:
patient_info.analyse_regions()

{'southwest': 325, 'southeast': 364, 'northwest': 325, 'northeast': 324}

There are four regions representing the US medical insurance data. The regions all have near the same population with only southeast having a slightly signficant population. The data will represent each region well. 

In [184]:
patient_info.analyse_smoker()

'The Average Yearly Medical Insurance Charges for Smokers is 32050.0 and for non-smoker is 13270.0 The difference between the groups is 18780.0'

The difference between the average smoking and non smoking charge is 18780.0 dollars. It is therefore recommeded to not smoke for a cheaper insurance cost. 

In [185]:
patient_info.cost_per_age_group()

{'Avg 18-25': 9087.02,
 'Avg 26-35': 9191.84,
 'Avg 36-45': 11641.44,
 'Avg 45-55': 14837.52,
 'Avg_56-65': 13267.76,
 'Avg over 65': 0.0}

Medical insurance costs generally increase with age. Patients in the 18-25 and 26-35 age brackets pay significantly less in annual medical insurance bills than older patients. However, the 46-55 age group spends more on annual medical insurance than the 56-65s. There is also no recorded insrance cost for patients over the age of 65. 