# U.S. Medical Insurance Costs

Hello! In this project, we will be analyzing the data provided to find out how specific factors such as region affect how much you are charged.

Here I will import the csv file into a pandas DataFrame for analyzing.
I will use the read_csv method from the pandas module to load it in and see the formatting.
I will also import the statistics module to use for calculations.

In [176]:
import pandas as pd
import statistics as stat
insurance_dataframe=pd.read_csv('insurance.csv')
print(insurance_dataframe)

      age     sex     bmi  children smoker     region      charges
0      19  female  27.900         0    yes  southwest  16884.92400
1      18    male  33.770         1     no  southeast   1725.55230
2      28    male  33.000         3     no  southeast   4449.46200
3      33    male  22.705         0     no  northwest  21984.47061
4      32    male  28.880         0     no  northwest   3866.85520
...   ...     ...     ...       ...    ...        ...          ...
1333   50    male  30.970         3     no  northwest  10600.54830
1334   18  female  31.920         0     no  northeast   2205.98080
1335   18  female  36.850         0     no  southeast   1629.83350
1336   21  female  25.800         0     no  southwest   2007.94500
1337   61  female  29.070         0    yes  northwest  29141.36030

[1338 rows x 7 columns]


We can see that each row documents a person's age, sex, bmi, number of children, smoker, region, and charges.

Here I will save each column to a variable for analysis. 

In [22]:
ages=insurance_dataframe['age']
sexes=insurance_dataframe['sex']
bmis=insurance_dataframe['bmi']
children=insurance_dataframe['children']
smokers=insurance_dataframe['smoker']
regions=insurance_dataframe['region']
charges=insurance_dataframe['charges']

First I would like to analyze the average age of this dataset to find any possible bias on this front.

In [168]:
def calculate_avg_age():
    return f'The average age for this dataset is {round(ages.mean())}.'
print(calculate_avg_age())

The average age for this dataset is 39.


The average age is 39, meaning this dataset is biased towards middle-aged people (people aged 35-49). 


Next, I would like to see how age influences the charge amount.

In [182]:
def compare_charges():
    ages_charges=insurance_dataframe.groupby('age').charges.sum().reset_index() 
    older_people=ages_charges[ages_charges.age>50]
    middle_aged_people=ages_charges[(ages_charges.age>35) & (ages_charges.age<50)]
    younger_people=ages_charges[ages_charges.age<35]
    
    old_avg=stat.mean(list(older_people.charges))

    middle_avg=stat.mean(list(middle_aged_people.charges))

    young_avg=stat.mean(list(younger_people.charges))

    print(f'On average, people aged 18-35 pay ${round(young_avg,2)} in charges.\nOn average, people aged 35-49 pay ${round(middle_avg,2)} in charges.\nOn average, people aged 50+ pay ${round(old_avg,2)} in charges.')

compare_charges()

On average, people aged 18-35 pay $312391.23 in charges.
On average, people aged 35-49 pay $376429.42 in charges.
On average, people aged 50+ pay $459875.39 in charges.


Looking at this data, we can make the inference that the older you are, the more in charges you pay.

Next, we would like to find out how the region affects the charge amount. We will do this by creating a function that gives us the average charge amount per region.

In [214]:
def region_affect_charge():
    charge_sum_per_region=insurance_dataframe.groupby('region').charges.sum().reset_index()
    charge_sum_per_region.charges=charge_sum_per_region.charges.round()
    print('This is the total amount in charges per region:\n',charge_sum_per_region)
    charge_avg_per_region=insurance_dataframe.groupby('region').charges.mean().reset_index()
    charge_avg_per_region.charges=charge_avg_per_region.charges.round(2)
    print('This is the average amount in charges per region:\n',charge_avg_per_region)

    charge_sex_per_region=insurance_dataframe.groupby(['region','sex']).charges.mean().reset_index().pivot(
        columns='sex',
        index='region',
        values='charges'
    ).reset_index()
    charge_sex_per_region.female=charge_sex_per_region.female.round(2)
    charge_sex_per_region.male=charge_sex_per_region.male.round(2)
    def calculate_diff(x):
        if x['female']>x['male']:
            diff=round(x['female']-x['male'],2)
            return f'{diff} (F)'
        else:
            diff=round(x['male']-x['female'],2)
            return f'{diff} (M)'
    charge_sex_per_region['difference']=charge_sex_per_region.apply(lambda x:
        calculate_diff(x),
        axis=1
    )

    print('This is the average amount in charges per region, organized by sex: \n',charge_sex_per_region)
region_affect_charge()

This is the total amount in charges per region:
       region    charges
0  northeast  4343669.0
1  northwest  4035712.0
2  southeast  5363690.0
3  southwest  4012755.0
This is the average amount in charges per region:
       region   charges
0  northeast  13406.38
1  northwest  12417.58
2  southeast  14735.41
3  southwest  12346.94
This is the average amount in charges per region, organized by sex: 
 sex     region    female      male   difference
0    northeast  12953.20  13854.01   900.81 (M)
1    northwest  12479.87  12354.12   125.75 (F)
2    southeast  13499.67  15879.62  2379.95 (M)
3    southwest  11274.41  13412.88  2138.47 (M)


From our results, we can see that people located in the southeast region get charged more than the other regions. Specifically, males in the southeast region, who get charged about $2379.95 more than females.

Alternatively, the southwest get charged less than the other regions. Specifically, females in the region, who get charged about $2138.47 less than males.

The final test, is seeing if there is a correlation between being a smoker and higher charges.

In [220]:
def smoker_affect_charges():
    smoker_charges_avg=insurance_dataframe.groupby('smoker').charges.mean().reset_index()
    print('This is the average charge:\n',smoker_charges_avg)

    smoker_charges_avg_sex=insurance_dataframe.groupby(['smoker','sex']).charges.mean().reset_index().pivot(
        columns='sex',
        index='smoker',
        values='charges'
    ).reset_index()

    def calculate_diff(x):
        if x['female']>x['male']:
            diff=round(x['female']-x['male'],2)
            return f'{diff} (F)'
        else:
            diff=round(x['male']-x['female'],2)
            return f'{diff} (M)'
    smoker_charges_avg_sex.female=smoker_charges_avg_sex.female.round(2)
    smoker_charges_avg_sex.male=smoker_charges_avg_sex.male.round(2)
    smoker_charges_avg_sex['difference']=smoker_charges_avg_sex.apply(lambda x:
        calculate_diff(x),
        axis=1
    )

    print('This is the average charge based on sex:\n',smoker_charges_avg_sex)
smoker_affect_charges()

This is the average charge:
   smoker       charges
0     no   8434.268298
1    yes  32050.231832
This is the average charge based on sex:
 sex smoker   female      male   difference
0       no   8762.3   8087.20    675.1 (F)
1      yes  30679.0  33042.01  2363.01 (M)


We can see here that you can expect higher charges if you are a smoker, as smokers pay about $23616 more. Specifically, male smokers, who pay about 2363.01 dollars more than female smokers. 


Summary:
Based on the results above, we can make three conclusions:
1. The older you are, the more you pay in charges.
2. If you live in the southeast, you will typically get charged about 2379.95 dollars more than people living in other regions.
3. If you are a smoker, you will typically get charged about $23616 more than the average non-smoker.

Thank you for looking through this project! If you have any feedback concerning my results or queries please tell me!