# Medical Insurance Costs Analysis
This notebook analyzes data from [Kaggle's Medical Cost Personal Datasets](https://www.kaggle.com/datasets/mirichoi0218/insurance) and aims to answer four main questions. 

1. What is the average age of the patients in the dataset?
2. Where a majority of the individuals are from?
3. Are insurance costs significantly different between smokers and non-smokers?
4. Do children have a signficant effect of a patient's insurance cost? 

First, let's load the data using Pandas. 

In [36]:
import pandas as pd
import seaborn as sns

insurance_df = pd.read_csv('insurance.csv')

insurance_df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


## 1. Average age of patients
Since the average is heavily influence by outliers, the median will also be calculated to determine if the average is a meaningful measure of central tendency. 

In [31]:
avg_age = round(insurance_df['age'].mean(), 2)
med_age = round(insurance_df['age'].median(), 2)

print(f'The average age of patients in the dataset is {avg_age}')
print(f'The median age of patients in the dataset is {med_age}')


The average age of patients in the dataset is 39.21
The median age of patients in the dataset is 39.0


Since the average and median values are very close, the average age of 39.21 for patients is a meaningful measure of central tendency in the dataset. 

## 2. Where a majority of the individuals are from?
Next, the region column will be analyze to determine where are the majority of patients are from. First, let's see what regions we are dealing with. 

In [32]:
distinct_regions = insurance_df['region'].unique()

print(f'There are a total of {len(distinct_regions)} distinct regions in the dataset that include: {distinct_regions}')


There are a total of 4 distinct regions in the dataset that include: ['southwest' 'southeast' 'northwest' 'northeast']


Futhermore, let's determine if the majority of patients are in a one region or if patients are evenly distributed among all regions. A dictionary will be created to keep the data organization. 

In [43]:
region_counts = []

for reg in distinct_regions:
    print(insurance_df.loc[insurance_df['region'] == reg])

      age     sex   bmi  children smoker     region      charges
0      19  female  27.9         0    yes  southwest  16884.92400
12     23    male  34.4         0     no  southwest   1826.84300
15     19    male  24.6         1     no  southwest   1837.23700
18     56    male  40.3         0     no  southwest  10602.38500
19     30    male  35.3         0    yes  southwest  36837.46700
...   ...     ...   ...       ...    ...        ...          ...
1316   19  female  20.6         0     no  southwest   1731.67700
1329   52    male  38.6         2     no  southwest  10325.20600
1331   23  female  33.4         0     no  southwest  10795.93733
1332   52  female  44.7         3     no  southwest  11411.68500
1336   21  female  25.8         0     no  southwest   2007.94500

[325 rows x 7 columns]
      age     sex    bmi  children smoker     region     charges
1      18    male  33.77         1     no  southeast   1725.5523
2      28    male  33.00         3     no  southeast   4449.4620
5