## Capstone Three 
### June 2021 - Junko Takasawa

### Data Wrangling
***

**US Health Insurance Dataset**
>**age** - age of the insured\
>**sex** - sex of the insured\
>**bmi** - body mass index (bmi) of the insured\
>**children** - number of children covered by this insurance\
>**region** - region where insured reside\
>**charges** - total charge of a calendar year
        
Data Source: https://www.kaggle.com/teertha/ushealthinsurancedataset

In [42]:
import pandas as pd
import numpy as np


In [43]:
df = pd.read_csv('insurance.csv')
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [44]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


No NaN or missing data.

In [45]:
# calculate monthly charge

df['monthly_charge'] = round(df['charges']/12, 2)
df

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,monthly_charge
0,19,female,27.900,0,yes,southwest,16884.92400,1407.08
1,18,male,33.770,1,no,southeast,1725.55230,143.80
2,28,male,33.000,3,no,southeast,4449.46200,370.79
3,33,male,22.705,0,no,northwest,21984.47061,1832.04
4,32,male,28.880,0,no,northwest,3866.85520,322.24
...,...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830,883.38
1334,18,female,31.920,0,no,northeast,2205.98080,183.83
1335,18,female,36.850,0,no,southeast,1629.83350,135.82
1336,21,female,25.800,0,no,southwest,2007.94500,167.33


In [46]:
# find min, max, and mean of age

print("Age (min):",df['age'].min())
print("Age (max):",df['age'].max())
print("Age (average):", round(df['age'].mean(), 1))

Age (min): 18
Age (max): 64
Age (average): 39.2


In [47]:
# allocate age to age_groups

df['age_group'] = pd.cut(df.age,[0, 19, 29, 39, 49, 59, 80], labels=['10s', '20s', '30s', '40s', '50s', '60s+'])
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,monthly_charge,age_group
0,19,female,27.9,0,yes,southwest,16884.924,1407.08,10s
1,18,male,33.77,1,no,southeast,1725.5523,143.8,10s
2,28,male,33.0,3,no,southeast,4449.462,370.79,20s
3,33,male,22.705,0,no,northwest,21984.47061,1832.04,30s
4,32,male,28.88,0,no,northwest,3866.8552,322.24,30s


In [48]:
# find min, max, and mean of BMI

print("BMI (min):",df['bmi'].min())
print("BMI (max):",df['bmi'].max())
print("BMI (average):", round(df['bmi'].mean(), 1))

BMI (min): 15.96
BMI (max): 53.13
BMI (average): 30.7


In [49]:
%%html
<style>
    table {
        display: inline-block
    }
</style>

**BMI and Corresponding Weight Status**

| BMI | Weight Status |
| --- | --- | 
| Below 18.5 | Underweight |
| 18.5 – 24.9 | Normal |
| 25.0 – 29.9 | Overweight |
| 30.0 and Above | Obese |

Data Source: https://www.cdc.gov/healthyweight/assessing/bmi/adult_bmi/index.html

In [50]:
# allocate age to weight_status groups

df['weight_status'] = pd.cut(df.bmi,[0, 18.5, 24.9, 29.9, 60], labels=['Underweight', 'Normal', 'Overweight', 'Obese'])
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,monthly_charge,age_group,weight_status
0,19,female,27.9,0,yes,southwest,16884.924,1407.08,10s,Overweight
1,18,male,33.77,1,no,southeast,1725.5523,143.8,10s,Obese
2,28,male,33.0,3,no,southeast,4449.462,370.79,20s,Obese
3,33,male,22.705,0,no,northwest,21984.47061,1832.04,30s,Normal
4,32,male,28.88,0,no,northwest,3866.8552,322.24,30s,Overweight


In [51]:
# check the distribution of dependents

df['children'].value_counts()

0    574
1    324
2    240
3    157
4     25
5     18
Name: children, dtype: int64

In [52]:
#export data into excel file

df.to_excel('insurance_data.xlsx')