# US Medical Insurance Exploration

In this notebook, we explore how common health variables may affect the charges in US medical insurance. We start by a quick exploration and organization of the data, then proceed to analyse how age and bmi affect such costs. In a third instance we use these two indices to model an expected cost for insurance.

## Basic ETL

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import scipy.stats as stats
import seaborn as sns

In [None]:
df = pd.read_csv("insurance.csv")
df.head()
df.smoker=df.smoker.replace('yes',1).replace("no",0)

In [None]:
df.describe()

In [None]:
df.info()

We see that there are no nan number nor outliers. The data is already clean.

In [None]:
df.corr().style.set_caption("General Correlation Table")

# Age modeling

Here we first vizualize and see the correlation between age other features. We start by a general plotting, then divide the sample into smokers and non-smokers.

In [None]:
df_age_groupby = df.groupby("age").mean()

In [None]:
df_age_groupby.plot(subplots=True,figsize=(20,25))

In [None]:
df_smokers = df[df.smoker==1].copy()
df_smokers.drop(columns=["smoker"],inplace=True);
df_age_smokers_groupby = df_smokers.groupby("age").mean()
df_age_smokers_groupby.plot(subplots=True,figsize=(25,20))

In [None]:
df_no_smokers = df[df.smoker==0].copy()
df_no_smokers.drop(columns=["smoker"],inplace=True)
df_age_no_smokers_groupby = df_smokers.groupby("age").mean()
df_age_no_smokers_groupby.plot(subplots=True,figsize=(25,20))

There is a clearer difference when we plot both groups together in a scatter plot.

In [None]:
plt.figure(figsize=(8,8))
plt.scatter(df_no_smokers.age,df_no_smokers.charges,alpha=0.5)
plt.scatter(df_smokers.age,df_smokers.charges,alpha=0.5)

plt.legend(["Smokers","Non-smokers"])
plt.xlabel("Age")
plt.ylabel("Cost")
plt.title("Insurance Cost by Age")

plt.show()

plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
plt.scatter(df_smokers.age,df_smokers.charges,alpha=0.5,color="darkorange")




plt.subplot(1,2,2)
plt.scatter(df_no_smokers.age,df_no_smokers.charges,alpha=0.5)


plt.show()



In [None]:
sns.lmplot(x = 'age', y = 'charges', hue = 'smoker', data = df)
plt.show()

We also look for some relation between age and bmi, without success.

In [None]:
plt.figure(figsize=(12,12))
plt.scatter(df_no_smokers.age,df_no_smokers.bmi,alpha=0.5)
plt.scatter(df_smokers.age,df_smokers.bmi,alpha=0.5)
plt.legend(["Non Smokers","Smokers"])
plt.show()

In [None]:
stats.ttest_ind(df_smokers.charges,df_no_smokers.charges)

In [None]:
stats.ttest_ind(df_smokers.bmi,df_no_smokers.bmi)

## Smokers by age

We take the opportunity to understand how the smoking population is distributed through ages. We specially divide the sample in three age groups, from where we conclude the data aquisition might be conduct in order to equally distribute the number of smokers and non-smokers. 

In [None]:
df_smoke_by_age = df.groupby("age").smoker.mean()
plt.figure(figsize=(10,6))
df_smoke_by_age.plot.bar()
plt.show()

Next we evaluate how is the distribution around age groups. We devide the sample in three age groups and apply a $\chi^2$-contingency test to understand how age might influence in smoking.

In [None]:
df_smoke_by_age = df.groupby(["age","smoker"]).sex.count().reset_index().set_index("smoker")

contingency = pd.merge(df_smoke_by_age.loc[0].reset_index().drop(columns=["smoker"]), 
df_smoke_by_age.loc[1].reset_index().drop(columns=["smoker"]), on="age")\
#     .rename(columns=["age","smokers","non_smokers"])
#     .drop(columns=["age"])

contingency.columns = ["age","smokers","non_smokers"]

def age_separator(x):
    if x <64/3:
        return "18-22"
    elif x < 2*64/3:
        return "23-44"
    else:
        return "44-64"

mask = contingency.age.apply(age_separator)

contingency = contingency.groupby(mask).count().drop(columns=["age"])

print(contingency)


print(f"\n The resulting p-value is: {stats.chi2_contingency(contingency)[1]}")

From this we conclude that the test might have been conducted in order to distribute the population equally between smokers and non-smokers.

# BMI comparison

Another question is how the BMI affects the current charges. Although not a necessarily required index when contracting the insurance, it shows very correlated with the charges in the smoking population. 

In [None]:
plt.scatter(df_smokers.bmi,df_smokers.charges,color="r",alpha=.3)
plt.scatter(df_no_smokers.bmi,df_no_smokers.charges,color="g",alpha=.3)
plt.legend(["Smokers","Non smokers"])

This graph does not gives us a clear vizualization since both plots require different scale. Let us break them into two.

In [None]:
plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
plt.scatter(df_smokers.bmi,df_smokers.charges,color="r",alpha=.3)
plt.subplot(1,2,2)
plt.scatter(df_no_smokers.bmi,df_no_smokers.charges,color="g",alpha=.3)
plt.show()


It seems  bmi and  charges are greatly correlated whenever you smoke, but not otherwise. We can acertain this information by computing their Pearson Correlations.

In [None]:
df_smokers.corr().style.set_caption("Correlation Smoking Population")

In [None]:
df_no_smokers.corr().style.set_caption("Correlation Non-smoking Population")

We indeed conclude that, whenever you are a smoker, the bmi is a better index for pricing than age itself.