In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd
from scipy.stats import ttest_ind, chisquare
from statsmodels.stats.proportion import proportions_ztest

## US Health Insurance Dataset [(source)](https://www.kaggle.com/datasets/teertha/ushealthinsurancedataset)

This dataset contains 1338 rows of insured data, where the Insurance charges are given against the following attributes of the insured: Age, Sex, BMI, Number of Children, Smoker and Region. There are no missing or undefined values in the dataset.

For the following exercises, assume that the data was taken from a random sample of users.

In [3]:
df = pd.read_csv('/content/drive/MyDrive/5.9.4 Home Assignment/insurance.csv')
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


## Problem 1: Using the US Health Insurance Dataset, Test if the insurance charges are different for males vs. females.

1. Define the Null and Alternative Hypothesis
1. Implement the applicable hypothesis test and calculate the p-values
1. What is the conclusion at an alpha of 0.05?


In [7]:
"""
Null Hypothesis: Their no difference in insurance charges in both males and females.
Alternative Hypothesis: Insurance charges are different for males vs. females.
"""

male_charges = df.loc[df.sex=='male', 'charges']
female_charges = df.loc[df.sex=='female', 'charges']

alpha = 0.05

t_stat, p_val = ttest_ind(male_charges, female_charges, alternative='two-sided')    # test statistics and p_value
print ('Test statistic: ', t_stat)
print ('p-value: ', p_val)

if p_val < alpha:    # greater than, less than, etc.?
    print (f'Reject the Null Hypothesis (alpha = {alpha}).')
else:
    print (f'Fail to Reject the Null Hypothesis (alpha = {alpha}).')

Test statistic:  2.097546590051688
p-value:  0.03613272100592976
Reject the Null Hypothesis (alpha = 0.05).


**Conclusion**
Their's no difference in insurance charges between males and females.

## Problem 2: Test if the insurance charges are higher for those with children.

1. Define the Null and Alternative Hypothesis
1. Implement the applicable hypothesis test and calculate the p-values
1. What is the conclusion at an alpha of 0.05?


In [9]:
"""
Null Hypothesis: Theirs no difference in charges between with children and dont have children.
Alternative Hypothesis: Insurance charges are higher for those with children
"""

no_child = df.loc[df.children== 0, 'charges']
with_child = df.loc[df.children > 0, 'charges']

alpha = 0.05

t_stat, p_val = ttest_ind(no_child, with_child, alternative='greater')    # get t-stat and p-value
print ('Test statistic: ', t_stat)
print ('p-value: ', p_val)

if p_val < alpha:
    print (f'Reject the Null Hypothesis (alpha = {alpha}).')
else:
    print (f'Fail to Reject the Null Hypothesis (alpha = {alpha}).')

Test statistic:  -2.3720611301669337
p-value:  0.9910849126582172
Fail to Reject the Null Hypothesis (alpha = 0.05).


**Conclusion**
Insurance charges are higher for those with children.

## Problem 3: Test if the proportion of smokers is more than 20%

1. Define the Null and Alternative Hypothesis
1. Implement the applicable hypothesis test and calculate the p-values
1. What is the conclusion at an alpha of 0.05?

In [10]:
"""
Null Hypothesis: Proportion of smokers are =< 20%
Alternative Hypothesis: Proportion of smokers are > 20%
"""

smokers = df[df.smoker == 'yes']
non_smokers = df[df.smoker == 'no']

alpha = 0.05

# compute for the p-value
z_stat, p_val = proportions_ztest(count=len(smokers),
                                  nobs=len(smokers)+len(non_smokers),
                                  value=0.2,
                                  alternative='two-sided')
print('p-value:', p_val)

if p_val < alpha:
    print (f'Reject the Null Hypothesis (alpha = {alpha}).')
else:
    print (f'Fail to Reject the Null Hypothesis (alpha = {alpha}).')

p-value: 0.66459875819366
Fail to Reject the Null Hypothesis (alpha = 0.05).


**Conclusion**
Proportion of smokers are > 20%

## Problem 4: Test if the proportion of smokers from each region follows the table below:

|           | Non-Smoker | Smoker |
|-----------|------------|-----------|
| northeast     | 20%        | 5%        |
| northwest | 20%        | 5%       |
| southeast | 20%        | 5%       |
| southwest | 20%        | 5%       |

1. Define the Null and Alternative Hypothesis
1. Implement the applicable hypothesis test and calculate the p-values
1. What is the conclusion at an alpha of 0.05?

In [12]:
"""
Null Hypothesis: Proportion of smokers of each region doen'st follows the tabl
Alternative Hypothesis: Proportion of smokers of each region follows the table.
"""

obs = df.groupby(['smoker', 'region']).charges.count()    # smokers per region
n_obs = len(df)    # number of observations

f_obs = obs.values
f_exp = []    # expected frequencies
for exp in [0.2, 0.2, 0.2, 0.2, 0.05, 0.05, 0.05, 0.05]:
  f_exp.append(exp*n_obs)

alpha = 0.05

chi_sq, p_val = chisquare(f_obs=f_obs, f_exp=f_exp)
print('p-value:', p_val)

if p_val < alpha:
    print (f'Reject the Null Hypothesis (alpha = {alpha}).')
else:
    print (f'Fail to Reject the Null Hypothesis (alpha = {alpha}).')

p-value: 0.11519018608755734
Fail to Reject the Null Hypothesis (alpha = 0.05).
