For this assignment, we will be working with the US Health Insurance Dataset from Kaggle.

We have been asked to use our hypothesis testing skills to answer the following questions:

- Q1. Do smokers have higher insurance charges than non-smokers?
- Q2. Are men more likely to smoke than women?
- Q3. Do different regions have different charges, on average?



For each question, make sure to:

1. State your Null Hypothesis and Alternative Hypothesis
2. Select the correct test according to the data type and number of samples
3. Test the assumptions of your selected test.
4. Execute the selected test, or the alternative test (if you do not meet the assumptions)
5. Interpret your p-value and reject or fail to reject your null hypothesis 
6. Show a supporting visualization that helps display the result

# Imports

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats
import scipy

# Load the Data

In [2]:
df = pd.read_csv('Data/insurance - insurance.csv')
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


# Q1: Do smokers have higher insurance charges than non-smokers?

## State the Null and Alternative Hypothesis

- Null Hypothesis: There is no significant difference in insurance charges between smokers and non-smokers.
- Alternative Hypothesis: There is a significant difference in insurance charges for smokers and non-smokers.

## Determining the Correct Test

- The data is numeric.
- There are 2 groups: smokers and non-smokers
- Which test to perform? 2 sample T-Test

## Visualize and Separate Features for Hypothesis

- The target column is 'charges'
- The column that determines the group is 'smoker'

In [3]:
# checking the col for smokers
df['smoker'].value_counts()

no     1064
yes     274
Name: smoker, dtype: int64

In [4]:
# Filtering out those who smoke and don't
smoker_df = df.loc[df['smoker']== 'yes'].copy()
non_smoker_df = df.loc[df['smoker']== 'no'].copy()

In [5]:
# define feature of interest
smoker = smoker_df['charges']
non_smoker = non_smoker_df['charges']

## Check for and Remove Outliers

In [6]:
# Check for outliers in charges for smokers
zscores= stats.zscore(smoker)
outliers = abs(zscores)>3
np.sum(outliers)

0

In [7]:
# Check for outliers in charges for non-smokers
zscores= stats.zscore(non_smoker)
outliers = abs(zscores)>3
np.sum(outliers)

24

## Check for Normality

Since the sample size being tested is larger than 20, I can proceed with the tests and assume normality to satisfy this test.

## Check for Equal Variance

In [8]:
# Test for equal variance
result = stats.levene(smoker, non_smoker)
print(result)

LeveneResult(statistic=332.6135162726081, pvalue=1.5593284881803726e-66)


In [9]:
# if-else to help interpret the p-value
if result.pvalue < .05:
    print(f"The groups do NOT have equal variance.")
else:
    print(f"The groups DO have equal variance.")

The groups do NOT have equal variance.


Even though the 2 groups do not have equal variance, I can still continue with the final hypothesis test as long as I set equal_var=False.

## Final Hypothesis Test

In [10]:
# run final hypothess test
result = stats.ttest_ind(smoker, non_smoker, equal_var=False)
print(result)
result.pvalue  < .05

Ttest_indResult(statistic=32.751887766341824, pvalue=5.88946444671698e-103)


True

In [14]:
# show the mean for each group
print(f'The mean insurance charges for smokers is {smoker.mean().round(2)}')
print(f'The mean insurance charges for non-smokers is {non_smoker.mean().round(2)}')

The mean insurance charges for smokers is 32050.23
The mean insurance charges for non-smokers is 8434.27


## FInal Conclusion

- The p-value is < 0.05 for the final test, so I reject the null hypothesis and support the alternative hypothesis that smokers have higher insurance charges than non-smokers.

# Q2. Are men more likely to smoke than women?
