# Hypothesis Testing with Insurance Data
- Clint Atterberry
- 9/6/22

We have been asked to use our hypothesis testing skills to answer the following questions:

1. Do smokers have higher insurance charges than non-smokers?
2. Are men more likely to smoke than women?
3. Do different regions have different charges, on average?



For each question, make sure to:

- State your Null Hypothesis and Alternative Hypothesis
- Select the correct test according to the data type and number of samples
- Test the assumptions of your selected test.
- Execute the selected test, or the alternative test (if you do not meet the assumptions)
- Interpret your p-value and reject or fail to reject your null hypothesis 
- Show a supporting visualization that helps display the result

# Setup

## Import and Clean Data

In [163]:
# imports
import pandas as pd

In [164]:
# load in insurance.csv
df = pd.read_csv('Data/insurance.csv')
df.head(2)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523


In [165]:
# exploring dataframe
df.info()

#  all column names appear appropriate
# no missing values
# 7 columns, 1338 rows

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [166]:
# check for duplicated data
df.duplicated().sum()

# 1 duplicate

1

In [167]:
# remove duplicates
df.drop_duplicates(inplace=True)
df.duplicated().sum()

# success!

0

In [168]:
# explore numeric columns
df.describe()

# numeric data appears appropriate

Unnamed: 0,age,bmi,children,charges
count,1337.0,1337.0,1337.0,1337.0
mean,39.222139,30.663452,1.095737,13279.121487
std,14.044333,6.100468,1.205571,12110.359656
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29,0.0,4746.344
50%,39.0,30.4,1.0,9386.1613
75%,51.0,34.7,2.0,16657.71745
max,64.0,53.13,5.0,63770.42801


In [169]:
# explore object columns
df.select_dtypes('object').groupby('sex').value_counts()

# object data appears appropriate

sex     smoker  region   
female  no      southwest    141
                southeast    139
                northwest    135
                northeast    132
        yes     southeast     36
                northeast     29
                northwest     29
                southwest     21
male    no      southeast    134
                northwest    131
                southwest    126
                northeast    125
        yes     southeast     55
                northeast     38
                southwest     37
                northwest     29
dtype: int64

# Transform Data

## OHE Gender

In [170]:
# # filter by gender
# male_df = df[df['sex']=='male']

# # reseting index
# male_df = male_df.reset_index(drop=True)

# # write 1's to 'male' and 0's to 'female'
# male_df[['male','female']] = [1,0]
# male_df.head()

In [171]:
# # filter by gender
# female_df = df[df['sex']=='female']

# # reseting index
# female_df = female_df.reset_index(drop=True)

# # write 1's to 'female' and 0's to 'male'
# female_df[['male','female']] = [0,1]
# female_df.head()

In [172]:
# insurance_df = pd.concat([male_df,female_df], ignore_index=True)

# # reviewing first 5
# # looking for values to be male = 1 and female to be 0
# print(insurance_df.head())
# print('\n\n')

# # reviewing last 5
# # looking for values to be male=0 and female to be 1
# print(insurance_df.tail())

## OHE Smoker

In [173]:
# insurance_df = df.copy()
# cols = list(df.columns)

# for i in cols:
#     insurance_df[i]
    
#     if i == 'sex':
#         for j in insurance_df['sex']:
#             if j == 'male':
#                 insurance_df[['male','female']] = [1,0]
#             else:
#                 insurance_df[['male','female']] = [0,1]
#     elif i == 'smoker':
#         for j in insurance_df['smoker']:
#             if j == 'yes':
#                 insurance_df[['smoker_yes','smoker_no']] = [1,0]
#             else:
#                 insurance_df[['smoker_yes','smoker_no']] = [0,1]

# insurance_df.head()

In [190]:
# make a copy of df
ins_df = df.copy()
ins_df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


southeast    364
southwest    325
northwest    324
northeast    324
Name: region, dtype: int64

In [221]:
from sklearn.preprocessing import OneHotEncoder, LabelBinarizer
obj = df.select_dtypes(include='object')
num = df.select_dtypes(exclude='object')

obj_ohe = OneHotEncoder().fit_transform(obj).toarray()
temp = pd.DataFrame(obj_ohe)
ins_df = ins_df.join(temp)
ins_df

ValueError: columns overlap but no suffix specified: RangeIndex(start=0, stop=8, step=1)

In [204]:
ins_df.head()

Unnamed: 0,0
0,1
1,0
2,0
3,0
4,0
