# Insurance Case Study

## Problem Statement

MedicaInsure is a medical insurance provider. Leveraging customer information is of paramount importance for most businesses. In the case of an insurance company, analysis of customer attributes like age, sex, smoking habits, etc. can be crucial in making decisions regarding the premium amount to be charged. 

The insurance company wants to know whether the proportion of female smokers is different from the proportion of male smokers in their customer population.

They have provided a sample dataset of customers and the charges claimed by them.

In [1]:
# Upgrading scipy  

#!pip install --upgrade scipy

import scipy
scipy.__version__

'1.7.3'

In [2]:
# Importing libraries

# numerical and dataset modification
import numpy as np
import pandas as pd

# visualization
import matplotlib.pyplot as plt 
import seaborn as sns
%matplotlib inline

# statistical
import scipy.stats as stats
import statsmodels.stats.proportion as proportion
import statsmodels.stats.multicomp as multicomp


In [3]:
# Mounting Google Drive

from google.colab import drive
drive.mount('/content/drive/')


Mounted at /content/drive/


In [6]:
# Loading dataset

data_orig = pd.read_csv('/content/drive/MyDrive/Python_files/insurance.csv')

# Ensuring the original dataset remains unchanged
df = data_orig.copy()

# general info on dataset
print(df.info())

# peeking first 10 rows
df.head(10)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB
None


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552
5,31,female,25.74,0,no,southeast,3756.6216
6,46,female,33.44,1,no,southeast,8240.5896
7,37,female,27.74,3,no,northwest,7281.5056
8,37,male,29.83,2,no,northeast,6406.4107
9,60,female,25.84,0,no,northwest,28923.13692


In [7]:
# Checking for missing values

df.isnull().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

In [15]:
# Forming a smaller data frame that has the number of male/female smokers

male_num = df.loc[df['sex'] == 'male'].shape[0]    # no. of males and females
female_num = df.loc[df['sex'] == 'female'].shape[0]

# number of males and females who smoke
male_smoker_num = df.loc[(df['sex'] == 'male') & (df['smoker'] == 'yes')].shape[0]
female_smoker_num = df.loc[(df['sex'] == 'female') & (df['smoker'] == 'yes')].shape[0]

print(round(male_smoker_num/male_num*100, 2), '% of men smoke.', sep = '')
print(round(female_smoker_num/female_num*100, 2), '% of women smoke.', sep = '')

sex_smoker_df = pd.DataFrame({'sex': ['M', 'F'], 'smoker': [male_smoker_num, female_smoker_num], 
                              'non-somoker': [male_num-male_smoker_num, female_num-female_smoker_num]})

sex_smoker_df

23.52% of men smoke.
17.37% of women smoke.


Unnamed: 0,sex,smoker,non-somoker
0,M,159,517
1,F,115,547


### First Approach: Using chi-square test of independence

### Forming Hypotheses:

$H_0$: Sex and smoking status are independent, i.e., the proportions of smoking males and females are not different.

$H_a$: Sex and smoking status are dependent, i.e., the proportions of smoking males and females are different, or, the sex influences the smoking status.


In [23]:
# Using Chi-square test of independence
chi, p_value, dof, expected = stats.chi2_contingency(sex_smoker_df.drop('sex', axis = 1))
print('p-value =', round(p_value, 4))
print('chi =', round(chi, 2))
print('dof =', dof)
print('Expected =', expected.round(2))

if(p_value >= 0.05):
  print('Sex and smoking status are independent.')
else:
  print('Sex and smoking status are dependent, i.e., the ratio of smoking men and women should be assumed different.')


p-value = 0.0065
chi = 7.39
dof = 1
Expected = [[138.43 537.57]
 [135.57 526.43]]
Sex and smoking status are dependent, i.e., the ratio of smoking men and women should be assumed different.


### Second Approach: Using z-test for proportions

### Forming Hypotheses:

$H_0$: $p_m = p_f$

$H_a$: $p_m \neq p_f$

Here, $p_m$ and $p_f$ denote the ratio of smoking men and women, respectively.

In [26]:
# Checking for assumptions

p_m = male_smoker_num/male_num
p_f = female_smoker_num/female_num

if(p_m*male_num > 10 and (1-p_m)*male_num > 10 and p_f*female_num > 10 and (1-p_f)*female_num > 10):
  print('Assumptions of two-sample z-test for proportions are satisfied.')
else:
  print('Assumptions of two-sample z-test for proportions are NOT satisfied.')

Assumptions of two-sample z-test for proportions are satisfied.


In [29]:
# Conduncting two-sample z-test

test_stat, p_value = proportion.proportions_ztest(count = np.array([male_smoker_num, female_smoker_num]),
                                                  nobs = np.array([male_num, female_num]), alternative = 'two-sided')

print('p-value =', round(p_value, 4))
print('test statistic =', round(test_stat, 2))

if(p_value >= 0.05):
  print('The fraction of male and female smokers are NOT different.')
else:
  print('The fraction of male and female smokers are different.')

p-value = 0.0053
test statistic = 2.79
The fraction of male and female smokers are different.
