### Statistical Estimation and Hypothesis Testing for Population Mean
#### By Mohan Bakshi

In [1]:
# Importing libraries

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as st
import pandas as pd
from IPython.display import display

In [2]:
# Importing cleaned census income dataset
census_income_df = pd.read_csv('datasets/cleaned_census_income.csv')

In [4]:
# Printing shape and head
print(census_income_df.shape)
census_income_df.head(10)

(30162, 15)


Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
1,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
2,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K
3,34,Private,216864,HS-grad,9,Divorced,Other-service,Unmarried,White,Female,0,3770,45,United-States,<=50K
4,38,Private,150601,10th,6,Separated,Adm-clerical,Unmarried,White,Male,0,3770,40,United-States,<=50K
5,74,State-gov,88638,Doctorate,16,Never-married,Prof-specialty,Other-relative,White,Female,0,3683,20,United-States,>50K
6,68,Federal-gov,422013,HS-grad,9,Divorced,Prof-specialty,Not-in-family,White,Female,0,3683,40,United-States,<=50K
7,45,Private,172274,Doctorate,16,Divorced,Prof-specialty,Unmarried,Black,Female,0,3004,35,United-States,>50K
8,38,Self-emp-not-inc,164526,Prof-school,15,Never-married,Prof-specialty,Not-in-family,White,Male,0,2824,45,United-States,>50K
9,52,Private,129177,Bachelors,13,Widowed,Other-service,Not-in-family,White,Female,0,2824,20,United-States,>50K


In [5]:
# Classifying column into categorical and quantitaive columns

def check_columns(df: pd.DataFrame):
    columns = df.columns
    quantitative_columns = list(set(df._get_numeric_data().columns))
    categorical_columns = list(set(columns) - set(quantitative_columns))
    return quantitative_columns, categorical_columns

quan_cols, cat_cols = check_columns(census_income_df)
print(f'Quantitative columns are: {quan_cols}')
print(f'Categorical columns are: {cat_cols}')

Quantitative columns are: ['capital.gain', 'age', 'capital.loss', 'fnlwgt', 'hours.per.week', 'education.num']
Categorical columns are: ['relationship', 'income', 'native.country', 'sex', 'marital.status', 'workclass', 'occupation', 'race', 'education']


### Question 1

The US Department of Labor is interested in studying the population mean of
ages of the American adults who work for the workclass = Private sector and have
an annual income greater than 50k.
- Denote X as the age variable for American adults who work for the Private sector
and have an annual Income greater than 50k (‘>50K’)
- Based on the sample dataset, estimate point estimation for population mean of X
- Based on the sample dataset, estimate 95% confidence interval for population
mean of X
- Situation: A recent study claimed that the average age of American adults working
for the Private sector and having Income > 50K was very young in the 1990s, just
about 25. The head of the US Department of Labor believes that the mean should be
higher.
Based on this dataset sample, conduct a hypothesis testing for the population mean
of X to draw conclusions about those statements.

In [49]:
income_private_sector_df = census_income_df[(census_income_df.workclass == 'Private') & (census_income_df.income == '>50K')]

# Number of sample
number_of_sample_X = income_private_sector_df.age.count()
print(f'\n Number of sample X: {number_of_sample_X}')

# Sample mean
population_mean_X_bar = income_private_sector_df.age.mean()
print(f'\n The point estimation for population mean of X: {population_mean_X_bar:.4f}')

# Sample standard deviation
population_std_X = income_private_sector_df.age.std()
print(f'\n Population Standard Deviation of X: {population_std_X:.4f}')

# Defining the confidence level
confidence_level_X = 0.95
print(f'\n The Confidence Level: {confidence_level_X}')

p_value_X = (1 - confidence_level_X) / 2
print(f'\n The P_value : {p_value_X:.4f}')
t_star = st.t.ppf(q = 1 - p_value_X, df = number_of_sample_X - 1)
print(f'\n T-star : {t_star:.4f}')

# Calculating the lower and upper bounds of the confidence interval
lower_bound_X = population_mean_X_bar - t_star*(population_std_X / np.sqrt(number_of_sample_X))
upper_bound_X = population_mean_X_bar + t_star*(population_std_X / np.sqrt(number_of_sample_X))
print(f"\n 95% Confidence Interval for Population Proportion of X: \n Lower Bound = {lower_bound_X:.4f} \n Upper Bound = {upper_bound_X:.4f}")


 Number of sample X: 4876

 The point estimation for population mean of X: 42.8201

 Population Standard Deviation of X: 9.9757

 The Confidence Level: 0.95

 The P_value : 0.0250

 T-star : 1.9605

 95% Confidence Interval for Population Proportion of X: 
 Lower Bound = 42.5401 
 Upper Bound = 43.1002


Since the point estimation for population mean of X (42.8201) falls within the confidence interval (42.5401 - 43.1002 ), we can conclude that there is evidence to support that the true population mean of American adults working for the Private sector and having an annual income greater than 50k is likely to fall within the range of 42.5401 to 43.1002 with a confidence level of 95%.

#### Hypothesis testing for population mean of X 

##### Step1: State the Hypotheses:

- Null Hypothesis (Ho): American adults working for the Private sector and having Income > 50K having age = 25.
- Alternative Hypothesis (Ha): American adults working for the Private sector and having Income > 50K having age > 25.

___H0 : mu = mu_0 = 25___

___Ha: mu > 25___

##### Step2: Collecting and summarizing data using t-test statistics, t-score

In [56]:
# Calculating t_score
t_score_X = (population_mean_X_bar - 25) / (population_std_X / np.sqrt(number_of_sample_X))

print(f'\n T_Score of X : {t_score_X:.4f}')


 T_Score of X : 124.7385


##### Step3: Finding the p-Value of the Test

In [57]:
# Calculating p-value
p_value_X = 1 - st.t.cdf(x=t_score_X, df=number_of_sample_X-1)
print(f'\n p value of X : {p_value_X:.4f}')


 p value of X : 0.0000


##### Step4: Draw conclusions based on p-value

We see that the p_value is very small 0 < 0.05. Thus, the null hypothesis can be rejected.

### Question 2

The US Department of Labor is also interested in studying the population mean of the number of education years ‘education.num’ of American adults who work for the workclass = Private and have an annual Income greater than 50k (‘>50K’) in the 1990s.
- Denote Y as the variable for number of education years ‘education.num’ of American adults who work for the workclass = Private and have an annual Income greater than $50k (‘>50K’) in the 1990s.
- Based on the sample dataset, estimate point estimation for population mean of Y
- Based on the sample dataset, estimate 95% confidence interval for population mean of Y
- Situation: A recent study claimed that the average number of education years ‘education.num’ of American adults who work for the workclass = Private and have an annual Income greater than 50k (‘>50K’) in the 1990s was small, just about 3 years. The head of the US Department of Labor believes that the mean
should be higher.

Based on this dataset sample, conduct a hypothesis testing for the population mean
of Y to draw conclusions about those statements

In [62]:
# Number of sample
number_of_sample_Y = income_private_sector_df[income_private_sector_df['education.num'].notna()].shape[0]
print(f'\n Number of sample Y: {number_of_sample_Y}')

# Sample mean
population_mean_Y_bar = income_private_sector_df['education.num'].mean()
print(f'\n The point estimation for population mean of Y: {population_mean_Y_bar:.4f}')

# Sample standard deviation
population_std_Y = income_private_sector_df['education.num'].std()
print(f'\n Population Standard Deviation of Y: {population_std_Y:.4f}')

# Defining the confidence level
confidence_level_Y = 0.95
print(f'\n The Confidence Level: {confidence_level_Y}')

p_value_Y = (1 - confidence_level_Y) / 2
print(f'\n The P_value : {p_value_Y:.4f}')
t_star_Y = st.t.ppf(q = 1 - p_value_Y, df = number_of_sample_Y - 1)
print(f'\n T-star : {t_star_Y:.4f}')

# Calculating the lower and upper bounds of the confidence interval
lower_bound_Y = population_mean_Y_bar - t_star_Y*(population_std_Y / np.sqrt(number_of_sample_Y))
upper_bound_Y = population_mean_Y_bar + t_star_Y*(population_std_Y / np.sqrt(number_of_sample_Y))
print(f"\n 95% Confidence Interval for Population Proportion of Y: \n Lower Bound = {lower_bound_Y:.4f} \n Upper Bound = {upper_bound_Y:.4f}")


 Number of sample Y: 4876

 The point estimation for population mean of Y: 11.4161

 Population Standard Deviation of Y: 2.2814

 The Confidence Level: 0.95

 The P_value : 0.0250

 T-star : 1.9605

 95% Confidence Interval for Population Proportion of Y: 
 Lower Bound = 11.3521 
 Upper Bound = 11.4802


Since the point estimation for population mean of Y (11.4161) falls within the confidence interval (11.3521 - 11.4802 ), we can conclude that there is evidence to support that the true population mean of American adults working for the Private sector and having an annual income greater than 50k is likely to fall within the range of 11.3521 to 11.4802 with a confidence level of 95%.

#### Hypothesis testing for population mean of Y 

##### Step1: State the Hypotheses:

- Null Hypothesis (Ho): American adults working for the Private sector and having Income > 50K having average number of education years  = 3 years.
- Alternative Hypothesis (Ha): American adults working for the Private sector and having Income > 50K having average number of education years > 3 years.

___H0 : mu = mu_0 = 3___

___Ha: mu > 3___

##### Step2: Collecting and summarizing data using t-test statistics, t-score

In [58]:
# Calculating t_score
t_score_Y = (population_mean_Y_bar - 3) / (population_std_Y / np.sqrt(number_of_sample_Y))

print(f'\n T_Score of Y : {t_score_Y:.4f}')


 T_Score of Y : 257.6011


##### Step3: Finding the p-Value of the Test

In [59]:
# Calculating p-value
p_value_Y = 1 - st.t.cdf(x=t_score_Y, df=number_of_sample_Y-1)
print(f'\n p value of Y : {p_value_Y:.4f}')


 p value of Y : 0.0000


##### Step4: Draw conclusions based on p-value

We see that the p_value is very small 0 < 0.05. Thus, the null hypothesis can be rejected.