# Lab-5: Statistical Estimation and Hypothesis Testing for Population Proportion

### By Mohan Bakshi

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
from IPython.display import display

In [2]:
# Importing cleaned_census_income dataset
census_income_df = pd.read_csv('cleaned_census_income.csv')

In [3]:
print(census_income_df.shape)
census_income_df.head(10)

(30162, 15)


Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
1,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
2,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K
3,34,Private,216864,HS-grad,9,Divorced,Other-service,Unmarried,White,Female,0,3770,45,United-States,<=50K
4,38,Private,150601,10th,6,Separated,Adm-clerical,Unmarried,White,Male,0,3770,40,United-States,<=50K
5,74,State-gov,88638,Doctorate,16,Never-married,Prof-specialty,Other-relative,White,Female,0,3683,20,United-States,>50K
6,68,Federal-gov,422013,HS-grad,9,Divorced,Prof-specialty,Not-in-family,White,Female,0,3683,40,United-States,<=50K
7,45,Private,172274,Doctorate,16,Divorced,Prof-specialty,Unmarried,Black,Female,0,3004,35,United-States,>50K
8,38,Self-emp-not-inc,164526,Prof-school,15,Never-married,Prof-specialty,Not-in-family,White,Male,0,2824,45,United-States,>50K
9,52,Private,129177,Bachelors,13,Widowed,Other-service,Not-in-family,White,Female,0,2824,20,United-States,>50K


In [11]:
# Classifying column into categorical and quantitaive columns

def check_columns(df: pd.DataFrame):
    columns = df.columns
    quantitative_columns = list(set(df._get_numeric_data().columns))
    categorical_columns = list(set(columns) - set(quantitative_columns))
    return quantitative_columns, categorical_columns

quan_cols, cat_cols = check_columns(census_income_df)
print(f'Quantitative columns are: {quan_cols}')
print(f'Categorical columns are: {cat_cols}')

Quantitative columns are: ['age', 'hours.per.week', 'education.num', 'capital.loss', 'fnlwgt', 'capital.gain']
Categorical columns are: ['marital.status', 'workclass', 'education', 'sex', 'occupation', 'race', 'native.country', 'income', 'relationship']


### Question 1

The US Department of Labor is interested in studying the population proportion of American adults who work for the workclass = Private sector and have an annual income greater than 50k.
- Denote X as the variable for American adults who work for the Private sector and have an annual Income greater than 50k (‘>50K’)
- Based on the sample dataset, estimate point estimation for population proportion of X
- Based on the sample dataset, estimate 95% confidence interval for population proportion of X
- A recent study claimed that the percentage of American adults working for the Private sector and having Income > 50K was very low in the 1990s, just about 25%. The head of the US Department of Labor believes that the percentage should be much higher. Based on this sample dataset, conduct a hypothesis testing for the population proportion of X to draw conclusions about those statements.

In [5]:
# Calculating the sample proportion
p_hat_X = len(census_income_df[(census_income_df['workclass'] == 'Private') & (census_income_df['income'] == '>50K')]) / len(census_income_df)
print(f'\nAmerican adults who work for the Private sector and have an annual Income greater than 50k (‘>50K’): {p_hat_X:.4f}')

# Defining the confidence level
confidence_level_X = 0.95
print(f'\nThe Confidence Level: {confidence_level_X}')

# Calculating the z-score corresponding to the confidence level
z_score_X = stats.norm.ppf((1 + confidence_level_X) / 2)
print(f'\nZ score corresponding to the confidence level: {z_score_X:.4f}')

# Calculating the standard error
std_error_X = np.sqrt(p_hat_X * (1 - p_hat_X) / len(census_income_df))
print(f'\nStandard Error: {std_error_X:.4f}')

# Calculating the margin of error
margin_of_error_X = z_score_X * std_error_X
print(f'\nMargin of Error: {margin_of_error_X:.4f}')

# Calculating the lower and upper bounds of the confidence interval
lower_bound_X = p_hat_X - margin_of_error_X
upper_bound_X = p_hat_X + margin_of_error_X
print(f"\n95% Confidence Interval for Population Proportion of X: \n Lower Bound = {lower_bound_X:.4f} \n Upper Bound = {upper_bound_X:.4f}")


American adults who work for the Private sector and have an annual Income greater than 50k (‘>50K’): 0.1617

The Confidence Level: 0.95

Z score corresponding to the confidence level: 1.9600

Standard Error: 0.0021

Margin of Error: 0.0042

95% Confidence Interval for Population Proportion of X: 
 Lower Bound = 0.1575 
 Upper Bound = 0.1658


Since the point estimate (0.1617) falls within the confidence interval (0.1575 - 0.1658 ), we can conclude that there is evidence to support that the true population proportion of American adults working for the Private sector and having Income > 50K is likely to be between 15.75% and 16.58% with a confidence level of 95%.

#### Hypothesis testing for population proportion

##### Step1: State the Hypotheses:

- Null Hypothesis (Ho): The population proportion of American adults working for the Private sector and having Income > 50K is 25%.
- Alternative Hypothesis (Ha): The population proportion of American adults working for the Private sector and having Income > 50K is greater than 25%.

___H0 : p0 = p = 0.25___

___Ha: p > 0.25___

##### Step2: Collecting and summarizing data using test statistics z-score

In [6]:
# Calculating z-score
p0 = 0.25
z_score = (p_hat_X - p0) / np.sqrt((p0 * (1 - p0) / len(census_income_df)))

print(f"Z-Score: {z_score:.4f}")

Z-Score: -35.4311


##### Step3: Finding the p-Value of the Test

In [7]:
# Calculating p-value
p_value_X = 1 - stats.norm.cdf(z_score)

print(f"P(Z > z_score): {p_value_X:.4f}")

P(Z > z_score): 1.0000


With a z-score of -35.4311 and a calculated p-value ('P(Z > z_score)') of 1.0000, which is greater than the significance level of 0.05, we fail to reject the null hypothesis that the population proportion of American adults working for the Private sector and having Income > 50K = 25%.

### Question 2

The US Department of Labor is also interested in studying the population proportion of American adults who work for the workclass Private and have education = Bachelors in the 1990s.
- Denote Y as the variable for American adults who work for the workclass = Private and have education = Bachelors.
- Based on the sample dataset, estimate point estimation for population proportion of Y
- Based on the sample dataset, estimate confidence interval for population proportion of Y
- A recent study claimed that the percentage of American adults who work for the workclass = Private and have education = Bachelors in the 1990s, just about 5%. The head of the US Department of Labor believes that the percentage should be much higher. Based on this dataset sample, conduct a hypothesis testing for the population proportion of Y to draw conclusions about those statements.

In [8]:
# Calculating the sample proportion
p_hat_Y= len(census_income_df[(census_income_df['workclass'] == 'Private') & (census_income_df['education'] == 'Bachelors')]) / len(census_income_df)
print(f'\nAmerican adults who work for the Private sector and have an education = "Bachelors": {p_hat_Y:.4f}')

# Defining the confidence level
confidence_level_Y = 0.95
print(f'\nThe Confidence Level: {confidence_level_Y}')

# Calculating the z-score corresponding to the confidence level
z_score_Y = stats.norm.ppf((1 + confidence_level_Y) / 2)
print(f'\nZ score corresponding to the confidence level: {z_score_Y:.4f}')

# Calculating the standard error
std_error_Y = np.sqrt(p_hat_Y * (1 - p_hat_Y) / len(census_income_df))
print(f'\nStandard Error: {std_error_Y:.4f}')

# Calculating the margin of error
margin_of_error_Y = z_score_Y * std_error_Y
print(f'\nMargin of Error: {margin_of_error_Y:.4f}')

# Calculating the lower and upper bounds of the confidence interval
lower_bound_Y = p_hat_Y - margin_of_error_Y
upper_bound_Y = p_hat_Y + margin_of_error_Y
print(f"\n95% Confidence Interval for Population Proportion of Y: \n Lower Bound = {lower_bound_Y:.4f} \n Upper Bound = {upper_bound_Y:.4f}")


American adults who work for the Private sector and have an education = "Bachelors": 0.1145

The Confidence Level: 0.95

Z score corresponding to the confidence level: 1.9600

Standard Error: 0.0018

Margin of Error: 0.0036

95% Confidence Interval for Population Proportion of Y: 
 Lower Bound = 0.1110 
 Upper Bound = 0.1181


Since the point estimate (0.1145) falls within the confidence interval (0.1110 - 0.1181 ), we can conclude that there is evidence to support that the true population proportion of American adults working for the Private sector and having an education level of "Bachelors" is likely to be between 11.10% and 11.81% with a confidence level of 95%.

#### Hypothesis testing for population proportion

##### Step1: State the Hypotheses:

- Null Hypothesis (Ho): The population proportion of American adults working for the Private sector and having Education = Bachelors is 5%.
- Alternative Hypothesis (Ha): The population proportion of American adults working for the Private sector and having Education = Bachelors is greater than 5%.

___H0 : p0 = p = 0.05___

___Ha: p > 0.05___

##### Step2: Collecting and summarizing data using test statistics z-score

In [9]:
# Calculating z-score
p0 = 0.05 
z_score = (p_hat_Y - p0) / np.sqrt((p0 * (1 - p0) / len(census_income_df)))

print(f"Z-Score: {z_score:.4f}")

Z-Score: 51.4360


In [10]:
# Calculating p-value
p_value = 1 - stats.norm.cdf(z_score)

print(f"P(Z > z_score): {p_value:.4f}")

P(Z > z_score): 0.0000


Since the P(Z > z_score) is less than the chosen significance level of 0.05, we reject the null hypothesis (Ho) that the percentage of American adults working for the workclass = Private and having education = Bachelors is 5%. Instead, we have sufficient evidence to support the alternative hypothesis (Ha) that the percentage is indeed higher than 5%, as believed by the head of the US Department of Labor.