This Notebook we will be going through and proving the disparity we found in 01_eda is statistically significant through a variety of statistical tests. 

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

df = pd.read_csv('../data/adult_clean.csv')

What is Probability, really?

Before Hyptothesis testing, we need to internalize what probability means in a data context. 
when we say "30.3% of men earn >50K", what are we really saying?

we have two interepetations: 
- Frequentist: If we sampled men from this population over and over, about 30.3% would earn >50K on average. its a long-run frequency
- In the data: out of every 1000 men in this dataset, roughly 303 are in the >50K bracket. 

These next two cells of code will show these concepts, for both men and women in this dataset. 

In [7]:
# first, we are going to get the total men in the dataset
total_men = (df['sex'] == 'Male').sum()

# now we get the men earning >50K 
men_over_50k = ((df['sex'] == 'Male') & (df['class'] == '>50K')).sum()

# we get the probability
prob_men_over_50k = men_over_50k / total_men

# now we print out to see our results
print(f"Total men: {total_men}")
print(f"Men >50K: {men_over_50k}")
print(f"P(>50K | Male): {prob_men_over_50k:.4f}")


Total men: 30527
Men >50K: 9539
P(>50K | Male): 0.3125



the reason we are typing out "P(>50K | Male)", instead of just typing out "Probability of Males over 50K" is notation meaning "Probability of >50K given Male", with the vertical bar in the middle representing "given that".
- this is called "conditional probability notation"

Formula:
P(A | B) = P(A and B) / P(B)

in plain english: "Of all the men (B), what fraction also earn >50K (A and B)?"

so, whats the difference in P-Values and Proportional Probability?

Probablity (proportion):
- What it measures: How often something occurs in your data
- Comes from: Counting Occurences
- Range: 0 to 1
- Example: "31% of men earn >50K"
- Notation: P(A), P(A|B)

P-value:
- What it measures: How likely your result is due to chance
- Comes From: Running a statistical test
- Range: 0 to 1
- Example: "There's a 0.001% chance this difference is random noise"
- Notation: p = 0.005

In [6]:
# first, we are going to get the total men in the dataset
total_women = (df['sex'] == 'Female').sum()

# now we get the men earning >50K 
women_over_50k = ((df['sex'] == 'Female') & (df['class'] == '>50K')).sum()

# we get the probability
prob_women_over_50k = women_over_50k / total_women

# now we print out to see our results
print(f"Total women: {total_women}")
print(f"Women >50K: {women_over_50k}")
print(f"P(>50K | Female): {prob_women_over_50k:.4f}")


Total women: 14695
Women >50K: 1669
P(>50K | Female): 0.1136


Chi-Square Test for Independence

What it asks: are two categorical variables independent, or is there a relationship between them?

For this dataset: Is 'sex' independent of 'class', or does knowing someone's sex tell you something about their income?

The logic: 
- Null Hypothesis (H0): Sex and income are independent, no relationship
- Alternative hypothesis (H1): Sex and income are *not* independent, there is a relationship

In [8]:
from scipy.stats import chi2_contingency
# Create a contingency table (raw counts, not proportions)
contingency_table = pd.crosstab(df['sex'], df['class'])
print(contingency_table)

# Perform chi-square test
chi2, p_value, dof, expected = chi2_contingency(contingency_table)
print(f"\nChi-square statistic: {chi2:.2f}")
print(f"P-value: {p_value:.2e}")
print(f"Degrees of freedom: {dof}")

class   <=50K  >50K
sex                
Female  13026  1669
Male    20988  9539

Chi-square statistic: 2104.13
P-value: 0.00e+00
Degrees of freedom: 1


In [9]:
n = contingency_table.sum().sum()
cramers_v = np.sqrt(chi2 / (n * (min(contingency_table.shape) - 1)))
print(f"Cramér's V: {cramers_v:.4f}")

Cramér's V: 0.2157


In [10]:
contingency_race = pd.crosstab(df['race'], df['class'])
print(contingency_race)

chi2_race, p_race, dof_race, expected_race = chi2_contingency(contingency_race)
print(f"\nChi-square statistic: {chi2_race:.2f}")
print(f"P-value: {p_race:.2e}")
print(f"Degrees of freedom: {dof_race}")

n_race = contingency_race.sum().sum()
cramers_v_race = np.sqrt(chi2_race / (n_race * (min(contingency_race.shape) - 1)))
print(f"Cramér's V: {cramers_v_race:.4f}")

class               <=50K   >50K
race                            
Amer-Indian-Eskimo    382     53
Asian-Pac-Islander    934    369
Black                3694    534
Other                 308     45
White               28696  10207

Chi-square statistic: 452.30
P-value: 1.38e-96
Degrees of freedom: 4
Cramér's V: 0.1000


In [15]:
from statsmodels.stats.proportion import proportion_confint

# Men >50K
men_total = ((df['sex'] == 'Male')).sum()
men_over_50k = ((df['sex'] == 'Male') & (df['class'] == '>50K')).sum()

ci_low, ci_high = proportion_confint(men_over_50k, men_total, alpha=0.05, method='wilson')
print(f"Men >50K: {men_over_50k / men_total:.4f}")
print(f"95% CI: [{ci_low:.4f}, {ci_high:.4f}]")

Men >50K: 0.3125
95% CI: [0.3073, 0.3177]


In [16]:
women_total = (df['sex'] == 'Female').sum()
women_over_50k = ((df['sex'] == 'Female') & (df['class'] == '>50K')).sum()

ci_low_f, ci_high_f = proportion_confint(women_over_50k, women_total, alpha=0.05, method='wilson')
print(f"Women >50K: {women_over_50k/women_total:.4f}")
print(f"95% CI: [{ci_low_f:.4f}, {ci_high_f:.4f}]")

Women >50K: 0.1136
95% CI: [0.1085, 0.1188]
