# Hypothesis Testing on Voter Turnout

This notebook performs hypothesis testing on voter turnout data to analyze differences between state categories (Red, Blue, Purple) and election types.

## Hypotheses
1. **ANOVA**: Compare mean voter turnout across Red, Blue, and Purple states.
2. **T-test**: Compare mean voter turnout between Red and Blue states.
3. **Chi-square**: Test for independence between voter turnout levels and state categories/election types.

In [None]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Set plot style
sns.set_theme(style="whitegrid")

## Load Data

In [None]:
# Load the dataset
df = pd.read_csv('../../data/processed-data/voter_turnout_selected_states.csv')

# Display first few rows
df.head()

## 1. ANOVA: Mean Comparison on Voter Turnouts (Red, Blue, Purple)

**Null Hypothesis ($H_0$):** The mean VEP turnout rate is the same for Red, Blue, and Purple states.
**Alternative Hypothesis ($H_1$):** At least one group mean is different.

In [None]:
# Filter for a specific recent election year for a snapshot comparison, e.g., 2020
# Or we can use the entire dataset if we assume independence across years (which might be a strong assumption due to time series nature)
# Let's look at the distribution first

plt.figure(figsize=(10, 6))
sns.boxplot(x='state_color', y='vep_turnout_rate', data=df, palette={'red': 'red', 'blue': 'blue', 'purple': 'purple'})
plt.title('Voter Turnout Rate by State Color')
plt.show()

# Perform One-Way ANOVA
red_turnout = df[df['state_color'] == 'red']['vep_turnout_rate']
blue_turnout = df[df['state_color'] == 'blue']['vep_turnout_rate']
purple_turnout = df[df['state_color'] == 'purple']['vep_turnout_rate']

f_stat, p_value = stats.f_oneway(red_turnout, blue_turnout, purple_turnout)

print(f"ANOVA F-statistic: {f_stat:.4f}")
print(f"ANOVA p-value: {p_value:.4e}")

if p_value < 0.05:
    print("Reject the Null Hypothesis: There is a significant difference in mean voter turnout between the state groups.")
else:
    print("Fail to reject the Null Hypothesis: No significant difference found.")

## 2. T-test: Red States vs Blue States

**Null Hypothesis ($H_0$):** The mean VEP turnout rate is the same for Red and Blue states.
**Alternative Hypothesis ($H_1$):** The mean VEP turnout rate is different.

In [None]:
# T-test
t_stat, p_val_ttest = stats.ttest_ind(red_turnout, blue_turnout, equal_var=False) # Welch's t-test assuming unequal variance

print(f"T-test statistic: {t_stat:.4f}")
print(f"T-test p-value: {p_val_ttest:.4e}")

if p_val_ttest < 0.05:
    print("Reject the Null Hypothesis: There is a significant difference in mean voter turnout between Red and Blue states.")
else:
    print("Fail to reject the Null Hypothesis: No significant difference found.")

## 3. Chi-square Test

We will test for independence between **Turnout Level** (High/Low) and **State Color** / **Election Type**.

First, we categorize `vep_turnout_rate` into 'High' and 'Low' based on the median.

In [None]:
# Create a binary variable for turnout
median_turnout = df['vep_turnout_rate'].median()
df['turnout_level'] = df['vep_turnout_rate'].apply(lambda x: 'High' if x >= median_turnout else 'Low')

print(f"Median Turnout Rate: {median_turnout:.4f}")
df[['state', 'year', 'vep_turnout_rate', 'turnout_level']].head()

### Test 3a: Turnout Level vs State Color

In [None]:
# Contingency Table
contingency_state = pd.crosstab(df['state_color'], df['turnout_level'])
print("Contingency Table (State Color vs Turnout Level):")
print(contingency_state)

# Chi-square test
chi2, p, dof, expected = stats.chi2_contingency(contingency_state)

print(f"\nChi-square statistic: {chi2:.4f}")
print(f"p-value: {p:.4e}")

if p < 0.05:
    print("Reject the Null Hypothesis: Turnout Level and State Color are dependent.")
else:
    print("Fail to reject the Null Hypothesis: Independence cannot be rejected.")

# Visualization
plt.figure(figsize=(8, 6))
sns.heatmap(contingency_state, annot=True, fmt='d', cmap='YlGnBu')
plt.title('Heatmap of State Color vs Turnout Level')
plt.show()

### Test 3b: Turnout Level vs Election Type

In [None]:
# Contingency Table
contingency_election = pd.crosstab(df['election_type'], df['turnout_level'])
print("Contingency Table (Election Type vs Turnout Level):")
print(contingency_election)

# Chi-square test
chi2_elec, p_elec, dof_elec, expected_elec = stats.chi2_contingency(contingency_election)

print(f"\nChi-square statistic: {chi2_elec:.4f}")
print(f"p-value: {p_elec:.4e}")

if p_elec < 0.05:
    print("Reject the Null Hypothesis: Turnout Level and Election Type are dependent.")
else:
    print("Fail to reject the Null Hypothesis: Independence cannot be rejected.")

# Visualization
contingency_election.plot(kind='bar', stacked=True, figsize=(8, 6))
plt.title('Turnout Level by Election Type')
plt.ylabel('Count')
plt.show()