# Comparing the Population Proportions with Chi-Square

## Introduction

Recently, I've been reading about the chi-square distribution and the many statistical analyses you can perform with it. One of the first analyses we can run is to test if two population proportions are equal. This is useful if someone makes the claim that there are no differences between different types of a categorical variable (i.e., permit zone) or if someone claims that there *is* a difference between categorical variables (i.e., day range, month, day).

This notebook contains two analyses comparing the population proportions for awarded entries.

The first analysis looks at the **odds of being awarded a permit between the different zones** and whether the proportions between the zones are equal (spoiler, they are not).

The second analysis explores the claim on the National Forest website that:

>If you really want to do a Friday-Sunday trip in mid-August, by all means apply for that trip, but remember that you’re odds of getting a permit will be less than if you tried for a Monday-Wednesday trip in July.
>
> *"How can I improve my chances of getting a permit?"*, [Forest Service](https://www.fs.usda.gov/detail/okawen/passes-permits/recreation/?cid=fsbdev3_053607)

Consequently, I compared the **awarded entries for Friday-Sunday trips in mid-August and compared the proportion to awarded entries for Monday-Wednesday trips in July**.

In [1]:
# Import the data
import pandas as pd
import numpy as np

# Define the data types for specific columns
# preferred_zone                           object
# preferred_entry_date             datetime64[ns]
# minimum_acceptable_group_size             int64
# results_status                           object
# awarded_preference                        int64
# awarded_entry_date               datetime64[ns]
# awarded_entrance_code_name               object
# awarded_group_size                        int64
# processing_sequence                       int64
# state                                    object
# year                                      int64
# awarded                                    bool
# preferred_option                          int64
# preferred_entry_date_month               object
# preferred_entry_date_day                 object

dtype_dict = {
    "preferred_zone": "category",
    "minimum_acceptable_group_size": "int64",
    "results_status": "category",
    "awarded_preference": "int64",
    "awarded_entrance_code_name": "category",
    "awarded_group_size": "int64",
    "processing_sequence": "int64",
    "state": "category",
    "year": "int64",
    "awarded": "bool",
    "preferred_option": "int64",
    "preferred_entry_date_month": "category",
    "preferred_entry_date_day": "category",
}

# Import the combined_results_split_actual.csv file
df = pd.read_csv('combined_results_split_actual.csv',
    # Import was failing to parse date columns, so I
    # to add the column names
    parse_dates=[
        "preferred_entry_date",
        "awarded_entry_date",
    ],
    date_format="%m-%d-%Y",  # Align format with export format
    dtype=dtype_dict,  # Specify data types for columns
)

df

Unnamed: 0,preferred_zone,preferred_entry_date,minimum_acceptable_group_size,results_status,awarded_preference,awarded_entry_date,awarded_entrance_code_name,awarded_group_size,processing_sequence,state,year,awarded,preferred_option,preferred_entry_date_month,preferred_entry_date_day
0,Colchuck Zone,2020-06-26,0,Unsuccessful,0,1970-01-01,,0,0,,2020,False,1,June,Friday
1,Core Enchantment Zone,2020-08-01,0,Unsuccessful,0,1970-01-01,,0,0,,2020,False,1,August,Saturday
2,Core Enchantment Zone,2020-09-19,0,Unsuccessful,0,1970-01-01,,0,0,,2020,False,1,September,Saturday
3,Core Enchantment Zone,2020-08-22,0,Unsuccessful,0,1970-01-01,,0,0,,2020,False,1,August,Saturday
4,Snow Zone,2020-07-17,0,Awarded,1,2020-07-17,Snow Zone,6,0,,2020,True,1,July,Friday
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
406562,Core Enchantment Zone,2023-08-11,4,Unsuccessful,0,1970-01-01,,0,36252,WA,2023,False,3,August,Friday
406563,Colchuck Zone,2023-07-24,2,Awarded,1,2023-05-31,Core Enchantment Zone,4,1504,WA,2023,False,3,July,Monday
406564,Core Enchantment Zone,2023-07-27,5,Unsuccessful,0,1970-01-01,,0,33420,CA,2023,False,3,July,Thursday
406565,Core Enchantment Zone,2023-05-21,8,Unsuccessful,0,1970-01-01,,0,28321,NY,2023,False,3,May,Sunday


## Compare the proportions of awarded permits between the different zones

In [2]:
# Create a filter for the entries that were awarded their desired zone and preferred option
awarded_desired_zone = df["preferred_zone"] == df["awarded_entrance_code_name"]
# Create a filter for the entries that were awarded their preferred option
awarded_preferred_option = df["preferred_option"] == df["awarded_preference"]

# Filter for the winners
awarded = df[awarded_desired_zone & awarded_preferred_option]
# Filter for the losers
not_awarded = df[~awarded_desired_zone | ~awarded_preferred_option]

# Create a crosstab of awarded and not awarded entries by zone
zone_crosstab = pd.crosstab(df["preferred_zone"], df["awarded"], margins=True)

zone_crosstab

awarded,False,True,All
preferred_zone,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Colchuck Zone,62461,1403,63864
Core Enchantment Zone,260327,2875,263202
Eightmile/Caroline Zone,7824,1112,8936
Eightmile/Caroline Zone (stock),927,147,1074
Snow Zone,42171,2544,44715
Stuart Zone,22242,1865,24107
Stuart Zone (stock),600,69,669
All,396552,10015,406567


Looking at the above data, we can intuit that the proportions are not going to be equal for the different zones.

### Step 1: Identify the null and alternative hypotheses

**H0**: All means are equal for the different permit zones

**H1**: All means are not equal for the different permit zones

### Step 2: Calculated the expected frequencies

In [3]:
# Calculate overall proportion
overall_proportion = awarded["preferred_zone"].count() / df["preferred_zone"].count()

# Print the raw numbers for the calculation
print(f"Raw numbers for the calculation:")
print(f"Total entries: {df['preferred_zone'].count()}")
print(f"Total awarded entries: {awarded['preferred_zone'].count()}")

print(f"Pooled proportion: {overall_proportion}")

Raw numbers for the calculation:
Total entries: 406567
Total awarded entries: 10015
Pooled proportion: 0.02463308630557817


In [4]:
# For each type of zone, calculate the proportion of awarded entries using the overall proportion
chi_squ_chart = df[["awarded", "preferred_zone"]].groupby("preferred_zone", observed=True).sum()

# Add the total number of entries for each zone
chi_squ_chart["total"] = df["preferred_zone"].value_counts()

# Calculate the expected number of awarded entries for each zone
chi_squ_chart["expected"] = round(chi_squ_chart["total"] * overall_proportion).astype(int)

chi_squ_chart

Unnamed: 0_level_0,awarded,total,expected
preferred_zone,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Colchuck Zone,1403,63864,1573
Core Enchantment Zone,2875,263202,6483
Eightmile/Caroline Zone,1112,8936,220
Eightmile/Caroline Zone (stock),147,1074,26
Snow Zone,2544,44715,1101
Stuart Zone,1865,24107,594
Stuart Zone (stock),69,669,16


### Step 3: Calculate the chi-squared test statistic

In [5]:
# Calculate the difference between the expected and awarded entries and square it
chi_squ_chart["diff_squ"] = (chi_squ_chart["awarded"] - chi_squ_chart["expected"]).pow(2)

chi_squ_chart

Unnamed: 0_level_0,awarded,total,expected,diff_squ
preferred_zone,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Colchuck Zone,1403,63864,1573,28900
Core Enchantment Zone,2875,263202,6483,13017664
Eightmile/Caroline Zone,1112,8936,220,795664
Eightmile/Caroline Zone (stock),147,1074,26,14641
Snow Zone,2544,44715,1101,2082249
Stuart Zone,1865,24107,594,1615441
Stuart Zone (stock),69,669,16,2809


In [6]:
# Calculate the test statistic by summing the squared differences and dividing by the expected number of awarded entries 
chi_squ_test_stat = chi_squ_chart["diff_squ"].sum() / chi_squ_chart["expected"].sum()

chi_squ_test_stat

1753.4573055028463

### Step 4: Determine chi-square critical value

In [7]:
# Determine chi-square critical value with df = 6 and alpha = 0.10 use scipy.stats
from scipy.stats import chi2

chi_squ_crit_val = chi2.ppf(0.90, 6)

chi_squ_crit_val

10.644640675668422

In [8]:
# Calculate the p-value using the chi-square test
chi_squ_p_value = 1 - chi2.cdf(chi_squ_test_stat, 6)

chi_squ_p_value

0.0

### Step 5: Compare the Test Statistic to the Critical Value

Because the test statistic is greater than the critical value we reject the null hypothesis that sample means are all equal. 

### Step 6: State your conclusions

Because the test statistic is so much greater than the critical value and the the p-value is essentially zero, we can say that the proportion of awarded permits is not even across all the permit zones zones.

## Compare the Population Proportions by Day of the Week and Month

>If you really want to do a Friday-Sunday trip in mid-August, by all means apply for that trip, but remember that you’re odds of getting a permit will be less than if you tried for a Monday-Wednesday trip in July.
>
> *"How can I improve my chances of getting a permit?"*, [Forest Service](https://www.fs.usda.gov/detail/okawen/passes-permits/recreation/?cid=fsbdev3_053607)

Technically, the odds of getting a permit may be less, but that doesn't give us a truth claim about the different days. Let's explore the two populations:
1. Friday-Sunday trip in mid-August
2. Monday-Wednesday trip in July


### Step 1: State the null and alternative hypothesis and set a level for alpha

**H0**, The means between the two date ranges are equal

**H1**, The means between the two date ranges are not equal

**alpha** = 0.10

### Step 2: Calculate the expected frequencies for the two populations

In [9]:
# Get the entries for Friday-Sunday trips in mid-August
friday_sunday_entries = df[
    (df["preferred_entry_date"].dt.dayofweek >= 4) # Friday
    & (df["preferred_entry_date"].dt.dayofweek <= 6) # Sunday
    & (df["preferred_entry_date"].dt.month == 8) # August
    & (df["preferred_entry_date"].dt.day >= 10) # Mid-August
    & (df["preferred_entry_date"].dt.day <= 20) # Mid-August
]

# Get the entries for Monday-Wednesday trips in July
monday_wednesday_entries = df[
    (df["preferred_entry_date"].dt.dayofweek >= 0) # Monday
    & (df["preferred_entry_date"].dt.dayofweek <= 2) # Wednesday
    & (df["preferred_entry_date"].dt.month == 7) # July
]

# Print the number of friday-sunday entries
friday_sunday_entries_count = len(friday_sunday_entries)

print(f"Total Fri-Sun Entries in Mid August: {friday_sunday_entries_count}")

# Print the number of monday-wednesday entries
monday_wednesday_entries_count = len(monday_wednesday_entries)

print(f"Total Mon-Wed Entries in July: {monday_wednesday_entries_count}")

Total Fri-Sun Entries in Mid August: 26948
Total Mon-Wed Entries in July: 37497


In [10]:
# Get the total awarded entries for Friday-Sunday trips in mid-August
friday_sunday_awarded = friday_sunday_entries["awarded"].sum()

# Get the total awarded entries for Monday-Wednesday trips in July
monday_wednesday_awarded = monday_wednesday_entries["awarded"].sum()

# Print the number of awarded entries for Friday-Sunday trips in mid-August
print(f"Total Awarded Fri-Sun Entries in Mid August: {friday_sunday_awarded}")

# Print the number of awarded entries for Monday-Wednesday trips in July
print(f"Total Awarded Mon-Wed Entries in July: {monday_wednesday_awarded}")

Total Awarded Fri-Sun Entries in Mid August: 317
Total Awarded Mon-Wed Entries in July: 770


In [11]:
# Create chart for chi-square test
chi_squ_chart = pd.DataFrame(
    {
        "Entries": [friday_sunday_entries_count, monday_wednesday_entries_count],
        "Awarded": [friday_sunday_awarded, monday_wednesday_awarded],
    },
    index=["Friday-Sunday-Mid-August", "Monday-Wednesday-July"],
)

chi_squ_chart

Unnamed: 0,Entries,Awarded
Friday-Sunday-Mid-August,26948,317
Monday-Wednesday-July,37497,770


In [12]:
# Calculate the expected number of awarded entries for each group
chi_squ_chart["expected"] = (chi_squ_chart["Awarded"].sum() / chi_squ_chart["Entries"].sum() * chi_squ_chart["Entries"]).round().astype(int)    

chi_squ_chart

Unnamed: 0,Entries,Awarded,expected
Friday-Sunday-Mid-August,26948,317,455
Monday-Wednesday-July,37497,770,632


In [13]:
# Create a column for the observed odds
chi_squ_chart["obs_odds"] = chi_squ_chart["Awarded"] / chi_squ_chart["Entries"]

# Create a column for the expected odds
chi_squ_chart["exp_odds"] = chi_squ_chart["expected"] / chi_squ_chart["Entries"]

# Calculate the difference between the observed and expected odds and square it
chi_squ_chart["diff_squ"] = (chi_squ_chart["Awarded"] - chi_squ_chart["expected"]).pow(2)

chi_squ_chart

Unnamed: 0,Entries,Awarded,expected,obs_odds,exp_odds,diff_squ
Friday-Sunday-Mid-August,26948,317,455,0.011763,0.016884,19044
Monday-Wednesday-July,37497,770,632,0.020535,0.016855,19044


### Step 3: Calculate the Chi-Square Test Stat

In [14]:
# Calculate the test statistic by summing the squared differences and dividing by the expected number of awarded entries
chi_squ_test_stat_dow = chi_squ_chart["diff_squ"].sum() / chi_squ_chart["expected"].sum()

chi_squ_test_stat_dow

35.039558417663294

### Step 4: Calculate the Chi-Square Critical Value

In [15]:
# Import the chi-square critical value with df = 1 and alpha = 0.10 using scipy.stats
from scipy.stats import chi2

# Determine chi-square critical value with df = 1 and alpha = 0.10
chi_squ_crit_val_dow = chi2.ppf(0.90, 1)

print(f"Chi-square critical value: {chi_squ_crit_val_dow}")

Chi-square critical value: 2.705543454095404


In [16]:
# Calculate the p-value using the chi-square test
chi_squ_p_value_dow = 1 - chi2.cdf(chi_squ_test_stat_dow, 1)

print(f"Chi-square p-value: {chi_squ_p_value_dow}")

Chi-square p-value: 3.230747447346971e-09


### Step 5: Compare the Test Statistic to the Critical Value

The test statistic is well above the critical value with a p-value is essentially zero. Therefore, we reject the null hypothesis.

### Step 6: State Conclusions

By rejecting the null hypothesis we have enough evidence to conclude that the proportions of being awarded a permit in Mid-August on Fri-Sun are not equal to the odds of winning a permit in July on Mon-Wed.

This conclusion supports the claim made on the National Forest website that an applicants' odds of being awarded a permit are better on a Mon-Wed in July than on a Fri-Sun in mid-August.