Following up on observations from the Plotly Dash app, this notebook conducts chi-square tests to determine the statistical signficance of under- and over- representation of different races among AP test takers.

In [1]:
# Import required libraries - Pandas for importing and parsing data,
# chi-square test from SciPy for significance testing 
from numpy import nan
import pandas as pd
from scipy.stats import chi2_contingency

# Load dataframe
df = pd.read_csv('participation_ratios.csv', index_col=1)
df.drop(columns=['Unnamed: 0'], inplace=True) # Drop old index numbers

In [2]:
# Create a list of all combinations of state/race for chi-square tests
groups = ['Native American', 'Asian', 'Hispanic', 'Black', 'White', 'Pacific Islander', 'Multiracial']
states = list(df.index)
checks = [(state, group) for state in states for group in groups]

# Add new columns to df to store p-values from chi-square tests
for group in groups:
    df[(group + ' p-value')] = nan

# Run chi-square tests and fill in new columns
for check in checks:
    a = df.loc[check[0], (check[1] + ' Number AP')] # Took AP test & of given race
    b = df.loc[check[0], (check[1] + ' Number')] - a # No AP test & of given race
    c = df.loc[check[0], 'Total AP'] - a # Took AP test & different race
    d = df.loc[check[0], 'Total'] - a - b - c # No AP test & different race (i.e. all other students)
    chi2 = chi2_contingency([[a,b],[c,d]], correction=False)
    df.loc[check[0], (check[1] + ' p-value')] = chi2[1]

# See if there are any missing values
df.filter(like='p-value').isnull().sum()

Native American p-value     0
Asian p-value               0
Hispanic p-value            0
Black p-value               0
White p-value               0
Pacific Islander p-value    0
Multiracial p-value         0
dtype: int64

---
---
Now that we have added p-values to the dataframe and verified that no values are missing, we apply a variety of filters and sorts to learn more about trends in under- and over- representation of different races.

In [3]:
# Find total number of state/race pairs with PR>1 for Native American, Pacific Islander, or Multiracial students
df[df['Native American Participation Ratio'] > 1].shape[0] +\
    df[df['Pacific Islander Participation Ratio'] > 1].shape[0] +\
    df[df['Multiracial Participation Ratio'] > 1].shape[0]

19

In [4]:
# Make a list of dataframes filtered to show data with PR>1 for Native American, Pacific Islander, or Multiracial students
list1 = []
for group in ['Native American', 'Pacific Islander', 'Multiracial']:
    df_temp = df[
        (df[group + ' Participation Ratio']>1) & (df[group + ' p-value']<0.1)
    ].filter(axis='columns', like=group)
    
    list1.append(df_temp)

In [5]:
list1[0] # States where Native American students are overrepresented

Unnamed: 0_level_0,Native American Number,Native American Percent,Native American Number AP,Native American Percent AP,Native American Participation Ratio,Native American p-value
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1


In [6]:
list1[1] # States where Pacific Islander students are overrepresented

Unnamed: 0_level_0,Pacific Islander Number,Pacific Islander Percent,Pacific Islander Number AP,Pacific Islander Percent AP,Pacific Islander Participation Ratio,Pacific Islander p-value
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
South Carolina,1049,0.1369,49,0.178,1.300219,0.060518
Tennessee,1033,0.1035,36,0.1361,1.314976,0.094432


In [7]:
list1[2] # States where multiracial students are overrepresented

Unnamed: 0_level_0,Multiracial Number,Multiracial Percent,Multiracial Number AP,Multiracial Percent AP,Multiracial Participation Ratio,Multiracial p-value
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
District of Columbia,1578,1.9165,96,2.57028,1.341132,0.002857


**Takeaways:** Of 19 possible cases of overrepresentation for Native American, Pacific Islander, and multiracial students, only one turned out to be statistically significant. Two more were marginally significant. Even these cases involved <100 students of the given race taking an AP test.

---
---
The next objective is to examine the statistical significance of Black and Hispanic students being underrepresented in every state.

In [8]:
# Verify that Black PR is <1 for all states
df[df['Black Participation Ratio']>1].filter(axis='columns', like='Black')

Unnamed: 0_level_0,Black Number,Black Percent,Black Number AP,Black Percent AP,Black Participation Ratio,Black p-value
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1


In [9]:
# Look for states where the underrepresentation of Black students is not statistically significant
df[df['Black p-value']>0.1].filter(axis='columns', like='Black')

Unnamed: 0_level_0,Black Number,Black Percent,Black Number AP,Black Percent AP,Black Participation Ratio,Black p-value
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Hawaii,3532,1.9331,105,1.8088,0.935699,0.484458
Idaho,3184,1.0758,65,0.892,0.82915,0.123599
South Dakota,3933,2.8699,60,2.467105,0.859649,0.22994


In [10]:
# Verify that Hispanic PR is <1 for all states
df[df['Hispanic Participation Ratio']>1].filter(axis='columns', like='Hispanic')

Unnamed: 0_level_0,Hispanic Number,Hispanic Percent,Hispanic Number AP,Hispanic Percent AP,Hispanic Participation Ratio,Hispanic p-value
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
District of Columbia,12720,15.4485,703,18.822,1.218371,5.270155e-09


Given this surprise, we will take a quick detour to see how the demographics and size of student enrollment for District of Columbia compare to other states:

In [11]:
with pd.option_context("display.max_columns", None):
    display(df.loc[['District of Columbia']])

Unnamed: 0_level_0,Total,Native American Number,Native American Percent,Asian Number,Asian Percent,Hispanic Number,Hispanic Percent,Black Number,Black Percent,White Number,White Percent,Pacific Islander Number,Pacific Islander Percent,Multiracial Number,Multiracial Percent,Total AP,Native American Number AP,Native American Percent AP,Asian Number AP,Asian Percent AP,Hispanic Number AP,Hispanic Percent AP,Black Number AP,Black Percent AP,White Number AP,White Percent AP,Pacific Islander Number AP,Pacific Islander Percent AP,Multiracial Number AP,Multiracial Percent AP,Native American Participation Ratio,Asian Participation Ratio,Hispanic Participation Ratio,Black Participation Ratio,White Participation Ratio,Pacific Islander Participation Ratio,Multiracial Participation Ratio,Native American p-value,Asian p-value,Hispanic p-value,Black p-value,White p-value,Pacific Islander p-value,Multiracial p-value
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1
District of Columbia,82338,149,0.181,1211,1.4708,12720,15.4485,58140,70.6114,8447,10.2589,93,0.1129,1578,1.9165,3735,6,0.16064,122,3.2664,703,18.822,2259,60.4819,543,14.5382,6,0.1606,96,2.57028,0.887514,2.220832,1.218371,0.856546,1.41713,1.422498,1.341132,0.764912,1.057176e-20,5.270155e-09,5.612766e-44,1.129086e-18,0.374457,0.002857


In [12]:
# Does DC have some of the fewest white students? (Yes)
df['White Percent'].sort_values().head()

State
District of Columbia    10.2589
Hawaii                  12.8153
New Mexico              23.6775
California              24.0664
Texas                   28.4501
Name: White Percent, dtype: float64

In [13]:
# Does DC have some of the most black students? (Yes)
df['Black Percent'].sort_values(ascending=False).head()

State
District of Columbia    70.6114
Mississippi             49.6869
Louisiana               44.0924
Georgia                 36.9957
Maryland                34.7630
Name: Black Percent, dtype: float64

In [14]:
# Does DC have some of the fewest students enrolled? (Again, yes)
df['Total'].sort_values().head()

State
District of Columbia     82338
Vermont                  82913
Wyoming                  94722
North Dakota            110469
Alaska                  131895
Name: Total, dtype: int64

In [15]:
# With that out of the way, back to the original plan...
# Look for states where the underrepresentation of Hispanic students is not statistically significant
df[df['Hispanic p-value']>0.1].filter(axis='columns', like='Hispanic')

Unnamed: 0_level_0,Hispanic Number,Hispanic Percent,Hispanic Number AP,Hispanic Percent AP,Hispanic Participation Ratio,Hispanic p-value
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Alaska,8802,6.6735,190,6.327,0.948078,0.441518


**Takeaways:** DC is the only location where Hispanic students appear to be overrepresented, AND this result is statistically significant. Notably, DC has unique demographics:
  * It has fewer students enrolled than any state
  * It has a smaller percentage of white students (~10%) than any state
  * It has a larger percentage of Black students (~70%) than any state
  
There are three states where the underrepresentation of Black students is statistically insignificant, and one for Hispanic students. Again, all these states involve a small number of students (< 200) of the given race taking an AP test.

---
---
The last objective is to examine the statistical significance of white and Asian students being overrepresented in every state.

In [16]:
# Verify that white students are overrepresented in all states
df[df['White Participation Ratio']<1].filter(axis='columns', like='White')

Unnamed: 0_level_0,White Number,White Percent,White Number AP,White Percent AP,White Participation Ratio,White p-value
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1


In [17]:
# Look for states where the overrepresentation of white students is not statistically significant
df[df['White p-value']>0.1].filter(axis='columns', like='White')

Unnamed: 0_level_0,White Number,White Percent,White Number AP,White Percent AP,White Participation Ratio,White p-value
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1


In [18]:
# Verify that Asian students are overrepresented in all states
df[df['Asian Participation Ratio']<1].filter(axis='columns', like='Asian')

Unnamed: 0_level_0,Asian Number,Asian Percent,Asian Number AP,Asian Percent AP,Asian Participation Ratio,Asian p-value
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1


In [19]:
# Look for states where the overrepresentation of white students is not statistically significant
df[df['Asian p-value']>0.1].filter(axis='columns', like='Asian')

Unnamed: 0_level_0,Asian Number,Asian Percent,Asian Number AP,Asian Percent AP,Asian Participation Ratio,Asian p-value
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1


**Takeaways:** White and Asian students are overrepresented in every state in a statistically significant way.