In [1]:
%load_ext autoreload
%autoreload 2

# Link to source code

https://github.com/juliafairbank7/juliafairbank7.github.io/blob/main/posts/

# Downloading the Data

We will be using data from the American Community Survey’s Public Use Microdata Sample (PUMS). Let's start by downloading a complete set of PUMS data for the state of Montana.

In [4]:
from folktables import ACSDataSource, ACSEmployment, BasicProblem, adult_filter
import numpy as np

STATE = "MT"

data_source = ACSDataSource(survey_year='2018', 
                            horizon='1-Year', 
                            survey='person')

acs_data = data_source.get_data(states=[STATE], download=True)

acs_data.head()

Unnamed: 0,RT,SERIALNO,DIVISION,SPORDER,PUMA,REGION,ST,ADJINC,PWGTP,AGEP,...,PWGTP71,PWGTP72,PWGTP73,PWGTP74,PWGTP75,PWGTP76,PWGTP77,PWGTP78,PWGTP79,PWGTP80
0,P,2018GQ0000197,8,1,300,4,30,1013097,41,61,...,67,40,42,42,37,40,6,6,6,94
1,P,2018GQ0001300,8,1,700,4,30,1013097,47,57,...,43,81,87,91,44,8,43,82,42,110
2,P,2018GQ0001512,8,1,500,4,30,1013097,114,18,...,103,17,104,117,219,182,17,17,200,232
3,P,2018GQ0001743,8,1,300,4,30,1013097,76,28,...,66,75,71,76,79,132,140,143,70,11
4,P,2018GQ0002532,8,1,300,4,30,1013097,112,18,...,16,198,101,211,113,95,111,213,195,19


As you can see, this pulls up a ton of data. Let's try to parse out a small number of features before we start analyzing. 

In [5]:
possible_features=['AGEP', 'SCHL', 'MAR', 'RELP', 'DIS', 'ESP', 'CIT', 'MIG', 'MIL', 'ANC', 'NATIVITY', 'DEAR', 'DEYE', 'DREM', 'SEX', 'RAC1P', 'ESR']
acs_data[possible_features].head()

Unnamed: 0,AGEP,SCHL,MAR,RELP,DIS,ESP,CIT,MIG,MIL,ANC,NATIVITY,DEAR,DEYE,DREM,SEX,RAC1P,ESR
0,61,16.0,3,16,2,,1,1.0,4.0,1,1,2,2,2.0,1,1,6.0
1,57,17.0,3,16,1,,1,3.0,4.0,1,1,2,2,1.0,1,1,6.0
2,18,19.0,5,17,2,,1,3.0,4.0,2,1,2,2,2.0,2,1,6.0
3,28,14.0,3,16,1,,1,1.0,2.0,1,1,2,2,1.0,1,1,6.0
4,18,16.0,5,17,1,,1,3.0,4.0,2,1,2,1,2.0,2,9,6.0


We have now pulled out a few features including age (AGEP), education attainment (SCHL), marital status (MAR), relationship (RELP), disability recode (DIS), race (RAC1P), sex (SEX) and more.

Let's now create a subset for the features we want to use, then construct a BasicProblem to use those features to predict employment status (ESR), using race (RAC1P) as the group label.

In [8]:
features_to_use = [f for f in possible_features if f not in ["ESR", "RAC1P"]]

EmploymentProblem = BasicProblem(
    features=features_to_use,
    target='ESR',
    target_transform=lambda x: x == 1,
    group='RAC1P',
    preprocess=lambda x: x,
    postprocess=lambda x: np.nan_to_num(x, -1),
)

features, label, group = EmploymentProblem.df_to_numpy(acs_data)

EmploymentProblem returns a feature matrix (features), a label vector label, and a group label vector group. 

Next, we will perform a train-test split, then get into creating our model!

In [18]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test, group_train, group_test = train_test_split(
    features, label, group, test_size=0.2, random_state=0)

In [23]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix

model = make_pipeline(StandardScaler(), LogisticRegression())
model.fit(X_train, y_train)

We can extract predictions on a test set by calling:

In [24]:
y_hat = model.predict(X_test)

By taking the mean, we can get the overall accuracy in predicting one's employment by calling:

In [25]:
(y_hat == y_test).mean()

0.7514506769825918

We can also group by race (RAC1P) to determine the accuracy for different groups. Below, we can extract the accuracy for predicting the employment of white individuals....

In [26]:
(y_hat == y_test)[group_test == 1].mean() #white individuals

0.7554704595185996

... compared to the accuracy for predicting the employment of black individuals....

In [27]:
(y_hat == y_test)[group_test == 2].mean() #black individuals

0.875

Now that we've seen what this model can do, I will be predicting employment status on the basis of demographics excluding race, and then auditing for racial bias.

In [60]:
import pandas as pd
df = pd.DataFrame(X_train, columns = features_to_use)
df["RACE"] = group_train
df["ESR_label"] = y_train
df

Unnamed: 0,AGEP,SCHL,MAR,RELP,DIS,ESP,CIT,MIG,MIL,ANC,NATIVITY,DEAR,DEYE,DREM,SEX,RACE,ESR_label
0,87.0,21.0,1.0,16.0,1.0,0.0,1.0,1.0,2.0,4.0,1.0,1.0,1.0,1.0,1.0,1,False
1,79.0,16.0,1.0,0.0,2.0,0.0,1.0,1.0,2.0,1.0,1.0,2.0,2.0,2.0,1.0,1,True
2,27.0,20.0,5.0,0.0,2.0,0.0,1.0,3.0,4.0,2.0,1.0,2.0,2.0,2.0,1.0,1,True
3,94.0,16.0,2.0,0.0,1.0,0.0,1.0,1.0,4.0,4.0,1.0,1.0,2.0,2.0,2.0,1,False
4,40.0,20.0,5.0,15.0,2.0,0.0,1.0,1.0,4.0,2.0,1.0,2.0,2.0,2.0,2.0,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8263,25.0,22.0,5.0,12.0,2.0,0.0,1.0,1.0,4.0,4.0,1.0,2.0,2.0,2.0,2.0,1,True
8264,10.0,6.0,5.0,2.0,2.0,1.0,3.0,1.0,0.0,1.0,1.0,2.0,2.0,2.0,1.0,1,False
8265,29.0,21.0,1.0,0.0,2.0,0.0,1.0,1.0,4.0,2.0,1.0,2.0,2.0,2.0,1.0,1,True
8266,78.0,17.0,1.0,1.0,2.0,0.0,1.0,1.0,2.0,1.0,1.0,2.0,2.0,2.0,1.0,1,False


# Analyzing the Data

## 1. How many individuals are in the dataframe?

There are 8,268 individuals in this dataset. 

## 2. What proportion have a target label equal to 1?

In [61]:
df.shape[0]
y_train.mean()

0.45379777455249154

Of those individuals, the proportion of people that have a target label equal to 1 (employed individuals) is 0.453798.

## 3. Of these individuals, how many are in each of the groups?

In [63]:
df.loc[df["RACE"] >= 2, "RACE"] = 2
df.groupby("RACE")["ESR_label"].mean()

RACE
1    0.471005
2    0.316703
Name: ESR_label, dtype: float64

For this analysis, I was looking at RACE value == 1 (white individuals) and RACE == 2 (black or African American individuals). Immediately, you can see the difference in the proportion of individuals assigned the target label 1. Among white individuals, the proportion is .471, while the proportion is 0.317 among Black individials. 

This large discrepancy is most likely due to years of systemic bias and racism. 