In [6]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Link to source code

https://github.com/juliafairbank7/juliafairbank7.github.io/blob/main/posts/

# Downloading the Data

We will be using data from the American Community Survey’s Public Use Microdata Sample (PUMS). Let's start by downloading a complete set of PUMS data for the state of Montana.

In [7]:
from folktables import ACSDataSource, ACSEmployment, BasicProblem, adult_filter
import numpy as np

STATE = "MT"

data_source = ACSDataSource(survey_year='2018', 
                            horizon='1-Year', 
                            survey='person')

acs_data = data_source.get_data(states=[STATE], download=True)

acs_data.head()

ModuleNotFoundError: No module named 'folktables'

As you can see, this pulls up a ton of data. Let's try to parse out a small number of features before we start analyzing. 

In [5]:
possible_features=['AGEP', 'SCHL', 'MAR', 'RELP', 'DIS', 'ESP', 'CIT', 'MIG', 'MIL', 'ANC', 'NATIVITY', 'DEAR', 'DEYE', 'DREM', 'SEX', 'RAC1P', 'ESR']
acs_data[possible_features].head()

NameError: name 'acs_data' is not defined

We have now pulled out a few features including age (AGEP), education attainment (SCHL), marital status (MAR), relationship (RELP), disability recode (DIS), race (RAC1P), sex (SEX) and more.

Let's now create a subset for the features we want to use, then construct a BasicProblem to use those features to predict employment status (ESR), using race (RAC1P) as the group label.

In [4]:
features_to_use = [f for f in possible_features if f not in ["ESR", "RAC1P"]]

EmploymentProblem = BasicProblem(
    features=features_to_use,
    target='ESR',
    target_transform=lambda x: x == 1,
    group='RAC1P',
    preprocess=lambda x: x,
    postprocess=lambda x: np.nan_to_num(x, -1),
)

features, label, group = EmploymentProblem.df_to_numpy(acs_data)

NameError: name 'BasicProblem' is not defined

EmploymentProblem returns a feature matrix (features), a label vector label, and a group label vector group. 

Next, we will perform a train-test split, then get into creating our model!

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test, group_train, group_test = train_test_split(
    features, label, group, test_size=0.2, random_state=0)

NameError: name 'features' is not defined

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix

model = make_pipeline(StandardScaler(), LogisticRegression())
model.fit(X_train, y_train)

We can extract predictions on a test set by calling:

In [None]:
y_hat = model.predict(X_test)

By taking the mean, we can get the overall accuracy in predicting one's employment by calling:

In [None]:
(y_hat == y_test).mean()

We can also group by race (RAC1P) to determine the accuracy for different groups. Below, we can extract the accuracy for predicting the employment of white individuals....

In [6]:
(y_hat == y_test)[group_test == 1].mean() #white individuals

NameError: name 'y_hat' is not defined

... compared to the accuracy for predicting the employment of black individuals....

In [7]:
(y_hat == y_test)[group_test == 2].mean() #black individuals

NameError: name 'y_hat' is not defined

Now that we've seen what this model can do, I will be predicting employment status on the basis of demographics excluding race, and then auditing for racial bias.

In [8]:
import pandas as pd
df = pd.DataFrame(X_train, columns = features_to_use)
df["RACE"] = group_train
df["ESR_label"] = y_train
df

NameError: name 'X_train' is not defined

# Analyzing the Data

## 1. How many individuals are in the dataframe?

There are 8,268 individuals in this dataset. 

## 2. What proportion have a target label equal to 1?

In [None]:
df.shape[0]
y_train.mean()

Of those individuals, the proportion of people that have a target label equal to 1 (employed individuals) is 0.453798.

## 3. Of these individuals, how many are in each of the groups?

In [9]:
df.loc[df["RACE"] >= 2, "RACE"] = 2
df.groupby("RACE")["ESR_label"].mean()

NameError: name 'df' is not defined

For this analysis, I was looking at RACE value == 1 (white individuals) and RACE == 2 (black or African American individuals). Immediately, you can see the difference in the proportion of individuals assigned the target label 1. Among white individuals, the proportion is .471, while the proportion is 0.317 among Black individials. 

This large discrepancy is most likely due to years of systemic bias and racism. 

## In each group, what proportion of individuals have target label equal to 1?

##  Check for intersectional trends by studying the proportion of positive target labels broken out by your chosen group labels and an additional group label. 