# PUMS data for the state of Alabama

There are approximately 48,000 rows of PUMS data in this data frame. Each one corresponds to an individual citizen of the state of Alabama who filled out the 2018 edition of the PUMS survey. We will filter through this dataset to predict employment status on the basis of demographics excluding race, and audit for racial bias. We will fit the training data on the Decision Tree Classifier model from scikit-learn and perform cross-validation to select the best max depth to achieve the highest accuracy. 

In [1]:
from folktables import ACSDataSource, ACSEmployment, BasicProblem, adult_filter
import numpy as np

STATE = "AL"

data_source = ACSDataSource(survey_year='2018', 
                            horizon='1-Year', 
                            survey='person')

acs_data = data_source.get_data(states=[STATE], download=True)

acs_data.head()

Unnamed: 0,RT,SERIALNO,DIVISION,SPORDER,PUMA,REGION,ST,ADJINC,PWGTP,AGEP,...,PWGTP71,PWGTP72,PWGTP73,PWGTP74,PWGTP75,PWGTP76,PWGTP77,PWGTP78,PWGTP79,PWGTP80
0,P,2018GQ0000049,6,1,1600,3,1,1013097,75,19,...,140,74,73,7,76,75,80,74,7,72
1,P,2018GQ0000058,6,1,1900,3,1,1013097,75,18,...,76,78,7,76,80,78,7,147,150,75
2,P,2018GQ0000219,6,1,2000,3,1,1013097,118,53,...,117,121,123,205,208,218,120,19,123,18
3,P,2018GQ0000246,6,1,2400,3,1,1013097,43,28,...,43,76,79,77,80,44,46,82,81,8
4,P,2018GQ0000251,6,1,2701,3,1,1013097,16,25,...,4,2,29,17,15,28,17,30,15,1


## Narrowing the features

We’ll focus on a relatively small number of features in the modeling tasks of this blog post. Here are all the possible features:

In [2]:
possible_features=['AGEP', 'SCHL', 'MAR', 'RELP', 'DIS', 'ESP', 'CIT', 'MIG', 'MIL', 'ANC', 'NATIVITY', 'DEAR', 'DEYE', 'DREM', 'SEX', 'RAC1P', 'ESR']
acs_data[possible_features].head()

Unnamed: 0,AGEP,SCHL,MAR,RELP,DIS,ESP,CIT,MIG,MIL,ANC,NATIVITY,DEAR,DEYE,DREM,SEX,RAC1P,ESR
0,19,18.0,5,17,2,,1,3.0,4.0,1,1,2,2,2.0,2,1,6.0
1,18,18.0,5,17,2,,1,3.0,4.0,1,1,2,2,2.0,2,2,6.0
2,53,17.0,5,16,1,,1,1.0,4.0,2,1,2,2,1.0,1,1,6.0
3,28,19.0,5,16,2,,1,1.0,2.0,1,1,2,2,2.0,1,1,6.0
4,25,12.0,5,16,1,,1,3.0,4.0,1,1,2,2,1.0,2,1,6.0


In [3]:
features_to_use = [f for f in possible_features if f not in ["ESR", "RAC1P"]]

In [4]:
EmploymentProblem = BasicProblem(
    features=features_to_use,
    target='ESR',
    target_transform=lambda x: x == 1,
    group='RAC1P',
    preprocess=lambda x: x,
    postprocess=lambda x: np.nan_to_num(x, -1),
)

features, label, group = EmploymentProblem.df_to_numpy(acs_data)

In [5]:
# split training and testing data

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test, group_train, group_test = train_test_split(
    features, label, group, test_size=0.2, random_state=0)

# Data Inspection 

In [26]:
import pandas as pd
df = pd.DataFrame(X_train, columns = features_to_use)
df["group"] = group_train
df["label"] = y_train

print(f"Number of individuals: {group_train.size}")
print(f"Percent of employed individuals: {y_train.mean()}")

Number of individuals: 38221
Percent of employed individuals: 0.4091468041129222


Of the 38,221 people in our training data, 40.95% have their target label equals to 1 - corresponding to those that are employed. 

In [27]:
group_train.groupby(['RAC1P'])[['ESR']].aggregate([np.mean,len ]).round(2)

AttributeError: 'numpy.ndarray' object has no attribute 'groupby'

# Training our data on Decision Tree Classifier

We will train our model on the training data with the Decision Tree Classifier from scikit-learn. Additionally, we will perform cross validation to tune the max depth of the algorithm. 

In [7]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score

best_score_DT = 0       #keeping track of highest accuracy
max_depth = 5           #iterator

# train on decision tree classifier
while max_depth != 0:
    DT = make_pipeline(StandardScaler(), DecisionTreeClassifier(max_depth=max_depth))
    DT.fit(X_train, y_train)
    cv_scores = cross_val_score(DT, X_train, y_train, cv=5)
    mean_score = cv_scores.mean()
    print(f"Max depth = {max_depth}, score = {mean_score.round(3)}")

    # keeping the list of columns for the max_depth that has the best score 
    if (DT.score(X_train, y_train) > best_score_DT):
        best_score_DT = DT.score(X_train, y_train)
        best_DT = DT
        best_max_depth = max_depth
    
    max_depth += -1

Max depth = 5, score = 0.814
Max depth = 4, score = 0.809
Max depth = 3, score = 0.794
Max depth = 2, score = 0.767
Max depth = 1, score = 0.636


In [8]:
print(f"Best max depth: {best_max_depth}")
print(f"Best score: {best_score_DT}")

Best max depth: 5
Best score: 0.8143429004997252


# Auditting for Bias

We will go ahead and audit for racial bias.

## Overall Measures

In [9]:
y_hat = best_DT.predict(X_test)

print("The overall accuracy in predicting whether someone is employed is: ")
print((y_hat == y_test).mean())

The overall accuracy in predicting whether someone is employed is: 
0.8115320217664295


In [10]:
matrix = confusion_matrix(y_test, y_hat)

tp = matrix[1][1]
tn = matrix[0][0]
fp = matrix[0][1]
fn = matrix[1][0]

ppv = tp / (tp + fp)
print(f"\nPPV: {ppv}")

print(f"\nFalse negative: {fn}")
print(f"\nFalse positive: {fp}")


PPV: 0.7652757078986587

False negative: 856

False positive: 945


The overall accuracy of our model is 81%, with a positive predictive value of 0.77. 

The overall accuracy of our 

## By-Group Measures

In [11]:
print("The accuracy for white individuals is: ")
print((y_hat == y_test)[group_test == 1].mean())

print("\nThe accuracy for black individuals is: ")
print((y_hat == y_test)[group_test == 2].mean())

The accuracy for white individuals is: 
0.810126582278481

The accuracy for black individuals is: 
0.8151589242053789


In [12]:
# white sub group
matrix_white = confusion_matrix(y_test[group_test == 1], y_hat[group_test == 1])

tp = matrix_white[1][1]
tn = matrix_white[0][0]
fp = matrix_white[0][1]
fn = matrix_white[1][0]

ppv = tp / (tp + fp)
print(f"\nPPV: {ppv}")

print(f"\nFalse negative for white individuals: {fn}")
print(f"\nFalse positive for white individuals: {fp}")


PPV: 0.7782139352306182

False negative for white individuals: 672

False positive for white individuals: 678


In [13]:
# black sub group
matrix_black = confusion_matrix(y_test[group_test == 2], y_hat[group_test == 2])

tp = matrix_black[1][1]
tn = matrix_black[0][0]
fp = matrix_black[0][1]
fn = matrix_black[1][0]

ppv = tp / (tp + fp)
print(f"\nPPV: {ppv}")

print(f"\nFalse negative for black individuals: {fn}")
print(f"\nFalse positive for black individuals: {fp}")


PPV: 0.7238805970149254

False negative for black individuals: 156

False positive for black individuals: 222
