## Homework: Fair prediction

In this homework you will build a logistic regression classifier on the Machine Bias data, then tune it to get equal false positive rates between black and white defendants.

### Part 0. Loading the data and building the feature matrix.
Free code, copied from our class notebook.

In [15]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn import metrics
%matplotlib inline

In [16]:
# Select between data on overall arrests and arrests for violent crimes
# This allows quick comparisons of the difference between these two data sets
violent = False

if violent:
    fname ='compas-scores-two-years-violent.csv'
    decile_col = 'v_decile_score'
    score_col = 'v_score_text'
else:
    fname ='compas-scores-two-years.csv'
    decile_col = 'decile_score'
    score_col = 'score_text'


In [17]:
cv = pd.read_csv(fname)

In [18]:
# Data cleaning ala ProPublica
cv = cv[
    (cv.days_b_screening_arrest <= 30) &  
    (cv.days_b_screening_arrest >= -30) &  
    (cv.is_recid != -1) &
    (cv.c_charge_degree != 'O') &
    (cv[score_col] != 'N/A')
]

# Keep only black and white races for this analysis
# cv = cv[(cv.race == 'African-American') | (cv.race=='Caucasian')]
         
# renumber the rows from 0 again
cv.reset_index(inplace=True, drop=True) 
cv.shape

(6172, 53)

In [19]:
# build up dummy variables for age, race, gender
features = pd.concat(
    [pd.get_dummies(cv.age_cat, prefix='age'),
     pd.get_dummies(cv.sex, prefix='sex'),
     pd.get_dummies(cv.c_charge_degree, prefix='degree'), # felony or misdemeanor charge ('f' or 'm')
     cv.priors_count],
    axis=1)

# We should have one less dummy variable than the number of categories, to avoid the "dummy variable trap"
# See https://www.quora.com/When-do-I-fall-in-the-dummy-variable-trap
features.drop(['age_25 - 45', 'sex_Female', 'degree_M'], axis=1, inplace=True)

# Try to predict whether someone is re-arrested
target = cv.two_year_recid

### Part 1. Your basic logistic regression

Fit a logistic regression to this data. Print out the accuracy, PPV, and FPV overall, and for just black vs. white defendants. 

Most of the code you need can be found in the class notebook.

In [29]:
# Fit a logistic regression
x = features.values
y = target.values
rez = []
lr = LogisticRegression()
lr.fit(x,y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [21]:
# Predict the result on the training data
y_pred = lr.predict(x)
guessed=pd.Series(y_pred)==1
# print(guessed)
actual=cv.two_year_recid==1

cm = pd.crosstab(guessed, actual, rownames=['guessed'], colnames=['actual'])
# cm

In [22]:
# Free code for you!

# cm is a confusion matrix. The rows are guessed, the columns are actual 
def print_ppv_fpv(cm):
    # the indices here are [col][row] or [actual][guessed]
    TN = cm[False][False]   
    TP = cm[True][True]
    FN = cm[True][False]
    FP = cm[False][True]
    print('Accuracy: ', (TN+TP)/(TN+TP+FN+FP))
    print('PPV: ', TP / (TP + FP))
    print('FPR: ', FP / (FP + TN))
    print('FNR: ', FN / (FN + TP))
    print()
#     return (FP / (FP + TN))

def print_metrics(guessed, actual):
    cm = pd.crosstab(guessed, actual, rownames=['guessed'], colnames=['actual'])
    print(cm)
    print()
    print_ppv_fpv(cm)    


In [23]:
# Print out the accuracy, PPV, FPV, FNV for
#  - everyone
print("Everyone")
print_ppv_fpv(cm)
#  - just white defendants
print("White defendents")
white_def = cv.race == 'Caucasian'
print_metrics(guessed[white_def], actual[white_def])
#  - just black defendants
print("Black defendents")
black_def = cv.race == 'African-American'
print_metrics(guessed[black_def], actual[black_def])

Everyone
Accuracy:  0.670447180816591
PPV:  0.6638477801268499
FPR:  0.23639607493309545
FNR:  0.4410822356710573

White defendents
actual   False  True 
guessed              
False     1068    494
True       213    328

Accuracy:  0.6638135996195911
PPV:  0.6062846580406654
FPR:  0.16627634660421545
FNR:  0.6009732360097324

Black defendents
actual   False  True 
guessed              
False     1026    564
True       488   1097

Accuracy:  0.6686614173228347
PPV:  0.6921135646687697
FPR:  0.32232496697490093
FNR:  0.3395544852498495



### Part 2. Equalizing false positive rates
Now you'll build your own classifier that equalizes the false positive rates between white and non-white defendants. There are many ways to do this. We're going to use race explicitly to set a different threshold for white and black defendants. 

To begin with, we are going to write our own prediction function, starting with this one:

In [30]:
# This takes a trained LogisticRegression, a set of features, and a threshold
# Predicts true wherever the regression gives a probability > threshold
# Note: returns a numpy array, not a dataframe
def predict_threshold(classifier, features, threshold):
    # predict_proba returns two columns: probability of true, and probability of false
    # [:,1] selects the second column
    rez = classifier.predict_proba(features)[:,1]
    return classifier.predict_proba(features)[:,1] > threshold

In [31]:
# This is the same as lr.predict(x) when we use a threshold of 0.5
y_pred2 = predict_threshold(lr, x, 0.5)
print(y_pred2)

guessed2=pd.Series(y_pred2)==1
actual=cv.two_year_recid==1
cm = pd.crosstab(guessed2, actual, rownames=['guessed'], colnames=['actual'])

[False False  True ... False False  True]


Now adapt this function so it takes two thresholds `a_threshold` and `b_threshold`, and a column of values `use_b` which means use the `b_threshold` for any row where it's true. The idea is to allow us to adjust the thresholds independently on two different groups.

In [32]:
# Write a function which takes the following arguments
def predict_threshold_groups(classifier, features, a_threshold, b_threshold, use_b):
    # calculate probabilities from our classifier
    
    # Create one Series which is True where the probabilities are bigger than a_threshold, 
    # and another for b_threshold
    # Then combine them, selecting values from either Series according to use_b
    a_result = predict_threshold(classifier, features, a_threshold)
    b_result = predict_threshold(classifier, features, b_threshold)
    final_threshold = []
    for i in range(0, len(use_b)):
        if use_b[i]:
            final_threshold.append(b_result[i])
        else:
            final_threshold.append(a_result[i])
    return final_threshold

Now use this function with different thresholds for black and white defendants. Print out the confusion martrix, accuracy, FPV, and PPV for the results -- again, overall and for each race.

In [33]:
# Predict recidivism with different thresholds for black and white
# Print out metrics for everyone, black, and white
y_pred3 = predict_threshold_groups(lr, x, 0.5, 0.5, black_def)
# print(y_pred3)

guessed3=pd.Series(y_pred3)==1
actual=cv.two_year_recid==1

print("Everyone")
cm = pd.crosstab(guessed3, actual, rownames=['guessed'], colnames=['actual'])
# cm
print_ppv_fpv(cm)
#  - just white defendants
print("White defendents")
white_def = cv.race == 'Caucasian'
print_metrics(guessed3[white_def], actual[white_def])
#  - just black defendants
print("Black defendents")
black_def = cv.race == 'African-American'
print_metrics(guessed3[black_def], actual[black_def])

Everyone
Accuracy:  0.670447180816591
PPV:  0.6638477801268499
FPR:  0.23639607493309545
FNR:  0.4410822356710573

White defendents
actual   False  True 
guessed              
False     1068    494
True       213    328

Accuracy:  0.6638135996195911
PPV:  0.6062846580406654
FPR:  0.16627634660421545
FNR:  0.6009732360097324

Black defendents
actual   False  True 
guessed              
False     1026    564
True       488   1097

Accuracy:  0.6686614173228347
PPV:  0.6921135646687697
FPR:  0.32232496697490093
FNR:  0.3395544852498495



In [34]:
y_pred3 = predict_threshold_groups(lr, x, 0.8, 0.88, black_def)
# print(y_pred3)

guessed3=pd.Series(y_pred3)==1
actual=cv.two_year_recid==1
cm = pd.crosstab(guessed3, actual, rownames=['guessed'], colnames=['actual'])
print(cm)
print()
print("Everyone")
print_ppv_fpv(cm)
#  - just white defendants
print("White defendents")
white_def = cv.race == 'Caucasian'
print_metrics(guessed3[white_def], actual[white_def])
#  - just black defendants
print("Black defendents")
black_def = cv.race == 'African-American'
print_metrics(guessed3[black_def], actual[black_def])

actual   False  True 
guessed              
False     3322   2627
True        41    182

Everyone
Accuracy:  0.5677252106286454
PPV:  0.8161434977578476
FPR:  0.012191495688373476
FNR:  0.9352082591669634

White defendents
actual   False  True 
guessed              
False     1269    784
True        12     38

Accuracy:  0.6214931050879696
PPV:  0.76
FPR:  0.00936768149882904
FNR:  0.9537712895377128

Black defendents
actual   False  True 
guessed              
False     1492   1536
True        22    125

Accuracy:  0.5092913385826772
PPV:  0.8503401360544217
FPR:  0.01453104359313078
FNR:  0.9247441300421433



Tune the thresholds so the False Positive Rate is the same for white and black defendants.
- What did you change to achive this?
- What effect does this have on the overall accuracy, FPR, FNR, and PPV?
- What effect does this have on the PPV for white and black?


**Changes made:**

I experimented with changing both the thresholds:
 - thresold A: since this is the threshold for Caucasian males, it gives a lower FPR. I tried to lower the threshold so that the FPR would increase to that of African Americans. This occurs for a threshold = 0.41 where FPR = 0.301, which is the closest I could get to FPR = 0.330 of African Americans
 - threshold B: I tried to increase the threshold here, to reduce the FPR for African Americans. The closest I can get to is: for threshold = 0.587, the FPR = 0.178, whiCausch is close to FPR for Caucasians at 0.171

**Overall Accuracy, FPR, FNR, PPV:**

The overall accuracy would decrease if either of the thresholds were changed. However, it dropped lower for when threshold B was changed (0.649 vs. 0.658), but the difference may not be significant.

The overall FPR reduces when threshold B is changed - this makes sense as changing the threshold for African Americans is done to reduce their FPR. On the other hand, playing with threshold A increases the overall FPR for the same reasons. I'm pretty certain that the FPR will be equal at other values of the thresholds, but I am choosing to ignore the question of where we should set the threshold (and hence, define bias) for this assignment.

Since FNR is inversely proportional to FPR, it is obvious that FNR increases when I try to reduce the FPR for African Americans, and decreases when I increase the FPR for Caucasians. This raises interesting questions about the consequences we are willing to pay to achieve equality - and which kind of equality. Should FPR be increased for both races so that the FNR remain low, or would you rather have FPRs low and deal with FNR?

PPV for the overall system increases when threshold B is increased (honestly, I'm a little surprised). This could be one of the indicators for choosing which thresholds should be changed. If overall precision of a system increases, that could be one of the ways to justify parameter decisions.

**PPV effect on Caucasians and African Americans:**

Again, PPV seems to be directly proportional to the threshold. A higher threshold increases PPV for African Americans, and a lower threshold for Caucasians decreases it. On the other hand, increasing both thresholds indiscriminately while keeping FPR equal will increase PPV for both races as well as overall, but as a result, FNR also increases highly, which is an interesting trade-off to keep in mind.

In [41]:
cv[cv.sex == 'Female'][decile_col].value_counts()

1     261
2     175
3     140
5     127
4     123
6     114
7      84
8      56
9      53
10     42
Name: decile_score, dtype: int64

### Bonus: Predicting race and the impossibility of blinding
So far we've excluded race as a predictive variable, hoping that this would make the results unbiased. But is race encoded in the other data points? To find out, alter the regression above to try to predict race from the other demographic and criminal history variables.

How accurately can you predict race just on these factors alone?

In [None]:
# Use cross validation and the classifier of your choice to see how well you can predict race
from sklearn.model_selection import cross_val_score



Let's compare this accuracy to just guessing one race all the time. Which race is more common in this data and what would the accuracy be if we just always guessed that race.

In [None]:
# What is the most common race in our arrest data?


In [None]:
# What is the accuracy if we always guess the most common race?


Based on this, how much information about race "leaks" into our original recidivism predictor, even if we don't give it the race variable as a feature?

(your answer here)