# INFO 370 Problem Set 8: Applied Modeling
*Name:* Israel Martinez

This problem set has three goals:

1. Use confusion matrices to understand a recent controversy around racial equality and criminal justice system.

2. Use your logistic regression skills to develop and validate a model, analogous to the proprietary COMPAS model that caused the above-mentioned controversy.

3. Encourage you to think over the role of statistical tools and AI in our policymaking process.

## 1 Is COMPAS fair? (60)

### Background
Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) algorithm is a commercial risk assessment tool that attempts to estimate a criminal defendents recidivism (when a criminal reoffends, i.e. commits another crime). COMPAS is reportedly one of the most widely used tools of its kind in the US. It is often used in the US criminal justice system to inform sentencing guidelines by judges, although specific rules and regulations vary.

In 2016, ProPublica published an investigative report arguing that racial bias was evident in the COMPAS algorithm. ProPublica had constructed a dataset from Florida public records, and used logistic regression and confusion matrix in its analysis. COMPASs owners disputed this analysis, and other academics noted that for people with the same COMPAS score, but different races, the recidivism rates are effectively the same.

The COMPAS algorithm is proprietary and not public. We know it includes 137 features, and deliberately excludes race. However, another study showed that a logistic regression with only 7 of those features was equally accurate! There are also some discussion  (admittedly the text is rather raw) in the lecture notes, ch 12.2.3.

Note: Links are optional but very helpful readings for this problem set!

### Dataset
The dataset you will be working with is based off ProPublicas dataset, compiled from public records in Florida. However, it has been cleaned up for simplicity. You will only use a subset of the variables in the dataset for this exercise:

age - Age in years

c_charge_degree - Classifier for an individuals crime–F for felony, M for misdemeanor 

race - Classifier for the recorded race of each individual in this dataset. We will mainly consider Caucasian, and African-American here.

age_cat - Classifies individuals as under 25, between 25 and 45, and older than 45

sex - “Male” or “Female”.

priors_count - Numeric, the number of previous crimes the individual has committed.

decile_score - COMPAS classification of each individuals risk of recidivism (1 = low . . . 10 = high). This is the score computed by the proprietary model.

two_year_recid - Binary variable, 1 if the individual recidivated within 2 years, 0 otherwise. This is the central outcome variable for our purpose.

Note that we limit the analysis with the time period of two years since the first crime.

Your task are the following:

1. (2pt) Load the COMPAS data, and perform the basic sanity checks.

In [1]:
# 3:45
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [6]:
compas = pd.read_csv("./data/compas-score-data.csv.bz2", sep="\t")

In [7]:
compas.head()

Unnamed: 0,age,c_charge_degree,race,age_cat,sex,priors_count,decile_score,two_year_recid
0,69,F,Other,Greater than 45,Male,0,1,0
1,34,F,African-American,25 - 45,Male,0,3,1
2,24,F,African-American,Less than 25,Male,4,4,1
3,44,M,Other,25 - 45,Male,0,1,0
4,41,F,Caucasian,25 - 45,Male,14,6,1


In [8]:
compas.shape

(6172, 8)

In [9]:
compas.isna().sum()

age                0
c_charge_degree    0
race               0
age_cat            0
sex                0
priors_count       0
decile_score       0
two_year_recid     0
dtype: int64

There's 6172 rows, 8 columns, and no missing data.

2. (2pt) Filter the data to keep only Caucasian and African-Americans. There are just too few of offenders of other races

In [11]:
compas2 = compas[(compas['race'] == "Caucasian") | 
                 (compas['race'] == "African-American")]
compas2.head()

Unnamed: 0,age,c_charge_degree,race,age_cat,sex,priors_count,decile_score,two_year_recid
1,34,F,African-American,25 - 45,Male,0,3,1
2,24,F,African-American,Less than 25,Male,4,4,1
4,41,F,Caucasian,25 - 45,Male,14,6,1
6,39,M,Caucasian,25 - 45,Female,0,1,0
7,27,F,Caucasian,25 - 45,Male,0,4,0


3. (2pt) Create a new dummy variable based off of COMPAS risk score (decile_score), which indicates if an individual was classified as low risk (score 1-4) or high risk (score 5-10).

In [12]:
compas2['dec_dummy'] = pd.cut(compas.decile_score, 
                             bins=[-np.inf, 5, np.inf],
                            labels=['low_risk', 'high_risk'], 
                             right=False)
compas2.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  compas2['dec_dummy'] = pd.cut(compas.decile_score,


Unnamed: 0,age,c_charge_degree,race,age_cat,sex,priors_count,decile_score,two_year_recid,dec_dummy
1,34,F,African-American,25 - 45,Male,0,3,1,low_risk
2,24,F,African-American,Less than 25,Male,4,4,1,low_risk
4,41,F,Caucasian,25 - 45,Male,14,6,1,high_risk
6,39,M,Caucasian,25 - 45,Female,0,1,0,low_risk
7,27,F,Caucasian,25 - 45,Male,0,4,0,low_risk


4. (6pt) Now analyze the offenders across this new risk category:

    (a) What is the recidivism rate (percentage of offenders who re-commit the crime) for lowrisk and high-risk individuals?

    (b) What are the recidivism rates for African-Americans and Caucasians?

In [14]:
# (a)
l_risk = compas2[compas2['dec_dummy'] == "low_risk"]
h_risk = compas2[compas2['dec_dummy'] == "high_risk"]
l_risk.two_year_recid.mean(), h_risk.two_year_recid.mean()

(0.3200145296040683, 0.6344554455445545)

For low-risk individuals, the recidivism rate is  32%. For high-risk individuals, it is 63%.

In [15]:
# (b)
aa_indiv = compas2[compas2['race'] == "African-American"]
white_indiv = compas2[compas2['race'] == "Caucasian"]
aa_indiv.two_year_recid.mean(), white_indiv.two_year_recid.mean()

(0.5231496062992126, 0.3908701854493581)

For African-Americans, the recidivism rate is 52%. For Caucasian individuals, it is 39%.

5. (10 pt) Now create a confusion matrix comparing COMPAS predictions for recidivism (low risk/high risk) and the actual two-year recidivism and interpret the results. In order to be on the same page, lets mark recidivists as positives.

    Note: you do not have to predict anything here. COMPAS has made the prediction for you, this is the variable you created in 3 based on decile_score. See the referred articles about the controversy around COMPAS methodology.

    Note 2: Do not just output a confusion matrix with accompanying text like “accuracy = x%, precision = y%”. Interpret your results such as “z% of recidivists were falsly classified as low-risk, COMPAS accurately classified N% of individuals, etc.”

In [16]:
from sklearn.metrics import confusion_matrix

In [17]:
compas2['dum_num'] = pd.cut(compas2.decile_score,
                           bins=[-np.inf, 5, np.inf],
                           labels=[0, 1], right=False)
mtx_decile = confusion_matrix(compas2.two_year_recid, 
                              compas2.dum_num)
mtx_decile

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  compas2['dum_num'] = pd.cut(compas2.decile_score,


array([[1872,  923],
       [ 881, 1602]])

In [18]:
# Accuracy
tp = 1602
tn = 1872
fp = 923
fn = 881
t = tp + tn + fp + fn
print('Accuracy:', (tn + tp)/t)

Accuracy: 0.6582038651004168


In [19]:
# Precision
print('Precision:', tp / (tp + fp))

Precision: 0.6344554455445545


In [20]:
# Recall
print('Recall:', tp / (tp + fn))

Recall: 0.6451872734595248


In [22]:
from sklearn.metrics import f1_score
print('F-Score:', f1_score(compas2.two_year_recid, compas2.dum_num))

F-Score: 0.639776357827476


Looking at the accuracy value, I know that the model accurately predicted 66% of individuals' recidivism rate. The precision value tells me that 63% of individuals were accurately predicted to recommit an offense. The recall value I got tells me how many of the positive recidivism cases were predicted correctly with the model, which was around 65%.

6. (12pt) Note the accuracy of the COMPAS classification, and also how its errors were distributed. Would you feel comfortable having a judge to use COMPAS to inform sentencing guidelines? At what point would the error/misclassification risk be acceptable for you?

    Remember: human judges are not perfect either!

Since the accuracy from the model was only 66%, I wouldn't feel comfortable having a judge use COMPAS to inform sentencing guidelines. It means that only 2/3 of cases were accurately predicted. Although this seems like a good number, according to the data there was 923 false negative cases, meaning this many people were falsely predicted to recidivate within 2 years. Personally, the error/misclassification risk would be acceptable if it was 85% or higher because this ultimately decides the fate of real lives. Their lives shouldn't be taken lightly.

7. (14pt) Now repeat your confusion matrix calculation and analysis from 5. But this time do it separately for African-Americans and for Caucasians:

    (a) How accurate is the COMPAS classification for African-American individuals? For Caucasians?
    
    (b) What are the false positive rates (false recidivism rates) FPR = FP/N = FP/(FP + TN)?
    
    (c) The false negative rates (false no-recidivism rates) FNR = FN/P = FN/(FN + TP)?

In [27]:
from sklearn.metrics import precision_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score

In [24]:
# African-Americans
aa_compas = compas2[compas2['race'] == "African-American"]
mtx_aa = confusion_matrix(aa_compas.two_year_recid, aa_compas.dum_num)
mtx_aa

array([[ 873,  641],
       [ 473, 1188]])

In [28]:
fp_aa = mtx_aa[0, 1]
tn_aa = mtx_aa[0, 0]
fn_aa = mtx_aa[1, 0]
tp_aa = mtx_aa[1, 1]
print("African-Americans")
print("Accuracy:", accuracy_score(aa_compas.two_year_recid,
                                 aa_compas.dum_num))
print("Precision:", precision_score(aa_compas.two_year_recid,
                                 aa_compas.dum_num))
print("False Positives:", (fp_aa / (fp_aa + tn_aa)))
print("False Negatives:", (fn_aa / (fn_aa + tp_aa)))

African-Americans
Accuracy: 0.6491338582677165
Precision: 0.6495352651722253
False Positives: 0.4233817701453104
False Negatives: 0.2847682119205298


In [25]:
# Whites
white_compas = compas2[compas2['race'] == "Caucasian"]
mtx_white = confusion_matrix(white_compas.two_year_recid, 
                             white_compas.dum_num)
mtx_white

array([[999, 282],
       [408, 414]])

In [30]:
fp_white = mtx_white[0, 1]
tn_white = mtx_white[0, 0]
fn_white = mtx_white[1, 0]
tp_white = mtx_white[1, 1]
print("Caucasions")
print("Accuracy:", accuracy_score(white_compas.two_year_recid,
                                 white_compas.dum_num))
print("Precision:", precision_score(white_compas.two_year_recid,
                                 white_compas.dum_num))
print("False Positives:", (fp_white / (fp_white + tn_white)))
print("False Negatives:", (fn_white / (fn_white + tp_white)))

Caucasions
Accuracy: 0.6718972895863052
Precision: 0.5948275862068966
False Positives: 0.22014051522248243
False Negatives: 0.49635036496350365


(a) African-American individuals have nearly a 65% accuracy rate, while Caucasions have 67%.

(b) African-American individuals have 42% false positives, while Caucasions have 22%.

(c) African-American individuals have 28% false negatives, while Caucasions have nearly 50%.

8. (12pt) If you have done this correctly, you will find that COMPAS’s true negative and true positive percentages are fairly similar for African-American and Caucasian individuals, but that false positive rates and false negative rates are different. Look again at the overal recidivism rates in the dataset for Black and White individuals. In your opinion, is the COMPAS algorithm fair? Justify your answer.

    Hint: This is not a trick question. If you read the first two recommended readings, you will find that people disagree how you define fairness. Your answer will not be graded on which side you take, but on your justification.
    
I personally do not think that the COMPAS algorithm is fair because there are multiple inequities happening in this model. Caucasians get nearly half false positive cases as African-American individuals. Also, African-Americans get nearly half negative cases compared to Caucasians, meaning judges are more forgiving on them and they are harsher on African-Americans. I believe that the false negatives for Caucasians are particularly high, with almost 50%. To follow this algorithm is dangerous to the public and dangerous for the freedoms of African-Americans.

## 2 Can you beat COMPAS? (40pt)
COMPAS model has created quite a bit controversy. One issue frequently brought up is that it is “closed source”, i.e. its inner workings are not available neither for public nor for the judges who
are actually making the decisions. But is it a big problem? Maybe you can devise as good a model as COMPAS to predict recidivism? Maybe you can do even better? Let’s try!

We proceed as follows:

• Note that you should not use variable score_text that originates from COMPAS model. Do you see why?

• First we devise a model that explicitly does not include gender and race. Your task is to use cross-validation to develop the best model you can do based on the available variables.

• Thereafter we add gender and see if gender improves the model performance.

• And finally we also add race and see if race has an additional explanatory effect, i.e. does race help to improve the performance of the model.


More detailed tasks are here:

1. (8pt) Before we start: what do you think, what is an appropriate model performance measure here? A, P, R, F or something else? Maybe you want to report multiple measures? Explain!

I believ that the new COMPAS model should priortize the precision measure because we use that when false-positives are worse than false-negatives. False-positives in the COMPAS model means that the individual was falsely predicted to recidivate. If the other measures were considered, I would put recall as the second most import because it detects false negatives, or people who were inaccurately predicted to not recidivate.

2. (6pt) Now it is time to do the modeling. Create a logistic regression model that contains all explanatory variables you have in data into the model. (Some of these you have to convert to dummies). Do not include the variables discussed above, do not include race and gender in this model either to avoid explicit gender/racial bias.

    Use 10-fold CV to compute its relevant performance measure(s) you discussed above.

In [32]:
compas2['charge_dum'] = np.where(compas2.c_charge_degree == 'F', 1, 0)
compas2.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  compas2['charge_dum'] = np.where(compas2.c_charge_degree == 'F', 1, 0)


Unnamed: 0,age,c_charge_degree,race,age_cat,sex,priors_count,decile_score,two_year_recid,dec_dummy,dum_num,charge_dum
1,34,F,African-American,25 - 45,Male,0,3,1,low_risk,0,1
2,24,F,African-American,Less than 25,Male,4,4,1,low_risk,0,1
4,41,F,Caucasian,25 - 45,Male,14,6,1,high_risk,1,1
6,39,M,Caucasian,25 - 45,Female,0,1,0,low_risk,0,0
7,27,F,Caucasian,25 - 45,Male,0,4,0,low_risk,0,1


In [37]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

In [36]:
X = compas2[['charge_dum', 'age', 'decile_score', 'priors_count']]
y = compas2.two_year_recid.values
m = LogisticRegression().fit(X, y)

In [38]:
cval_accuracy = cross_val_score(m, X, y, scoring="accuracy", cv=10)
cval_precision = cross_val_score(m, X, y, scoring="precision", cv=10)
cval_accuracy.mean(), cval_precision.mean()

(0.679228120867115, 0.6800350481668023)

3. (6pt) Experiment with different models to find the best model according to your preformance indicator. (Include/exclude different variables, you may also do feature engineering, e.g. create different age groups, include variables like age2, age2, interaction effects, etc. But do not include race and gender.

    Report what did you try (but no need to report the full results of all unsuccessful models you tried), and your best model’s performance. Is it better or worse than for the COMPAS model?

In [39]:
compas2['dec_sq5'] = compas2['decile_score']**5
compas2['priors_sq'] = compas2['priors_count']**2
compas2['dec_priors'] = compas2['decile_score'] * compas2['priors_count']
compas2.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  compas2['dec_sq5'] = compas2['decile_score']**5
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  compas2['priors_sq'] = compas2['priors_count']**2
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  compas2['dec_priors'] = compas2['decile_score'] * compas2['priors_count']


Unnamed: 0,age,c_charge_degree,race,age_cat,sex,priors_count,decile_score,two_year_recid,dec_dummy,dum_num,charge_dum,dec_sq5,priors_sq,dec_priors
1,34,F,African-American,25 - 45,Male,0,3,1,low_risk,0,1,243,0,0
2,24,F,African-American,Less than 25,Male,4,4,1,low_risk,0,1,1024,16,16
4,41,F,Caucasian,25 - 45,Male,14,6,1,high_risk,1,1,7776,196,84
6,39,M,Caucasian,25 - 45,Female,0,1,0,low_risk,0,0,1,0,0
7,27,F,Caucasian,25 - 45,Male,0,4,0,low_risk,0,1,1024,0,0


In [40]:
X = compas2[['charge_dum', 'age', 'dec_sq5', 'priors_sq', 'dec_priors']]
m = LogisticRegression().fit(X, y)

In [41]:
cval_accuracy = cross_val_score(m, X, y, scoring="accuracy", cv=10)
cval_precision = cross_val_score(m, X, y, scoring="precision", cv=10)
cval_accuracy.mean(), cval_precision.mean()

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


(0.6670979242136739, 0.738844693355416)

I tried to mess with the age variable but I would get worse scores than before, so I took it out of the model. I then messed with the decile_score variable and noticed that the precision value increases the more that the power increases, so that's why it's at 5. This gave it a precision score of around 73%, which is 5 points higher than before. I added priors_count variable and squared that to get better precision. And finally, I multiplied the newly created variables and got a precision score that was nearly 6 points higher than before (74%). I'm not sure why I get a big red error message here.

4. (4pt) Now add sex to the model. Does it help to improve the performance?

In [42]:
compas2['sex_dum'] = np.where(compas2.sex == 'Male', 1, 0)
X = compas2[['charge_dum', 'age', 'dec_sq5', 'priors_sq', 'dec_priors',
            'sex_dum']]
m = LogisticRegression().fit(X, y)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  compas2['sex_dum'] = np.where(compas2.sex == 'Male', 1, 0)


In [43]:
cval_accuracy = cross_val_score(m, X, y, scoring="accuracy", cv=10)
cval_precision = cross_val_score(m, X, y, scoring="precision", cv=10)
cval_accuracy.mean(), cval_precision.mean()

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


(0.6693742452992929, 0.7357323877194704)

It slightly improved accuracy by 0.2%, but it got a worse precision score by 0.3% which is the most important measure here.

5. (4pt) And finally add race. Does the model improve?

In [44]:
compas2['race_dum'] = np.where(compas2.race == 'African-American', 1, 0)
X = compas2[['charge_dum', 'age', 'dec_sq5', 'priors_sq', 'dec_priors',
            'sex_dum', 'race_dum']]
m = LogisticRegression().fit(X, y)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  compas2['race_dum'] = np.where(compas2.race == 'African-American', 1, 0)


In [45]:
cval_accuracy = cross_val_score(m, X, y, scoring="accuracy", cv=10)
cval_precision = cross_val_score(m, X, y, scoring="precision", cv=10)
cval_accuracy.mean(), cval_precision.mean()

(0.6706964090621585, 0.7300937768079361)

It improved accuracy but did not improve the precision.

6. (12pt) Discuss the results. Did you manage to be equally good as COMPAS? Did you create a better model? Do gender and race help to improve your predictions? What should judges do when having access to such models? Should they use such models?

My model managed to be equally as good as COMPAS because it had higher precision and accuracy values, meaning that it performs better. When adding gender, the accuracy got slightly better, and it got better when adding race. However, aftering adding those variables, the precision score got worse and worse. When judges have access to such models, they should be conscious of the bias that is embedded in these models and remember that no model is perfect, despite how technilogically advanced it seems. Judges should still rely on their own judicial judgement to decide the fate of criminals because no model can replace this process.

*I spent 5