# Evaluating COMPAS Criminal Recidivism Prediction Model

---

Maxwell Jayne

&emsp;&emsp;The Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) algorithm is used to assess defendants for recidivism risk: the risk a defendant may perform another crime. The COMPAS model is a commercial risk aseessment tool that was developed and is owned by Equivant (formerly Northpointe) and has been used in US courts and during pretrial detention with a recorded accuracy of 65% in recidivism prediction. COMPAS or similar risk assessment models are used in 46 states across the US and many, similar to COMPAS, have proprietary algorithms, meaning the general public cannot learn how these risk recidivism risk scores are calculated. The COMPAS test itself is made up of 137 questions, and race is not asked about.

&emsp;&emsp;In 2016, ProPublica--a non-profit online news source based out of New York--published an investigation of COMPAS, and found the algorithm to be racially biased, quoting, "Black defendants were also twice as likely as white defendants to be misclassified as being a higher risk of violent recidivism. And white violent recidivists were 63% more likley to have been misclassified as a low risk of violent recidivism, compared with black violent recidivists" (Larson, et al.). Additionally, the investigation found that of the recidivism predictions COMPAS made, only 20% actually re-offended. Equivant responded with a statement critiquing the ProPublica investigation and disagreeing with its claims of racial bias. The investigation was also criticized by a criminal justice think tank, Community Resources for Justice, stating the results of ProPublica's investigation contradicted existing studies that concluded risk can be calculated without racial or gender bias. (Larson, et al.) (Angwin, et al.).

&emsp;&emsp;This study will examine COMPAS scoring data using a dataset compiled by ProPublica from Florida public records, and evaluate the model in terms of statistical accuracy and racial bias, specifically pertaining to treatment of Caucasian and African American individuals. Initial analysis will be conducted, and confusion matrices constructed for comparison. Similar predictions models will then be created to compare to the COMPAS algorithm results.

### Data Preparation & Initial Analysis

&emsp;&emsp;We begin by importing python packages for analysis, loading a simplified compas recidivism scoring dataset, and running preliminary checks, making sure there are no missing values. A few sample rows of the dataset are displayed below. The variables in the dataset are: <br>
* **Age** - Age in years
* **c_charge-degree** - Type of crime commited: F for Felony, M for misdemeanor
* **race** - Recorded race of individual
* **age_cat** - Classifies individuals as under 25, 25-45, or older than 25
* **sex** - Recorded sex of individual
* **priors_count** - The number of previous crimes the individual has commited
* **decile_score** -  COMPAS's classification of an individuals risk of recidivism (1 = low . . . 10 = high). This is the score computed by the proprietary model.
* **two_year_recid** - Whether or not the individual recidivated (commited another crime): 1 if yes, 0 if no

&emsp;&emsp;Note: For this study, we will only consider recidivation within two years of initial offense.

In [94]:
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
import warnings
warnings.filterwarnings('ignore')

In [92]:
compas = pd.read_csv("compas-score-data.csv.bz2", sep="\t")
print("Dataset Dimensions: ", compas.shape)
print("Missing Values: ", compas.isna().sum().sum())
compas.sample(3)

Dataset Dimensions:  (6172, 8)
Missing Values:  0


Unnamed: 0,age,c_charge_degree,race,age_cat,sex,priors_count,decile_score,two_year_recid
2424,40,F,Other,25 - 45,Male,8,7,1
5905,37,M,African-American,25 - 45,Male,4,4,1
1125,43,F,Caucasian,25 - 45,Male,2,1,0


&emsp;&emsp;Below we filter out rows with values other than 'African-American' or 'Caucasian' in the race column and create a secondary score measure with a binary encoding. This second score tells us if an individuals score from the proprietary COMPAS model indicates they are at high risk to recidivate (a decile score greater than or equal to 5), or at low risk (a decile score less than 5). The new column, called 'high_score' contains a 1 if the individual is at high risk of recidivating (score >= 5), or a zero if the indivdual is low risk (score < 5).

In [52]:
compas = compas[(compas.race == "African-American") | (compas.race == "Caucasian")]
print("Dataset Dimensions: ", compas.shape)
compas.loc[compas["decile_score"] < 5, 'high_score'] = 0 
compas.loc[compas["decile_score"] >= 5, 'high_score'] = 1
compas.sample(3)

Unnamed: 0,age,c_charge_degree,race,age_cat,sex,priors_count,decile_score,two_year_recid,high_score
3281,29,M,Caucasian,25 - 45,Male,0,2,0,0.0
4554,69,F,African-American,Greater than 45,Male,1,1,1,0.0
6041,38,M,Caucasian,25 - 45,Male,1,1,1,0.0


&emsp;&emsp;Next we analyze recidivism rates between low and high scoring individuals according to the compas algorithm, and between African American and Caucasian offenders. From the dataset, recidivism rate for low scoring individuals is 32%, and 63% for high scoring individuals. For African American individuals the recidivism rate is 52% and for Caucasian individuals the rate is 39%.

In [67]:
print("Recidivism rates for individuals with high and low COMPAS scores:")
print()
print(compas.groupby("high_score").two_year_recid.mean())
print()
print("Recidivism rates for African American and Caucasian individuals:")
print()
print(compas.groupby("race").two_year_recid.mean())
print()
print("Rate of high COMPAS scores (score >= 5) for African American and Caucasian individuals:")
print()
print(compas.groupby("race").high_score.mean())

Recidivism rates for individuals with high and low COMPAS scores:

high_score
0.0    0.320015
1.0    0.634455
Name: two_year_recid, dtype: float64

Recidivism rates for African American and Caucasian individuals:

race
African-American    0.52315
Caucasian           0.39087
Name: two_year_recid, dtype: float64

Rate of high COMPAS scores (score >= 5) for African American and Caucasian individuals:

race
African-American    0.576063
Caucasian           0.330956
Name: high_score, dtype: float64


&emsp;&emsp;Of the high scores (COMPAS scores greater than or equal to 5) the model assigns, only 63% of individuals actually re-commit a crime. This lines up with the reported accuracy of 65%, though it only considers individuals with COMPAS scores higher than 5. From the metrics above, we can also see that African Americans are more likely to have a high COMPAS score with a rate of 58% compared to Caucasians at 33%. This high score rate is higher than the actual recidivism rate for African Americans (52%) and lower than actual for Caucasians (39%).

### Confusion Matrix Analysis

&emsp;&emsp;Below we construct a confusion matrix to analyze COMPAS predictions and actual recidivism cases. Note positive indicates recidivism, negative indicates no re-offense within two years, true indicates the COMPAS algorithm predicted correctly, and false indicates the algorithms prediction was wrong.

In [16]:
tp = compas[(compas.two_year_recid == 1) & (compas.high_score == 1)]
tp = len(tp)
print("True positive (tp) - ", tp)
tn = compas[(compas.two_year_recid == 0) & (compas.high_score == 0)]
tn = len(tn)
print("True negative (tn) - ", tn)
fp = compas[(compas.two_year_recid == 0) & (compas.high_score == 1)]
fp = len(fp)
print("False positive (fp) - ", fp)
fn = compas[(compas.two_year_recid == 1) & (compas.high_score == 0)]
fn = len(fn)
print("False negative (fn) - ", fn)

True positive (tp) -  1602
True negative (tn) -  1872
False positive (fp) -  923
False negative (fn) -  881


| Recidivism Rate | **Predicted** | |
| -------- | ------------: | ------:|
| | Negative | Positive |
|**Actual**| 1872 | 923 |
| | 881 | 1602 |

&emsp;&emsp;Our confusion matrix shows us that the compas model correctly identifies repeat offenders in 1602 cases of 5,278 and non-repeat offenders in 1872 cases, giving us an accuracy of 65.82%, close to the reported accuracy. However the model misidentifies individuals in a large chunk of the cases. Compas falsely classified 923 individuals as high risk or about 17.5% of the sample. Compas also misidentifies 881 individuals as low risk or about 16.7% of the sample.

&emsp;&emsp;Below we calculate accuracy, precision, and recall for the compas predictions. As noted above, Compas is 65.8% accurate in its predictions and from the similar precision (0.634) and recall (0.645), and similar false positive and false negative numbers, we can see the misidentifications are fairly evenly distributed, though slightly more likely to predict a false positive than a false negative, at least in this sample. A false positive indicates an individual had a high COMPAS score and did not recidivate. A high score may influence a judges decision in court, which may have consequences for an individual who will not recidivate. As our justice system uses the tenet - 'Innocent until proven guilty' - this study will have a greater focus on false positives than false negatives.

In [63]:
A = (tp + tn) / (tp + tn + fp + fn)
P = tp / (tp + fp)
R = tp / (tp + fn)
print("Accuracy - ", A)
print("Precision - ", P)
print("Recall - ", R)

Accuracy -  0.6582038651004168
Precision -  0.6344554455445545
Recall -  0.6451872734595248


&emsp;&emsp;Below we create confusion matrices for both African Americans and Caucasians:

In [18]:
compas_AA = compas[compas.race == "African-American"]
# compas_AA.shape
compas_C = compas[compas.race == "Caucasian"]
# compas_C.shape

In [65]:
print("African American"), print()
tp_AA = compas_AA[(compas_AA.two_year_recid == 1) & (compas_AA.high_score == 1)]
tp_AA = len(tp_AA)
print("True positive - ", tp_AA)
tn_AA = compas_AA[(compas_AA.two_year_recid == 0) & (compas_AA.high_score == 0)]
tn_AA = len(tn_AA)
print("True negative - ", tn_AA)
fp_AA = compas_AA[(compas_AA.two_year_recid == 0) & (compas_AA.high_score == 1)]
fp_AA = len(fp_AA)
print("False positive - ", fp_AA)
fn_AA = compas_AA[(compas_AA.two_year_recid == 1) & (compas_AA.high_score == 0)]
fn_AA = len(fn_AA)
print("False negative - ", fn_AA)

print(), print("Caucasian"), print()
tp_C = compas_C[(compas_C.two_year_recid == 1) & (compas_C.high_score == 1)]
tp_C = len(tp_C)
print("True positive - ", tp_C)
tn_C = compas_C[(compas_C.two_year_recid == 0) & (compas_C.high_score == 0)]
tn_C = len(tn_C)
print("True negative - ", tn_C)
fp_C = compas_C[(compas_C.two_year_recid == 0) & (compas_C.high_score == 1)]
fp_C = len(fp_C)
print("False positive - ", fp_C)
fn_C = compas_C[(compas_C.two_year_recid == 1) & (compas_C.high_score == 0)]
fn_C = len(fn_C)
print("False negative - ", fn_C)

African American

True positive -  1188
True negative -  873
False positive -  641
False negative -  473

Caucasian

True positive -  414
True negative -  999
False positive -  282
False negative -  408


| African American | **Predicted** | | |   Caucasian   | **Predicted** | |
| -------- | ------------: | ------:| | -------- | ------------: | ------:|
| | Negative | Positive | | | Negative | Positive |
|**Actual**| 873 | 641 | |**Actual**| 999 | 282 |
| | 473 | 1188 | | | 408 | 414 |

In [68]:
print("African American"), print()
A_AA = (tp_AA + tn_AA) / (tp_AA + tn_AA + fp_AA + fn_AA)
FPR_AA = fp_AA / (tn_AA + fp_AA)
FNR_AA = fn_AA / (tp_AA + fn_AA)
print("Accuracy - ", A_AA)
print("False positive rate - ", FPR_AA)
print("False negative rate - ", FNR_AA), print()

print("Caucasian"), print()
A_C = (tp_C + tn_C) / (tp_C + tn_C + fp_C + fn_C)
FPR_C = fp_C / (tn_C + fp_C)
FNR_C = fn_C / (tp_C + fn_C)
print("Accuracy - ", A_C)
print("False positive rate - ", FPR_C)
print("False negative rate - ", FNR_C)

African American

Accuracy -  0.6491338582677165
False positive rate -  0.4233817701453104
False negative rate -  0.2847682119205298

Caucasian

Accuracy -  0.6718972895863052
False positive rate -  0.22014051522248243
False negative rate -  0.49635036496350365


&emsp;&emsp;From our calculations above, we can see that the model is more accurate in predicting for Caucasians than for African Americans, though only by 2.2 percentage points. More importantly, the false positive rate is far higher for African Americans and the false negative rate is much lower. This indicates that an African American individual is far more likely to be falsely classified as high risk than a Caucasian, and far less likely that an African American individual will be falsely classified as low risk than a Caucasian individual. This correlates with ProPublica's findings and indicates the COMPAS algorithm may be unfair in its scoring in regards to race.

### Comparing Alternate Prediction Models

&emsp;&emsp;Below we will construct alternate machine learning algorithms to compare to the COMPAS algorithm. This studies prediction models will ignore race and sex entirely, and will attempt to produce accuracies similar to COMPAS using the other features COMPAS considers. When evaluating these models, the most appropriate metrics to track are accuracy--a good indicator of the models ability to properly predict--and precision, which is sensitive to false positives. As a positive indicates a high COMPAS score, and a false positive conclusion may have negative or harmful impacts on an individual going to trial, it is therefore a focus. Recall will also be calculated for additional insight. 

&emsp;&emsp;First, two new binary variables are created, one indicating whether or not an individual has any prior charges (**priors2**), and the other indicating whether or not an individual is younger than 30 (**age2**). Three feature sets are then constructed. The feature sets are:
* \[ age, c_charge_degree, age_cat, priors_count \]
* \[ age, c_charge_degree, age2, priors2 \]
* \[ age, c_charge_degree, age_cat, priors_count, priors2 \]

&emsp;&emsp;Three models are then constructed: A logistic regression model, a decision tree model with a maximum depth of 100 nodes, and a nearest neighbors model considering 70 neighbors. The logistic regression model is fit to the first feature set, and the decision tree and nearest neaighbors models are fitted to all three feature sets. Using 10-fold cross validation, accuracy, precision, and recall are calculated and listed below.

In [80]:
compas.loc[compas["priors_count"] == 0, 'priors2'] = 0 
compas.loc[compas["priors_count"] > 0, 'priors2'] = 1
compas.loc[compas["age"] < 30, 'age2'] = 0
compas.loc[compas["age"] >= 30, 'age2'] = 1

X = pd.get_dummies(
    compas[["age", "c_charge_degree", "age_cat", "priors_count"]],
    columns = ["c_charge_degree", "age_cat"]).values

X2 = pd.get_dummies(
    compas[["age", "c_charge_degree", "age2", "priors2"]],
    columns = ["c_charge_degree", "age2", "priors2"]).values

X3 = pd.get_dummies(
    compas[["age", "c_charge_degree", "age_cat", "priors_count", "priors2"]],
    columns = ["c_charge_degree", "age_cat", "priors2"]).values

y = compas.two_year_recid.values

m1 = LogisticRegression()
m2 = DecisionTreeClassifier(max_depth=100)
m3 = KNeighborsClassifier(70)

In [95]:
# Pre:   This function takes a model object to fit the data, a feature list or array, and a target list or array (y)
# Post:  The function will perform 10 fold cross validation and return a dict with key-mapped arrays of scoring metrics
def cross_val(model, features, y):
    scores = cross_validate(model, features, y,
                scoring=["accuracy", "recall", "precision"],
                cv=10)
    return scores

In [97]:
# Pre:   This function takes a dict with mapped arrays as given by the SKlearn.model_selection package's cross_validate 
#        function
# Post:  The function will print testing data accuracy, precision, and recall
def results(cv_scores):
    print("Accuracy - ", np.mean(cv_scores['test_accuracy']))
    print("Precision - ", np.mean(cv_scores['test_precision']))
    print("Recall - ", np.mean(cv_scores['test_recall']))
    print()

In [83]:
cv1 = cross_val(m1, X, y)
cv2 = cross_val(m2, X, y)
cv3 = cross_val(m3, X, y)
cv4 = cross_val(m2, X2, y)
cv5 = cross_val(m3, X2, y)
cv6 = cross_val(m2, X3, y)
cv7 = cross_val(m3, X3, y)

In [85]:
print("Logistic Regression - Feature Set 1")
results(cv1)
print("Decision Tree Classifier - Feature Set 1")
results(cv2)
print("Nearest Neighbors Classifier - Feature set 1")
results(cv3)
print("Decision Tree Classifier - Feature Set 2")
results(cv4)
print("Nearest Neighbors Classifier - Feature set 2")
results(cv5)
print("Decision Tree Classifier - Feature Set 3")
results(cv6)
print("Nearest Neighbors Classifier - Feature set 3")
results(cv7)

Logistic Regression - Feature Set 1
Accuracy -  0.6703291213846242
Precision -  0.6604111653377751
Recall -  0.6161630392537893

Decision Tree Classifier - Feature Set 1
Accuracy -  0.6464536254384452
Precision -  0.6484584297536495
Recall -  0.5424747376603186

Nearest Neighbors Classifier - Feature set 1
Accuracy -  0.6788561612328216
Precision -  0.6864363192147438
Recall -  0.5843600207280736

Decision Tree Classifier - Feature Set 2
Accuracy -  0.6305423782416193
Precision -  0.6057220974960358
Recall -  0.6133647493198602

Nearest Neighbors Classifier - Feature set 2
Accuracy -  0.6333879592892877
Precision -  0.6131658607606252
Recall -  0.5960454722114263

Decision Tree Classifier - Feature Set 3
Accuracy -  0.6462642314990512
Precision -  0.6484861189078318
Recall -  0.5416699054281642

Nearest Neighbors Classifier - Feature set 3
Accuracy -  0.6773377752860675
Precision -  0.6776957166066594
Recall -  0.5988550978105972



&emsp;&emsp;Based solely upon the accuracy, precision, and recall of the models built above, a nearest neighbors regression model that does not consider race or gender (considering 70 nearest neighbors, age, age category, charge degree, and prior charges) does a better job of predicting than the compas algorithm, at least using this data sample. The nearest neighbors model predicts with an accuracy of 67.88% compared to compas's 65.82%. Precision for the nearest neighbors model is calculated at 0.6864, better than the compas's precision of 0.6345. This indicates that the nearest neighbors model is less likely to falsely identify someone as a high risk individual. On the other hand, the nearest neighbors model has a lower recall score of 0.5844 than compas's 0.6452, meaning the nearest neighbors model is more likely to falsely identify someone as a low risk individual. The nearest neighbors model proves that a risk assessment software can achive similar if not better accuracies without considering any features that can be effected by race or gender (besides charge degree, as a result of potential bias previously in court). This is backed up by the rest of the prediction models scoring accuracies of 63.05% or greater, all close to the reported accuracy of COMPAS (65%). The higher precision and lower recall scores of the nearest neighbors model also align to a greater degree with our justice systems value of innocent until proven guilty.

### Confusion Matrix Comparison

&emsp;&emsp;Using the nearest neighbors regression model considering 70 nearest neighbors discussed above, recidivism rates are predicted. Confusion matrices and false positive/negative rates for both the nearest neighbors model and COMPAS scoring are displayed below:

In [98]:
m3.fit(X, y)
yhat = m3.predict(X)
cm = confusion_matrix(y, yhat)
print(cm)

[[2107  688]
 [ 979 1504]]


| Nearest Neighbors | **Predicted** | | | COMPAS | **Predicted** | |
| -------- | ------------: | ------:| | -------- | ------------: | ------:|
| | Negative | Positive | | | Negative | Positive |
|**Actual**| 2107 | 688 | |**Actual**| 1872 | 923 |
| | 979 | 1504 | | | 881 | 1602 |

In [99]:
print("Nearest Neighbors Model")
print("False Positive Rate - ", 688 / (688 + 2107))
print("False Negative Rate - ", 979 / (979 + 1504))
print()
print("COMPAS Algorithm")
print("False Positive Rate - ", fp / (tn + fp))
print("False Negative Rate - ", fn / (tp + fn))

Nearest Neighbors Model
False Positive Rate -  0.24615384615384617
False Negative Rate -  0.39428111155859846

COMPAS Algorithm
False Positive Rate -  0.3302325581395349
False Negative Rate -  0.35481272654047524


&emsp;&emsp;From our results above we can see the nearest neighbors model predicts more negative or low risk values than the compas algorithm. The nearest neighbors model correctly identifies more low risk individuals than the COMPAS algorithm, and mididentifies less individuals who do not recidivate within two years as high risk for re-offending. The nearest neighbors model is less accurate with predicting high risk individuals who do recidiavte, and misidentifies more individuals who do recidivate as low risk. Overall the nearest neighbors model aligns better with the tenet that an individual is innocent until proven guilty than the compas algorithm.

### Conclusion

&emsp;&emsp;After examing the COMPAS model, there is evidence that its predictions are racially biased. This disparity is especially noticable in the difference in false positive and negative rates between African Americans and Caucasians, where an African Amercian individual is more likely to be given a high risk score and not recidivate, and a Caucasian individual is more likley to be given a low risk score and recidivate. While the developers may not have intended for this disparity, there is statistical evidence of its difference in treatment in regard to race. Additionally, this study found that similar prediction algorithms can predict with similar, if not better, accuracy than the COMPAS algorithm using the sample data while not considering factors that can be effected by race or gender. With these findings in mind, it can be concluded that race and gender are not necessary factors to consider in commercial risk assessment tools, and likely cause more damage to African American and potentially other minority indivduals when considered.

&emsp;&emsp;Judges with access to these models need to use them carefully, as they are not perfect indicators of a persons risk of re-offending. While a judge may consider such a model and take insight from it, they should not use it as a primary measure of a persons guilt or innocence, and should not let the model create pre-concieved notions about an individual. Models like this heavily oversimplify the variables at play in any individuals circumstances, and do little, if anything, to actually help the people regarded in the dataset.

### Bibliography

Angwin, Julia, et al. “Machine Bias.” ProPublica, 23 May 2016, www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing. 

Larson, Jeff, et al. “How We Analyzed the Compas Recidivism Algorithm.” ProPublica, 23 May 2016, www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm/. 