# COMPAS analysis

We recreate the first section of the [Propublica COMPAS analysis](https://github.com/propublica/compas-analysis) in Python. 

What follows are the calculations performed for ProPublica's analaysis of the COMPAS Recidivism Risk Scores. It might be helpful to open [the methodology](https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm/) in another tab to understand the following.



## Loading the Data

We select fields for severity of charge, number of priors, demographics, age, sex, compas scores, and whether each person was accused of a crime within two years.



In [20]:
import pandas as pd
import datetime

In [21]:
raw_data = pd.read_csv('./compas-scores-two-years.csv')
print('Num rows: %d' %len(raw_data))

Num rows: 7214


However not all of the rows are useable for the first round of analysis.

There are a number of reasons remove rows because of missing data:

 - If the charge date of a defendants Compas scored crime was not within 30 days from when the person was arrested, we assume that because of data quality reasons, that we do not have the right offense.
 - We coded the recidivist flag -- is_recid -- to be -1 if we could not find a compas case at all.
 - In a similar vein, ordinary traffic offenses -- those with a df of 'O' -- will not result in Jail time are removed (only two of them).
 - We filtered the underlying data from Broward county to include only those rows representing people who had either recidivated in two years, or had at least two years outside of a correctional facility.
 - We remove rows where there is no score_text ('N/A')

In [24]:
df = raw_data[((raw_data['days_b_screening_arrest'] <=30) & 
      (raw_data['days_b_screening_arrest'] >= -30) &
      (raw_data['is_recid'] != -1) &
      (raw_data['c_charge_degree'] != 'O') & 
      (raw_data['score_text'] != 'N/A')
     )]

print('Num rows filtered: %d' % len(df))

Num rows filtered: 6172


In [22]:
raw_data.head()

Unnamed: 0,id,name,first,last,compas_screening_date,sex,dob,age,age_cat,race,...,v_decile_score,v_score_text,v_screening_date,in_custody,out_custody,priors_count.1,start,end,event,two_year_recid
0,1,miguel hernandez,miguel,hernandez,2013-08-14,Male,1947-04-18,69,Greater than 45,Other,...,1,Low,2013-08-14,2014-07-07,2014-07-14,0,0,327,0,0
1,3,kevon dixon,kevon,dixon,2013-01-27,Male,1982-01-22,34,25 - 45,African-American,...,1,Low,2013-01-27,2013-01-26,2013-02-05,0,9,159,1,1
2,4,ed philo,ed,philo,2013-04-14,Male,1991-05-14,24,Less than 25,African-American,...,3,Low,2013-04-14,2013-06-16,2013-06-16,4,0,63,0,1
3,5,marcu brown,marcu,brown,2013-01-13,Male,1993-01-21,23,Less than 25,African-American,...,6,Medium,2013-01-13,,,1,0,1174,0,0
4,6,bouthy pierrelouis,bouthy,pierrelouis,2013-03-26,Male,1973-01-22,43,25 - 45,Other,...,1,Low,2013-03-26,,,2,0,1102,0,0


In [3]:
# TODO: implement the filtering explained above
# Hint: you should end up with 6172

Higher COMPAS scores are slightly correlated with a longer length of stay.



In [4]:
import numpy as np
from datetime import datetime
from scipy.stats import pearsonr

In [5]:
def date_from_str(s):
    return datetime.strptime(s, '%Y-%m-%d %H:%M:%S')

In [6]:
# TODO: find the Pearson correlation between length of stay (jail_out - jail_in) and COMPAS decile

After filtering we have the following demographic breakdown:

In [7]:
# TODO: find counts for each age group
# TODO: find counts and percentages for each race group
# TODO: find counts of each score_text group

In [8]:
# TODO: create interaction table of counts between race/sex interactions

In [9]:
# TODO: find counts and percentages for each gender group

In [10]:
# TODO: How many defendants had two-year recidivism in last two years? What percentage of all defendants?

Judges are often presented with two sets of scores from the Compas system -- one that classifies people into High, Medium and Low risk, and a corresponding decile score. There is a clear downward trend in the decile scores as those scores increase for white defendants.



In [11]:
%matplotlib inline

from matplotlib import pyplot as plt

In [12]:
# TODO: plot decile scores by count for African-America defendants

In [13]:
# TODO: plot decile scores by count for Caucasian defendants

In [14]:
# TODO: Create a pivot table between decile_score and race

## Racial Bias in Compas

After filtering out bad rows, our first question is whether there is a significant difference in Compas scores between races. To do so we need to change some variables into factors, and run a logistic regression, comparing low scores to high scores.


In [15]:
from sklearn.linear_model import LogisticRegression

In [16]:
lr = LogisticRegression(solver='lbfgs')

In [17]:
# TODO: create columns for all options for charge_degree, age, race, sex, and 

In [18]:
# TODO: Run these input features to predict whether score_text is Low or Med/High using a Logistic Regression

In [19]:
# TODO: What is the correlation coefficient for African_American? What is the odds ratio?