# Compas Analysis

What follows is the preprocessing performed for ProPublica's analaysis of the COMPAS Recidivism Risk Scores.

## Loading the Data

We select fields for severity of charge, number of priors, demographics, age, sex, compas scores, and whether each person was accused of a crime within two years.

In [8]:
import pandas as pd
raw_data = pd.read_csv("./compas-scores-two-years.csv")
len(raw_data)

7214

However not all of the rows are useable for the first round of analysis.

There are a number of reasons remove rows because of missing data:
* If the charge date of a defendants Compas scored crime was not within 30 days from when the person was arrested, we assume that because of data quality reasons, that we do not have the right offense.
* We coded the recidivist flag -- `is_recid` -- to be -1 if we could not find a compas case at all.
* In a similar vein, ordinary traffic offenses -- those with a `c_charge_degree` of 'O' -- will not result in Jail time are removed (only two of them).
* We filtered the underlying data from Broward county to include only those rows representing people who had either recidivated in two years, or had at least two years outside of a correctional facility.

In [9]:
df = raw_data[['age', 'c_charge_degree', 'race', 'age_cat', 'score_text', 'sex', 'priors_count', 'days_b_screening_arrest', 'decile_score', 'is_recid', 'two_year_recid', 'c_jail_in', 'c_jail_out']]
df = df.loc[(df['days_b_screening_arrest'] <= 30) & (df['days_b_screening_arrest'] >= -30)  & (df['is_recid'] != -1)  & (df['c_charge_degree'] != 'O')  & (df['score_text'] != 'N/A')]
df = df.drop(columns=['is_recid'])
len(df)

6172

We now prepare the data, so that it can be used by the FairnessLab.

In [10]:
df['D'] = list(map(int, df['decile_score'] <= 4))
inverted_scores = 1 - df['decile_score'] / 10
df['scores'] = inverted_scores
df['Y'] = list(map(int, df['two_year_recid'] == 0))
df

Unnamed: 0,age,c_charge_degree,race,age_cat,score_text,sex,priors_count,days_b_screening_arrest,decile_score,two_year_recid,c_jail_in,c_jail_out,D,scores,Y
0,69,F,Other,Greater than 45,Low,Male,0,-1.0,1,0,2013-08-13 06:03:42,2013-08-14 05:41:20,1,0.9,1
1,34,F,African-American,25 - 45,Low,Male,0,-1.0,3,1,2013-01-26 03:45:27,2013-02-05 05:36:53,1,0.7,0
2,24,F,African-American,Less than 25,Low,Male,4,-1.0,4,1,2013-04-13 04:58:34,2013-04-14 07:02:04,1,0.6,0
5,44,M,Other,25 - 45,Low,Male,0,0.0,1,0,2013-11-30 04:50:18,2013-12-01 12:28:56,1,0.9,1
6,41,F,Caucasian,25 - 45,Medium,Male,14,-1.0,6,1,2014-02-18 05:08:24,2014-02-24 12:18:30,0,0.4,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7209,23,F,African-American,Less than 25,Medium,Male,0,-1.0,7,0,2013-11-22 05:18:27,2013-11-24 02:59:20,0,0.3,1
7210,23,F,African-American,Less than 25,Low,Male,0,-1.0,3,0,2014-01-31 07:13:54,2014-02-02 04:03:52,1,0.7,1
7211,57,F,Other,Greater than 45,Low,Male,0,-1.0,1,0,2014-01-13 05:48:01,2014-01-14 07:49:46,1,0.9,1
7212,33,M,African-American,25 - 45,Low,Female,3,-1.0,2,0,2014-03-08 08:06:02,2014-03-09 12:18:04,1,0.8,1


In [11]:
df['sex-binary'] = list(map(int, df['sex'] == 'Male'))
df['sensitive-attribute'] = list(map(int, df['race'] == 'Caucasian'))
df = df.loc[(df['race'] == 'Caucasian') | (df['race'] == 'African-American')]
df

Unnamed: 0,age,c_charge_degree,race,age_cat,score_text,sex,priors_count,days_b_screening_arrest,decile_score,two_year_recid,c_jail_in,c_jail_out,D,scores,Y,sex-binary,sensitive-attribute
1,34,F,African-American,25 - 45,Low,Male,0,-1.0,3,1,2013-01-26 03:45:27,2013-02-05 05:36:53,1,0.7,0,1,0
2,24,F,African-American,Less than 25,Low,Male,4,-1.0,4,1,2013-04-13 04:58:34,2013-04-14 07:02:04,1,0.6,0,1,0
6,41,F,Caucasian,25 - 45,Medium,Male,14,-1.0,6,1,2014-02-18 05:08:24,2014-02-24 12:18:30,0,0.4,0,1,1
8,39,M,Caucasian,25 - 45,Low,Female,0,-1.0,1,0,2014-03-15 05:35:34,2014-03-18 04:28:46,1,0.9,1,0,1
10,27,F,Caucasian,25 - 45,Low,Male,0,-1.0,4,0,2013-11-25 06:31:06,2013-11-26 08:26:57,1,0.6,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7207,30,M,African-American,25 - 45,Low,Male,0,-1.0,2,1,2014-05-09 10:01:33,2014-05-10 08:28:12,1,0.8,0,1,0
7208,20,F,African-American,Less than 25,High,Male,0,-1.0,9,0,2013-10-19 11:17:15,2013-10-20 08:13:06,0,0.1,1,1,0
7209,23,F,African-American,Less than 25,Medium,Male,0,-1.0,7,0,2013-11-22 05:18:27,2013-11-24 02:59:20,0,0.3,1,1,0
7210,23,F,African-American,Less than 25,Low,Male,0,-1.0,3,0,2014-01-31 07:13:54,2014-02-02 04:03:52,1,0.7,1,1,0


In [13]:
df.to_csv('compas.csv', index=False)
df.to_json('compas.json', orient='records')

In [14]:
without_scores = df.drop(columns=['scores'])
without_scores.to_csv('compas_without_scores.csv', index=False)
without_scores.to_json('compas_without_scores.json', orient='records')