# COMPAS Dataset

The COMPAS dataset comes with predictions, so we do not have to train a model. Here we clean the data and get it into a form suitable for FairVis

The data is from ProPublica - https://github.com/propublica/compas-analysis

In [2]:
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv("data/compas/compas-scores-two-years.csv")

Gather the features we want to focus on

In [4]:
cols = ["age", 
        "c_charge_degree", 
        "race",  
        "sex", 
        "priors_count", 
        "days_b_screening_arrest", 
        "decile_score", 
        "is_recid", 
        "two_year_recid", 
        "c_jail_in", 
        "c_jail_out"]
df = df[cols]

Process and clean data as done in Pro Publica

In [5]:
# filter those with recent crimes (propub methology)
df = df.loc[(df['days_b_screening_arrest'] <= 30) & (df['days_b_screening_arrest'] >= -30)]

# create jail days
jail_length = pd.to_datetime(df['c_jail_out']) - pd.to_datetime(df['c_jail_in'])
df['jail_length_days'] = jail_length.apply(lambda x: x.days)

df.drop('c_jail_in', axis=1, inplace=True)
df.drop('c_jail_out', axis=1, inplace=True)

# make charge degree more legible 
df["c_charge_degree"] = df['c_charge_degree'].map({'F': 'Felony', 'M': 'Misdemeanor'})

# create out
higher_risk = df['decile_score'] >= 5
df['out'] = higher_risk.astype(int)
df.drop('decile_score', axis=1, inplace=True)

# make class
df["class"] = (df['two_year_recid'] | df['is_recid'])

df.drop('is_recid', axis=1, inplace=True)
df.drop('two_year_recid', axis=1, inplace=True)

In [6]:
df.to_csv('processed/compas_out.csv', index=False)