Import our tools

In [1]:
import pandas as pd

Load in the relevant data, parsing the date fields to a format pandas understands on the way.

In [2]:
df = pd.read_csv(
    'compas-scores.csv',
    usecols=[
        'sex',
        'age',
        'age_cat',
        'race',
        'juv_fel_count', 
        'decile_score', 
        'juv_misd_count',
        'juv_other_count', 
        'priors_count', 
        'days_b_screening_arrest',
        'c_jail_in', 
        'c_jail_out',
        'c_offense_date',
        'c_arrest_date', 
        'c_days_from_compas', 
        'c_charge_degree',
        'c_charge_desc', 
        'is_recid', 
        'num_r_cases', 
        'r_charge_degree', 
        'r_days_from_arrest', 
        'r_offense_date',
        'r_charge_desc', 
        'r_jail_in', 
        'r_jail_out', 
        'is_violent_recid',
        'num_vr_cases', 
        'vr_charge_degree', 
        'vr_offense_date',
        'vr_charge_desc', 
        #'v_decile_score',
        #'v_score_text', 
        'v_screening_date', 
        #'score_text', 
    ],
    parse_dates=[
        'c_jail_in', 
        'c_jail_out', 
        'c_offense_date', 
        'c_arrest_date', 
        'r_offense_date', 
        'r_jail_in', 
        'r_jail_out', 
        'vr_offense_date',
        'v_screening_date'
    ]
)

Transform all string data to lowercase and remove extra whitespace

In [3]:
df = df.applymap(lambda s : s.lower().strip() if type(s) == str else s)

Transform all string columns to groups of binary columns

In [4]:
df = pd.get_dummies(df)

Check if we have any more object columns (should be none)

In [5]:
df.dtypes.value_counts()

uint8             982
datetime64[ns]      9
int64               8
float64             5
dtype: int64

Calculate jail stay lengths

In [6]:
df['c_days_in_jail'] = (df['c_jail_out'] - df['c_jail_in']).dt.days.fillna(0).astype(int)
df['r_days_in_jail'] = (df['r_jail_out'] - df['r_jail_in']).dt.days.fillna(0).astype(int)

Drop date columns

In [7]:
df.drop([
        'c_jail_in', 
        'c_jail_out', 
        'c_offense_date', 
        'c_arrest_date', 
        'r_offense_date', 
        'r_jail_in', 
        'r_jail_out', 
        'vr_offense_date',
        'v_screening_date'
    ],
    axis='columns',
    inplace=True
)

Check dtypes again, should be no dates remaining

In [8]:
df.dtypes.value_counts()

uint8      982
int64       10
float64      5
dtype: int64

Still a few floats, see what that is all about

In [9]:
df.select_dtypes(include=['float64'])

Unnamed: 0,days_b_screening_arrest,c_days_from_compas,num_r_cases,r_days_from_arrest,num_vr_cases
0,-1.0,1.0,,,
1,,,,,
2,-1.0,1.0,,,
3,-1.0,1.0,,0.0,
4,,1.0,,,
5,,76.0,,,
6,0.0,0.0,,,
7,-1.0,1.0,,0.0,
8,-1.0,1.0,,,
9,-1.0,1.0,,,


At least two look like only NaNs, check to be sure

In [10]:
print(df['num_vr_cases'].value_counts())
print(df['num_r_cases'].value_counts())

Series([], Name: num_vr_cases, dtype: int64)
Series([], Name: num_r_cases, dtype: int64)


Yeah, NaNs only. We can drop those.

In [11]:
df.drop([
        'num_r_cases',
        'num_vr_cases'
    ],
    axis='columns',
    inplace=True
)

Replace other NaNs with column means (imputation) and check the columns again

In [12]:
df.fillna(df.mean(), inplace=True)
df.select_dtypes(include=['float64'])

Unnamed: 0,days_b_screening_arrest,c_days_from_compas,r_days_from_arrest
0,-1.000000,1.000000,20.410569
1,-0.878037,63.587653,20.410569
2,-1.000000,1.000000,20.410569
3,-1.000000,1.000000,0.000000
4,-0.878037,1.000000,20.410569
5,-0.878037,76.000000,20.410569
6,0.000000,0.000000,20.410569
7,-1.000000,1.000000,0.000000
8,-1.000000,1.000000,20.410569
9,-1.000000,1.000000,20.410569


Should be no more NaNs to go, so we can continue to learning.

Split to target and explanatory variables

In [13]:
predicted_variable = 'decile_score'
X = df.loc[:, df.columns != predicted_variable]
y = df.loc[:, df.columns == predicted_variable]
print(X.shape, y.shape)

(11757, 994) (11757, 1)


Split to train and test sets

In [14]:
from sklearn import model_selection

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.1, random_state=42)

Train a linear classifier and check the $R^2$ score. Score of 1 is best possible, values can range to $-\inf$. We do a 10-fold cross validation to get a better sense of the model fit.

In [15]:
from sklearn import linear_model

model = linear_model.LinearRegression(normalize=True)
scores = model_selection.cross_val_score(model, X, y, cv=5)
scores

array([ -6.83044025e+27,  -1.19054010e+28,  -5.69988528e+27,
        -9.25803996e+27,  -8.88523580e+27])

Absolute rubbish, try linear support vector regression instead

In [None]:
from sklearn import svm

model = svm.LinearSVR(epsilon=0.5)
scores = model_selection.cross_val_score(model, X, y.values.ravel(), cv=5, n_jobs=2)
scores

array([  3.01579093e-01,   2.00690892e-02,  -4.35889931e-01,
         1.29654965e-01,  -5.52683440e+01])

Marginally better, still pretty bad. Let's try support vector regression with an RBF. In principle it has an infinite-dimensional faeture space so that should get to something reasonable at the cost of possibly overfitting.

In [None]:
from sklearn import svm

model = svm.SVR(epsilon=0.5)
scores = model_selection.cross_val_score(model, X, y.values.ravel(), cv=3, n_jobs=3)
scores

It's... not bad(?) considering we are using $\frac{1}{3}$ of the data for testing on each round of the cross validation. The resulting $R^2$ values indicate $R \approx 0.6$ which is by no means an abysmal degree of correlation. At the same time, we would expect a significantly higher degree of success based on what the task actually **is**.

Might be problem with the nature of the data, more specifically not enough data for practical learning considering the size of the (unmodified) feature space. We only have ~10x data points to the number of features.

Might want to try, say, a random forest or alternatively doing some tricks with sample generation. Might also be issues regarding colinearity, but that's a bit beyond me, to be quite hones.