Import our tools

In [1]:
import pandas as pd

Load in the relevant data, parsing the date fields to a format pandas understands on the way.

In [2]:
df = pd.read_csv(
    'compas-scores.csv',
    usecols=[
        'sex',
        'age',
        #'age_cat',
        'race',
        'juv_fel_count', 
        'decile_score', 
        'juv_misd_count',
        'juv_other_count', 
        'priors_count', 
        'days_b_screening_arrest',
        'c_jail_in', 
        'c_jail_out',
        'c_offense_date',
        'c_arrest_date', 
        'c_days_from_compas', 
        'c_charge_degree',
        'c_charge_desc', 
        'is_recid', 
        'num_r_cases', 
        'r_charge_degree', 
        'r_days_from_arrest', 
        'r_offense_date',
        'r_charge_desc', 
        'r_jail_in', 
        'r_jail_out', 
        'is_violent_recid',
        'num_vr_cases', 
        'vr_charge_degree', 
        'vr_offense_date',
        'vr_charge_desc', 
        #'v_decile_score',
        #'v_score_text', 
        'v_screening_date', 
        #'score_text', 
    ],
    parse_dates=[
        'c_jail_in', 
        'c_jail_out', 
        'c_offense_date', 
        'c_arrest_date', 
        'r_offense_date', 
        'r_jail_in', 
        'r_jail_out', 
        'vr_offense_date',
        'v_screening_date'
    ]
)

In [3]:
print(df.shape)
df = df[df['c_days_from_compas'] <= 10]
print(df.shape)

(11757, 30)
(8938, 30)


Transform all string data to lowercase and remove extra whitespace. This handles some problematic cases where we have e.g. 'Id theft' and 'ID Theft' as separate labels.

In [4]:
df = df.applymap(lambda s : s.lower().strip() if type(s) == str else s)

Transform all string columns to groups of binary columns. This is probably not the smartest way to go about this, given that we have so many crime labels. The smart thing to do would probably be some type of a crime classification system, e.g. is_violent, is_with_weapon, etc. But that would require a lot of manual work.

In [5]:
df = pd.get_dummies(df)

Check if we have any more object columns (should be none)

In [6]:
df.dtypes.value_counts()

uint8             865
datetime64[ns]      9
int64               8
float64             5
dtype: int64

Calculate jail stay lengths

In [7]:
df['c_days_in_jail'] = (df['c_jail_out'] - df['c_jail_in']).dt.days.fillna(0).astype(int)
df['r_days_in_jail'] = (df['r_jail_out'] - df['r_jail_in']).dt.days.fillna(0).astype(int)

Drop date columns

In [8]:
df.drop([
        'c_jail_in', 
        'c_jail_out', 
        'c_offense_date', 
        'c_arrest_date', 
        'r_offense_date', 
        'r_jail_in', 
        'r_jail_out', 
        'vr_offense_date',
        'v_screening_date'
    ],
    axis='columns',
    inplace=True
)

Check dtypes again, should be no dates remaining

In [9]:
df.dtypes.value_counts()

uint8      865
int64       10
float64      5
dtype: int64

Still a few floats, see what that is all about

In [10]:
df.select_dtypes(include=['float64'])

Unnamed: 0,days_b_screening_arrest,c_days_from_compas,num_r_cases,r_days_from_arrest,num_vr_cases
0,-1.0,1.0,,,
2,-1.0,1.0,,,
3,-1.0,1.0,,0.0,
4,,1.0,,,
6,0.0,0.0,,,
7,-1.0,1.0,,0.0,
8,-1.0,1.0,,,
9,-1.0,1.0,,,
10,-1.0,1.0,,,
11,-1.0,1.0,,,


At least two look like only NaNs, check to be sure

In [11]:
print(df['num_vr_cases'].value_counts())
print(df['num_r_cases'].value_counts())

Series([], Name: num_vr_cases, dtype: int64)
Series([], Name: num_r_cases, dtype: int64)


Yeah, NaNs only. We can drop those. The others look pretty meaningless as well, so drop those too.

In [12]:
df.drop([
        'num_r_cases',
        'num_vr_cases',
        'days_b_screening_arrest',
        'c_days_from_compas',
        'r_days_from_arrest'

    ],
    axis='columns',
    inplace=True
)

Check if we have any more NaNs hanging around

In [13]:
for col in df:
    count = len(df[col]) - df[col].count()
    if count:
        print(col, count)

Should be no more NaNs to go, so we can continue to learning.

Split to target and explanatory variables

In [14]:
predicted_variable = 'decile_score'
X = df.loc[:, df.columns != predicted_variable]
y = df.loc[:, df.columns == predicted_variable]
print(X.shape, y.shape)

(8938, 874) (8938, 1)


Split to train and test sets

In [15]:
from sklearn import model_selection

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.1, random_state=42)

Train a ridge regression model and check the $R^2$ score. Score of 1 is best possible, values can range to $-\inf$. RidgeCV automatically cross validates (leave-one-out) to figure out the best hyperparameter $\alpha$ 

In [26]:
from sklearn import linear_model

model = linear_model.RidgeCV(scoring='explained_variance')
model.fit(X_train, y_train)
print(model.score(X_test, y_test))
print('lol')

ValueError: 'explained_variance' is not a valid scoring value. Valid options are ['accuracy', 'adjusted_rand_score', 'average_precision', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'neg_log_loss', 'neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc']

That's... not horrible. Let's check which features have the greatest effect on the thing. 

In [17]:
coeffs = [(coef, col) for col, coef in zip(X.columns, model.coef_[0])]
coeffs = sorted(coeffs, reverse=True)

for a, b in coeffs[:20]: print(a, b)
    
print('...')

orig = {col:coef for coef, col in coeffs}
abss = sorted([(abs(coef), col) for coef, col in coeffs])
near_zero = sorted([(orig[col], col) for coef, col in abss[:50]], reverse=True)
for a, b in near_zero: print(a, b)


print('...')

for a, b in coeffs[-20:]: print(a, b)

1.39733712844 c_charge_desc_felony petit theft
1.33286223897 c_charge_desc_deliver 3,4 methylenediox
1.00966859115 c_charge_desc_tampering with physical evidence
0.871561651017 c_charge_desc_disorderly intoxication
0.864869177347 c_charge_desc_grand theft dwell property
0.849941928109 c_charge_desc_poss cocaine/intent to del/sel
0.803461438168 c_charge_desc_prowling/loitering
0.786103837616 c_charge_desc_possession of cocaine
0.783340329506 c_charge_desc_felony dui (level 3)
0.764838759329 c_charge_desc_poss pyrrolidinovalerophenone
0.750748098897 c_charge_desc_deliver cannabis
0.748838722175 c_charge_desc_petit theft
0.747573463182 c_charge_desc_deliver cocaine
0.746547854705 vr_charge_desc_felony battery w/prior convict
0.734804176959 r_charge_desc_unlaw use false name/identity
0.698627644687 c_charge_desc_grand theft (motor vehicle)
0.697080695875 c_charge_desc_poss of cocaine w/i/d/s 1000ft park
0.694179963057 r_charge_desc_robbery / no weapon
0.671362857916 race_african-american
0

Let's check the non-crime label ones specifically

In [18]:
for a, b in coeffs: print(a, b) if 'charge_desc' not in b else ...

0.671362857916 race_african-american
0.532643550234 race_native american
0.517917289476 is_violent_recid
0.358295678854 vr_charge_degree_(f7)
0.355421952635 is_recid
0.292999683579 vr_charge_degree_(m2)
0.26034588717 juv_fel_count
0.252770759163 juv_other_count
0.222022233675 priors_count
0.212178042404 vr_charge_degree_(f6)
0.204834840463 r_charge_degree_f
0.133357706964 juv_misd_count
0.0813856044311 sex_female
0.0740144190665 c_charge_degree_f
0.0655252589349 race_caucasian
0.0641869599376 vr_charge_degree_(f1)
0.0322916860824 vr_charge_degree_(mo3)
0.00483148934211 c_days_in_jail
0.00193975864752 r_days_in_jail
-0.00168447175154 c_charge_degree_o
-0.00415002850227 vr_charge_degree_(f3)
-0.0239280909414 vr_charge_degree_(f5)
-0.072329947315 c_charge_degree_m
-0.0813856044311 sex_male
-0.0909063601894 age
-0.0918325400547 r_charge_degree_m
-0.113002300408 r_charge_degree_o
-0.205276207035 vr_charge_degree_(m1)
-0.208680434903 vr_charge_degree_(f2)
-0.295009621143 race_hispanic
-0.353

Try linear support vector regression next

In [21]:
from sklearn import svm

model = svm.LinearSVR(epsilon=0.5)
scores = model_selection.cross_val_score(model, X, y.values.ravel(), cv=10, n_jobs=2)
scores

array([-0.61469311, -3.66407706,  0.34482374,  0.01771783, -0.46750105,
       -0.14204074,  0.23770379,  0.37259487,  0.20467988,  0.26196131])

Marginally better, still pretty bad. Let's try support vector regression with an RBF. In principle it has an infinite-dimensional faeture space so that should get to something reasonable at the cost of possibly overfitting.

In [22]:
from sklearn import svm

model = svm.SVR(epsilon=0.5)
scores = model_selection.cross_val_score(model, X, y.values.ravel(), cv=10, n_jobs=3)
scores

KeyboardInterrupt: 

It's... not bad(?) considering we are using $\frac{1}{3}$ of the data for testing on each round of the cross validation. The resulting $R^2$ values indicate $R \approx 0.6$ which is by no means an abysmal degree of correlation. At the same time, we would expect a significantly higher degree of success based on what the task actually **is**.

Might be problem with the nature of the data, more specifically not enough data for practical learning considering the size of the (unmodified) feature space. We only have ~10x data points to the number of features.

Might want to try, say, a random forest or alternatively doing some tricks with sample generation. Might also be issues regarding colinearity, but that's a bit beyond me, to be quite hones.