# TODO:
- Check which crimes are correl. with socio-economic status (proxy: race (and sex and age?))
- Check whether race/age/sex is correl. with COMPAS scores
- Check whether race/age/sex is correl. with OUR scores
- Try out some other models as well (SVMs, Random Forests, Naive Bayes) 

# Trying stuff out
Import our tools

In [1]:
import pandas as pd

Load in the relevant data, parsing the date fields to a format pandas understands on the way.

In [2]:
df = pd.read_csv(
    'compas-scores.csv',
    usecols=[
        'sex',
        'age',
        #'age_cat',
        'race',
        'juv_fel_count', 
        #'decile_score', 
        'juv_misd_count',
        'juv_other_count', 
        'priors_count', 
        'days_b_screening_arrest',
        'c_jail_in', 
        'c_jail_out',
        'c_offense_date',
        'c_arrest_date', 
        'c_days_from_compas', 
        'c_charge_degree',
        'c_charge_desc', 
        'is_recid', 
        #'num_r_cases', 
        #'r_charge_degree', 
        #'r_days_from_arrest', 
        #'r_offense_date',
        #'r_charge_desc', 
        #'r_jail_in', 
        #'r_jail_out', 
        #'is_violent_recid',
        #'num_vr_cases', 
        #'vr_charge_degree', 
        #'vr_offense_date',
        #'vr_charge_desc', 
        #'v_decile_score',
        #'v_score_text', 
        #'v_screening_date', 
        #'score_text', 
    ],
    parse_dates=[
        'c_jail_in', 
        'c_jail_out', 
        'c_offense_date', 
        'c_arrest_date', 
        #'r_offense_date', 
        #'r_jail_in', 
        #'r_jail_out', 
        #'vr_offense_date',
        #'v_screening_date'
    ]
)

In [3]:
print(df.shape)
df = df[df['c_days_from_compas'] <= 10]
print(df.shape)

(11757, 16)
(8938, 16)


In [4]:
df['c_charge_desc'].value_counts().to_csv('lol.cvs')

Transform all string data to lowercase and remove extra whitespace. This handles some problematic cases where we have e.g. 'Id theft' and 'ID Theft' as separate labels.

In [5]:
df = df.applymap(lambda s : s.lower().strip() if type(s) == str else s)
df['c_charge_desc'].value_counts().to_csv('cadasdas.csv')

Transform all string columns to groups of binary columns. This is probably not the smartest way to go about this, given that we have so many crime labels. The smart thing to do would probably be some type of a crime classification system, e.g. is_violent, is_with_weapon, etc. But that would require a lot of manual work.

In [6]:
df = pd.get_dummies(df)

Check if we have any more object columns (should be none)

In [7]:
df.dtypes.value_counts()

uint8             449
int64               6
datetime64[ns]      4
float64             2
dtype: int64

Calculate jail stay lengths

In [8]:
df['c_days_in_jail'] = (df['c_jail_out'] - df['c_jail_in']).dt.days.fillna(0).astype(int)
#df['r_days_in_jail'] = (df['r_jail_out'] - df['r_jail_in']).dt.days.fillna(0).astype(int)

Drop date columns

In [9]:
df.drop([
        'c_jail_in', 
        'c_jail_out', 
        'c_offense_date', 
        'c_arrest_date', 
        #'r_offense_date', 
        #'r_jail_in', 
        #'r_jail_out', 
        #'vr_offense_date',
        #'v_screening_date'
    ],
    axis='columns',
    inplace=True
)

Check dtypes again, should be no dates remaining

In [10]:
df.dtypes.value_counts()

uint8      449
int64        7
float64      2
dtype: int64

Still a few floats, see what that is all about

In [11]:
df.select_dtypes(include=['float64'])

Unnamed: 0,days_b_screening_arrest,c_days_from_compas
0,-1.0,1.0
2,-1.0,1.0
3,-1.0,1.0
4,,1.0
6,0.0,0.0
7,-1.0,1.0
8,-1.0,1.0
9,-1.0,1.0
10,-1.0,1.0
11,-1.0,1.0


At least two look like only NaNs, check to be sure

In [12]:
#print(df['num_vr_cases'].value_counts())
#print(df['num_r_cases'].value_counts())

Yeah, NaNs only. We can drop those. The others look pretty meaningless as well, so drop those too.

In [13]:
df.drop([
        #'num_r_cases',
        #'num_vr_cases',
        'days_b_screening_arrest',
        'c_days_from_compas',
        #'r_days_from_arrest'

    ],
    axis='columns',
    inplace=True
)

Check if we have any more NaNs hanging around

In [14]:
for col in df:
    count = len(df[col]) - df[col].count()
    if count:
        print(col, count)

Should be no more NaNs to go, so we can continue to learning.

Split to target and explanatory variables

In [15]:
predicted_variable = 'is_recid'
X = df.loc[:, df.columns != predicted_variable]
y = df.loc[:, df.columns == predicted_variable]
print(X.shape, y.shape)

(8938, 455) (8938, 1)


Split to train and test sets

In [16]:
from sklearn import model_selection

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.1, random_state=42)

Train a ridge regression model and check the $R^2$ score. Score of 1 is best possible, values can range to $-\inf$. RidgeCV automatically cross validates (leave-one-out) to figure out the best hyperparameter $\alpha$ 

In [23]:
from sklearn import linear_model

model = linear_model.LogisticRegression()
model.fit(X_train, y_train.values.ravel())
print(model.score(X_test, y_test))

0.733780760626


That's... not horrible. Let's check which features have the greatest effect on the thing. 

In [25]:
coeffs = [(coef, col) for col, coef in zip(X.columns, model.coef_[0])]
coeffs = sorted(coeffs, reverse=True)
for a, b in coeffs: print(a, b)

1.29233676478 c_charge_desc_felony committing prostitution
1.21420375941 c_charge_desc_petit theft
1.13329455331 c_charge_desc_poss f/arm delinq
1.04835792942 c_charge_desc_poss alprazolam w/int sell/del
1.02120651253 c_charge_desc_neglect child / bodily harm
0.983438032543 c_charge_desc_corrupt public servant
0.940950020878 c_charge_desc_carrying a concealed weapon
0.870246068063 c_charge_desc_harm public servant or family
0.832521665792 c_charge_desc_aiding escape
0.807823947984 c_charge_desc_viol pretrial release dom viol
0.799983465416 c_charge_desc_shoot into vehicle
0.793784060193 c_charge_desc_attempted burg/convey/unocc
0.793165957547 c_charge_desc_possession of carisoprodol
0.77534927392 c_charge_desc_solicit purchase cocaine
0.75957826002 c_charge_desc_petit theft/ prior conviction
0.733270743709 c_charge_desc_fighting/baiting animals
0.732961823164 c_charge_desc_viol injunct domestic violence
0.721250786906 c_charge_desc_use of anti-shoplifting device
0.671019254476 c_charge

In [24]:
coeffs = [(coef, col) for col, coef in zip(X.columns, model.coef_[0])]
coeffs = sorted(coeffs, reverse=True)

for a, b in coeffs[:20]: print(a, b)
    
print('...')

orig = {col:coef for coef, col in coeffs}
abss = sorted([(abs(coef), col) for coef, col in coeffs])
near_zero = sorted([(orig[col], col) for coef, col in abss[:50]], reverse=True)
for a, b in near_zero: print(a, b)


print('...')

for a, b in coeffs[-20:]: print(a, b)

1.29233676478 c_charge_desc_felony committing prostitution
1.21420375941 c_charge_desc_petit theft
1.13329455331 c_charge_desc_poss f/arm delinq
1.04835792942 c_charge_desc_poss alprazolam w/int sell/del
1.02120651253 c_charge_desc_neglect child / bodily harm
0.983438032543 c_charge_desc_corrupt public servant
0.940950020878 c_charge_desc_carrying a concealed weapon
0.870246068063 c_charge_desc_harm public servant or family
0.832521665792 c_charge_desc_aiding escape
0.807823947984 c_charge_desc_viol pretrial release dom viol
0.799983465416 c_charge_desc_shoot into vehicle
0.793784060193 c_charge_desc_attempted burg/convey/unocc
0.793165957547 c_charge_desc_possession of carisoprodol
0.77534927392 c_charge_desc_solicit purchase cocaine
0.75957826002 c_charge_desc_petit theft/ prior conviction
0.733270743709 c_charge_desc_fighting/baiting animals
0.732961823164 c_charge_desc_viol injunct domestic violence
0.721250786906 c_charge_desc_use of anti-shoplifting device
0.671019254476 c_charge

Let's check the non-crime label ones specifically

In [19]:
for a, b in coeffs: print(a, b) if 'charge_desc' not in b else ...

0.0798295169387 c_charge_degree_o
0.0501782325222 juv_other_count
0.0488913367833 race_african-american
0.0414765488602 sex_male
0.0263776120591 priors_count
0.0157476772352 race_caucasian
0.00309473681351 juv_fel_count
-0.000468122783317 c_days_in_jail
-0.00320791348621 race_native american
-0.00559965551894 juv_misd_count
-0.00642519604519 age
-0.00745456376594 race_other
-0.0229589024523 race_hispanic
-0.0307111532929 c_charge_degree_m
-0.0310176343141 race_asian
-0.0414765488602 sex_female
-0.0491183636458 c_charge_degree_f


Try linear support vector regression next

In [20]:
from sklearn import svm

#model = svm.LinearSVR(epsilon=0.5)
#scores = model_selection.cross_val_score(model, X, y.values.ravel(), cv=10, n_jobs=2)
#scores

Marginally better, still pretty bad. Let's try support vector regression with an RBF. In principle it has an infinite-dimensional faeture space so that should get to something reasonable at the cost of possibly overfitting.

In [21]:
from sklearn import svm

#model = svm.SVR(epsilon=0.5)
#scores = model_selection.cross_val_score(model, X, y.values.ravel(), cv=10, n_jobs=3)
#scores

It's... not bad(?) considering we are using $\frac{1}{3}$ of the data for testing on each round of the cross validation. The resulting $R^2$ values indicate $R \approx 0.6$ which is by no means an abysmal degree of correlation. At the same time, we would expect a significantly higher degree of success based on what the task actually **is**.

Might be problem with the nature of the data, more specifically not enough data for practical learning considering the size of the (unmodified) feature space. We only have ~10x data points to the number of features.

Might want to try, say, a random forest or alternatively doing some tricks with sample generation. Might also be issues regarding colinearity, but that's a bit beyond me, to be quite hones.