### PREDICTING PAROLE VIOLATORS

In many criminal justice systems around the world, inmates deemed not to be a threat to society are released from prison under the parole system prior to completing their sentence. They are still considered to be serving their sentence while on parole, and they can be returned to prison if they violate the terms of their parole.

Parole boards are charged with identifying which inmates are good candidates for release on parole. They seek to release inmates who will not commit additional crimes after release. In this problem, we will build and validate a model that predicts if an inmate will violate the terms of his or her parole. Such a model could be useful to a parole board when deciding to approve or deny an application for parole.

For this prediction task, we will use data from the United States 2004 National Corrections Reporting Program, a nationwide census of parole releases that occurred during 2004. We limited our focus to parolees who served no more than 6 months in prison and whose maximum sentence for all charges did not exceed 18 months. The dataset contains all such parolees who either successfully completed their term of parole during 2004 or those who violated the terms of their parole during that year. The dataset contains the following variables:

- male: 1 if the parolee is male, 0 if female
- race: 1 if the parolee is white, 2 otherwise
- age: the parolee's age (in years) when he or she was released from prison
- state: a code for the parolee's state. 2 is Kentucky, 3 is Louisiana, 4 is Virginia, and 1 is any other state. The three states were selected due to having a high representation in the dataset.
- time.served: the number of months the parolee served in prison (limited by the inclusion criteria to not exceed 6 months).
- max.sentence: the maximum sentence length for all charges, in months (limited by the inclusion criteria to not exceed 18 months).
- multiple.offenses: 1 if the parolee was incarcerated for multiple offenses, 0 otherwise.
- crime: a code for the parolee's main crime leading to incarceration. 2 is larceny, 3 is drug-related crime, 4 is driving-related crime, and 1 is any other crime.
- violator: 1 if the parolee violated the parole, and 0 if the parolee completed the parole without violation.

In [16]:
# Owen Wichiencharoen's standard Python Imports:

import pandas as pd
from pandas import Series, DataFrame
import numpy as np
import seaborn as sns
sns.set_style('darkgrid')

from scipy import stats
import statsmodels.formula.api as smf
import statsmodels.api as sm

import matplotlib.pyplot as plt
%matplotlib inline


from sklearn.cross_validation import train_test_split, cross_val_score
# from sklearn import datasets, metrics
import sklearn.linear_model as lm

# from sklearn.tree import DecisionTreeClassifier, export_graphviz
# from sklearn.ensemble import RandomForestClassifier
# import pydot
# from os import system
# from sklearn.externals.six import StringIO
# from IPython.display import Image

#import itertools
#import pandas_datareader.data as pdweb
#from pandas_datareader.data import DataReader
#from datetime import datetime
#from io import StringIO

In [17]:
parole = pd.read_csv('../data/parole.csv')
parole[:5]

Unnamed: 0,male,race,age,state,time.served,max.sentence,multiple.offenses,crime,violator
0,1,1,33.2,1,5.5,18,0,4,0
1,0,1,39.7,1,5.4,12,0,3,0
2,1,2,29.5,1,5.6,12,0,3,0
3,1,1,22.4,1,5.7,18,0,1,0
4,1,2,21.6,1,5.4,12,0,1,0


In [18]:
parole.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 675 entries, 0 to 674
Data columns (total 9 columns):
male                 675 non-null int64
race                 675 non-null int64
age                  675 non-null float64
state                675 non-null int64
time.served          675 non-null float64
max.sentence         675 non-null int64
multiple.offenses    675 non-null int64
crime                675 non-null int64
violator             675 non-null int64
dtypes: float64(2), int64(7)
memory usage: 47.5 KB


In [19]:
# Fix the weight column names that Python won't like

parole.columns

Index(['male', 'race', 'age', 'state', 'time.served', 'max.sentence',
       'multiple.offenses', 'crime', 'violator'],
      dtype='object')

In [20]:
parole.columns = ['male', 'race', 'age', 'state', 'time_served', 'max_sentence',
       'multiple_offenses', 'crime', 'violator']
parole.columns

Index(['male', 'race', 'age', 'state', 'time_served', 'max_sentence',
       'multiple_offenses', 'crime', 'violator'],
      dtype='object')

In [21]:
# How many of the 675 parolees broke their parole?

parole['violator'].value_counts()

0    597
1     78
Name: violator, dtype: int64

In [22]:
# There are categorical vars ("factors") that look like numerical vars

parole['race'].value_counts()

1    389
2    286
Name: race, dtype: int64

In [23]:
# We need to change them to categorical if they're more than 0 & 1,
# so Python doesn't think they're numbers

parole['state'] = parole['state'].astype(str)
parole['crime'] = parole['crime'].astype(str)
parole.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 675 entries, 0 to 674
Data columns (total 9 columns):
male                 675 non-null int64
race                 675 non-null int64
age                  675 non-null float64
state                675 non-null object
time_served          675 non-null float64
max_sentence         675 non-null int64
multiple_offenses    675 non-null int64
crime                675 non-null object
violator             675 non-null int64
dtypes: float64(2), int64(5), object(2)
memory usage: 47.5+ KB


In [24]:
train, test = train_test_split(parole, test_size=0.3, random_state=144)
print(len(train))
print(len(test))

472
203


Using glm (and remembering the parameter family="binomial"), train a logistic regression model on the training set. Your dependent variable is "violator", and you should use all of the other variables as independent variables.

What variables are significant in this model?