# Crime in Chicago

The objective of this project is to predict whether a person who committed a particular crime was arrested for the city of Chicago. The city of Chicago Data Portal has every crime dating back to 2001 in it's database with location and crime information for each crime.  This dataset will be combined with NOAA weather data and a model will be created for arrests.

In [1]:
import pandas as pd
import numpy as np
import pickle
import feather
import seaborn as sns
import matplotlib.pyplot as plt
import joblib

from sqlalchemy import create_engine  
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from imblearn.over_sampling import RandomOverSampler
from sklearn.ensemble import VotingClassifier
from sklearn.pipeline import Pipeline
from joblib import dump, load

%matplotlib inline

Ignoring the warnings was included at the end for a cleaner looking notebook.

In [2]:
import warnings
warnings.filterwarnings('ignore')

For this project I loaded the chicago crime data onto a remote server and pulled the data down as needed. Using postgreSQL, I created the schema for the data to be loaded onto the remote server.

SQL Query for remote server:

CREATE TABLE IF NOT EXISTS ChicagoCrime (  
        ID integer,  
        CaseNumber varchar(20),  
        Date varchar(50),  
        Block varchar(50),  
        IUCR varchar(10),  
        PrimaryType varchar(50),  
        Description varchar(100),  
        LocationDescription varchar(150),  
        Arrest varchar(10),  
        Domestic varchar(10),  
        Beat integer,  
        District real,  
        Ward real,  
        CommunityArea real,  
        FBICode varchar(10),  
        XCoordinate varchar(20),  
        YCoordinate varchar(20),  
        Year integer,  
        UpdatedOn varchar(50),  
        Latitude varchar(15),  
        Longitude varchar(15),  
        Location varchar(50)  
    );  


## Pull Data From Server

Data was pulled in from the remote server to be analyzed.

In [5]:
cnx = create_engine('postgresql://ubuntu@ec2-100-24-40-180.compute-1.amazonaws.com/chicago')
df = pd.read_sql_query('''SELECT * FROM chicagocrime''', cnx)

The datetimes will need to be converted to datetime format for pandas to recognize the type as a date. I also went ahead and checked to see what the arrest rate was for all crimes committed in chicago.

In [8]:
df['datetime'] = pd.to_datetime(df['date'], infer_datetime_format=True)
mask = df['arrest']  == 'true'
print('Percent of Crimes ending in Arrest: ' + str(len(df[mask])/len(df))b)

Percent of Crimes ending in Arrest: 0.27685069987994154


Only 27.7% of crimes in Chicago end in arrest. Yikes!

In order to work with the data locally we will serialize the data with feather.

In [42]:
#df.to_feather('chicago_crime.feather')
df = feather.read_dataframe('chicago_crime.feather')

## Read in Weather Data

Weather data from NOAA was pulled in as it may have some predictive ability for our problem. 

In [97]:
weather_data_path = '/Users/kevin/Downloads/1598904.csv'
df_weather = pd.read_csv(weather_data_path)

Weather Column Descriptions:

WT03 - Thunder  
WT04 - Ice pellets, sleet, snow pellets, or small hail"  
PRCP - Precipitation  
WT05 - Hail (may include small hail)  
WV03 - Thunder  
WT06 - Glaze or rime   
WT07 - Dust, volcanic ash, blowing dust, blowing sand, or blowing obstruction  
WT08 - Smoke or haze   
SNWD - Snow depth  
WT09 - Blowing or drifting snow  
WDF2 - Direction of fastest 2-minute wind  
WDF5 - Direction of fastest 5-second wind  
PGTM - Peak gust time  
WT11 - High or damaging winds  
TMAX - Maximum temperature  
WT13 - Mist  
WSF2 - Fastest 2-minute wind speed  
FMTM - Time of fastest mile or fastest 1-minute wind  
WSF5 - Fastest 5-second wind speed  
SNOW - Snowfall  
WT14 - Drizzle  
WT15 - Freezing drizzle   
WT16 - Rain (may include freezing rain, drizzle, and freezing drizzle)"   
WT17 - Freezing rain   
WT18 - Snow, snow pellets, snow grains, or ice crystals  
WT19 - Unknown source of precipitation   
AWND - Average wind speed  
WT21 - Ground fog  
WT22 - Ice fog or freezing fog  
WV20 - Rain or snow shower  
WT01 - Fog, ice fog, or freezing fog (may include heavy fog)  
WESD - Water equivalent of snow on the ground  
WT02 - Heavy fog or heaving freezing fog (not always distinguished from fog)  
TAVG - Average Temperature.  
TMIN - Minimum temperature  
TSUN - Total sunshine for the period  

The weather data will need to be merged with the crime data by date so converted the weather dates to datetimes. I also made the columns lowercase for consistency.  

In [98]:
df_weather.columns = map(str.lower, df_weather.columns)
df_weather['datetime'] = pd.to_datetime(df_weather['date'], infer_datetime_format=True)

## Merge Weather Data and Crime Data

Now that both weather and crime data are loaded in I merged the two dataframes into one.

In [100]:
df_weather = df_weather.sort_values('datetime')
df = df.sort_values('datetime')
cw_df = pd.merge_asof(df, df_weather, on = 'datetime', direction = 'backward', tolerance = pd.Timedelta('1 day')) 
df = cw_df.reset_index()
#df.to_feather('chicago_crime_and_weather.feather')

## Clean Dataset

After analyzing the combined weather and crime dataset I decided which columns to drop. Each column was analyzed and the decision to drop the column was made individually, hence why there are several repetitive drop methods.    

In [112]:
df = df.drop('index', axis = 1)
df = df.drop('casenumber', axis = 1)
df = df.drop('id', axis = 1)
df = df.drop('block', axis = 1)
df = df.drop('station', axis = 1)
df = df.drop('fmtm', axis = 1)
df = df.drop('pgtm', axis = 1)
df = df.drop('snwd', axis = 1)
df = df.drop('xcoordinate', axis = 1)
df = df.drop('ycoordinate', axis = 1)
df = df.drop('datetime', axis = 1)
df = df.drop('tavg', axis = 1)
df = df.drop('date_y', axis = 1)
df = df.drop('iucr', axis = 1)
df = df.drop('name', axis = 1)
df = df.drop('year', axis = 1)
df = df.drop('updatedon', axis = 1)
df = df.drop('location', axis = 1)
df = df.drop('fbicode', axis = 1)
df = df.drop('description', axis = 1)
df = df.drop('date_x', axis = 1)
df = df.drop(['wdf2', 'wdf5', 'wesd', 'wsf2', 'wsf5', 'wt01',
       'wt02', 'wt03', 'wt04', 'wt05', 'wt06', 'wt07', 'wt08', 'wt09', 'wt11',
       'wt13', 'wt14', 'wt15', 'wt16', 'wt17', 'wt18', 'wt19', 'wt21', 'wt22',
       'wv03', 'wv20', 'tsun'], axis = 1)


Now that I have all the data I want in my dataframe, I will convert all data types to the proper type so they can be fed into a sklearn classifer.

In [None]:
df['primarytype'] = df['primarytype'].astype('category')
df['description'] = df['description'].astype('category')
df['locationdescription'] = df['locationdescription'].astype('category')
df['arrest'].replace('true', 1, inplace = True)
df['arrest'].replace('false', 0, inplace = True)
df['domestic'].replace('true', 1, inplace = True)
df['domestic'].replace('false', 0, inplace = True)
df['fbicode'] = df['fbicode'].astype('category')
df['xcoordinate'] = df['xcoordinate'].fillna(value=np.nan)
df['xcoordinate'] = df['xcoordinate'].astype('int64', errors = 'ignore')
df['ycoordinate'] = df['ycoordinate'].fillna(value=np.nan)
df['ycoordinate'] = df['ycoordinate'].astype('int64', errors = 'ignore')
df['latitude'] = df['latitude'].astype('float64', errors = 'ignore')
df['longitude'] = df['longitude'].astype('float64', errors = 'ignore')
df['station'] = df['station'].astype('category')
df = df.dropna(subset=['district'])
df = df.dropna(subset=['latitude'])
df['locationdescription'] = df.locationdescription.fillna(value='OTHER')
df['communityarea'] = df.sort_values(by=['beat', 'district', 'ward'])['communityarea'].fillna(method='ffill')
df['ward'] = df.sort_values(by=['beat', 'district', 'communityarea'])['ward'].fillna(method='ffill')
df = df.reset_index()

Now that the dataset is cleaned, I wanted to make sure there are no nulls in the dataset that would cause issues in a model.

In [None]:
for header in df.columns:
    
    nulls_count = df[f'{header}'].isnull().sum()
    
    print(f'There are {nulls_count} in {header}')

The dataset is now cleaned and ready to be examined closer. The work done so far will be saved in a feather file for quick data loading in the future.

In [None]:
#df.to_feather('chicago_crime_cleaned.feather')
df = feather.read_dataframe('chicago_crime_cleaned.feather')

## EDA

In [None]:
#df.to_feather('chicago_crime_final.feather')
df = feather.read_dataframe('chicago_crime_final.feather')

I first wanted to check for linear relationships in the data. I ran a pearson correlation to see if there were any relationships between the features and the label. I also wanted to check for colinearity between the features, which could lead to model overfitting.

In [4]:
df.corr()

Unnamed: 0,index,arrest,domestic,beat,district,ward,communityarea,latitude,longitude,awnd,prcp,snow,tmax,tmin
index,1.0,-0.055044,0.04337,-0.035996,-0.004956,0.013127,0.004968,-0.005265,0.001056,0.036246,0.016684,0.020514,0.01734,0.038502
arrest,-0.055044,1.0,-0.069274,-0.015993,-0.01678,-0.015836,-0.008292,0.002096,-0.031477,0.001616,-0.009167,0.00233,-0.023662,-0.025416
domestic,0.04337,-0.069274,1.0,-0.041821,-0.038657,-0.050101,0.072056,-0.075669,0.004518,0.002332,0.002825,0.002082,0.004467,0.003772
beat,-0.035996,-0.015993,-0.041821,1.0,0.939092,0.635785,-0.506381,0.61265,-0.473687,-0.003126,-0.000468,0.000737,-0.002075,-0.002319
district,-0.004956,-0.01678,-0.038657,0.939092,1.0,0.68874,-0.499337,0.620597,-0.528367,-0.001122,-2.7e-05,0.000919,-0.001339,-0.00122
ward,0.013127,-0.015836,-0.050101,0.635785,0.68874,1.0,-0.532559,0.626385,-0.432463,5.9e-05,-1.1e-05,0.001221,-4.9e-05,0.000588
communityarea,0.004968,-0.008292,0.072056,-0.506381,-0.499337,-0.532559,1.0,-0.747118,0.240317,0.000821,0.001185,-0.000435,0.001802,0.001377
latitude,-0.005265,0.002096,-0.075669,0.61265,0.620597,0.626385,-0.747118,1.0,-0.410834,-2.8e-05,-0.000483,0.001649,-0.003313,-0.00263
longitude,0.001056,-0.031477,0.004518,-0.473687,-0.528367,-0.432463,0.240317,-0.410834,1.0,-0.002323,0.000345,-0.002221,0.007822,0.007829
awnd,0.036246,0.001616,0.002332,-0.003126,-0.001122,5.9e-05,0.000821,-2.8e-05,-0.002323,1.0,0.080271,0.099045,-0.250733,-0.215913


All categorical variables in the dataframe will need to be converted to dummies before they are loaded into the model.  After the dummies are added to the model, the original categorical columns can be dropped.

In [5]:
df = pd.concat([df, pd.get_dummies(df['primarytype'])], axis = 1)
df = df.drop('primarytype', axis = 1)
df = pd.concat([df, pd.get_dummies(df['locationdescription'])], axis = 1)
df = df.drop('locationdescription', axis = 1)

### Create an Evaluation Function

An evaluation function was created to consistently analyze all of the models to determine which model was the best.  The AUC score was the score that will tell me which model was the best performing.  The validation and testing accuracy scores will tell me how accurate the model was with out of sample data.  The training accuracy can be compared to the validation and testing accuracy and will let me know if the model is overfitting.

In [3]:
def evaluate_model(clf):
    
    train_preds = clf.predict(X_train)
    train_auc = roc_auc_score(y_train, train_preds)
    val_preds = clf.predict(X_val)
    val_auc = roc_auc_score(y_val, val_preds)
    test_preds = clf.predict(X_test)
    test_auc = roc_auc_score(y_test, test_preds)
    train_score = clf.score(X_train, y_train)
    val_score = clf.score(X_val, y_val)
    test_score = clf.score(X_test, y_test)
    confusion_mat = confusion_matrix(y_test, test_preds)
    
    return print(f"AUC for training set: {train_auc} \nAUC for validation set: {val_auc} \nAUC for test set: {test_auc} \nScore for training set: {train_score}\nScore for validation set: {val_score} \nScore for test set: {test_score} \nConfusion Matrix: \n{confusion_mat}")

## Create a Model

In [9]:
#df.to_feather('chicago_crime_model_data.feather')
df = feather.read_dataframe('chicago_crime_model_data.feather')

I decided to drop several more variables which were somewhat redundant in the dataset. 

In [10]:
df = df.drop(['index', 'domestic', 'beat', 'district', 'ward', 'communityarea'], axis = 1)

The features and labels were split before loading into the model.

In [11]:
y = df['arrest']
X = df.drop('arrest', axis = 1)

Due to the size of the dataset (~6.7 million rows) I decided to use a single validation set while testing my model. Using cross validation is the most reliable way to create a model.  However, it is computationally expensive and would be time consuming to do so.  

In [12]:
#Test/train split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [13]:
#Train/validation split
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.3, random_state=42)

### Logistic Regression

The first model I ran was a logistic regression model. This model required that the data was scaled before it was added to the model. I used a sklearn pipeline to handle the scaling and running of the model. 

In [14]:
clf_logistic_pipeline = Pipeline([('scale_train', StandardScaler()),  ('lr', LogisticRegression())])

In [None]:
clf_logistic_pipeline.fit(X_train, y_train)

In [None]:
evaluate_model(clf_logistic_pipeline)

AUC for training set: 0.7819176581804969  
AUC for validation set: 0.7817091792999603   
AUC for test set: 0.7818304472777092   
Score for training set: 0.8653519925078904  
Score for validation set: 0.8651978977998159  
Score for test set: 0.8653141251789568  
Confusion Matrix:   
[[1415779   45369]  
 [ 226796  332795]]  

In order to use this trained model in the future, I saved the model using joblib.

In [None]:
dump(clf_logistic_pipeline, 'clf_logistic_pipeline.joblib') 
#clf_logistic_pipeline = load('filename.joblib') 

Logistic Regression allows for the betas to be analyzed. Each individual value of e^beta is the likelihood of an arrest of an individual crime as compared to all other crimes. I went ahead and looked at the top 10 crimes that had the highest probability of arrest as well as the bottom 10 crimes.  

In [None]:
importance_list = []


for tup in zip(X_train.columns, np.exp(clf_logistic_pipeline.named_steps['lr'].coef_[0])):
    
    importance_list.append(tup) 
    sorted_importance_list = sorted(importance_list, key=lambda tup: tup[1], reverse = True)
    sorted_importance_list_reversed = sorted(importance_list, key=lambda tup: tup[1], reverse = False)
sorted_importance_list[0:10], sorted_importance_list_reversed[0:10]


([('NARCOTICS', 6.449520558275791),  
  ('PROSTITUTION', 1.94090044673702),  
  ('DEPARTMENT STORE', 1.3555689429902227),  
  ('CRIMINAL TRESPASS', 1.3481573587031486),  
  ('GAMBLING', 1.3177199360656247),  
  ('GROCERY FOOD STORE', 1.3097096245166744),  
  ('LIQUOR LAW VIOLATION', 1.2945964863184407),  
  ('WEAPONS VIOLATION', 1.273162809482963),  
  ('DRUG STORE', 1.1821135829524616),  
  ('INTERFERENCE WITH PUBLIC OFFICER', 1.1708414586148312)],  
 [('THEFT', 0.509949611367234),  
  ('CRIMINAL DAMAGE', 0.6028952015130741),  
  ('BURGLARY', 0.6737965347995939),  
  ('ROBBERY', 0.7442151529506927),  
  ('MOTOR VEHICLE THEFT', 0.762481021645242),  
  ('RESIDENCE', 0.8164103236220748),  
  ('DECEPTIVE PRACTICE', 0.8252460436674496),  
  ('RESIDENCE-GARAGE', 0.907916971075862),  
  ('OTHER OFFENSE', 0.9155231101266241),  
  ('BATTERY', 0.9170271875463851)])

### Random Forest

The next model that I chose was a random forest. This model was chosen since it usually a good predictive model. In many cases the model is too good of a predictor on the training set and overfits the data.  The forest usually needs some pruning to prevent overfitting.

In [None]:
rf = RandomForestClassifier(n_jobs=-1)
rf.fit(X_train, y_train)

In [10]:
clf_rf = RandomForestClassifier(n_estimators = 50, max_depth = 10, min_samples_leaf = 2, oob_score=True, n_jobs=-1)
clf_rf.fit(X_train, y_train)


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=10, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=2, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=-1,
            oob_score=True, random_state=None, verbose=0, warm_start=False)

In [None]:
evaluate_model(clf_rf)

AUC for training set: 0.7378734801893847  
AUC for validation set: 0.7380699441781224   
AUC for test set: 0.7381047408102378  
Score for training set: 0.8523847165569017  
Score for validation set: 0.8523207198494469   
Score for test set: 0.8525074242640934   
Confusion Matrix:   
[[1453149    7999]  
 [ 290045  269546]]  

Random forests have a somewhat built in validation set called the out-of-bag score that is good to check to ensure the model is not overfitting.

In [12]:
clf_rf.oob_score_

0.84806602800330488

Random Forests also have an important attribute called the feature importance score.  Random forest classifiers split trees by the purity of the data so the feature importance is the number of times a feature shows up in a tree while weighting features higher that show up closer to the root node of the trees.

In [13]:
# You should create function for this and put it in a class
importance_list = []

for tup in zip(X_train.columns, clf_rf.feature_importances_):
    
    importance_list.append(tup) 
    sorted_importance_list = sorted(importance_list, key=lambda tup: tup[1], reverse = True)
sorted_importance_list[0:10]

[('NARCOTICS', 0.52285329801007996),
 ('CRIMINAL TRESPASS', 0.071447538463131677),
 ('THEFT', 0.06582852637974039),
 ('PROSTITUTION', 0.050390058821393059),
 ('SIDEWALK', 0.044579003890294074),
 ('CRIMINAL DAMAGE', 0.032731838878306781),
 ('WEAPONS VIOLATION', 0.030913909633955301),
 ('DEPARTMENT STORE', 0.017539930207047666),
 ('RESIDENCE', 0.017514975629403056),
 ('BURGLARY', 0.016004138556858317)]

 Narcotics is the most important feature. Since most narcotics arrests are the result of raids, it can be inferred intuitively that it would be an important feature in determining arrests.  Similar intuition can be used for theft.  Theft is common and most of the time there are no arrests for theft so if the crime is theft, there is not a high likelihood of arrest.

In [18]:
dump(clf_rf, 'clf_rf.joblib') 
#clf_rd = load('filename.joblib') 

['clf_rf.joblib']

## Grid Search RF

I wanted to see if the hyperparameters for the random forest model could be further optimized.  I used a grid searh over several models to see if the hyperparameters could be tuned.

In [31]:
rfc = RandomForestClassifier(n_jobs=-1)
parameters = {'n_estimators':[10,20,30], 'max_depth' : [3,7, 10, None], 'min_samples_leaf':[1,3,5,7]}
rfc_clf = GridSearchCV(rfc, parameters, cv=5)

In [None]:
%%time
rfc_clf.fit(X_train, y_train)

The best parameters were called for the model below.

In [None]:
rfc_clf.best_params

GridSearchCV(cv=5, error_score='raise',  
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',  
            max_depth=None, max_features='auto', max_leaf_nodes=None,  
            min_impurity_decrease=0.0, min_impurity_split=None,  
            min_samples_leaf=1, min_samples_split=2,  
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,  
            oob_score=False, random_state=None, verbose=0,  
            warm_start=False),  
       fit_params=None, iid=True, n_jobs=1,  
       param_grid={'n_estimators': [10, 20, 30], 'max_depth': [3, 7, 10, None], 'min_samples_leaf': [1, 3, 5, 7]},  
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',  
       scoring=None, verbose=0)  

In [9]:
rfb = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [10]:
rfb.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

## KNN

The next model that I tried was K nearest neighbors.  This model is the slowest to run since the euclidean distance needs to be calculated for every point to every other point so it is computationally expensive.  

In [11]:
clf_knn_pipeline = Pipeline([('scale_train', StandardScaler()),  ('lr', KNeighborsClassifier(n_neighbors=5, n_jobs=-1))])

In [12]:
clf_knn_pipeline.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('scale_train', StandardScaler(copy=True, with_mean=True, with_std=True)), ('lr', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=-1, n_neighbors=5, p=2,
           weights='uniform'))])

In [13]:
evaluate_model(clf_knn_pipeline)

AUC for training set: 0.8304842129492409 
AUC for validation set: 0.7900891238840703 
AUC for test set: 0.7887500002362207 
Score for training set: 0.8863387755102041
Score for validation set: 0.8525904761904762 
Score for test set: 0.8523133333333334 
Confusion Matrix: 
[[199987  12419]
 [ 31887  55707]]


In [14]:
pd.to_pickle(clf_knn_pipeline, 'knn_clf_pipeline.p')

In [50]:
nbrs = KNeighborsClassifier(n_neighbors=5)

In [None]:
%%time
nbrs.fit(X_train_scaled, y_train)

CPU times: user 18min 58s, sys: 1.53 s, total: 18min 59s
Wall time: 18min 57s


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform')

In [None]:
evaluate_model(nbrs, X_train=X_train_scaled, X_val=X_val_scaled, X_test=X_test_scaled)

In [73]:
#%% time
train_preds = nbrs.predict(X_train)
roc_auc_score(y_train, train_preds)

0.81331357275038685

In [72]:
#test score
# n_neighbors=5 scores 0.7659
preds = nbrs.predict(X_val)
roc_auc_score(y_val, preds)

0.76138047840754519

In [74]:
confusion_matrix(y_val, preds)

array([[139939,   8366],
       [ 25963,  35732]])

## Gradient Boosting

The mext model that I chose was gradient boosting.

In [21]:
gb_clf = GradientBoostingClassifier(learning_rate=0.01)
gb_clf.fit(X_train, y_train)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.01, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              n_iter_no_change=None, presort='auto', random_state=None,
              subsample=1.0, tol=0.0001, validation_fraction=0.1,
              verbose=0, warm_start=False)

In [23]:
evaluate_model(gb_clf)

AUC for training set: 0.7363564703752371 
AUC for validation set: 0.7366049415361069 
AUC for test set: 0.7366194163182307 
Score for training set: 0.8500505523491769
Score for validation set: 0.8500096852779533 
Score for test set: 0.8501983680227877 
Confusion Matrix: 
[[1448281   12867]
 [ 289843  269748]]


## Balancing the Dataset

Since only 27.7% of crimes ended up with an arrest, I decided to balance the dataset and try running models on the balanced dataset.

In [42]:
ros = RandomOverSampler(random_state=0)
X_resampled, y_resampled = ros.fit_sample(X_train,y_train)

In [45]:
# Yay, balanced classes!
len(y_resampled), len(X_resampled)

(693394, 693394)

In [53]:
rfb_balanced = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [55]:
rfb_balanced.fit(X_resampled, y_resampled)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [57]:
preds = rfb_balanced.predict(X_val)
roc_auc_score(y_val, preds)

0.79100646692502496

In [58]:
rfb_balanced.score(X_val, y_val)

0.849547619047619

In [60]:
rfb_balanced.score(X_test, y_test)

0.84939666666666669

In [62]:
confusion_matrix(y_val, preds)

array([[138361,   9944],
       [ 21651,  40044]])

## Ensemble of Several Models

I then wanted to try to put several of the models together in an ensemble and run them.

In [27]:
model_list = [('lr', clf_logistic_pipeline), ('rf', clf_rf), ('gb', gb_clf)]

In [28]:
# create voting classifier
voting_classifer = VotingClassifier(estimators=model_list,
                                    voting='hard', #<-- sklearn calls this hard voting
                                    n_jobs=-1)
voting_classifer.fit(X_train, y_train)

VotingClassifier(estimators=[('lr', Pipeline(memory=None,
     steps=[('scale_train', StandardScaler(copy=True, with_mean=True, with_std=True)), ('lr', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, ...    subsample=1.0, tol=0.0001, validation_fraction=0.1,
              verbose=0, warm_start=False))],
         flatten_transform=None, n_jobs=-1, voting='hard', weights=None)

In [30]:
evaluate_model(voting_classifer)

AUC for training set: 0.7437367358291241 
AUC for validation set: 0.7436991551717599 
AUC for test set: 0.743873117622716 
Score for training set: 0.8540023311343996
Score for validation set: 0.8538194635911314 
Score for test set: 0.8540613112331676 
Confusion Matrix: 
[[1447775   13373]
 [ 281531  278060]]


Overall the logistic regression model performed the best.  These models could all be further tweaked for better performance, however, model with 88% accuracy predicting arrests for the city of Chicago is a good model.