# **Predicting “Successful” Criminal Activity in New York City**

The following code is disigned to predict whether crimes commited in NYC will be “successful” or “unsuccessful.” Multiple classification models are tested to determine which has the highest accuracy when used on the NYPD Complaint Data dataset, which includes all valid felony, misdemeanor, and violation crimes reported to the New York City Police Department (NYPD) in the first quarter of 2018.

## **1. Load the dataset**

In [55]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd

pd.set_option('display.max_columns', 50)

df = pd.read_csv('NYPD_Complaint_Data_Current_YTD.csv')

## **2. Data prepocessing**

### Remove unnecessary columns

In [56]:
columns_remove = ['CMPLNT_NUM', 'ADDR_PCT_CD', 'BORO_NM', 'CMPLNT_FR_DT', 'CMPLNT_FR_TM', 'CMPLNT_TO_DT', 
                  'CMPLNT_TO_TM', 'HADEVELOPT', 'HOUSING_PSA', 'JURISDICTION_CODE', 'JURIS_DESC', 'KY_CD', 
                  'LOC_OF_OCCUR_DESC', 'PARKS_NM', 'PATROL_BORO', 'PD_CD', 
                  'RPT_DT', 'STATION_NAME', 'TRANSIT_DISTRICT', 'X_COORD_CD',
                  'Y_COORD_CD', 'Latitude', 'Longitude', 'Lat_Lon']

df = df.drop(columns_remove, axis=1)

df.head()

Unnamed: 0,CRM_ATPT_CPTD_CD,LAW_CAT_CD,OFNS_DESC,PD_DESC,PREM_TYP_DESC,SUSP_AGE_GROUP,SUSP_RACE,SUSP_SEX,VIC_AGE_GROUP,VIC_RACE,VIC_SEX
0,COMPLETED,MISDEMEANOR,DANGEROUS WEAPONS,"WEAPONS, POSSESSION, ETC",STREET,,,,UNKNOWN,UNKNOWN,E
1,COMPLETED,FELONY,RAPE,RAPE 2,RESIDENCE - PUBLIC HOUSING,18-24,UNKNOWN,M,<18,BLACK,F
2,COMPLETED,MISDEMEANOR,OFF. AGNST PUB ORD SENSBLTY &,AGGRAVATED HARASSMENT 2,RESIDENCE-HOUSE,25-44,BLACK,M,18-24,BLACK,F
3,COMPLETED,FELONY,ROBBERY,"ROBBERY,DELIVERY PERSON",RESIDENCE - APT. HOUSE,UNKNOWN,WHITE HISPANIC,M,25-44,WHITE HISPANIC,M
4,COMPLETED,VIOLATION,HARRASSMENT 2,"HARASSMENT,SUBD 3,4,5",RESIDENCE - PUBLIC HOUSING,45-64,WHITE HISPANIC,F,25-44,BLACK,F


### Remove rows with null values

In [57]:
import numpy as np

# Replace UNKNOWN and erroneous values with nulls
df.replace('UNKNOWN', np.NaN, inplace=True)
df.replace('E', np.NaN, inplace=True)
df.replace('D', np.NaN, inplace=True)
df.replace('U', np.NaN, inplace=True)

print('Number of rows before removing rows with missing values: ' + str(df.shape[0]))

# Remove rows with null values
df.dropna(axis=0, inplace=True)

print('Number of rows after removing rows with missing values: ' + str(df.shape[0]))

Number of rows before removing rows with missing values: 109543
Number of rows after removing rows with missing values: 31597


## **3. Get the feature and target vectors**

In [58]:
# Get the feature vector
X = df.drop('CRM_ATPT_CPTD_CD', axis = 1)

# Get the target vector
y = df['CRM_ATPT_CPTD_CD']

print('X shape: ' + str(X.shape))
print('y shape: ' + str(y.shape))

X shape: (31597, 10)
y shape: (31597,)


## **4. Encoding**

### Encode the target vector (y)

In [59]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y = le.fit_transform(y)

### Define custom encoder for the feature vector (X)

This custom encoder will encode all features in the feature vector, since LabelEncoder is only capable of encoding 1-d arrays.

In [60]:
class MultiColumnLabelEncoder:
    
    def __init__(self, columns=None):
        # Array of column names to encode
        self.columns = columns

    def fit(self, X):
        return self

    def transform(self, X):
        # Transforms columns of X specified in self.columns using LabelEncoder()
        # If no columns specified, transforms all columns in X
        output = X.copy()
        if self.columns is not None:
            for col in self.columns:
                output[col] = LabelEncoder().fit_transform(output[col])
        else:
            for colname, col in output.iteritems():
                output[colname] = LabelEncoder().fit_transform(col)
        return output

    def fit_transform(self, X):
        return self.fit(X).transform(X)

### Encode the feature vector (X)

In [61]:
X = MultiColumnLabelEncoder().fit_transform(X)

## **5. Divide data into training and testing sets**

In [62]:
from sklearn.model_selection import train_test_split

# Randomly choose 30% of the data for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y)

# Show the shape of the data
print('y train shape: ' + str(np.unique(y_train, return_counts=True)))
print('y test shape: ' + str(np.unique(y_test, return_counts=True)))

y train shape: (array([0, 1]), array([  345, 21772]))
y test shape: (array([0, 1]), array([ 148, 9332]))


## **6. Model Selection: LR, MLP, DT, RF**

### Create dictionary of classifiers:
- Logistic Regression
- Multi-Layer Perceptron
- Decistion Tree
- Random Forest

In [9]:
# The key is the classifier acronym and the value is the classifier
the value is the classifier (with random_state=0)
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

clfs = {'lr': LogisticRegression(random_state=0),
        'mlp': MLPClassifier(random_state=0),
        'dt': DecisionTreeClassifier(random_state=0),
        'rf': RandomForestClassifier(random_state=0)}

### Create dictionary of pipelines

In [10]:
# The key is the classifier acronym and the value is the pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

pipe_clfs = {}

for name, clf in clfs.items():
    pipe_clfs[name] = Pipeline([('StandardScaler', StandardScaler()), ('clf', clf)])

### Create dictionary of parameter grids

In [None]:
# The key is the classifier acronym and the value is the parameter grid of the classifier
param_grids = {}

### Create parameter grids for each classifier and set values for hyper parameter tuning

In [11]:
# Set C range
C_range = [10 ** i for i in range(-4, 5)]

# Create parameter grid for Logistic Regression
# Hyper parameters being tunes are multi_class, solver, C
param_grid = [{'clf__multi_class': ['ovr'], 
               'clf__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
               'clf__C': C_range},
              {'clf__multi_class': ['multinomial'],
               'clf__solver': ['newton-cg', 'lbfgs', 'sag', 'saga'],
               'clf__C': C_range}]
param_grids['lr'] = param_grid

# Create parameter grid for Multi-Layer Perceptron
# Hyper parameters being tunes are hidden_layer_sizes, activation
param_grid = [{'clf__hidden_layer_sizes': [10, 100, 200, 500],
               'clf__activation': ['identity', 'logistic', 'tanh', 'relu']}]
param_grids['mlp'] = param_grid

# Create parameter grid for Decision Tree
# Hyper parameters being tunes are min_samples_split, min_samples_leaf
param_grid = [{'clf__min_samples_split': [2, 10, 30],
               'clf__min_samples_leaf': [1, 10, 30]}]
param_grids['dt'] = param_grid

# Create parameter grid for Random Forest
# Hyper parameters being tunes are n_estimators, min_samples_split, min_samples_leaf
param_grid = [{'clf__n_estimators': [2, 10, 30],
               'clf__min_samples_split': [2, 10, 30],
               'clf__min_samples_leaf': [1, 10, 30]}]
param_grids['rf'] = param_grid

### Tune hyper parameters using GridSearchCV in conjunction with StratifiedKFold

Note: This may take up to an hour given the number of samples in X_train and y_train.

In [12]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold

# List of [best_score_, best_params_, best_estimator_]
best_score_param_estimators = []

# Use GridSearchCV on each classifier
for name in pipe_clfs.keys():
    gs = GridSearchCV(estimator=pipe_clfs[name],
                      param_grid=param_grids[name],
                      scoring='accuracy',
                      n_jobs=1,
                      cv=StratifiedKFold(n_splits=20, shuffle=True, random_state=0))
    
    # Fit the pipeline
    gs = gs.fit(X_train, y_train)
    
    # Update best_score_param_estimators
    best_score_param_estimators.append([gs.best_score_, gs.best_params_, gs.best_estimator_])

### Print best score and parameters for each classifier (estimator)

In [14]:
# Sort best_score_param_estimators in ascending order of best_score_
best_score_param_estimators = sorted(best_score_param_estimators, key=lambda x : x[0], reverse=True)

# For each [best_score_, best_params_, best_estimator_], Print out [best_score_, best_params_, best_estimator_], 
# where best_estimator_ is a pipeline
for best_score_param_estimator in best_score_param_estimators:
    print([best_score_param_estimator[0], best_score_param_estimator[1], 
           type(best_score_param_estimator[2].named_steps['clf'])], end='\n\n')

[0.985034136636976, {'clf__min_samples_leaf': 1, 'clf__min_samples_split': 30, 'clf__n_estimators': 30}, <class 'sklearn.ensemble.forest.RandomForestClassifier'>]

[0.9848080661934259, {'clf__min_samples_leaf': 10, 'clf__min_samples_split': 30}, <class 'sklearn.tree.tree.DecisionTreeClassifier'>]

[0.9844011393950355, {'clf__C': 0.0001, 'clf__multi_class': 'ovr', 'clf__solver': 'newton-cg'}, <class 'sklearn.linear_model.logistic.LogisticRegression'>]

[0.9844011393950355, {'clf__activation': 'identity', 'clf__hidden_layer_sizes': 10}, <class 'sklearn.neural_network.multilayer_perceptron.MLPClassifier'>]



### Run the highest scoring classifier and associated parameters on the test data

In [44]:
from sklearn.metrics import precision_recall_fscore_support

y_pred = best_score_param_estimators[0][2].predict(X_test)

print('Classifier:', end=' ')
print(type(best_score_param_estimators[0][2].named_steps['clf']), end='\n\n')
print('Accuracy:', end=' ')
print(precision_recall_fscore_support(y_pred, y_test, average='micro')[0])

Classifier: <class 'sklearn.ensemble.forest.RandomForestClassifier'>

Accuracy: 0.9850210970464135


## **7. Model Selection: Deep Neural Network with Keras**

In [63]:
# NOTE: must pip install keras and tensorflow
from keras.layers.core import Dense, Dropout
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense

# Create model
model = Sequential()
model.add(Dense(12, input_dim=10, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Fit model to training data
model.fit(X_train, y_train, epochs=5, batch_size=10)

# Evaluate model on test data
scores = model.evaluate(X_test, y_test)
print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5

acc: 98.44%


## **8. Conclusion**
After testing the data on multiple models including Logistic Regression, Multi-Layer Perceptron, Decistion Tree, Random Forest, and Deep Neural Network, it was found that Random Forest had the best performance, being able to predict whether crimes in NYC will be successful or unsuccessful with an accuracy of 0.99.