# **Chicago Crime Analysis**

**Author:** Meg Hutch

**Date:** June 7, 2020

**Data source:** Data was accessed from [data.cityofchicago](https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2). 

As described on their website:
> "This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days. Data is extracted from the Chicago Police Department's CLEAR (Citizen Law Enforcement Analysis and Reporting) system."

In this analysis several machine and deep learning methods are implemented to examine the utility of these advanced analytic methods in predicting crime in the city of Chicago. We also include exploratory sub-analyses of crimes reported during the historical civil unrests in late May 2020. 

In [None]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.colors

In [None]:
crime_df = pd.read_csv(r'C:\\Users\\User\\Box Sync/Projects/Chicago_Crime/Crimes_-_2001_to_present.csv')

In [None]:
crime_areas = pd.read_csv(r'C:\\Users\\User\\Box Sync/Projects/Chicago_Crime/CommAreas.csv')

In [None]:
list(crime_df.columns) 

## **Data Pre-Processing**

In [None]:
crime_df.columns = crime_df.columns.str.replace(' ', '_')

In [None]:
crime_df["Primary_Type_Description"] = crime_df["Primary_Type"] + " " +  crime_df["Description"]

In [None]:
crime_df['Year'] = crime_df['Year'].astype(object)
crime_df['Community_Area'] = crime_df['Community_Area'].astype(object)

In [None]:
#crime_areas = crime_areas[['AREA_NUMBE','COMMUNITY']]
#crime_areas = crime_areas.dropna()
#crime_areas.columns =['Community_Area', 'Community_Name']

In [None]:
# remove '-' from the logintitude/latitude community_area entries
#crime_areas = crime_areas[~crime_areas.Community_Area.str.contains("-")]

# transition the column to type 'float'
#crime_df['Community_Area'] = crime_df['Community_Area'].astype(float)
#crime_areas['Community_Area'] = crime_areas['Community_Area'].astype(float)

# merge the seperate dataframes 
#crime_df = pd.merge(crime_df, crime_areas, on='Community_Area')
#crime_df.head(20)

# **Descriptives**

* Show the counts of variables - unique descriptions?
* note: make sure to exclude colinear data: IUCR - explain why!

**Top 10 Crimes**

The top 10 crimes included Theft, Battery, Criminal Damage, Narcotics, Assault, Other Offense, Motor Vehicle Theft, Deceptive Practice and Robbery.  

In [None]:
crime_df10 = crime_df.Primary_Type.value_counts()
crime_df10 = crime_df10.head(10)

crime_df10 = pd.DataFrame(crime_df10)

plt1 = crime_df10.plot(kind="bar", color = "tomato")
plt1.tick_params(axis="x", labelsize = 10, labelrotation = 90)
plt1.set_title("Top 10 Crimes")

In [None]:
crime_df.Description.value_counts()
crime_df.Primary_Type.value_counts() # much more descrition types

In [None]:
crime_desc10 = crime_df.Description.value_counts()
crime_desc10 = crime_desc10.head(10)

crime_desc10 = pd.DataFrame(crime_desc10)

plt1 = crime_desc10.plot(kind="bar", color = "tomato")
plt1.tick_params(axis="x", labelsize = 10, labelrotation = 90)
plt1.set_title("Top 10 Crime Descriptions (Overall)")

**Top 10 Crimes and Descriptions**

In [None]:
crime_df10 = crime_df.Primary_Type_Description.value_counts()
crime_df10 = crime_df10.head(10)

crime_df10 = pd.DataFrame(crime_df10)

plt1 = crime_df10.plot(kind="bar", color = "tomato")
plt1.tick_params(axis="x", labelsize = 10, labelrotation = 85)
plt1.set_title("Top 10 Primary Crimes & Descriptions")

**Top 10 Crime Locations**

In [None]:
crime_df10 = crime_df.Location_Description.value_counts()
crime_df10 = crime_df10.head(10)

crime_df10 = pd.DataFrame(crime_df10)

plt1 = crime_df10.plot(kind="bar", color = "tomato")
plt1.tick_params(axis="x", labelsize = 10, labelrotation = 90)
plt1.set_title("Top 10 Locations")

**Top 10 Districts**

**Where do these map to?**

In [None]:
crime_df10 = crime_df.District.value_counts()
crime_df10 = crime_df10.head(10)

crime_df10 = pd.DataFrame(crime_df10)

plt1 = crime_df10.plot(kind="bar", color = "tomato")
plt1.tick_params(axis="x", labelsize = 10, labelrotation = 90)
plt1.set_title("Top 10 Districts")

**Top 10 Communities**

In [None]:
crime_df10 = crime_df.Community_Area.value_counts()
crime_df10 = crime_df10.head(10)

crime_df10 = pd.DataFrame(crime_df10)

plt1 = crime_df10.plot(kind="bar", color = "tomato")
plt1.tick_params(axis="x", labelsize = 10, labelrotation = 90)
plt1.set_title("Top 10 Communities")

# **Machine Learning Methods to Predict Crimes**

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, roc_auc_score, accuracy_score, auc, precision_recall_fscore_support, f1_score, log_loss
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel 
from sklearn.model_selection import StratifiedShuffleSplit

**Pre-Process Data**

(make sure that the Community_Name is okay...) 

In [None]:
crime_model = crime_df[['Primary_Type', 'Location_Description', 'Year', 'Community_Area']]

In [None]:
# check for missing values
crime_m = crime_model.dropna()
print(len(crime_model))
print(len(crime_m))

In [None]:
# drop missing values
crime_model = crime_model.dropna()
crime_x = crime_model[['Location_Description', 'Year', 'Community_Area']]
crime_y = crime_model[['Primary_Type']]

In [None]:
crime_model['freq'] = crime_model.groupby('Primary_Type')['Primary_Type'].transform('count')
crime_model = crime_model[crime_model.groupby('Primary_Type').freq.transform(len) > 100]

In [None]:
len(crime_model)

In [None]:
crime_x = pd.get_dummies(crime_x)
crime_x.head(5)

In [None]:
#crime_y['Primary_Type'] = LabelEncoder().fit_transform(crime_y.Primary_Type)
crime_y.loc[:, 'Primary_Type'] = pd.factorize(crime_y['Primary_Type'])[0].reshape(-1,1)

In [None]:
crime_y.Primary_Type.value_counts()

In [None]:
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.25, random_state=0)
sss.get_n_splits(crime_x, crime_y)

In [None]:
crime_x = np.asarray(crime_x)
crime_y = np.asarray(crime_y)

In [None]:
for train_index, test_index in sss.split(crime_x, crime_y):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = crime_x[train_index], crime_x[test_index]
    y_train, y_test = crime_y[train_index], crime_y[test_index]

In [None]:
unique1, counts1 = np.unique(y_train, return_counts=True)
print(counts1/len(y_train)*100)

unique2, counts2 = np.unique(y_test, return_counts=True)
print(counts2/len(y_test)*100)

**Evaluate model on training data**

In [None]:
def evaluate_model(model, x, y, cv = True):
    """prints common binary classification evaluation metrics and an ROC curve. 

    Keyword arguments:
    model -- a 'fitted' sklearn model object 
    x -- predictor matrix (dtype='numpy array', required)
    y -- outcome vector (dtype='numpy array', required)
    cv -- if True, prints  score from 5-fold crossvalidation (dtype='boolean', default='True')
    """
    import sklearn.metrics
    from sklearn.metrics import log_loss, average_precision_score, precision_recall_curve
    from sklearn.model_selection import cross_val_score
    from sklearn.model_selection import StratifiedKFold

    if cv==True:
        cv_results= cross_val_score(model, x, y, scoring='roc_auc_ovo_weighted', cv=StratifiedKFold(5))
        print("across 5 fold cv on trainingset, the model had \n", 
             "mean auroc: {:0.3f}".format(np.mean(cv_results)), "\n",
             "std auroc: {:0.3f}".format(np.std(cv_results))
             )

        base_cv_score=np.mean(cross_val_score(model, x, y, scoring='roc_auc_ovo_weighted', cv=StratifiedKFold(5)))

    print("###metrics on provided dataset:###")
    ##basic model performance
    y_hat = model.predict(x) # predicted classes using default 0.5 threshold
    y_proba = model.predict_proba(x)[:,] #predicted probabilities
    #errors = abs(y_hat - y)
    #mape = 100 * np.mean(errors / y) # mean absolute percentage error
    #accuracy = 100 - mape 
    auc=roc_auc_score(y, y_proba, multi_class = 'ovr', average = 'weighted')
    #loss= log_loss(y, y_hat)

    print ('the AUC is: {:0.3f}'.format(auc))
    #print ('the logloss is: {:0.3f}'.format(loss))
    print("confusion matrix:\n ", confusion_matrix(y, y_hat))
    print("classification report:\n ", classification_report(y, y_hat, digits=3))

    ez_roc(model, x, y, pos_label=1) #plotting roc curve
    plt.show()
    #ez_prc(model, x, y, pos_label=1) #plotting roc curve
    #plt.show()

In [None]:
def ez_roc(model, x, y, pos_label=1):
    """prints a basic Recievor Operator Curve (ROC). 

    Keyword arguments:
    model -- a 'fitted' sklearn model object 
    x -- predictor matrix (dtype='numpy array', required)
    y -- outcome vector (dtype='numpy array', required)
    pos_label --binary label considered positive in y  (dtype='int', default=1)
    """
    from sklearn.metrics import roc_curve, auc

    model_name=type(model).__name__ # defining model name as the __name__ characteristic held by sklearn models

    y_proba = model.predict_proba(x)[:,1]
        
    fpr, tpr, thresholds = roc_curve(y, y_proba, pos_label=pos_label)
    roc_auc = auc(fpr, tpr)
    
    plt.title('ROC curve')
    ax1= plt.plot(fpr, tpr, 'b', label = '%s AUC = %0.3f' % (model_name, roc_auc), linewidth=2)
    plt.legend(loc = 'lower right')
    plt.plot([0, 1], [0, 1],'r--')
    plt.xlim([0, 1])
    plt.ylim([0, 1])
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    return()

In [None]:
x = X_train[0:200000]
y = y_train[0:200000]

In [None]:
y = y.reshape(-1)
y.shape

In [None]:
unique, counts = np.unique(y, return_counts=True)

to_remove = counts[(counts <= 10)]  
to_remove
# this drop them 
#to_remove = counts[~(counts <= 10)]  

In [None]:
lr= LogisticRegression(penalty='l2', solver='newton-cg', random_state = 12345)
#fit model
lr.fit(x, y)
#evaluate model (on training data)
evaluate_model(lr, x, y, cv = False)

In [None]:
#fit model
lr.fit(x, y)

y_hat = lr.predict(x) # predicted classes using default 0.5 threshold
y_proba = lr.predict_proba(x)[:,] #predicted probabilities

roc_auc_score(y, y_proba, multi_class = 'ovo', average = 'weighted')

In [None]:
#define model
lr= LogisticRegression(penalty='l2', solver='newton-cg', random_state = 12345)
#fit model
lr.fit(X_train, y_train)
#evaluate model (on training data)
evaluate_model(lr, X_train, y_train, cv=True)

In [None]:
y_hat = lr.predict(X_train) # predicted classes using default 0.5 threshold
y_proba = lr.predict_proba(X_train)[:,] #predicted probabilities
evaluate_model(lr, X_train, y_train, cv=True)

In [None]:
#define model
lr= LogisticRegression(penalty='l2', solver='newton-cg', random_state = 12345)

#fit model
lr.fit(X_train, y_train)

y_hat = lr.predict(mcTrain_x) # predicted classes using default 0.5 threshold
y_proba = lr.predict_proba(mcTrain_x)[:,] #predicted probabilities

roc_auc_score(mcTrain_y, y_proba, multi_class = 'ovo', average = 'weighted')

#evaluate model (on training data)

#evaluate_model(lr, mcTrain_x, mcTrain_y, cv=True)

In [None]:

split into training/test sets 
stratified k-folds
cross fold validation

In [None]:
# Data to keep in the model 

# Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split



# Random Forest

# Deep Learning

# K-Nearest-Neighbors

In [None]:
neigh = NearestNeighbors(n_neighbors=2)
neigh.fit(X)
NearestNeighbors(n_neighbors=2)

## is there a way to validate these...check past homework assignment perhaps?

In [None]:
-KNN

-Logisitc Regression

-Random Forest

-PyTorch Neural Network

-Crime location map

-sub analysis since civil unrest