<h3>ML Homicde Free Community Area Prediction Based on Standardized Hardship Indcators</h3>

Dataset: HIHOM20142017.xlsx

We will use the 2014 HI data to train a classification model for which we address the question: is this a homicide free community?  (1=TRUE, 0=FALSE).

We will then test the model on 2017 HI data.

Step 1) Import the usual libraries and hardship index + homicide data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df=pd.read_excel("standardizedindicators.xlsx")
df.head(2)

Unnamed: 0,Community,index,HI14,UNEMP14,NOHS14,DEP14,HOUS14,POV14,INC14,HI17,UNEMP17,NOHS17,DEP17,HOUS17,POV17,INC17,HOM14,HOM17,LAT,LON
0,Rogers Park,0,39.7,-0.788798,-0.261084,-1.21849,0.720716,0.137627,-0.094348,39.4,-0.631354,-0.273276,-1.430761,0.865889,0.142165,-0.181587,8,4,41.97,-87.72
1,West Ridge,1,44.3,-0.705481,-0.199145,0.441181,0.868467,-0.431007,-0.148259,47.3,-0.518122,0.083923,0.385205,1.24907,-0.390925,-0.248222,3,2,41.81,-87.73


Step 2) Create two new columns H14 and H17 with value 0 if the community has 0 homicides and 1 if the community has at least 1 homicide in the given year.

In [2]:
df["H14"]=0
df["H17"]=0
for i in df.index:
    if df.loc[i,"HOM14"]>0:
        df.loc[i,"H14"]=1
    else:
        df.loc[i,"H14"]=0
    if df.loc[i,"HOM17"]>0:
        df.loc[i,"H17"]=1
    else:
        df.loc[i,"H17"]=0
df.head(5)

Unnamed: 0,Community,index,HI14,UNEMP14,NOHS14,DEP14,HOUS14,POV14,INC14,HI17,...,DEP17,HOUS17,POV17,INC17,HOM14,HOM17,LAT,LON,H14,H17
0,Rogers Park,0,39.7,-0.788798,-0.261084,-1.21849,0.720716,0.137627,-0.094348,39.4,...,-1.430761,0.865889,0.142165,-0.181587,8,4,41.97,-87.72,1,1
1,West Ridge,1,44.3,-0.705481,-0.199145,0.441181,0.868467,-0.431007,-0.148259,47.3,...,0.385205,1.24907,-0.390925,-0.248222,3,2,41.81,-87.73,1,1
2,Uptown,2,29.9,-0.693579,-0.756591,-1.801618,-0.254438,0.129011,0.74582,31.5,...,-1.725662,0.204031,0.02268,0.511639,5,5,41.8333,-87.6333,1,1
3,Lincoln Square,3,23.8,-0.95543,-0.774288,-1.397914,-0.904541,-0.835943,0.898993,21.7,...,-1.492846,-0.841007,-0.942398,0.890934,0,1,41.75,-87.71,0,1
4,North Center,4,14.9,-1.348205,-1.181312,-1.128779,-1.200043,-1.404576,2.034007,16.9,...,-0.499497,-1.189353,-1.457106,2.050376,0,0,41.74,-87.66,0,0


Step 3) Import Machine Learning libraries:

In [3]:
from sklearn import datasets
from sklearn import model_selection
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

Step 4) Create the training and validation sets using 2014 and 2017 data respectively.

In [4]:
homicide_data14=df[["UNEMP14","NOHS14","DEP14","HOUS14","POV14","INC14","H14"]]
homicide_data14.columns=["UNEMP","NOHS","DEP","HOUS","POV","INC","H"]
homicide_data14.head(2)

Unnamed: 0,UNEMP,NOHS,DEP,HOUS,POV,INC,H
0,-0.788798,-0.261084,-1.21849,0.720716,0.137627,-0.094348,1
1,-0.705481,-0.199145,0.441181,0.868467,-0.431007,-0.148259,1


In [5]:
homicide_data17=df[["UNEMP17","NOHS17","DEP17","HOUS17","POV17","INC17","H17"]]
homicide_data17.columns=["UNEMP","NOHS","DEP","HOUS","POV","INC","H"]
homicide_data17.head(2)

Unnamed: 0,UNEMP,NOHS,DEP,HOUS,POV,INC,H
0,-0.631354,-0.273276,-1.430761,0.865889,0.142165,-0.181587,1
1,-0.518122,0.083923,0.385205,1.24907,-0.390925,-0.248222,1


Step 4 Create the input dataset X and output dataset y used for training, and the validiation sets X_valid, y_valid by executing the following commands. (The validation set will be used later to check how well our model predicts the output).

In [6]:
training = homicide_data14
validation = homicide_data17
X=training[["UNEMP","NOHS","DEP","HOUS","POV","INC"]]
y=training["H"]
X_valid=validation[["UNEMP","NOHS","DEP","HOUS","POV","INC"]]
y_valid=validation["H"]

Step 5) Execute the following to compare the prediction accuracy of different ML classification algorithms: K Nearest Neighbors, Decision Tree and Support Vector Machines. Using n_splits=7, the data is randomly split into 7 groups of 11.  Each group is used to est the accuracy of the model built on the other 6 groups. The accuracy is the average of the accuracy for the 7 splits.

In [7]:
seed=7
score='accuracy'
models=[]
models.append(('KNN',KNeighborsClassifier(n_neighbors=5)))
models.append(('DTC',DecisionTreeClassifier(random_state = seed)))
models.append(('SVM',SVC(gamma='auto')))
#Evaluate each model
results=[]
names=[]
for name, model in models:
    kfold=model_selection.KFold(n_splits=7)
    cv_results=model_selection.cross_val_score(model,X,y,cv=kfold,scoring=score)
    results.append(cv_results)
    names.append(name)
    msg="%s: %f (%f)" % (name, cv_results.mean(),cv_results.std())
    print(msg)

KNN: 0.818182 (0.068721)
DTC: 0.688312 (0.144617)
SVM: 0.831169 (0.075727)


In predicting whether a homicide will occur, KNN has an estimated accuracy score of 79.2%, DTC 68.8% and SVM 80.5%.

Step 6) Executing the next cell, we can get more information about SVM model accuracy for a single trial to predict the 2017 positive homicide communities  (as opposed to the previous result which reports an average accuracy over several trials when randomly dividing the training and testing data).

In [8]:
#Import required metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
svm=SVC()
svm.fit(X,y)
predictions=svm.predict(X_valid)
print(accuracy_score(y_valid,predictions))
print(confusion_matrix(y_valid,predictions))
print(classification_report(y_valid,predictions))

0.9090909090909091
[[ 5  6]
 [ 1 65]]
             precision    recall  f1-score   support

          0       0.83      0.45      0.59        11
          1       0.92      0.98      0.95        66

avg / total       0.90      0.91      0.90        77



<table style="width:40%">
  <tr>
    <th>TRUE POSITIVE (TP)</th>
    <th>FALSE NEGATIVE (FN)</th>
      </tr>
  <tr>
    <td>FALSE POSITIVE (FP)</td>
    <td>TRUE NEGATIVE (TN)</td>
      </tr>
  <tr>
  </tr>
    </table>
    <ul>
    <li> <b>Accuracy</b>=number of correct predictions out of total dataset.  (TP+TN)/(TP+TN+FP+FN) </li>
    <li> <b>Recall (true positive rate)</b>=How many true positives get predicted out of all positives in dataset. TP/(TP+FN). High recall means most positives are correctly predicted.</li>
    <li> <b>Precision</b> =measure of correctness of a positive predicition TP/(TP+FP)
    .  If a result is predicted as positive, how sure can you be of the prediciton?

Step 7) KNN model accuracy for a single trial:

In [9]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X, y)
predictions=knn.predict(X_valid)
print(accuracy_score(y_valid,predictions))
print(confusion_matrix(y_valid,predictions))
print(classification_report(y_valid,predictions))

0.8701298701298701
[[ 6  5]
 [ 5 61]]
             precision    recall  f1-score   support

          0       0.55      0.55      0.55        11
          1       0.92      0.92      0.92        66

avg / total       0.87      0.87      0.87        77



Step 8) DTC model accuracy for a single trial:

In [10]:
dtc = DecisionTreeClassifier(random_state = seed)
dtc.fit(X, y)
predictions=dtc.predict(X_valid)
print(accuracy_score(y_valid,predictions))
print(confusion_matrix(y_valid,predictions))
print(classification_report(y_valid,predictions))

0.7792207792207793
[[ 5  6]
 [11 55]]
             precision    recall  f1-score   support

          0       0.31      0.45      0.37        11
          1       0.90      0.83      0.87        66

avg / total       0.82      0.78      0.80        77



Conclusions: comparing model accuracies of single trials, SVM has 85.7%, KNN has 83.1%, and DTC has 71.4% accuracy. 

<h3> Questions</h3>
1) Why does model appear to predict only positives?

2) What is the confusion matrix for KNN and DTC?

Notes: 
I changed line 11 so that 0 indicates neighborhoods with 0 homicides, and 1 indicates neighborhoods with 1 or more homicides in the given year.
I removed "random_state = seed" from the argument of the KFold function because it was triggering a warning message - this did not change the output or accuracy of the KNN method. It seems as though this parameter is only necessary for the DTC function.

I added code for single trials of KNN and DTC methods (lines 20 and 21).
SVM had the highest accurcay using both methods: the split method, and single trials (80.5%, 85.7%), followed by KNN (79.2%, 83.1%) and then DTC (68.8%, 71.4%).
Overall the single trials are more accurate.