#**Private versus Public?: Classifying NYC Colleges' Incident Reports Through Decision Trees and KNN Models**\
##Navpreet Kaur and Emily Yih

###Abstract:
<p> In this project, various New York City's colleges' 2020 annual security reports were analyzed to extract the 2019 incident report statistics and to attempt to classify the type of institution these crimes are committed in: public, or private. Three models were built to classify the data and compared against each other in terms of accuracy, precision, and recall. Two models were decision trees, using Gini and entropy to decide what attributes to split on, and the third model was a KNN model using the nearest neighbor algorithm. In the end, it was determined that the decision tree using entropy was the best model for the dataset, with the decision tree using Gini being the least effective.</p>

###Introduction:
<p>Crime is something that happens in all parts of New York City and at all types of schools, and it is essential to reorganize the crime immediately and take preventative measures. Therefore, the goal of this project is to build and compare multiple models to classify the type of school where each crime was committed: public school or private school. All colleges and universities have to create and share their annual security report where they share the school's crime statistics from the past years, so by looking at the reports of various schools located throughout New York City, a data set of 525 examples was organized.</p>
<p>Two of the models built utilized a classification technique of decision trees. Decision tree classifiers have a tree-like structure, where the root is on top and it grows downwards. There are multiple decision trees within this project that represent the data presented, but these models will be compared to each other in terms of the size and comprehensibility of each tree. Tree models can be built differently depending on the splitting of the attributes. Therefore, the first model was built based on the Gini, so the attributes split on were decided by whichever one produced the lowest Gini score. Another model built in the project was the entropy model, so the best feature to split on was identified by the entropy after the split. The best feature to split on will produce the smallest entropy value, similar to Gini. Aside from the size and comprehensibility of each decision tree, they can also be assessed and compared with respect to their accuracy, precision, and recall values. The last model, a KNN model, used the nearest neighbor algorithm, which uses the closest points to perform classification. This type of model presents different advantages from the decision trees: extremely expressive, and robust to noise; however it is sensitive to irrelevant and redundant features, unlike decision trees.</p>

###Experiment Methodology:
<p> The dataset we gathered consisted of 25 colleges/universities across NYC in all 5 boroughs [Queens: Queens College, St. John's University, LaGuardia Community College, York College, Plaza College, Queensborough Community College; Manhattan: Fordham University Lincoln Center, Baruch College, Columbia University, NYU, Pace University, NYIT; Brooklyn: Brooklyn College, Pratt Institute, New Your City College of Technology, NYU Tandon, LIU, Kingsborough Community College; Bronx: Fordham University Rose Hill, Lehman College, Bronx Community College, Manhattan College; Staten Island: College of Staten Island, Wagner College, St. John's University]. From each college, we collected data from 2019 regarding the incidents that were reported on campus, in residence halls, on non-campus property, and on public property. The types of incidents that were reported were also identified. The incidents consisted of aggravated assault, arson, burglary, motor vehicle theft, murder/non-negligent manslaughter, manslaughter by negligence, robbery, rape, fondling, incest, statutory rape, domestic violence, dating violence, stalking, drug abuse, violations arrest, liquor law violations arrest, weapons possession arrest, unfounded crime, drug abuse violations(referral), liquor law violations(referral), and weapons possession(referral). We also noted whether or not these colleges/universities were public or private, since this is what we would assign as the class of the dataset.</p>

In [2]:
#View the data collected
import pandas as pd

df = pd.read_csv("incident_report2019_data.csv")
print(df)

                             college_name   location  num_location  \
0    Fordham University at Lincoln Center  Manhattan             0   
1    Fordham University at Lincoln Center  Manhattan             0   
2    Fordham University at Lincoln Center  Manhattan             0   
3    Fordham University at Lincoln Center  Manhattan             0   
4    Fordham University at Lincoln Center  Manhattan             0   
..                                    ...        ...           ...   
520       Queensborough Community College     Queens             4   
521       Queensborough Community College     Queens             4   
522       Queensborough Community College     Queens             4   
523       Queensborough Community College     Queens             4   
524       Queensborough Community College     Queens             4   

    type_of_university  num_type_of_university  length_of_college_in_years  \
0              Private                       0                           4   
1  

<p>There were also some non-numerical values which we later changed to numerical values so that we can implement them into the python decision tree algorithm. The dataset also included a lot of values that were reported as '0', which was a bit of a concern. Some schools do not have residence halls, so there were also missing data points, those values were replaced with 0's. DUe to time constraints, we were only able to gather data from 25 schools, but if we further explore this topic, we would like to increase that number and gather more data that could be useful</p>
<p>We also produced some scatter plots on Weka to get a better understanding of the dataset and identify any patterns, outliers, and etc. Below is an example of some of the models from the scatter plots produced.</p>
<img src = "scatter_plots.jpg">
<p>Using the dataset and the type of university as the class, we used classification algorithms to first build decision tree models and obtain the accuracy and confusion matrix for each of the models. Then, we chose to use a KNN model to see whether that was a better classifier. Three different models were used: 2 decision tree models and 1 nearest-neighbor model. For all 3 models, we used accuracy, precision, and recall as the measures to analyze the different models and understand what information they tell us about the model.</p>
<p>The first decision tree used the default python settings, where the decision tree was based on Gini. We played around with the cross-validation and for the final model that we will discuss in the next section, we used 50-fold cross-validation, this made the test set 2% of the examples. There was no pruning in this model, so max_depth was default set to none which would lead to overfitting. We then calculate the test accuracy to determine the best model for the test data. This step was done for all three models.</p>

In [9]:
pip install graphviz



In [17]:
#Decision Tree #1
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import confusion_matrix

from sklearn.tree import export_graphviz
from six import StringIO
from IPython.display import Image
import pydotplus
import graphviz 
from sklearn.tree import export_text

#Feature selection: want a dependent variable and independent variable
features = ['num_location','num_specified_type','on_campus','residence_halls',
'noncampus_property','public_property']

X = df[features]
y = df.num_type_of_university #Target variable

#Split the data into training sets and test sets
#By cross_validation, we will do a 50 fold cross validation, so training 98% of the data
# and test on 2%. 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.02,
random_state=1)

model = DecisionTreeClassifier()
model = model.fit(X_train,y_train)
y_pred = model.predict(X_test)

print("Accuracy: ", metrics.accuracy_score(y_test,y_pred))

#Print the confusion matrix so that we get a better understanding
conf_matrix = confusion_matrix(y_pred, y_test)
print(conf_matrix)

#Draw the decision Tree
dot_data = StringIO()
export_graphviz(model, out_file=dot_data, filled=True, rounded=True, special_characters=True, feature_names=features,
class_names=['0','1'])
graph = pydotplus.graphviz.graph_from_dot_data(dot_data.getvalue())
graph.write_png('typemodel1.png')
Image(graph.create_png())

Accuracy:  0.6363636363636364
[[6 3]
 [1 1]]


InvocationException: GraphViz's executables not found