#**Private versus Public?: Classifying NYC Colleges' Incident Reports Through Decision Trees and KNN Models**
##Navpreet Kaur and Emily Yih

<style>
    h3 {
        font-family = verdana; 
        font-size: 14px; 
        font-weight: 500; 
        color: cadetblue;
    }
    p {
        font-size: 12px;
        color: aliceblue;
    }
</style>
<h3>Abstract:</h3>
<p> In this project, various New York City's colleges' 2020 annual security reports were analyzed to extract the 2019 incident report statistics and to attempt to classify the type of institution these crimes are committed in: public, or private. Three models were built to classify the data and compared against each other in terms of accuracy, precision, and recall. Two models were decision trees, using Gini and entropy to decide what attributes to split on, and the third model was a KNN model using the nearest neighbor algorithm. In the end, it was determined that the decision tree using entropy was the best model for the dataset, with the decision tree using Gini being the least effective.</p>
<h3>Introduction:</h3>
<p>Crime is something that happens in all parts of New York City and at all types of schools, and it is essential to reorganize the crime immediately and take preventative measures. Therefore, the goal of this project is to build and compare multiple models to classify the type of school where each crime was committed: public school or private school. All colleges and universities have to create and share their annual security report where they share the school's crime statistics from the past years, so by looking at the reports of various schools located throughout New York City, a data set of 525 examples was organized.</p>
<p>Two of the models built utilized a classification technique of decision trees. Decision tree classifiers have a tree-like structure, where the root is on top and it grows downwards. There are multiple decision trees within this project that represent the data presented, but these models will be compared to each other in terms of the size and comprehensibility of each tree. Tree models can be built differently depending on the splitting of the attributes. Therefore, the first model was built based on the Gini, so the attributes split on were decided by whichever one produced the lowest Gini score. Another model built in the project was the entropy model, so the best feature to split on was identified by the entropy after the split. The best feature to split on will produce the smallest entropy value, similar to Gini. Aside from the size and comprehensibility of each decision tree, they can also be assessed and compared with respect to their accuracy, precision, and recall values. The last model, a KNN model, used the nearest neighbor algorithm, which uses the closest points to perform classification. This type of model presents different advantages from the decision trees: extremely expressive, and robust to noise; however it is sensitive to irrelevant and redundant features, unlike decision trees.</p>
<h3>Experiment Methodology:</h3>
<p> The dataset we gathered consisted of 25 colleges/universities across NYC in all 5 boroughs [Queens: Queens College, St. John's University, LaGuardia Community College, York College, Plaza College, Queensborough Community College; Manhattan: Fordham University Lincoln Center, Baruch College, Columbia University, NYU, Pace University, NYIT; Brooklyn: Brooklyn College, Pratt Institute, New Your City College of Technology, NYU Tandon, LIU, Kingsborough Community College; Bronx: Fordham University Rose Hill, Lehman College, Bronx Community College, Manhattan College; Staten Island: College of Staten Island, Wagner College, St. John's University]. From each college, we collected data from 2019 regarding the incidents that were reported on campus, in residence halls, on non-campus property, and on public property. The types of incidents that were reported were also identified. The incidents consisted of aggravated assault, arson, burglary, motor vehicle theft, murder/non-negligent manslaughter, manslaughter by negligence, robbery, rape, fondling, incest, statutory rape, domestic violence, dating violence, stalking, drug abuse, violations arrest, liquor law violations arrest, weapons possession arrest, unfounded crime, drug abuse violations(referral), liquor law violations(referral), and weapons possession(referral). We also noted whether or not these colleges/universities were public or private, since this is what we would assign as the class of the dataset.</p>


In [1]:
#View the data collected
import pandas as pd

df = pd.read_csv("incident_report2019_data.csv")
print(df)

                             college_name   location  num_location  \
0    Fordham University at Lincoln Center  Manhattan             0   
1    Fordham University at Lincoln Center  Manhattan             0   
2    Fordham University at Lincoln Center  Manhattan             0   
3    Fordham University at Lincoln Center  Manhattan             0   
4    Fordham University at Lincoln Center  Manhattan             0   
..                                    ...        ...           ...   
520       Queensborough Community College     Queens             4   
521       Queensborough Community College     Queens             4   
522       Queensborough Community College     Queens             4   
523       Queensborough Community College     Queens             4   
524       Queensborough Community College     Queens             4   

    type_of_university  num_type_of_university  length_of_college_in_years  \
0              Private                       0                           4   
1  