# Machine Learning

### In this notebook, we determine what model to fit in order to predict the location of a school shooting. We also analyze the performance of the "best" model." 

In [1]:
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')

xlsx = pd.ExcelFile('/content/drive/MyDrive/Colab Notebooks/DATA 301 Final Project/SSDB_Raw_Data.xlsx')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
df_incidents_updated = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/DATA 301 Final Project/incidents_1991_2017.csv")
df_incidents_updated.head()

Unnamed: 0,Incident_ID,Sources,Number_News,Media_Attention,Reliability,Date,Quarter,School,City,State,...,Officer_Involved,Bullied,Domestic_Violence,Gang_Related,Preplanned,Shots_Fired,Active_Shooter_FBI,Year,Month,Day
0,20171231WAPIM,https://www.heraldnet.com/news/police-say-they...,,,2,2017-12-31,Winter,Pinewood Elementary School,Marysville,WA,...,No,No,No,No,No,60,No,2017,12,31
1,20171231LAEDA,https://www.nola.com/crime/index.ssf/2017/01/a...,,,2,2017-12-31,Winter,Edna Karr High School,Algiers,LA,...,No,No,No,No,No,<30,No,2017,12,31
2,20171227CALIL,https://www.dailynews.com/2017/12/28/man-argui...,,,2,2017-12-27,Winter,Lincoln Elementary School,Lancaster,CA,...,No,No,No,No,No,2,No,2017,12,27
3,20171219MIBEB,http://www.wnem.com/story/37105109/breaking-po...,,,2,2017-12-19,Winter,Beecher High School,Beecher,MI,...,No,No,No,No,No,,No,2017,12,19
4,20171214TXELD,https://www.nbcdfw.com/news/local/Dallas-ISD-G...,,,2,2017-12-14,Winter,Elisha M. Pease Elementary School,Dallas,TX,...,No,No,No,No,No,1,No,2017,12,14


**We want to use the type of weapon to determine the location of a school shooting. So we read in more data and merge it to the existing dataframe.**

In [3]:
df_weapons_updated = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/DATA 301 Final Project/weapons.csv")
df_weapons_updated.head()

Unnamed: 0,Incident_ID,Weapon_Type
0,19700105DCHIW,Handgun
1,19700105DCSOW,Handgun
2,19700105DCUNW,Handgun
3,19700206OHJOC,Handgun
4,19700323CADAL,Handgun


In [4]:
df_incidents_updated = df_incidents_updated.merge(df_weapons_updated, on="Incident_ID")
df_incidents_updated.head()

Unnamed: 0,Incident_ID,Sources,Number_News,Media_Attention,Reliability,Date,Quarter,School,City,State,...,Bullied,Domestic_Violence,Gang_Related,Preplanned,Shots_Fired,Active_Shooter_FBI,Year,Month,Day,Weapon_Type
0,20171231WAPIM,https://www.heraldnet.com/news/police-say-they...,,,2,2017-12-31,Winter,Pinewood Elementary School,Marysville,WA,...,No,No,No,No,60,No,2017,12,31,No Data
1,20171231LAEDA,https://www.nola.com/crime/index.ssf/2017/01/a...,,,2,2017-12-31,Winter,Edna Karr High School,Algiers,LA,...,No,No,No,No,<30,No,2017,12,31,Multiple Unknown
2,20171227CALIL,https://www.dailynews.com/2017/12/28/man-argui...,,,2,2017-12-27,Winter,Lincoln Elementary School,Lancaster,CA,...,No,No,No,No,2,No,2017,12,27,Handgun
3,20171219MIBEB,http://www.wnem.com/story/37105109/breaking-po...,,,2,2017-12-19,Winter,Beecher High School,Beecher,MI,...,No,No,No,No,,No,2017,12,19,Handgun
4,20171214TXELD,https://www.nbcdfw.com/news/local/Dallas-ISD-G...,,,2,2017-12-14,Winter,Elisha M. Pease Elementary School,Dallas,TX,...,No,No,No,No,1,No,2017,12,14,Handgun


**As we worked through fitting various models, we realized that some location values rarely appeared in our dataset. This did not provide a solid basis for fitting the model and cross validation. Therefore, we decided to keep only the top 3 locations.**

In [5]:
locations = df_incidents_updated["Location"].value_counts().head(n=3)
locations = locations.index.tolist()
locations

['Parking Lot', 'Classroom', 'Beside Building']

In [6]:
df_incidents_updated = df_incidents_updated[df_incidents_updated["Location"].isin(locations) == True]

In [7]:
y_train = df_incidents_updated["Location"]

In [8]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

def get_f1(variable):
  ct = make_column_transformer(
    (OneHotEncoder(handle_unknown='ignore'), variable)
    )

  pipeline = make_pipeline(
    ct,
    KNeighborsClassifier(n_neighbors=6)
    )
  
  scores = cross_val_score(pipeline,
                           X=df_incidents_updated[variable],
                           y=y_train,
                           scoring="f1_macro",
                           cv=10)
  
  return scores.mean()

f1 = pd.Series()
for variable in [["State"],
                 ["School_Level"],
                 ["Time_Period"],
                 ["During_School"],
                 ["Targets"],
                 ["Situation"],
                 ["Bullied"],
                 ["Preplanned"],
                 ["Weapon_Type"],
                 ["State", "School_Level"],
                 ["State", "School_Level", "Time_Period"],
                 ["State", "School_Level", "Time_Period", "During_School"],
                 ["State", "School_Level", "Time_Period", "During_School", "Targets"],
                 ["State", "School_Level", "Time_Period", "During_School", "Targets", "Situation"],
                 ["State", "School_Level", "Time_Period", "During_School", "Targets", "Situation", "Bullied"],
                 ["State", "School_Level", "Time_Period", "During_School", "Targets", "Situation", "Bullied", "Preplanned"],
                 ["State", "School_Level", "Time_Period", "During_School", "Targets", "Situation", "Bullied", "Preplanned", "Weapon_Type"],
                 ["State", "School_Level", "Time_Period", "During_School", "Situation", "Preplanned", "Weapon_Type"],
                 ["State", "School_Level", "Time_Period", "During_School", "Situation", "Bullied", "Weapon_Type"],
                 ["State", "School_Level", "Time_Period", "During_School", "Situation", "Bullied", "Preplanned","Weapon_Type"],
                 ["State", "School_Level", "Time_Period", "During_School", "Situation"],
                 ["State", "School_Level", "Time_Period", "Situation"]
                 ]:
  f1[str(variable)] = get_f1(variable)

f1



['State']                                                                                                                    0.359158
['School_Level']                                                                                                             0.295463
['Time_Period']                                                                                                              0.506604
['During_School']                                                                                                            0.273958
['Targets']                                                                                                                  0.369851
['Situation']                                                                                                                0.487261
['Bullied']                                                                                                                  0.274152
['Preplanned']                                                

**We use the combination of variables that give us the highest f1 score.**

In [9]:
X_train = df_incidents_updated[['State', 'School_Level', 'Time_Period', 'During_School', 'Situation', 'Preplanned', 'Weapon_Type']]

In [10]:
from sklearn.metrics import accuracy_score

ct = make_column_transformer(
    (OneHotEncoder(handle_unknown='ignore'), ['State', 'School_Level', 'Time_Period', 'During_School', 'Situation', 'Preplanned', 'Weapon_Type'])
    )

pipeline = make_pipeline(
    ct,
    KNeighborsClassifier(n_neighbors=6)
    )
  
pipeline.fit(X=X_train, y=y_train)

y_train_ = pipeline.predict(X_train)
accuracy_score(y_train, y_train_)

0.7149643705463183

**This model correctly predicts 71.5% of the locations. This seems like a pretty low accuracy, even though this is the best model we fit.**

In [11]:
from sklearn.model_selection import GridSearchCV

pipeline = make_pipeline(
    ct,
    KNeighborsClassifier()
)

grid_search = GridSearchCV(pipeline,
                           param_grid={
                               "kneighborsclassifier__n_neighbors": range(1, 20)
                           },
                           scoring="f1_macro",
                           cv=10)
grid_search.fit(X_train, y_train)
grid_search.best_estimator_

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('onehotencoder',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['State', 'School_Level',
                                                   'Time_Period',
                                                   'During_School', 'Situation',
                                                   'Preplanned',
                                                   'Weapon_Type'])])),
                ('kneighborsclassifier', KNeighborsClassifier(n_neighbors=6))])

**The optimal k value is 6.**

In [12]:
from sklearn.metrics import precision_score, recall_score, f1_score

(precision_score(y_train == "Parking Lot", y_train_ == "Parking Lot"), 
 recall_score(y_train == "Parking Lot", y_train_ == "Parking Lot"),
 f1_score(y_train == "Parking Lot", y_train_ == "Parking Lot"))

(0.7897727272727273, 0.695, 0.7393617021276595)

**79% of shootings predicted to be in a parking lot were actually in a parking lot. 69.5% of shootings actually in a parking lot were predicted to be in a parking lot. Our model is decent.**

In [13]:
(precision_score(y_train == "Classroom", y_train_ == "Classroom"), 
 recall_score(y_train == "Classroom", y_train_ == "Classroom"),
 f1_score(y_train == "Classroom", y_train_ == "Classroom"))

(0.7377049180327869, 0.8108108108108109, 0.7725321888412018)

**74% of shootings predicted to be in a classroom were actually in a classroom. 81% of shootings actually in a classroom were predicted to be in a classroom. Our model is decent.**

In [14]:
(precision_score(y_train == "Beside Building", y_train_ == "Beside Building"), 
 recall_score(y_train == "Beside Building", y_train_ == "Beside Building"),
 f1_score(y_train == "Beside Building", y_train_ == "Beside Building"))

(0.5853658536585366, 0.6545454545454545, 0.6180257510729614)

**58.5% of shootings predicted to be beside a building were actually beside a building. 65.5% of shootings actually beside a building were predicted to be in beside a building. Our model is decent.**