# 1. Introduction
Road traffic injuries are currently estimated to be the eighth leading cause of death across all age groups globally.

The machine learning solution we will be working on has the objective of predicting the probability of collision and its severity, based on the traffic records provided by Seattle Police Department. This solution will incorporate a set of factors, including weather conditions, road conditions, location, special events, construction, traffic jams, vehicle type, and more.

Local, State, and Federal government would be highly interested in accurate predictions of the severity of an accident, to reduce cost incurred as a result of accident damage and to save lives.

# 2. Data
The dataset used for this project is based on car accidents which have taken place within the city of Seattle, Washington from the year 2004 to 2020. This data is regarding the severity of each car accidents along with the time and conditions under which each accident occurred. The model aims to predict the severity of an accident (SEVERITYCODE is the variable reflected in the training dataset). The variable is in the form of 1 (Property Damage Only Collision) and 2 (Injury Collision).

The dataset is labeled data, so we have built a supervised learning model. In total there are 37 attributes. Some attributes are not useful in making predictions, so they were removed (e.g., COLDETKEY, INCKEY, and REPORTNO).

From a feature engineering standpoint, we used the following variables to make our predictions: WEATHER, ROADCOND, SPEEDING, LIGHTCOND, UNDERINFL, and INATTENTIONIND.

# 3. Data Ingestion and Cleaning

In [87]:
import pandas as pd
import numpy as np
import seaborn as sns

from sklearn import preprocessing
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

import matplotlib.pyplot as plt

import datetime as dt

In [44]:
df = pd.read_csv(r"C:\Users\575862\Documents\Python Scripts\Data-Collisions.csv", index_col=0)
df.head(5)

Unnamed: 0_level_0,SEVERITYCODE,X,Y,OBJECTID,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,LOCATION,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
INCKEY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1307,2,-122.323148,47.70314,1,1307,3502005,Matched,Intersection,37475.0,5TH AVE NE AND NE 103RD ST,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
52200,1,-122.347294,47.647172,2,52200,2607959,Matched,Block,,AURORA BR BETWEEN RAYE ST AND BRIDGE WAY N,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
26700,1,-122.33454,47.607871,3,26700,1482393,Matched,Block,,4TH AVE BETWEEN SENECA ST AND UNIVERSITY ST,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
1144,1,-122.334803,47.604803,4,1144,3503937,Matched,Block,,2ND AVE BETWEEN MARION ST AND MADISON ST,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
17700,2,-122.306426,47.545739,5,17700,1807429,Matched,Intersection,34387.0,SWIFT AVE S AND SWIFT AV OFF RP,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


In [39]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 194673 entries, 1307 to 308220
Data columns (total 36 columns):
SEVERITYCODE      194673 non-null int64
X                 189339 non-null float64
Y                 189339 non-null float64
OBJECTID          194673 non-null int64
COLDETKEY         194673 non-null int64
REPORTNO          194673 non-null object
STATUS            194673 non-null object
ADDRTYPE          192747 non-null object
INTKEY            65070 non-null float64
LOCATION          191996 non-null object
EXCEPTRSNCODE     84811 non-null object
EXCEPTRSNDESC     5638 non-null object
SEVERITYDESC      194673 non-null object
COLLISIONTYPE     189769 non-null object
PERSONCOUNT       194673 non-null int64
PEDCOUNT          194673 non-null int64
PEDCYLCOUNT       194673 non-null int64
VEHCOUNT          194673 non-null int64
INCDATE           194673 non-null object
INCDTTM           194673 non-null object
JUNCTIONTYPE      188344 non-null object
SDOT_COLCODE      194673 non-null

In [41]:
#Convert Severity Code from (1/2) to (0/1)
severity_code = df["SEVERITYCODE"].values

labels = preprocessing.LabelEncoder()
labels.fit([0, 1])
severity_code = labels.transform(severity_code)

df["SEVERITYCODE"] = severity_code

In [46]:
#Encode In Attention (0 = No, 1 = Yes)
df["INATTENTIONIND"].replace("Y", 1, inplace=True)
df["INATTENTIONIND"].replace(np.nan, 0, inplace=True)

#Encoding Under the influence (0 = No, 1 = Yes)
df["UNDERINFL"].replace("N", 0, inplace=True)
df["UNDERINFL"].replace("Y", 1, inplace=True)

#Encoding Speeding(0 = No, 1 = Yes)
df["SPEEDING"].replace("Y", 1, inplace=True)
df["SPEEDING"].replace(np.nan, 0, inplace=True)

#Encoding Light Conditions(0 = Light, 1 = Medium, 2 = Dark)
df["LIGHTCOND"].replace("Daylight", 0, inplace=True)
df["LIGHTCOND"].replace("Dark - Street Lights On", 1, inplace=True)
df["LIGHTCOND"].replace("Dark - No Street Lights", 2, inplace=True)
df["LIGHTCOND"].replace("Dusk", 1, inplace=True)
df["LIGHTCOND"].replace("Dawn", 1, inplace=True)
df["LIGHTCOND"].replace("Dark - Street Lights Off", 2, inplace=True)
df["LIGHTCOND"].replace("Dark - Unknown Lighting", 2, inplace=True)
df["LIGHTCOND"].replace("Other","Unknown", inplace=True)

#Encoding Weather Conditions(0 = Clear, 1 = Overcast and Cloudy, 2 = Windy, 3 = Rain and Snow
df["WEATHER"].replace("Clear", 0, inplace=True)
df["WEATHER"].replace("Raining", 3, inplace=True)
df["WEATHER"].replace("Overcast", 1, inplace=True)
df["WEATHER"].replace("Other", "Unknown", inplace=True)
df["WEATHER"].replace("Snowing", 3, inplace=True)
df["WEATHER"].replace("Fog/Smog/Smoke", 2, inplace=True)
df["WEATHER"].replace("Sleet/Hail/Freezing Rain", 3, inplace=True)
df["WEATHER"].replace("Blowing Sand/Dirt", 2, inplace=True)
df["WEATHER"].replace("Severe Crosswind", 2, inplace=True)
df["WEATHER"].replace("Partly Cloudy", 1, inplace=True)

#Encoding Road Conditions(0 = Dry, 1 = Mushy, 2 = Wet)
df["ROADCOND"].replace("Dry", 0, inplace=True)
df["ROADCOND"].replace("Wet", 2, inplace=True)
df["ROADCOND"].replace("Ice", 2, inplace=True)
df["ROADCOND"].replace("Snow/Slush", 1, inplace=True)
df["ROADCOND"].replace("Other", "Unknown", inplace=True)
df["ROADCOND"].replace("Standing Water", 2, inplace=True)
df["ROADCOND"].replace("Sand/Mud/Dirt", 1, inplace=True)
df["ROADCOND"].replace("Oil", 2, inplace=True)


#Making new dataframe with only variables and unique keys
features=df[["X", "Y", "INATTENTIONIND", "UNDERINFL", "WEATHER", "ROADCOND", "LIGHTCOND", "SPEEDING", "SEVERITYCODE"]]
df_feature=features.copy()
df_feature.dropna(axis=0, how='any', inplace=True)

In [49]:
df_feature["SEVERITYCODE"].value_counts()

1    128154
2     56013
Name: SEVERITYCODE, dtype: int64

In [74]:
# Separate majority and minority classes
df_majority = df_feature[df_feature.SEVERITYCODE==1]
df_minority = df_feature[df_feature.SEVERITYCODE==2]
 
# Upsample minority class
df_minority_upsampled = resample(df_minority, 
                                 replace=True,     # sample with replacement
                                 n_samples=128154,    # to match majority class
                                 random_state=123) # reproducible results
 
# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_majority, df_minority_upsampled])

# Display new class counts
df_upsampled.SEVERITYCODE.value_counts()

2    128154
1    128154
Name: SEVERITYCODE, dtype: int64

In [77]:
df_upsampled.head(5)

Unnamed: 0_level_0,X,Y,INATTENTIONIND,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,SPEEDING,SEVERITYCODE
INCKEY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
52200,-122.347294,47.647172,0.0,0,3,2,1,0.0,1
26700,-122.33454,47.607871,0.0,0,1,0,0,0.0,1
1144,-122.334803,47.604803,0.0,0,0,0,0,0.0,1
320840,-122.387598,47.690575,0.0,0,0,0,0,0.0,1
83300,-122.338485,47.618534,0.0,0,3,2,0,0.0,1


### Define Idenpendent and Dependent Variables

In [78]:
X = np.asarray(df_upsampled[["INATTENTIONIND", "UNDERINFL", "WEATHER", "ROADCOND", "LIGHTCOND", "SPEEDING"]])
X[0:5]

array([[0.0, '0', 3, 2, 1, 0.0],
       [0.0, '0', 1, 0, 0, 0.0],
       [0.0, 0, 0, 0, 0, 0.0],
       [0.0, 0, 0, 0, 0, 0.0],
       [0.0, '0', 3, 2, 0, 0.0]], dtype=object)

In [80]:
y = np.asarray(df_upsampled[["SEVERITYCODE"]])
y[0:5]

array([[1],
       [1],
       [1],
       [1],
       [1]], dtype=int64)

In [89]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

In [91]:
## K-Nearest Neighbors
##k = 25
##neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train, y_train)
##neigh

##K_yhat = neigh.predict(X_test)

ValueError: could not convert string to float: 'Unknown'