# Seattle Car Accident Severity

In [40]:
import itertools
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import pandas as pd
import numpy as np
import matplotlib.ticker as ticker
from sklearn import preprocessing
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

# Business Understanding

Reducing traffic accidents is an important public safety challenge around the world. 
Currently the amount of car accidents has raised despite the efforts of Seattle government to stop them. 
Thus, the Seattle government is enabling safer routes and cost-effectively improving the transportation infrastructure, all in order to make the roads safer [1]. 

The results of the model provide advices for the stakeholders to make decisions about reducing the accident severity given the current weather, road and visibility conditions. 
The stakeholders are local Seattle government, police, and car insurance institutes. 
The target audience of the project are people who drive a car. 

# Data

In [41]:
df = pd.read_csv('Data_Collisions.csv')
df.head()

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


The data was collected by SPD and recorded by Traffic Records during the timeframe 2004 to Present
The data consists of 37 independent variables and 194,673 rows. The dependent variable, “SEVERITYCODE”, 
contains the measurement the severity of an accident from 0 to 3 within the dataset. 

Severity codes are as follows:

1. 3—fatality
2. 2b—serious injury
3. 2—injury
4. 1—prop damage
5. 0—unknown  

# Data Preprocessing

The original data is not ready for data analysis. First, I decided to focus on only four features: severity, weather, road conditions and light conditions. Second, I had to convert the object data types into numerical data types. Third, I needed to balance the target feature. 

In [42]:
Feature = df[['SEVERITYCODE','WEATHER','ROADCOND','LIGHTCOND',]]
print (Feature.dtypes)

SEVERITYCODE     int64
WEATHER         object
ROADCOND        object
LIGHTCOND       object
dtype: object


In [62]:
#Feature["WEATHER"] = Feature["WEATHER"].astype('category')
#Feature["ROADCOND"] = Feature["ROADCOND"].astype('category')
#Feature["LIGHTCOND"] = Feature["LIGHTCOND"].astype('category')

Feature["WEATHER_CAT"] = Feature["WEATHER"].cat.codes
Feature["ROADCOND_CAT"] = Feature["ROADCOND"].cat.codes
Feature["LIGHTCOND_CAT"] = Feature["LIGHTCOND"].cat.codes

print (Feature.dtypes)                 

SEVERITYCODE        int64
WEATHER          category
ROADCOND         category
LIGHTCOND        category
WEATHER_CAT          int8
ROADCOND_CAT         int8
LIGHTCOND_CAT        int8
dtype: object


In [44]:
Feature['SEVERITYCODE'].value_counts()

1    136485
2     58188
Name: SEVERITYCODE, dtype: int64

In [45]:
from sklearn.utils import resample

In [46]:
df_maj = Feature[Feature.SEVERITYCODE==1]
df_min = Feature[Feature.SEVERITYCODE==2]

df_maj_dsample = resample(df_maj,
                             replace=False,
                             n_samples=58188,
                             random_state=123)

balanced_df = pd.concat([df_maj_dsample, df_min])

balanced_df.SEVERITYCODE.value_counts()

2    58188
1    58188
Name: SEVERITYCODE, dtype: int64

# Methodology


I used three machine learning models:

1. K Nearest Neighbour (KNN)
2. Decision Tree
3. Linear Regression

In [47]:
X = np.asarray(balanced_df[['WEATHER_CAT', 'ROADCOND_CAT', 'LIGHTCOND_CAT']])
X[0:5]

array([[ 6,  8,  2],
       [ 1,  0,  5],
       [10,  7,  8],
       [ 1,  0,  5],
       [ 1,  0,  5]], dtype=int8)

In [48]:
y = np.asarray(balanced_df['SEVERITYCODE'])
y[0:5]

array([1, 1, 1, 1, 1])

In [49]:
X= preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

array([[ 1.15236718,  1.52797946, -1.21648407],
       [-0.67488   , -0.67084969,  0.42978835],
       [ 2.61416492,  1.25312582,  2.07606076],
       [-0.67488   , -0.67084969,  0.42978835],
       [-0.67488   , -0.67084969,  0.42978835]])

In [50]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Train set: (81463, 3) (81463,)
Test set: (34913, 3) (34913,)


# K Nearest Neighbor(KNN)

In [51]:
from sklearn.neighbors import KNeighborsClassifier
k = 15

kNN_model = KNeighborsClassifier(n_neighbors=k).fit(X_train,y_train)
kNN_model
yhat = kNN_model.predict(X_test)
yhat[0:5]

array([2, 2, 1, 1, 2])

# Decision Tree

In [52]:
from sklearn.tree import DecisionTreeClassifier
DT_model = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
DT_model.fit(X_train,y_train)
DT_model

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=4,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [53]:
yhat = DT_model.predict(X_test)
yhat

array([2, 2, 1, ..., 2, 2, 2])

# Logistic Regression

In [54]:
from sklearn.linear_model import LogisticRegression
LR_model = LogisticRegression(C=0.01).fit(X_train,y_train)
LR_model

LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [55]:
yhat = LR_model.predict(X_test)
yhat

array([1, 2, 1, ..., 2, 2, 2])

# Model Evaluation using Test set

In [56]:
from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

In [57]:
knn_yhat = kNN_model.predict(X_test)
print("KNN Jaccard index: %.2f" % jaccard_similarity_score(y_test, knn_yhat))
print("KNN F1-score: %.2f" % f1_score(y_test, knn_yhat, average='weighted') )
print("KNN Accuracy: %.2f" % accuracy_score(y_test, knn_yhat))

KNN Jaccard index: 0.56
KNN F1-score: 0.55
KNN Accuracy: 0.56


In [58]:
DT_yhat = DT_model.predict(X_test)
print("DT Jaccard index: %.2f" % jaccard_similarity_score(y_test, DT_yhat))
print("DT F1-score: %.2f" % f1_score(y_test, DT_yhat, average='weighted'))
print("KNN Accuracy: %.2f" % accuracy_score(y_test, DT_yhat))

DT Jaccard index: 0.56
DT F1-score: 0.48
KNN Accuracy: 0.56


In [59]:
LR_yhat = LR_model.predict(X_test)
print("LR Jaccard index: %.2f" % jaccard_similarity_score(y_test, LR_yhat))
print("LR F1-score: %.2f" % f1_score(y_test, LR_yhat, average='weighted'))
print("KNN Accuracy: %.2f" % accuracy_score(y_test, LR_yhat))

LR Jaccard index: 0.53
LR F1-score: 0.51
KNN Accuracy: 0.53


# Report

This is a summary of the results I obtained:

| Algorithm          | Jaccard | F1-score | Accuracy |
|--------------------|---------|----------|---------|
| KNN                | 0.56   | 0.55    | 0.56      |
| Decision Tree      | 0.56   | 0.48   | 0.56      |
| LogisticRegression | 0.53   | 0.51    | 0.53   |

As you can see, the best model to predict car accident severity is K Nearest Neighbor(KNN)

# Conclusion

In [None]:
According to the weather, road and light conditions have an impact on the severity of the accident, whether due to property damage (class 1) or injuries (class 2). Likewise, the KNN algorithm better explains the results of these conditions.