# **Applied Data Science Capstone - Car accident severity**

## **Introduction/Business Understanding**

Car collisions have been huge problems to our society and very fatal, sometimes leading to serious casulaties. Thus it is important to analyze the previously obtained data and predict before they happen. Through our capstone project, the primary goal is to build an appropriate machine learning model and predict the severity codes, which is one of the main parameters describing the severity of accidents, for collision cases.

Classifying the severity of accidents using severity codes would lead to a big decrease in casualties and damages of accidents in future as people regarding this problem can use the data to improve environments such as road conditions for reducing the total property/human damages.

----

## **Data**

The provided example dataset, the data of all collisions in Seattle from 2004 to present, will be used for this project. This data have 35 attributes in total including severity codes. As we do not need all the attributes, some attributes that look irrelevant to our modeling will de deleted and further data cleanings will be done.

#### **STEP 1: Import libraries and load the data**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pylab as pl
import scipy.optimize as opt
from sklearn import preprocessing,metrics
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix,jaccard_similarity_score,log_loss
%matplotlib inline

In [2]:
!wget -O Data_Collisions.csv https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv

--2020-10-30 07:19:57--  https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv
Resolving s3.us.cloud-object-storage.appdomain.cloud (s3.us.cloud-object-storage.appdomain.cloud)... 67.228.254.196
Connecting to s3.us.cloud-object-storage.appdomain.cloud (s3.us.cloud-object-storage.appdomain.cloud)|67.228.254.196|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 73917638 (70M) [text/csv]
Saving to: ‘Data_Collisions.csv’


2020-10-30 07:20:01 (19.8 MB/s) - ‘Data_Collisions.csv’ saved [73917638/73917638]



#### **STEP 2: Choose which attributes to use to train machine learning model**

As stated above, not all the attributes will be necessary to build effective models. Therefore I explored the csv file in advance and decided to drop some attributes such as descriptions which seem irrelevant to achieving our goal. You can check which attributes were deleted and which were chosen as follows:

In [4]:
df = pd.read_csv('Data_Collisions.csv')
df.drop(['X','Y','OBJECTID','INCKEY','COLDETKEY','REPORTNO','STATUS','INTKEY','LOCATION','SEVERITYCODE.1','SEVERITYDESC','EXCEPTRSNCODE','EXCEPTRSNDESC','INCDTTM','JUNCTIONTYPE','SDOT_COLCODE','SDOT_COLDESC','SDOTCOLNUM','ST_COLCODE','ST_COLDESC','SEGLANEKEY','CROSSWALKKEY','HITPARKEDCAR'], axis=1, inplace=True)
df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,SEVERITYCODE,ADDRTYPE,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,INCDATE,INATTENTIONIND,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SPEEDING
0,2,Intersection,Angles,2,0,0,2,2013/03/27 00:00:00+00,,N,Overcast,Wet,Daylight,,
1,1,Block,Sideswipe,2,0,0,2,2006/12/20 00:00:00+00,,0,Raining,Wet,Dark - Street Lights On,,
2,1,Block,Parked Car,4,0,0,3,2004/11/18 00:00:00+00,,0,Overcast,Dry,Daylight,,
3,1,Block,Other,3,0,0,3,2013/03/29 00:00:00+00,,N,Clear,Dry,Daylight,,
4,2,Intersection,Angles,2,0,0,2,2004/01/28 00:00:00+00,,0,Raining,Wet,Daylight,,


In [5]:
print(df.dtypes)

SEVERITYCODE       int64
ADDRTYPE          object
COLLISIONTYPE     object
PERSONCOUNT        int64
PEDCOUNT           int64
PEDCYLCOUNT        int64
VEHCOUNT           int64
INCDATE           object
INATTENTIONIND    object
UNDERINFL         object
WEATHER           object
ROADCOND          object
LIGHTCOND         object
PEDROWNOTGRNT     object
SPEEDING          object
dtype: object


As we look up on data types in our data set, many of the attirubtes are in the form of 'object', not in numerical data types such as 'int' or 'float'. We will change these data types into numerical ones in the next step.

#### **STEP 3: Balance the data by downsampling**

In [6]:
df['SEVERITYCODE'].value_counts()

1    136485
2     58188
Name: SEVERITYCODE, dtype: int64

CODE 1 values outnumbered CODE 2 values more than twice and we need to fix it to prevent the final result from being biased. By resample the original data, we can match the size of each value set. Downsampling is a method that creates a random subset of a bigger data set to have the same size of a smaller data set. We will apply downsampling to CODE 1 value data set:

In [7]:
df_max = df[df['SEVERITYCODE'] == 1]
df_min = df[df['SEVERITYCODE'] == 2]
df_max_ds = resample(df_max,replace=False,n_samples=58188)

df_balanced = pd.concat([df_max_ds,df_min])
print(df_balanced['SEVERITYCODE'].value_counts())

2    58188
1    58188
Name: SEVERITYCODE, dtype: int64


Now CODE 1 values and CODE 2 values have the same number of data. 

In [8]:
df_balanced = df_balanced.reset_index()
df_balanced.drop('index',axis=1,inplace=True)

#### **STEP 4: Give additional cleanings for easy analysis**

I cleaned the data by dropping unnecessary values and converting undesirable data types into numerical data types, which can be helpful for an easy analysis.

First I checked all the attibutes and their values using value_counts() function and got rid of some rows with very low occurrences(under 50 times) as outliers.

In [9]:
df_balanced = df_balanced.drop(df_balanced[df_balanced['WEATHER'].isin(["Blowing Sand/Dirt", "Severe Crosswind", "Partly Cloudy"])].index)
df_balanced = df_balanced.drop(df_balanced[df_balanced['ROADCOND'].isin(["Sand/Mud/Dirt", "Oil"])].index)
df_balanced = df_balanced.drop(df_balanced[df_balanced['LIGHTCOND'].isin(["Dark - Unknown Lighting"])].index)

Next I filled up the missing values and converted categorical values into numerical ones.

In [10]:
df_balanced['INATTENTIONIND'] = df_balanced['INATTENTIONIND'].replace('Y',1).replace(np.nan,0)
df_balanced['UNDERINFL'] = df_balanced['UNDERINFL'].replace('Y',1).replace('N',0).replace(np.nan,0)
df_balanced['PEDROWNOTGRNT'] = df_balanced['PEDROWNOTGRNT'].replace('Y',1).replace(np.nan,0)
df_balanced['SPEEDING'] = df_balanced['SPEEDING'].replace('Y',1).replace(np.nan,0)

In [11]:
df_balanced['ADDRTYPE'] = df_balanced['ADDRTYPE'].replace(np.nan,'Unknown')
df_balanced['COLLISIONTYPE'] = df_balanced['COLLISIONTYPE'].replace(np.nan,'Unknown')

In [12]:
df_balanced['WEATHER'] = df_balanced['WEATHER'].replace(np.nan,'Unknown')
df_balanced['ROADCOND'] = df_balanced['ROADCOND'].replace(np.nan,'Unknown')
df_balanced['LIGHTCOND'] = df_balanced['LIGHTCOND'].replace(np.nan,'Unknown')

In [13]:
df_balanced = df_balanced.astype({'INATTENTIONIND':int,'UNDERINFL':int,'PEDROWNOTGRNT':int,'SPEEDING':int})
df_balanced.head()

Unnamed: 0,SEVERITYCODE,ADDRTYPE,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,INCDATE,INATTENTIONIND,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SPEEDING
0,1,Intersection,Angles,2,0,0,2,2019/12/25 00:00:00+00,0,0,Clear,Dry,Daylight,0,0
1,1,Block,Parked Car,2,0,0,2,2015/09/12 00:00:00+00,0,0,Unknown,Unknown,Unknown,0,0
2,1,Intersection,Left Turn,2,0,0,2,2005/03/28 00:00:00+00,0,0,Raining,Wet,Daylight,0,0
3,1,Block,Rear Ended,2,0,0,2,2005/09/16 00:00:00+00,0,0,Raining,Wet,Daylight,0,0
4,1,Block,Rear Ended,2,0,0,2,2016/10/24 00:00:00+00,1,0,Clear,Dry,Daylight,0,0


#### **STEP 5: Define the feature dataset X and the target y**

Using some simple data visualizations, I decided to use the following attributes as the feature data set for my machine learning modelings: ADDRTYPE, WEATHER, ROADCOND, LIGHTCOND, PERSONCOUNT, and PEDCOUNT.

In [17]:
X = df_balanced[['ADDRTYPE','WEATHER','ROADCOND','LIGHTCOND','PERSONCOUNT','PEDCOUNT']].values
X[0:5]

array([['Intersection', 'Clear', 'Dry', 'Daylight', 2, 0],
       ['Block', 'Unknown', 'Unknown', 'Unknown', 2, 0],
       ['Intersection', 'Raining', 'Wet', 'Daylight', 2, 0],
       ['Block', 'Raining', 'Wet', 'Daylight', 2, 0],
       ['Block', 'Clear', 'Dry', 'Daylight', 2, 0]], dtype=object)

Obviously, our target is SEVERITYCODE.

In [18]:
y = df_balanced['SEVERITYCODE'].values
y[0:5]

array([1, 1, 1, 1, 1])

#### **STEP 6: Convert categorical values into numerical values**

In the feature data set X, you can see that columns except PERSONCOUNT and PEDCOUNT consist of categorical values. I used the label encoder to get desirable data types.

In [19]:
le_addr = preprocessing.LabelEncoder()
le_addr.fit(['Block','Intersection','Unknown','Alley'])
X[:,0] = le_addr.transform(X[:,0])

le_weather = preprocessing.LabelEncoder()
le_weather.fit(['Clear','Raining','Overcast','Unknown','Snowing','Other','Fog/Smog/Smoke','Sleet/Hail/Freezing Rain'])
X[:,1] = le_weather.transform(X[:,1])

le_road = preprocessing.LabelEncoder()
le_road.fit(['Dry','Wet','Unknown','Ice','Snow/Slush','Other','Standing Water'])
X[:,2] = le_road.transform(X[:,2])

le_light = preprocessing.LabelEncoder()
le_light.fit(['Daylight','Dark - Street Lights On','Unknown','Dusk','Dawn','Dark - No Street Lights','Dark - Street Lights Off','Other'])
X[:,3] = le_light.transform(X[:,3])

X[0:5]

array([[2, 0, 0, 4, 2, 0],
       [1, 7, 5, 7, 2, 0],
       [2, 4, 6, 4, 2, 0],
       [1, 4, 6, 4, 2, 0],
       [1, 0, 0, 4, 2, 0]], dtype=object)

#### **STEP 7: Normalize the data**

It is necessary to normalize the data as numerical values in our feature data set shouldn't be regarded as weights on each attribute. I used data standarization to give data zero mean and unit variance.

In [20]:
X = preprocessing.StandardScaler().fit(X).transform(X.astype(float))
X[0:5]



array([[ 1.20126503, -0.76533134, -0.7164001 ,  0.21811474, -0.37222737,
        -0.23769606],
       [-0.76169461,  2.27744764,  1.12174506,  2.44336761, -0.37222737,
        -0.23769606],
       [ 1.20126503,  0.9733995 ,  1.48937409,  0.21811474, -0.37222737,
        -0.23769606],
       [-0.76169461,  0.9733995 ,  1.48937409,  0.21811474, -0.37222737,
        -0.23769606],
       [-0.76169461, -0.76533134, -0.7164001 ,  0.21811474, -0.37222737,
        -0.23769606]])

#### **STEP 8: Split the data into train set and test set**

I set 20% of the total data set as test set and the rest as train set.

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Train set: (92981, 6) (92981,)
Test set: (23246, 6) (23246,)


-----

## **Methodology**

I chose to use three classification algorithms - K-Nearest-Neighbors, Decesion Trees, and Logistic Regression - for machine learning modelings. Each modeling has two steps, finding the best parameter and train the model to predict.

#### **METHOD 1: K-Nearest-Neighbors**

In [22]:
Ks = 20
mean_acc = np.zeros((Ks-1))
std_acc = np.zeros((Ks-1))
ConfustionMx = [];
for n in range(1,Ks):
    neigh = KNeighborsClassifier(n_neighbors = n).fit(X_train,y_train)
    yhat = neigh.predict(X_test)
    mean_acc[n-1] = metrics.accuracy_score(y_test, yhat)
    std_acc[n-1] = np.std(yhat==y_test)/np.sqrt(yhat.shape[0])

print("The best accuracy was with", mean_acc.max(), "with k =", mean_acc.argmax()+1) 

The best accuracy was with 0.6415727436978405 with k = 17


In [23]:
k_max = mean_acc.argmax()+1
neigh = KNeighborsClassifier(n_neighbors = k_max).fit(X_train,y_train)
neigh

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=17, p=2,
           weights='uniform')

In [24]:
yhat_knn = neigh.predict(X_test)
yhat_knn[0:5]

array([1, 1, 2, 1, 2])

#### **Method 2: Decision Tree**

In [25]:
Ds = 20
mean_acc = np.zeros((Ks-1))
std_acc = np.zeros((Ks-1))
ConfustionMx = [];
for n in range(1,Ks):
    tree = DecisionTreeClassifier(criterion="entropy", max_depth=n).fit(X_train,y_train)
    pred = tree.predict(X_test)
    mean_acc[n-1] = metrics.accuracy_score(y_test, pred)
    std_acc[n-1] = np.std(pred==y_test)/np.sqrt(pred.shape[0])

print("The best accuracy was with", mean_acc.max(), "with max depth =", mean_acc.argmax()+1) 

The best accuracy was with 0.6524133184203734 with max depth = 7


In [26]:
d_max = mean_acc.argmax()+1
carTree = DecisionTreeClassifier(criterion="entropy", max_depth=d_max).fit(X_train,y_train)
carTree

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=7,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [27]:
predTree = carTree.predict(X_test)
predTree

array([1, 1, 2, ..., 2, 1, 1])

#### **Method 3: Logistic Regression**

In [28]:
Cs = 20
mean_acc = np.zeros((Ks-1))
std_acc = np.zeros((Ks-1))
ConfustionMx = [];
for n in range(1,Ks):
    LR = LogisticRegression(C=0.00005*n, solver='liblinear').fit(X_train,y_train)
    yhat = LR.predict(X_test)
    mean_acc[n-1] = metrics.accuracy_score(y_test, yhat)
    std_acc[n-1] = np.std(yhat==y_test)/np.sqrt(yhat.shape[0])
    
print("The best accuracy was with", mean_acc.max(), "with C =", round(0.00005*(mean_acc.argmax()+1),5))

The best accuracy was with 0.6301729329777166 with C = 0.0002


In [29]:
c_max = round(0.00005*(mean_acc.argmax()+1),5)
LR = LogisticRegression(C=c_max, solver='liblinear').fit(X_train,y_train)
LR

LogisticRegression(C=0.0002, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='warn', n_jobs=None, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [30]:
yhat_lr = LR.predict(X_test)
yhat_lr

array([1, 1, 2, ..., 2, 1, 1])

In [31]:
yhat_lr_prob = LR.predict_proba(X_test)

-----

## **Results/Discussion**

Now we will calculate the accuracy of predicting right classes for SEVERITYCODE to see each model works well.

In [32]:
f1 = metrics.accuracy_score(y_test, yhat_knn)
j1 = jaccard_similarity_score(y_test, yhat_knn)
print("KNN F1 score:", f1)
print("KNN Jaccard similarity score:", j1)

KNN F1 score: 0.6415727436978405
KNN Jaccard similarity score: 0.6415727436978405


In [33]:
f2 = metrics.accuracy_score(y_test, predTree)
j2 = jaccard_similarity_score(y_test, predTree)
print("Decision Tree F1 score:", f2)
print("Decision Tree Jaccard similarity score:", j2)

Decision Tree F1 score: 0.6524133184203734
Decision Tree Jaccard similarity score: 0.6524133184203734


In [34]:
f3 = metrics.accuracy_score(y_test, yhat_lr)
j3 = jaccard_similarity_score(y_test, yhat)
l3 = log_loss(y_test, yhat_lr_prob)
print("Logistic Regresiion F1 score:", f3)
print("Logistic Regression Jaccard similarity score:", j3)
print("Logistic Regression log loss:", l3)

Logistic Regresiion F1 score: 0.6301729329777166
Logistic Regression Jaccard similarity score: 0.6285382431386045
Logistic Regression log loss: 0.6367656833415959


In [35]:
f1_score = [f1,f2,f3]
jaccard_score = [j1,j2,j3]
log_loss = ['NA','NA',l3]
res = pd.DataFrame(data=[f1_score, jaccard_score, log_loss], index=['F1 score','Jaccard score','Log Loss'], columns=['KNN','Decision Tree','Logistic Regression'])
res

Unnamed: 0,KNN,Decision Tree,Logistic Regression
F1 score,0.641573,0.652413,0.630173
Jaccard score,0.641573,0.652413,0.628538
Log Loss,,,0.636766


-----

## **Conclusion**

To predict the severity codes for car collision accidents, I picked three classification algorithms - K-nearest neighbors, decision tree, and logistic regression - for machine learning modeling. From the results, we can see that all three models yielded the accuracy over 0.6 and among them the decesion tree performed best with the accuracy of 0.652413.