# Determining severity of an accident
## Coursera Capstone

## Table of Contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)

## Introduction: Business Problem <a name="introduction"></a>

In this project we will try to determine if we are likely to get in to an accident on a car journey depending on certain weather and road conditions. The report is targeted at insurance companies. With this model an insurance company may be able to handle claims more efficiently as they can predict whetehjr a claim is likely to be to do with a personal injury claim or a property damage claim. If when claiming a person enters the circumstances of their accident then the model should allow them to be directed to the correct claims department.

## Data <a name="data"></a>

Based on the business problem we will use the Coursera provided data set (link shown later) and extract particular attributes of interest that can help predict whether an accident results in proporty damage or injury. Some key factors are:

* Road Conditions
* Light Conditions
* Weather Conditions

From this raw data set it is necessary to remove unwanted attributes and then go through various data cleaning stages to ensure the data is in the right format to model

Any missing data is removed for the attributes of interest and any attributes that are 'unknown' or 'other' are also removed as these will not be useful in helping the predictiveness of our model

Categorical data is encoded into binary reporesenation by one-hot encoding in python.

The target variable is referred to as 'Severity code'. For our data set we had two outcomes for this, injury or accident.

Once the data has been cleaned we are left with a data set of shape of 167,857 samples and 26 features.

In [36]:
#import necessary python libraries
import itertools
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import pandas as pd
import numpy as np
import matplotlib.ticker as ticker
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss

In [2]:
#data set - usinf the coursera provided example dataset
data_set = "https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv"

In [3]:
#read date set to a pandas dataframe
df = pd.read_csv(data_set)
df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


In [4]:
df.shape

(194673, 38)

In [5]:
#describe summary statistics of the dataset
df.describe(include = "all")

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
count,194673.0,189339.0,189339.0,194673.0,194673.0,194673.0,194673.0,194673,192747,65070.0,...,189661,189503,4667,114936.0,9333,194655.0,189769,194673.0,194673.0,194673
unique,,,,,,,194670.0,2,3,,...,9,9,1,,1,115.0,62,,,2
top,,,,,,,1776526.0,Matched,Block,,...,Dry,Daylight,Y,,Y,32.0,One parked--one moving,,,N
freq,,,,,,,2.0,189786,126926,,...,124510,116137,4667,,9333,27612.0,44421,,,187457
mean,1.298901,-122.330518,47.619543,108479.36493,141091.45635,141298.811381,,,,37558.450576,...,,,,7972521.0,,,,269.401114,9782.452,
std,0.457778,0.029976,0.056157,62649.722558,86634.402737,86986.54211,,,,51745.990273,...,,,,2553533.0,,,,3315.776055,72269.26,
min,1.0,-122.419091,47.495573,1.0,1001.0,1001.0,,,,23807.0,...,,,,1007024.0,,,,0.0,0.0,
25%,1.0,-122.348673,47.575956,54267.0,70383.0,70383.0,,,,28667.0,...,,,,6040015.0,,,,0.0,0.0,
50%,1.0,-122.330224,47.615369,106912.0,123363.0,123363.0,,,,29973.0,...,,,,8023022.0,,,,0.0,0.0,
75%,2.0,-122.311937,47.663664,162272.0,203319.0,203459.0,,,,33973.0,...,,,,10155010.0,,,,0.0,0.0,


In [6]:
#Severity code will be the target attribute
df['SEVERITYCODE'].value_counts()

1    136485
2     58188
Name: SEVERITYCODE, dtype: int64

In [7]:
df.info

<bound method DataFrame.info of         SEVERITYCODE           X          Y  OBJECTID  INCKEY  COLDETKEY  \
0                  2 -122.323148  47.703140         1    1307       1307   
1                  1 -122.347294  47.647172         2   52200      52200   
2                  1 -122.334540  47.607871         3   26700      26700   
3                  1 -122.334803  47.604803         4    1144       1144   
4                  2 -122.306426  47.545739         5   17700      17700   
...              ...         ...        ...       ...     ...        ...   
194668             2 -122.290826  47.565408    219543  309534     310814   
194669             1 -122.344526  47.690924    219544  309085     310365   
194670             2 -122.306689  47.683047    219545  311280     312640   
194671             2 -122.355317  47.678734    219546  309514     310794   
194672             1 -122.289360  47.611017    219547  308220     309500   

       REPORTNO   STATUS      ADDRTYPE   INTKEY  ... RO

In [8]:
df.dtypes

SEVERITYCODE        int64
X                 float64
Y                 float64
OBJECTID            int64
INCKEY              int64
COLDETKEY           int64
REPORTNO           object
STATUS             object
ADDRTYPE           object
INTKEY            float64
LOCATION           object
EXCEPTRSNCODE      object
EXCEPTRSNDESC      object
SEVERITYCODE.1      int64
SEVERITYDESC       object
COLLISIONTYPE      object
PERSONCOUNT         int64
PEDCOUNT            int64
PEDCYLCOUNT         int64
VEHCOUNT            int64
INCDATE            object
INCDTTM            object
JUNCTIONTYPE       object
SDOT_COLCODE        int64
SDOT_COLDESC       object
INATTENTIONIND     object
UNDERINFL          object
WEATHER            object
ROADCOND           object
LIGHTCOND          object
PEDROWNOTGRNT      object
SDOTCOLNUM        float64
SPEEDING           object
ST_COLCODE         object
ST_COLDESC         object
SEGLANEKEY          int64
CROSSWALKKEY        int64
HITPARKEDCAR       object
dtype: objec

In [9]:
df_final = df.filter(['SEVERITYCODE','ADDRTYPE','JUNCTIONTYPE', 'WEATHER', 'ROADCOND','LIGHTCOND'], axis=1)
df_final.head()

Unnamed: 0,SEVERITYCODE,ADDRTYPE,JUNCTIONTYPE,WEATHER,ROADCOND,LIGHTCOND
0,2,Intersection,At Intersection (intersection related),Overcast,Wet,Daylight
1,1,Block,Mid-Block (not related to intersection),Raining,Wet,Dark - Street Lights On
2,1,Block,Mid-Block (not related to intersection),Overcast,Dry,Daylight
3,1,Block,Mid-Block (not related to intersection),Clear,Dry,Daylight
4,2,Intersection,At Intersection (intersection related),Raining,Wet,Daylight


In [10]:
#check for missing data
missing_data = df_final.isnull()
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("") 

SEVERITYCODE
False    194673
Name: SEVERITYCODE, dtype: int64

ADDRTYPE
False    192747
True       1926
Name: ADDRTYPE, dtype: int64

JUNCTIONTYPE
False    188344
True       6329
Name: JUNCTIONTYPE, dtype: int64

WEATHER
False    189592
True       5081
Name: WEATHER, dtype: int64

ROADCOND
False    189661
True       5012
Name: ROADCOND, dtype: int64

LIGHTCOND
False    189503
True       5170
Name: LIGHTCOND, dtype: int64



In [11]:
#given the missing values do not take up a large portion of the data set we will drop missing values
df_final.dropna(subset=['ADDRTYPE','JUNCTIONTYPE', 'WEATHER', 'ROADCOND','LIGHTCOND'], axis=0, inplace=True)
df_final.shape


(182914, 6)

In [12]:
#check for missing data again, to ensure processing has worked correctly
missing_data_updated = df_final.isnull()
for column in missing_data_updated.columns.values.tolist():
    print(column)
    print (missing_data_updated[column].value_counts())
    print("")


SEVERITYCODE
False    182914
Name: SEVERITYCODE, dtype: int64

ADDRTYPE
False    182914
Name: ADDRTYPE, dtype: int64

JUNCTIONTYPE
False    182914
Name: JUNCTIONTYPE, dtype: int64

WEATHER
False    182914
Name: WEATHER, dtype: int64

ROADCOND
False    182914
Name: ROADCOND, dtype: int64

LIGHTCOND
False    182914
Name: LIGHTCOND, dtype: int64



In [13]:
df_final.groupby(['ADDRTYPE'])['SEVERITYCODE'].value_counts()

ADDRTYPE      SEVERITYCODE
Alley         1                 206
              2                  29
Block         1               90112
              2               29254
Intersection  1               35958
              2               27355
Name: SEVERITYCODE, dtype: int64

In [14]:
df_final['ADDRTYPE'].replace(to_replace=['Alley','Block','Intersection'], value=[0,1,2],inplace=True)
df_final.head()

Unnamed: 0,SEVERITYCODE,ADDRTYPE,JUNCTIONTYPE,WEATHER,ROADCOND,LIGHTCOND
0,2,2,At Intersection (intersection related),Overcast,Wet,Daylight
1,1,1,Mid-Block (not related to intersection),Raining,Wet,Dark - Street Lights On
2,1,1,Mid-Block (not related to intersection),Overcast,Dry,Daylight
3,1,1,Mid-Block (not related to intersection),Clear,Dry,Daylight
4,2,2,At Intersection (intersection related),Raining,Wet,Daylight


In [18]:
df_final.groupby(['JUNCTIONTYPE'])['SEVERITYCODE'].value_counts()

JUNCTIONTYPE                                       SEVERITYCODE
At Intersection (but not related to intersection)  1                1439
                                                   2                 616
At Intersection (intersection related)             1               34492
                                                   2               26729
Driveway Junction                                  1                7324
                                                   2                3195
Mid-Block (but intersection related)               1               15153
                                                   2                7188
Mid-Block (not related to intersection)            1               67754
                                                   2               18859
Ramp Junction                                      1                 110
                                                   2                  50
Unknown                                            1        

In [19]:
df_final.drop(df_final[df_final.JUNCTIONTYPE == 'Unknown'].index, inplace=True)

In [21]:
df_final.groupby(['WEATHER'])['SEVERITYCODE'].value_counts()

WEATHER                   SEVERITYCODE
Blowing Sand/Dirt         1                  36
                          2                  13
Clear                     1               73476
                          2               35585
Fog/Smog/Smoke            1                 370
                          2                 186
Other                     1                 632
                          2                 114
Overcast                  1               18513
                          2                8675
Partly Cloudy             2                   3
                          1                   2
Raining                   1               21560
                          2               11089
Severe Crosswind          1                  18
                          2                   7
Sleet/Hail/Freezing Rain  1                  85
                          2                  27
Snowing                   1                 714
                          2                 167
U

In [30]:
df_final.drop(df_final[df_final.WEATHER == 'Unknown'].index, inplace=True)

TypeError: unsupported operand type(s) for |: 'str' and 'str'

In [23]:
df_final.groupby(['ROADCOND'])['SEVERITYCODE'].value_counts()

ROADCOND        SEVERITYCODE
Dry             1               81761
                2               39617
Ice             1                 850
                2                 264
Oil             1                  34
                2                  24
Other           1                  69
                2                  42
Sand/Mud/Dirt   1                  40
                2                  21
Snow/Slush      1                 731
                2                 160
Standing Water  1                  75
                2                  29
Unknown         1                1098
                2                 146
Wet             1               30748
                2               15563
Name: SEVERITYCODE, dtype: int64

In [24]:
df_final.drop(df_final[df_final.ROADCOND == 'Unknown'].index, inplace=True)

In [25]:
df_final.groupby(['LIGHTCOND'])['SEVERITYCODE'].value_counts()

LIGHTCOND                 SEVERITYCODE
Dark - No Street Lights   1                1039
                          2                 319
Dark - Street Lights Off  1                 782
                          2                 307
Dark - Street Lights On   1               31911
                          2               14201
Dark - Unknown Lighting   1                   5
                          2                   4
Dawn                      1                1578
                          2                 805
Daylight                  1               73245
                          2               37909
Dusk                      1                3677
                          2                1891
Other                     1                 135
                          2                  49
Unknown                   1                1936
                          2                 235
Name: SEVERITYCODE, dtype: int64

In [26]:
df_final.drop(df_final[df_final.LIGHTCOND == 'Unknown'].index, inplace=True)

In [27]:
df_final.head()

Unnamed: 0,SEVERITYCODE,ADDRTYPE,JUNCTIONTYPE,WEATHER,ROADCOND,LIGHTCOND
0,2,2,At Intersection (intersection related),Overcast,Wet,Daylight
1,1,1,Mid-Block (not related to intersection),Raining,Wet,Dark - Street Lights On
2,1,1,Mid-Block (not related to intersection),Overcast,Dry,Daylight
3,1,1,Mid-Block (not related to intersection),Clear,Dry,Daylight
4,2,2,At Intersection (intersection related),Raining,Wet,Daylight


In [44]:

Feature = pd.concat([df_final,pd.get_dummies(df_final[['WEATHER','ROADCOND','LIGHTCOND']])], axis=1)
Feature = Feature.drop(['SEVERITYCODE','ADDRTYPE','JUNCTIONTYPE','WEATHER','ROADCOND','LIGHTCOND'],axis=1)
Feature.head()

Unnamed: 0,WEATHER_Blowing Sand/Dirt,WEATHER_Clear,WEATHER_Fog/Smog/Smoke,WEATHER_Other,WEATHER_Overcast,WEATHER_Partly Cloudy,WEATHER_Raining,WEATHER_Severe Crosswind,WEATHER_Sleet/Hail/Freezing Rain,WEATHER_Snowing,...,ROADCOND_Standing Water,ROADCOND_Wet,LIGHTCOND_Dark - No Street Lights,LIGHTCOND_Dark - Street Lights Off,LIGHTCOND_Dark - Street Lights On,LIGHTCOND_Dark - Unknown Lighting,LIGHTCOND_Dawn,LIGHTCOND_Daylight,LIGHTCOND_Dusk,LIGHTCOND_Other
0,0,0,0,0,1,0,0,0,0,0,...,0,1,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,1,0,0,0,...,0,1,0,0,1,0,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,0,0,0,0,0,0,1,0,0,0,...,0,1,0,0,0,0,0,1,0,0


In [45]:
X = Feature
X[0:5]

Unnamed: 0,WEATHER_Blowing Sand/Dirt,WEATHER_Clear,WEATHER_Fog/Smog/Smoke,WEATHER_Other,WEATHER_Overcast,WEATHER_Partly Cloudy,WEATHER_Raining,WEATHER_Severe Crosswind,WEATHER_Sleet/Hail/Freezing Rain,WEATHER_Snowing,...,ROADCOND_Standing Water,ROADCOND_Wet,LIGHTCOND_Dark - No Street Lights,LIGHTCOND_Dark - Street Lights Off,LIGHTCOND_Dark - Street Lights On,LIGHTCOND_Dark - Unknown Lighting,LIGHTCOND_Dawn,LIGHTCOND_Daylight,LIGHTCOND_Dusk,LIGHTCOND_Other
0,0,0,0,0,1,0,0,0,0,0,...,0,1,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,1,0,0,0,...,0,1,0,0,1,0,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,0,0,0,0,0,0,1,0,0,0,...,0,1,0,0,0,0,0,1,0,0


In [46]:
y = df_final['SEVERITYCODE'].values
y[0:5]

array([2, 1, 1, 1, 2])

## K Nearest Neighbour


In [47]:
#split data into tarin and test
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Train set: (134285, 26) (134285,)
Test set: (33572, 26) (33572,)


In [48]:
#Normalise data - give data zero mean and unit variance
X_train = preprocessing.StandardScaler().fit(X_train).transform(X_train)
X_test = preprocessing.StandardScaler().fit(X_test).transform(X_test)

In [55]:
from sklearn import metrics
Ks = 8
mean_acc = np.zeros(4)

for n in range(4,Ks):
    
    #Train Model and Predict  
    neigh = KNeighborsClassifier(n_neighbors = n).fit(X_train,y_train)
    yhat=neigh.predict(X_test)
    mean_acc[n-4] = metrics.accuracy_score(y_test, yhat)


mean_acc


KeyboardInterrupt: 

In [None]:
k = 7
#Train Model and Predict  
neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
neigh

In [None]:
#calculate stats for knn algorithm
yhat_knn = neigh.predict(X_test)
knn_jac = jaccard_similarity_score(y_test, yhat_knn)
knn_f1 = f1_score(y_test, yhat_knn)
print("KNN jaccard score is: ", knn_jac)
print("KNN f1 score is: ", knn_f1)

## Decision Tree

In [None]:
dt = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
dt # it shows the default parameters

In [None]:
dt.fit(X_train,y_train)

In [None]:
#calculate stats for decision tree algorithm
predTree = dt.predict(X_test)
dectree_jac = jaccard_similarity_score(y_test, predTree)
dectree_f1 = f1_score(y_test, predTree)
print("Decision tree jaccard score is: ", dectree_jac)
print("Decision tree f1 score is: ", dectree_f1)

## Support Vector Machines

In [None]:
clf = svm.SVC(kernel='rbf')
clf.fit(X_train, y_train) 

In [None]:
#calculate stats for svm algorithm
svm_hat = clf.predict(X_test)
svm_jac = jaccard_similarity_score(y_test, svm_hat)
svm_f1 = f1_score(y_test, svm_hat)
print("SVM jaccard score is: ", svm_jac)
print("SVM f1 score is: ", svm_f1)

## Logistic Regression

In [None]:
LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
LR

In [None]:
#calculate stats for logistic algorithm
log_hat = LR.predict(X_test)
log_jac = jaccard_similarity_score(y_test, log_hat)
log_f1 = f1_score(y_test, log_hat)
log_logloss = log_loss(y_test, log_hat)
print("Logistic jaccard score is: ", log_jac)
print("Logistic f1 score is: ", log_f1)
print("Logistic log loss score is: ", log_logloss)