In [17]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Hello Capstone Project Course!



## Introduction / Business Problem

### Model Purpose
The purpose of this model is the predict the severity of a potential road accident (a binary measurement in this case). This is to attempt to avoid the possibility of being stuck in traffic induced by a severe accident (denoted as 2, whereas a less severe accident is denoted as 1). The severity of an accident will be predicted based on the road and light conditions at a given point in time. 

### Primary Audience / Stakeholders

This type of model is useful for anyone seeking to plan a trip and wish to avoid being stuck in traffic due to accidents. Perhaps it is especially useful for those who value avoiding wasting their time in traffic, and have the ability to plan their travelling. E.g., delivery drivers may not benefit as much for this model as, for instance, people travelling for leisure/vacation, as the former are often on a schedule outside of their control and may therefore be forced to travel irrespective of what the model suggests. However, the model can be used to plan deliveries (and the like) more carefully, and reduction of leuisarly drivers during more accident prone times will still benefit those with less choice.

### Potential Limitations of Model

While the data used to create the model is based on traffic accident data in Seattle, it is likely that the model is generalizable to situations outside of Seattle as well. However, this predictive ability likely decreases as the predicted area becomes more different from Seattle. For instance, Seattle is located in the Pacific Northwest of the United States, and the model will likely perform best for areas that has similar driving laws/cultures and climate as Seattle does. Thus, this model is most useful to stakeholders travelling through areas of this type. 


## Data Understanding / Exploration

In [18]:
# loads data
data = pd.read_csv('Data-Collisions.csv')
data.head()

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


In [19]:
# counts number of empty values in ROADCOND
data['ROADCOND'].isnull().sum()

5012

In [20]:
# counts the number of unique occurances in the severity code col (to determine if the label is balanced)
data['SEVERITYCODE'].value_counts()

1    136485
2     58188
Name: SEVERITYCODE, dtype: int64

In [21]:
# lists all the features in the dataset and some information about them
data.info()

&lt;class &#39;pandas.core.frame.DataFrame&#39;&gt;
RangeIndex: 194673 entries, 0 to 194672
Data columns (total 38 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   SEVERITYCODE    194673 non-null  int64  
 1   X               189339 non-null  float64
 2   Y               189339 non-null  float64
 3   OBJECTID        194673 non-null  int64  
 4   INCKEY          194673 non-null  int64  
 5   COLDETKEY       194673 non-null  int64  
 6   REPORTNO        194673 non-null  object 
 7   STATUS          194673 non-null  object 
 8   ADDRTYPE        192747 non-null  object 
 9   INTKEY          65070 non-null   float64
 10  LOCATION        191996 non-null  object 
 11  EXCEPTRSNCODE   84811 non-null   object 
 12  EXCEPTRSNDESC   5638 non-null    object 
 13  SEVERITYCODE.1  194673 non-null  int64  
 14  SEVERITYDESC    194673 non-null  object 
 15  COLLISIONTYPE   189769 non-null  object 
 16  PERSONCOUNT     194673 non-null  int64  
 

## Excluding Features

As we can see, there are a lot of different features to choose from. There are too many in fact and we need to investigate which of these are relevant for solving the outlined problem. This will be achieved partly by reading Metadata.pdf, which contains a detailed description of each feature, which will be useful for immediately excluding some features. Others will required further exploration to determine their relevance. 

### Irrelevant Features
To start, we will exclude features that are obviously irrelevant, such as features containing various identification numbers. We also remove dates/timestamps as an accident is obviously not dependent on a particular day. 

We remove:
* OBJECTID
* INCKEY 
* COLDETKEY
* REPORTNO
* STATUS
* INTKEY
* EXCEPTRSNCODE
* EXCEPTRSNDESC
* SEVERITYCODE.1 (Duplicate)
* SEVERITYDESC
* INCDATE
* INCDTTM
* SDOTCOLNUM

Additionally, there are features that detail how many vehicles/persons were part of a particular accident. Naturally, the more people/vehicles, the more severe an accident is likely to be. Using this data for predictive purposes is not practical, so this is likewise excluded

We remove:
* PEDCOUNT (number of pedestrians)
* PEDCYLCOUNT (number of pedestrians and cyclists)
* PERSONCOUNT (number of persons)
* VEHCOUNT (number of vehicles)


In [22]:
features_to_exclude = ['OBJECTID', 'INCKEY', 'COLDETKEY', 'REPORTNO', 'STATUS', 'INTKEY', 'EXCEPTRSNCODE', 'EXCEPTRSNDESC', 'SEVERITYCODE.1', 'SEVERITYDESC', 'INCDATE', 'INCDTTM', 'SDOTCOLNUM', 'PEDCOUNT', 'PEDCYLCOUNT', 'PERSONCOUNT', 'VEHCOUNT']

### A Priori / Illegal Features

We also exclude features which obviously increase the risk of a traffic incident (and which are illegal) such as driving under the influence, not paying attention, driving without adequate rest, or speeding. While these obviously are correlated with the risk and severity of traffic accidents, they present no additional information to a prudent driver seeking to avoid accidents.

We remove:
* SPEEDING (speeding driver)
* UNDERINFL (drunk driver)
* INATTENTIONIND (inattentive driver)
* PEDROWNOTGRNT (not giving pedestrian right of way - illegal driving)

In [23]:
features_to_exclude += ['SPEEDING', 'UNDERINFL', 'INATTENTIONIND', 'PEDROWNOTGRNT']

### Too Detailed Features

There are some features that contain a bit too much detailed information about the crashes, and cannot readily be used to avoid accidents, but may be useful for understanding them. However, the latter is more relevant to car manufacturers to design safer cars or to road engineer for designing safer roads; these insights are not relevant for this application. Furthermore, many of the features contain too detailed information about accidents and cannot readily be generalized.

The following features corresponds to these types of information, and thus are excluded:
* ST_COLCODE & ST_COLDESC (details of which direction a car was traveling in before an accident)
* SDOT_COLCODE & SDOT_COLDESC (details of the specific accident)
* SEGLANEKEY (details of which lane segment an accident occured; not practical to avoid lane changing)
* CROSSWALKKEY (which crosswalk an accident occured in)
* HITPARKEDCAR (if a parked car was hit)
* COLLISIONTYPE (also contains information of what the driver was doing while the accident occured)
* JUNCTIONTYPE (which junction types an accident occured in; not practical to avoid any of these)
* ADDRTYPE (similar to JUNCTIONTYPE)

One feature that contains significant information about traffic accidents that would be useful to use is the location of where an accident was recorded X & Y. However, it is beyond the scope of this model to use these, but would be neat to incorporate into a model where a detailed travel path is given as an input (e.g. like in Google Maps).

We also exclude:
* X & Y (contains specific coordinates; might relevant but too difficult to use in this model)
* LOCATION (detailed description of the location)

In [24]:
features_to_exclude += ['ST_COLCODE', 'ST_COLDESC', 'SDOT_COLCODE', 'SDOT_COLDESC', 'SEGLANEKEY', 'CROSSWALKKEY', 'HITPARKEDCAR', 'COLLISIONTYPE', 'JUNCTIONTYPE', 'ADDRTYPE', 'X', 'Y', 'LOCATION']

### Surviving Features

After removing all the discussed features we are left with only these three:
* WEATHER
* LIGHTCOND
* ROADCOND

These are great for predictive purposes as they are easy to use generate for a prediction; you just need to check the weather. They are also not particularly specific to any specific location in the Seattle area, or indeed the Seattle area itself, thus it can more readily be exted to other examples. 

Each of these features contain the following values:
* ROADCOND: Dry, Ice, Oil, Other, Sand/Mud/Dirt, Snow/Slush, Standing Water, Unknown, Wet
* LIGHTCOND: Dark - No Street Lights, Dark - Street Lights Off, Dark - Street Lights On, Dark - Unknown Lighting, Dawn, Daylight, Dusk, Other, Unknown
* WEATHER: Blowing, Sand/Dirt, Clear, Fog/Smog/Smoke, Other, Overcast, Partly Cloudy, Raining, Severe Crosswind, Sleet/Hail/Freezing, Rain, Snowing, Unknown

In particular weather is important to include here, as the combination of one poor weather phenomena together with otherwise good road conditions may still make for less safe driving - i.e. dry roads by with severe crosswinds.


## Final Target & Feature Selection

In [27]:
# target feature 
target = 'SEVERITYCODE'

# all the features in the dataset
all_features = list(data)

# creates list of features we've decided to keep (i.e. we remove the feature we want to exclude and the target variable from all the columns in the dataset)

features = list(set(all_features).difference(features_to_exclude+[target]))

# new dataframae only containing the target and feature variables
df = data[[target]+features]

In [28]:
df.groupby(target)[features[0]].value_counts().unstack()


ROADCOND,Dry,Ice,Oil,Other,Sand/Mud/Dirt,Snow/Slush,Standing Water,Unknown,Wet
SEVERITYCODE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,84446,936,40,89,52,837,85,14329,31719
2,40064,273,24,43,23,167,30,749,15755


In [100]:
df.groupby(target)[features[1]].value_counts().unstack()

LIGHTCOND,Dark - No Street Lights,Dark - Street Lights Off,Dark - Street Lights On,Dark - Unknown Lighting,Dawn,Daylight,Dusk,Other,Unknown
SEVERITYCODE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,1203,883,34032,7,1678,77593,3958,183,12868
2,334,316,14475,4,824,38544,1944,52,605


In [29]:
df.groupby(target)[features[2]].value_counts().unstack()

WEATHER,Blowing Sand/Dirt,Clear,Fog/Smog/Smoke,Other,Overcast,Partly Cloudy,Raining,Severe Crosswind,Sleet/Hail/Freezing Rain,Snowing,Unknown
SEVERITYCODE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,41,75295,382,716,18969,2,21969,18,85,736,14275
2,15,35840,187,116,8745,3,11176,7,28,171,816


### Data Discussion

The data encompasses traffic accident data recorded in the Seattle area. There are 37 different features and 1 target variable in the dateset. These features contain some irrelevant data (at least for prediction purposes) such as report numbers and object IDs, that simply will not be used. 
 
There is also detailed data of how each crash occurred, such as which lane segment, if the accident occurred in an intersection, on a block or something similar. These details are likewise not very interesting as travelling through intersections, changing lanes, and passing by city blocks is an intrinsic part of travel and cannot be avoided. Other irrelevant data is if the accident occurred while the driver was under the influence or he/she was speeding. It should go without saying that this should not be done while driving and is thus unnecessary data. 
 
Furthermore, there is a lot of data that is not independent from other variables. For instance, weather conditions contain less information than road and light conditions (ROADCOND and LIGHTCOND respectively) and such we don't need to consider weather. Likewise, the number of pedestrians or vehicles involved in a crash are assumed to be built into the severity of the accident. 
 
For these reasons, we will only use ROADCOND and LIGHTCOND for our model. These features are categorical, as is the target variables, and some supervised classification algorithm is ideal. 

## Preprocessing

In [101]:
# one hot encoding of the chosen features
X = pd.get_dummies(df[features])
Y = df[target]

In [102]:
# splitting the data into train/test sets
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

In [103]:
# we use the ensemble random forest classifier for this problem
from sklearn.ensemble import RandomForestClassifier

# trains the forest models
clf = RandomForestClassifier()
clf.fit(X_train, Y_train)

# predicts results
Y_pred = clf.predict(X_test) 

In [104]:
from sklearn.metrics import f1_score, confusion_matrix, precision_score, recall_score

print('Confusion Matrix Tree : \n', confusion_matrix(Y_test, Y_pred, normalize='all'),'\n')
print('The precision for Tree is ', precision_score(Y_test, Y_pred), '\n') 
print('The recall for Tree is ', recall_score(Y_test, Y_pred),'\n')  
print('The F1-score for Tree is ', f1_score(Y_test, Y_pred),'\n') 

Confusion Matrix Tree : 
 [[7.02529857e-01 7.70514961e-05]
 [2.97393091e-01 0.00000000e+00]] 

The precision for Tree is  0.7025839926024864 

The recall for Tree is  0.9998903348442755 

The F1-score for Tree is  0.8252775766352884 

