# Predicting Road Accidents

Luis Terán

## 1. Problem statement

<p style="text-align:justify">Motor vehicle accidents continue to be one of the leading causes of accidental deaths and injuries in the United States. They are responsible for billions in property damage and other economic losses each year. More than 38,000 people die every year in crashes on U.S. roadways. The U.S. traffic fatality rate is 12.4 deaths per 100,000 inhabitants. An additional 4.4 million are injured seriously enough to require medical attention. Road crashes are the leading cause of death in the U.S. for people aged 1-54 (ASIRT, 2020). </p>

<p style="text-align:justify"> In order to identify the cause of the problem, the aim of this project is to identify those factors that influence the most on cars accidents and have a quantitatively estimate of the significance the relationship between the factors and the road accidents.</p>

<p style="text-align:justify">Even though this is a sample data from Seattle, this behavior patterns in car accidents can be related to other states or even countries with less amount of data of this problem. So this can be used as a feature reference for prevention of car accidents in all other places.</p>

## 2. Data wrangling

### 2.1 Importing the dataset

In [239]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

The original dataset was obtained from the Seattle’s government page at:

- <https://data.seattle.gov/Land-Base/Collisions/9kas-rb8d> 

The dataset if available for public access. It was created in April 8, 2020 and last update register is from August 27, 2020. Further information of the dataset is available in:

- <https://www.seattle.gov/Documents/Departments/SDOT/GIS/Collisions_OD.pdf>


In [267]:
# A random seed is set for reproducibility purposes
np.random.seed(12345)

In [265]:
df = pd.read_csv("Collisions.csv")
df.head(3)

Unnamed: 0,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,LOCATION,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,1256033.0,240501.215914,1,332176,333676,EA40602,Matched,Intersection,26581.0,28TH AVE W AND W DRAVUS ST,...,Dry,Daylight,,,,14.0,From same direction - both going straight - on...,0,0,N
1,1282438.0,223443.774169,2,328504,330004,EA10294,Unmatched,Block,,LAKE WASHINGTON BLVD BETWEEN LAKESIDE AVE AND ...,...,,,,,,,,0,0,Y
2,1269233.0,229465.525407,3,329091,330591,EA15604,Matched,Block,,WESTLAKE AVE N BETWEEN DENNY WAY AND JOHN ST,...,Dry,Daylight,,,,11.0,From same direction - both going straight - bo...,0,0,N


A first view of the data is displayed, it consists of a set of 40 variables with 220937 observations:

In [241]:
df.shape

(220937, 40)

The data is split iinto

<p style="text-align:justify"> Before selecting the features for predicting it was necessary to state our main prediction objective. We want to know the probability and magnitude of an accident given some characteristics, in this way, the **“Severity code”** variable could be helpful. The severity code variable is the severity of the collision in a road accident, that is our expected to predict variable.</p>

<p style="text-align:justify"> The complete dataset was split into two different dataframes. The first one was for feature selection, those variables that can contribute to an accident were selected and joined to this dataframe. On the other hand, there were selected other variables for additional information. The aim of this dataset is for getting a better understanding of the data and the problem.</p>


In [242]:
features = ['INCDATE', 'UNDERINFL', 'WEATHER', 'ROADCOND', 'LIGHTCOND', 'SPEEDING', 'SEVERITYCODE']
additionalInfo = ['X', 'Y', 'SEVERITYDESC', 'COLLISIONTYPE', 'PEDCOUNT', 'PEDCYLCOUNT', 'VEHCOUNT', 'INJURIES', 
                 'SERIOUSINJURIES', 'FATALITIES', 'SEVERITYCODE']

In [243]:
dfInfo = df[additionalInfo].copy()
df = df[features].copy()

In [244]:
df.head()

Unnamed: 0,INCDATE,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,SPEEDING,SEVERITYCODE
0,2020/06/09 00:00:00+00,N,Overcast,Dry,Daylight,,2
1,2020/02/02 00:00:00+00,,,,,,0
2,2020/02/12 00:00:00+00,N,Clear,Dry,Daylight,,1
3,2020/01/23 00:00:00+00,N,Raining,Wet,Dark - Street Lights On,,1
4,2019/11/26 00:00:00+00,N,Clear,Dry,Daylight,,2


#### 2.1 Null values

Before we start with the pre-processing process we first need to take care of NULL values. A NULL value in a table is a value in a field that appears to be blank. We first need to know the amount of the NULL values in the dataset to understand how these NULL values are affecting the dataset.

In [246]:
nulls = df.isna().sum().to_frame()
nulls.columns = ['Null count']
nulls['Null percentage'] = nulls['Null count']/(df.shape[0])*100
nulls

Unnamed: 0,Null count,Null percentage
INCDATE,0,0.0
UNDERINFL,26251,11.881668
WEATHER,26460,11.976265
ROADCOND,26380,11.940055
LIGHTCOND,26548,12.016095
SPEEDING,211039,95.51999
SEVERITYCODE,1,0.000453


<p style="text-align:justify"> As we see, for most of the columns there is about a 12% of NULL values. The exception to this rule are the "SEVERITYCODE" column but especially the "SPEEDING" column that has over a 95% of NULL values, this will be explain later. For now, since our NULL values in the data are not a significant amount and considering the number of observations in the dataset, we can drop the whole row with presence of NULL values on it for simplicity, exluding those present in the "SPEEDING" column. We still get 194208 observations out of the 220937 observations in the original dataset (about 88% of the data). </p>

In [247]:
df.dropna(subset=['INCDATE', 'UNDERINFL', 'WEATHER', 'ROADCOND', 'LIGHTCOND', 'SEVERITYCODE'], inplace = True)

In [248]:
df.shape

(194208, 7)

#### 2.2 Severity code

<p style="text-align:justify"> Furthermore, now the data needs to be in a format such that the Machine Learning algorithms can process it. For that, the categorical data needs to be transformed into an integer equivalent. For the case of the "SEVERITYCODE" column we can see there are 4 different levels for the severity caused (1, 2, 2b, 3). These levels are transformed to an integer scale equivalent (1, 2, 3, 4). </p>

In [249]:
df['SEVERITYCODE'].value_counts().to_frame()

Unnamed: 0,SEVERITYCODE
1,133272
2,57565
2b,3033
3,338


In [250]:
# Replacing 3's for 4's
df['SEVERITYCODE'].replace('3', '4', inplace = True)
# Replacing 2b's for 3's
df['SEVERITYCODE'].replace('2b', '3', inplace = True)
# Converting the string values to integers
df['SEVERITYCODE'].astype(int)
# New set of values
df['SEVERITYCODE'].value_counts().to_frame()

Unnamed: 0,SEVERITYCODE
1,133272
2,57565
3,3033
4,338


#### 2.3 Influence of drugs or alcohol

Consequently, is necessary to do the same for the "UNDERINFL" column. In this case, the result should be "0" for no influence of drugs or alcohol and "1" for influence. Instead we have a combination between 0/1 and Y/N observations. We will replace all the character observations to their corresponding number.

In [251]:
df['UNDERINFL'].value_counts().to_frame()

Unnamed: 0,UNDERINFL
N,102947
0,81658
Y,5373
1,4230


In [252]:
# Replacing 3's for 4's
df['UNDERINFL'].replace('N', '0', inplace = True)
# Replacing 2b's for 3's
df['UNDERINFL'].replace('Y', '1', inplace = True)
# Converting the string values to integers
df['UNDERINFL'].astype(int)
# New set of values
df['UNDERINFL'].value_counts().to_frame()

Unnamed: 0,UNDERINFL
0,184605
1,9603


#### 2.4 Weather

In [253]:
df['WEATHER'].value_counts().to_frame()

Unnamed: 0,WEATHER
Clear,114157
Raining,33974
Overcast,28460
Unknown,15079
Snowing,913
Other,843
Fog/Smog/Smoke,576
Sleet/Hail/Freezing Rain,116
Blowing Sand/Dirt,55
Severe Crosswind,26


In [254]:
# Dictionary of reference for replacement in weather
cleanup_nums = {"WEATHER": {"Clear": 1, "Raining": 2, "Overcast": 3, "Snowing": 4, "Fog/Smog/Smoke": 5, 
                            "Sleet/Hail/Freezing Rain": 6, "Blowing Sand/Dirt": 7, "Severe Crosswind": 8, 
                            "Partly Cloudy": 9, "Other": 10, "Unknown": 10}}
# Replacing all the ellements for the defined numbers in the dictionary
df.replace(cleanup_nums, inplace = True)
# Converting the string values to integers
df['UNDERINFL'] = df['UNDERINFL'].astype(int)

#### 2.5 Road condition

In the case of the "ROADCOND" column, its values are all categorical and need to be transformed into integer numbers. This will be made with a dictionary that replaces all the values in the column.

In [255]:
df['ROADCOND'].value_counts().to_frame()

Unnamed: 0,ROADCOND
Dry,127863
Wet,48633
Unknown,15081
Ice,1228
Snow/Slush,1009
Other,135
Standing Water,119
Sand/Mud/Dirt,76
Oil,64


In [256]:
# Dictionary of reference for replacement in weather
cleanup_nums = {"ROADCOND": {"Dry": 1, "Wet": 2, "Ice": 3, "Snow/Slush": 4, "Standing Water": 5, 
                            "Sand/Mud/Dirt": 6, "Oil": 7, "Other": 8, "Unknown": 8}}
# Replacing all the ellements for the defined numbers in the dictionary
df.replace(cleanup_nums, inplace = True)
# Converting the string values to integers
df['ROADCOND'] = df['ROADCOND'].astype(int)

#### 2.6 Light condition

In the same way as happened with the previous column, the "LIGHTCOND" column need to be transformed into integer numbers through a dictionary.

In [257]:
df['LIGHTCOND'].value_counts().to_frame()

Unnamed: 0,LIGHTCOND
Daylight,119017
Dark - Street Lights On,49969
Unknown,13502
Dusk,6061
Dawn,2596
Dark - No Street Lights,1571
Dark - Street Lights Off,1229
Other,243
Dark - Unknown Lighting,20


In [258]:
# Dictionary of reference for replacement in weather
cleanup_nums = {"LIGHTCOND": {"Daylight": 1, "Dark - Street Lights On": 2, "Dusk": 3, "Dawn": 4, "Dark - No Street Lights": 5, 
                            "Dark - Street Lights Off": 6, "Dark - Unknown Lighting": 7, "Other": 8, "Unknown": 8}}
# Replacing all the ellements for the defined numbers in the dictionary
df.replace(cleanup_nums, inplace = True)
# Converting the string values to integers
df['LIGHTCOND'] = df['LIGHTCOND'].astype(int)

#### 2.7 Speeding

For the last feature, we can see that it has only one value with a very few counts. Since it is a boolean variable, it coulb be interpreted that all the missing values are the opposite case of the filled ones. For that reason, all the missing values are defined as "0". Also the string values of "Y" are trasformed into its equivalent "1".

In [259]:
df['SPEEDING'].value_counts().to_frame()

Unnamed: 0,SPEEDING
Y,9881


In [260]:
# Replacing all null values with 0
df.loc[(~(df['SPEEDING'] == 'Y')), 'SPEEDING'] = 0
# Replacing Y's to 1's
df['SPEEDING'].replace('Y', '1', inplace = True)
# Converting the string values to integers
df['SPEEDING'] = df['SPEEDING'].astype(int)

In [261]:
df['SPEEDING'].value_counts().to_frame()

Unnamed: 0,SPEEDING
0,184327
1,9881


Finally, we have an integer matrix of value propperly defined for the remaining process: 

In [263]:
df.head()

Unnamed: 0,INCDATE,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,SPEEDING,SEVERITYCODE
0,2020/06/09 00:00:00+00,0,3,1,1,0,2
2,2020/02/12 00:00:00+00,0,1,1,1,0,1
3,2020/01/23 00:00:00+00,0,2,2,2,0,1
4,2019/11/26 00:00:00+00,0,1,1,1,0,2
5,2013/03/25 00:00:00+00,1,1,1,1,0,1
