# Capstone Project - Traffic Accident Severity Prediction
Applied Data Science Capstone by IBM/Coursera

## Introduction <a name="introduction"></a>

Traffic accident has been a great threat to public health and security. It causes the loss of properties and lives, for both individual and society. Traffic accident severity prediction research recognize the key factors that contribute to a car accident. Successful prediction can improve the public traffic safety and transportation efficiency by multiple measurements, such as reinforce aged infrastructure in critical spots to reduce the accidental risk, redistribute assistance resource for timely rescue in case of emergency, alert divers to pay more attention to accident-prone condition and so on. 

This project examines the collisions data of Seattle since 2004 till 2020, compares different classification algorithms to select the best model for accident prediction, and identifies some dangerous situations for drivers by clustering to formulate appropriate prevention strategies and actions.



## Data <a name="data"></a>

**Data source**<br>

Collisions - Seattle GeoData - ArcGIS Online: https://data-seattlecitygis.opendata.arcgis.com/datasets/5b5c745e0f1f48e7a53acec63a0022ab_0
<br>
Attribute Information: https://www.seattle.gov/Documents/Departments/SDOT/GIS/Collisions_OD.pdf
<br>
GeoJService: https://gisdata.seattle.gov/server/rest/services/SDOT/SDOT_Collisions/MapServer/0/query?outFields=*&where=1%3D1
<br>
GeoJSON: https://opendata.arcgis.com/datasets/5b5c745e0f1f48e7a53acec63a0022ab_0.geojson

The source includes all types of collisions data in Seattle city from 2014/01/01 to 2020/05/20, There are 35 attributes and total 212453 records, the target dependency is the severity of the collison:
* 3—fatality (325)
* 2b—serious injury (2950)
* 2—injury (55964)
* 1—prop damage (131672)
* 0—unknown (20668)

There are 194673 records in 4 categories excluding the missing information (0). To simply the issue, they were divided into 2 categories: **1-prop damage**(1) and **2-injury**(2,2b,3).
<br>
The Geo-information, Latitude and Longitude of collision, is integrated into the master source, here is the example of the data: https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv

**Data selection**

184920 out of 194673 valid records are selected by ignoring "Unmatched" values in "STATUS" colunm and "Not Enough Information, or Insufficient Location Information" values in "EXCEPTRSNDESC" column. In addition, the 3 missing values in "ADDRTYPE" column are also deleted.  As a result, around 5% records cannot be used for the training and are removed from the datasets.

**Feature selection**

In the 37 features, there are 22 may contribute to accidental severity in certain way, and they have to be transformed into appropriate data format for further processing and exploratory. The table below summarized the treatment of different features

| Format | Selected Features |
| --- | --- |
| Binary| INATTENTIONIND, UNDERINFL, PEDROWNOTGRNT, SPEEDING, SEGLANEKEY, CROSSWALKKEY, HITPARKEDCAR| 
| Float| X, Y| 
| Date| INCDATE| 
| Time| INCDTTM| 
| Encode Categorical| ADDRTYPE, COLLISIONTYPE, JUNCTIONTYPE, WEATHER, ROADCOND, LIGHTCOND, ST_COLCODE| 
| Int| PERSONCOUNT, PEDCOUNT, PEDCYLCOUNT, VEHCOUNT| 


In [2]:
#import necessary library
import pandas as pd
import numpy as np
import datetime

In [None]:
#download the master data source as csv
!wget -O collision_train.csv https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv

In [46]:
#read the date source
df = pd.read_csv('collision_train.csv')
df.head()

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


**Data cleaning**

- format factore related to date and time, to understand if weekday/weekend, season or morning/evening can impact the accident rate

In [47]:
#convert to date time object
df['INCDATE']=pd.to_datetime(df['INCDATE'])
df['INCDTTM']=pd.to_datetime(df['INCDTTM'])

In [54]:
#add the column year, month, day of the week
df['INC_year']=df['INCDATE'].dt.year
df['INC_month']=df['INCDATE'].dt.month
df['INC_day_of_week']=df['INCDATE'].dt.dayofweek

#the distribution of values in each time/date related feature
print(df['INC_day_of_week'].value_counts(dropna=False),'\n-------------------------------------\n',\
      df['INC_year'].value_counts(dropna=False),'\n-------------------------------------\n',\
      df['INC_month'].value_counts(dropna=False))

4    32333
3    29324
2    28778
1    28556
5    27389
0    26338
6    21955
Name: INC_day_of_week, dtype: int64 
-------------------------------------
 2006    15188
2005    15115
2007    14456
2008    13660
2015    12995
2004    11865
2014    11841
2009    11734
2016    11659
2011    10919
2012    10907
2017    10873
2010    10808
2013    10577
2018    10419
2019     9412
2020     2245
Name: INC_year, dtype: int64 
-------------------------------------
 10    17768
5     16763
11    16582
6     16566
1     16407
7     16364
8     16296
3     16150
4     15978
9     15864
12    15545
2     14390
Name: INC_month, dtype: int64


In [63]:
#get the hour of accident, missing value are replaced with NaT
df['time']=df['INCDTTM'].dt.time
df['time']=df['time'].apply(lambda x: pd.NaT if (x==datetime.time(0,0))  else x.hour)
df['time'].value_counts(dropna=False)

NaN     30526
17.0    12947
16.0    12122
15.0    11514
14.0    10615
12.0    10384
13.0    10219
18.0     9743
8.0      8570
11.0     8209
9.0      8052
10.0     7465
19.0     7256
7.0      6543
20.0     6236
21.0     5571
22.0     5468
23.0     4611
0.0      3855
2.0      3606
1.0      3408
6.0      3199
5.0      1667
3.0      1665
4.0      1222
Name: time, dtype: int64

- format these factors into binary

In [5]:
df[['INATTENTIONIND', 'UNDERINFL', 'PEDROWNOTGRNT', 'SPEEDING', 'SEGLANEKEY', 'CROSSWALKKEY', 'HITPARKEDCAR']]

Unnamed: 0,INATTENTIONIND,UNDERINFL,PEDROWNOTGRNT,SPEEDING,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,,N,,,0,0,N
1,,0,,,0,0,N
2,,0,,,0,0,N
3,,N,,,0,0,N
4,,0,,,0,0,N
...,...,...,...,...,...,...,...
194668,,N,,,0,0,N
194669,Y,N,,,0,0,N
194670,,N,,,0,0,N
194671,,N,,,4308,0,N


In [43]:
#before operation into binary type
binary_list=['INATTENTIONIND', 'UNDERINFL', 'PEDROWNOTGRNT', 'SPEEDING', 'SEGLANEKEY', 'CROSSWALKKEY', 'HITPARKEDCAR']
for b_l in binary_list:
    print(df[b_l].value_counts(dropna=False),'\n-------------------------------------')

NaN    164868
Y       29805
Name: INATTENTIONIND, dtype: int64 
-------------------------------------
N      100274
0       80394
Y        5126
NaN      4884
1        3995
Name: UNDERINFL, dtype: int64 
-------------------------------------
NaN    190006
Y        4667
Name: PEDROWNOTGRNT, dtype: int64 
-------------------------------------
NaN    185340
Y        9333
Name: SPEEDING, dtype: int64 
-------------------------------------
0         191907
6532          19
6078          16
12162         15
10336         14
           ...  
35157          1
10817          1
15043          1
525169         1
16376          1
Name: SEGLANEKEY, Length: 1955, dtype: int64 
-------------------------------------
0         190862
523609        17
520838        15
525567        13
521707        10
           ...  
521019         1
630862         1
25545          1
523322         1
27186          1
Name: CROSSWALKKEY, Length: 2198, dtype: int64 
-------------------------------------
N    187457
Y     

In [45]:
#after operation into binary type
for b_l in binary_list:
    df[b_l] = df[b_l].fillna(0)
    df[b_l]=df[b_l].apply(lambda x: 0 if ( x==0 or x=='0' or x=='N')  else 1)
    print(df[b_l].value_counts(dropna=False),'\n-------------------------------------')

0    164868
1     29805
Name: INATTENTIONIND, dtype: int64 
-------------------------------------
0    185552
1      9121
Name: UNDERINFL, dtype: int64 
-------------------------------------
0    190006
1      4667
Name: PEDROWNOTGRNT, dtype: int64 
-------------------------------------
0    185340
1      9333
Name: SPEEDING, dtype: int64 
-------------------------------------
0    191907
1      2766
Name: SEGLANEKEY, dtype: int64 
-------------------------------------
0    190862
1      3811
Name: CROSSWALKKEY, dtype: int64 
-------------------------------------
0    187457
1      7216
Name: HITPARKEDCAR, dtype: int64 
-------------------------------------


- encode categorical independencies

In [None]:
#before operation into binary type