# Capstone Project - Traffic Accident Severity Prediction
Applied Data Science Capstone by IBM/Coursera

## Introduction <a name="introduction"></a>

Traffic accident has been a great threat to public health and security. It causes the loss of properties and lives, for both individual and society. Traffic accident severity prediction research recognize the key factors that contribute to a car accident. Successful prediction can improve the public traffic safety and transportation efficiency by multiple measurements, such as reinforce aged infrastructure in critical spots to reduce the accidental risk, redistribute assistance resource for timely rescue in case of emergency, alert divers to pay more attention to accident-prone condition and so on. 

This project examines the collisions data of Seattle since 2004 till 2020, compares different classification algorithms to select the best model for accident prediction, and identifies some dangerous situations for drivers by clustering to formulate appropriate prevention strategies and actions.



## Data <a name="data"></a>

**Data source**<br>

Collisions - Seattle GeoData - ArcGIS Online: https://data-seattlecitygis.opendata.arcgis.com/datasets/5b5c745e0f1f48e7a53acec63a0022ab_0
<br>
Attribute Information: https://www.seattle.gov/Documents/Departments/SDOT/GIS/Collisions_OD.pdf
<br>
GeoJService: https://gisdata.seattle.gov/server/rest/services/SDOT/SDOT_Collisions/MapServer/0/query?outFields=*&where=1%3D1
<br>
GeoJSON: https://opendata.arcgis.com/datasets/5b5c745e0f1f48e7a53acec63a0022ab_0.geojson

The source includes all types of collisions data in Seattle city from 2014/01/01 to 2020/05/20, There are 35 attributes and total 212453 records, the target dependency is the severity of the collison:
* 3—fatality (325)
* 2b—serious injury (2950)
* 2—injury (55964)
* 1—prop damage (131672)
* 0—unknown (20668)

There are 194673 records in 4 categories excluding the missing information (0). To simply the issue, they were divided into 2 categories: **1-prop damage**(1) and **2-injury**(2,2b,3).
<br>
The Geo-information, Latitude and Longitude of collision, is integrated into the master source, here is the example of the data: https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv

**Data selection**

184920 out of 194673 valid records are selected by ignoring "Unmatched" values in "STATUS" colunm and "Not Enough Information, or Insufficient Location Information" values in "EXCEPTRSNDESC" column. In addition, the 3 missing values in "ADDRTYPE" column are also deleted.  As a result, around 5% records cannot be used for the training and are removed from the datasets.

**Feature selection**

In the 37 features, there are 22 may contribute to accidental severity in certain way, and they have to be transformed into appropriate data format for further processing and exploratory. The table below summarized the treatment of different features

| Format | Selected Features |
| --- | --- |
| Binary| INATTENTIONIND, UNDERINFL, PEDROWNOTGRNT, SPEEDING, SEGLANEKEY, CROSSWALKKEY, HITPARKEDCAR| 
| Float| X, Y| 
| Date| INCDATE| 
| Time| INCDTTM| 
| Encode Categorical| ADDRTYPE, COLLISIONTYPE, JUNCTIONTYPE, WEATHER, ROADCOND, LIGHTCOND, ST_COLCODE| 
| Int| PERSONCOUNT, PEDCOUNT, PEDCYLCOUNT, VEHCOUNT| 


In [1]:
#import necessary library
import pandas as pd
import numpy as np
import datetime

In [None]:
#download the master data source as csv
!wget -O collision_train.csv https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv

In [2]:
#read the date source
df = pd.read_csv('collision_train.csv')

#filter the invalide data
df = df[(df['EXCEPTRSNDESC']!='Not Enough Information, or Insufficient Location Information')]
df = df[(df['STATUS']=='Matched')]
df.columns,df.shape

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


(Index(['SEVERITYCODE', 'X', 'Y', 'OBJECTID', 'INCKEY', 'COLDETKEY', 'REPORTNO',
        'STATUS', 'ADDRTYPE', 'INTKEY', 'LOCATION', 'EXCEPTRSNCODE',
        'EXCEPTRSNDESC', 'SEVERITYCODE.1', 'SEVERITYDESC', 'COLLISIONTYPE',
        'PERSONCOUNT', 'PEDCOUNT', 'PEDCYLCOUNT', 'VEHCOUNT', 'INCDATE',
        'INCDTTM', 'JUNCTIONTYPE', 'SDOT_COLCODE', 'SDOT_COLDESC',
        'INATTENTIONIND', 'UNDERINFL', 'WEATHER', 'ROADCOND', 'LIGHTCOND',
        'PEDROWNOTGRNT', 'SDOTCOLNUM', 'SPEEDING', 'ST_COLCODE', 'ST_COLDESC',
        'SEGLANEKEY', 'CROSSWALKKEY', 'HITPARKEDCAR'],
       dtype='object'),
 (184920, 38))

**Data cleaning**

- format factore related to date and time, to understand if weekday/weekend, season or morning/evening can impact the accident rate

In [69]:
#convert to date time object
df['INCDATE']=pd.to_datetime(df['INCDATE'])
df['INCDTTM']=pd.to_datetime(df['INCDTTM'])

In [70]:
#add the column year, month, day of the week
df['INC_year']=df['INCDATE'].dt.year
df['INC_month']=df['INCDATE'].dt.month
df['INC_day_of_week']=df['INCDATE'].dt.dayofweek

#the distribution of values in each time/date related feature
print(df['INC_day_of_week'].value_counts(dropna=False),'\n-------------------------------------\n',\
      df['INC_year'].value_counts(dropna=False),'\n-------------------------------------\n',\
      df['INC_month'].value_counts(dropna=False))

4    30660
3    27809
2    27246
1    27049
5    26247
0    24909
6    21000
Name: INC_day_of_week, dtype: int64 
-------------------------------------
 2006    15122
2005    15043
2007    14345
2008    13482
2004    11830
2009    11546
2015    11040
2014    10902
2011    10771
2010    10658
2016    10374
2017    10236
2012    10168
2013     9766
2018     9660
2019     8729
2020     1248
Name: INC_year, dtype: int64 
-------------------------------------
 10    16945
6     15889
5     15837
11    15744
7     15629
1     15496
8     15493
3     15429
9     15266
4     15198
12    14704
2     13290
Name: INC_month, dtype: int64


In [71]:
#get the hour of accident, missing value are replaced with NaT
df['time']=df['INCDTTM'].dt.time
df['time']=df['time'].apply(lambda x: pd.NaT if (x==datetime.time(0,0))  else x.hour)
df['time'].value_counts(dropna=False)

NaN     25373
17.0    12580
16.0    11817
15.0    11212
14.0    10353
12.0     9979
13.0     9978
18.0     9431
8.0      8308
11.0     7994
9.0      7823
10.0     7202
19.0     7055
7.0      6342
20.0     6045
21.0     5431
22.0     5323
23.0     4488
0.0      3775
2.0      3531
1.0      3339
6.0      3098
5.0      1625
3.0      1617
4.0      1201
Name: time, dtype: int64

- format these factors into binary

In [5]:
df[['INATTENTIONIND', 'UNDERINFL', 'PEDROWNOTGRNT', 'SPEEDING', 'SEGLANEKEY', 'CROSSWALKKEY', 'HITPARKEDCAR']]

Unnamed: 0,INATTENTIONIND,UNDERINFL,PEDROWNOTGRNT,SPEEDING,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,,N,,,0,0,N
1,,0,,,0,0,N
2,,0,,,0,0,N
3,,N,,,0,0,N
4,,0,,,0,0,N
...,...,...,...,...,...,...,...
194668,,N,,,0,0,N
194669,Y,N,,,0,0,N
194670,,N,,,0,0,N
194671,,N,,,4308,0,N


In [3]:
#before operation into binary type
binary_list=['INATTENTIONIND', 'UNDERINFL', 'PEDROWNOTGRNT', 'SPEEDING', 'SEGLANEKEY', 'CROSSWALKKEY', 'HITPARKEDCAR']
for b_l in binary_list:
    print(df[b_l].value_counts(dropna=False),'\n-------------------------------------')

NaN    155675
Y       29245
Name: INATTENTIONIND, dtype: int64 
-------------------------------------
N    96132
0    79724
Y     5072
1     3992
Name: UNDERINFL, dtype: int64 
-------------------------------------
NaN    180294
Y        4626
Name: PEDROWNOTGRNT, dtype: int64 
-------------------------------------
NaN    175681
Y        9239
Name: SPEEDING, dtype: int64 
-------------------------------------
0        182230
6532         19
6078         16
12162        15
10336        14
          ...  
24899         1
20933         1
8651          1
13001         1
23093         1
Name: SEGLANEKEY, Length: 1906, dtype: int64 
-------------------------------------
0         181248
523609        17
520838        15
525567        13
521707        10
           ...  
523322         1
521275         1
525381         1
29899          1
524097         1
Name: CROSSWALKKEY, Length: 2144, dtype: int64 
-------------------------------------
N    179225
Y      5695
Name: HITPARKEDCAR, dtype: int6

In [6]:
#after operation into binary type
for b_l in binary_list:
    df[b_l] = df[b_l].fillna(0)
    df[b_l]=df[b_l].apply(lambda x: 0 if ( x==0 or x=='0' or x=='N')  else 1)
    print(df[b_l].value_counts(dropna=False),'\n-------------------------------------')

0    155675
1     29245
Name: INATTENTIONIND, dtype: int64 
-------------------------------------
0    175856
1      9064
Name: UNDERINFL, dtype: int64 
-------------------------------------
0    180294
1      4626
Name: PEDROWNOTGRNT, dtype: int64 
-------------------------------------
0    175681
1      9239
Name: SPEEDING, dtype: int64 
-------------------------------------
0    182230
1      2690
Name: SEGLANEKEY, dtype: int64 
-------------------------------------
0    181248
1      3672
Name: CROSSWALKKEY, dtype: int64 
-------------------------------------
0    179225
1      5695
Name: HITPARKEDCAR, dtype: int64 
-------------------------------------


- encode categorical independencies

In [15]:
categorical_feature=['ADDRTYPE','JUNCTIONTYPE', 'COLLISIONTYPE', 'ST_COLDESC', 'WEATHER', 'ROADCOND', 'LIGHTCOND']
for c_f in categorical_feature:
    print(df[c_f].value_counts(dropna=False),\
          '\n','Nan_value_count_%: ',df[c_f].isnull().sum(axis = 0),'/',"{:0.4%}".format(df[c_f].isnull().sum(axis = 0)/len(df[c_f])),\
          '\n-------------------------------------')


Block           121089
Intersection     63083
Alley              745
NaN                  3
Name: ADDRTYPE, dtype: int64 
 Nan_value_count_%:  3 / 0.0016% 
-------------------------------------
Mid-Block (not related to intersection)              85537
At Intersection (intersection related)               60871
Mid-Block (but intersection related)                 22354
Driveway Junction                                    10527
NaN                                                   3440
At Intersection (but not related to intersection)     2027
Ramp Junction                                          159
Unknown                                                  5
Name: JUNCTIONTYPE, dtype: int64 
 Nan_value_count_%:  3440 / 1.8603% 
-------------------------------------
Parked Car    44766
Angles        34521
Rear Ended    33611
Other         23350
Sideswipe     18278
Left Turn     13635
Pedestrian     6481
Cycles         5332
Right Turn     2921
Head On        2007
NaN              18
Name:

ADDTYPE vs JUNCTIONTYPE: <br>
They are similar features, by cross-comparing the grouped result, it seems that the **JUNCTIONTYPE** can better summarise the location type of collision, thus we could drop ADDTYPE

In [17]:
#Grouped and Count
df_junctiontype_vs_addtype=df[['JUNCTIONTYPE','ADDRTYPE', 'COLLISIONTYPE']]\
.replace({'Unknown':'Other'})\
.fillna('NaN')\
.rename(columns={'COLLISIONTYPE': 'Count'}) #change column name to 'Count'

print(
    df_junctiontype_vs_addtype.groupby(by=['JUNCTIONTYPE','ADDRTYPE']).count(),\
    '\n----------------------------------------------------------------------------------------------\n',\
    df_junctiontype_vs_addtype.groupby(['ADDRTYPE','JUNCTIONTYPE']).count()
)

                                                                Count
JUNCTIONTYPE                                      ADDRTYPE           
At Intersection (but not related to intersection) Alley             1
                                                  Block             1
                                                  Intersection   2025
At Intersection (intersection related)            Block             4
                                                  Intersection  60867
Driveway Junction                                 Alley            58
                                                  Block         10469
Mid-Block (but intersection related)              Block         22353
                                                  Intersection      1
Mid-Block (not related to intersection)           Alley           175
                                                  Block         85350
                                                  Intersection     11
                    

COLLISIONTYPE vs ST_COLDESC: <br>
They are similar features, by cross-comparing the grouped result, ST_COLDESC can be considered as the sub-division of COLLISIONTYPE. COLLISIONTYPE has 63 uniques values, which could be over overwhelming, so we can keep **COLLISIONTYPE.

In [18]:
df_coltype_vs_coldesc=df[['COLLISIONTYPE','ST_COLDESC', 'JUNCTIONTYPE']].replace('Unknown','Other')\
.replace({'Unknown':'Other'},{np.nan:'Unknown'})\
.fillna('NaN')\
.rename(columns={'JUNCTIONTYPE': 'Count'}) #change column name to 'Count'
    
print(
    df_coltype_vs_coldesc.groupby(['COLLISIONTYPE','ST_COLDESC'], sort=True).count(),\
    '\n----------------------------------------------------------------------------------------------',\
    df_coltype_vs_coldesc.groupby(['ST_COLDESC','COLLISIONTYPE']).count()
)

                                                                  Count
COLLISIONTYPE ST_COLDESC                                               
Angles        Entering at angle                                   34521
Cycles        Pedalcyclist All Other Involvements ONE UNIT - ...     23
              Pedalcyclist Strikes Moving Vehicle                   257
              Pedalcyclist Strikes Pedalcyclist or Pedestrian        18
              Vehicle - Pedalcyclist                               4621
...                                                                 ...
Sideswipe     From same direction - both going straight - one...   2386
              Same direction -- both turning left -- both mov...    826
              Same direction -- both turning left -- one stop...     35
              Same direction -- both turning right -- both mo...   1172
              Same direction -- both turning right -- one sto...     73

[63 rows x 1 columns] 
----------------------------------------

Drop 2% of none-value will have few influence on the result

In [24]:
selected_categorical_feature=['JUNCTIONTYPE', 'COLLISIONTYPE', 'WEATHER', 'ROADCOND', 'LIGHTCOND']
print('droped_rows_%: ',"{:0.4%}".format(
    (df.shape[0]-df[selected_categorical_feature].dropna().shape[0])/df.shape[0])
     )

droped_rows_%:  2.0841%


In [25]:
#transform JUNCTIONTYPE
dummy_JUNCTIONTYPE = pd.get_dummies(df["JUNCTIONTYPE"]\
                                    .drop(df[df["JUNCTIONTYPE"]=='Unknown'].index)\
                                    .replace({'Mid-Block (not related to intersection)':'JUNCTION_Mid_Block_Not_Intersection_related',
                                              'At Intersection (intersection related)':'JUNCTION_At_Intersection_Intersection_related',
                                              'Mid-Block (but intersection related)':'JUNCTION_Mid_Block_Intersection_related',
                                              'At Intersection (but not related to intersection)':'JUNCTION_At_Intersection_Not_Intersection_related',
                                              'Driveway Junction':'JUNCTION_Driveway',
                                              'Ramp Junction':'JUNCTION_Ramp'}))

#create new column under multiple condition
#dummy_JUNCTIONTYPE['At Intersection Junction']=np.where(
#    dummy_JUNCTIONTYPE['At Intersection (but not related to intersection)']+dummy_JUNCTIONTYPE['At Intersection (intersection related)']>0,1, 0)



dummy_JUNCTIONTYPE.head()

Unnamed: 0,JUNCTION_At_Intersection_Intersection_related,JUNCTION_At_Intersection_Not_Intersection_related,JUNCTION_Driveway,JUNCTION_Mid_Block_Intersection_related,JUNCTION_Mid_Block_Not_Intersection_related,JUNCTION_Ramp
0,1,0,0,0,0,0
1,0,0,0,0,1,0
2,0,0,0,0,1,0
3,0,0,0,0,1,0
4,1,0,0,0,0,0


In [27]:
#transform JUNCTIONTYPE
dummy_LIGHTCOND = pd.get_dummies(df["LIGHTCOND"]\
                                 .drop(df[df["LIGHTCOND"]=='Dark - Unknown Lighting'].index)\
                                 .replace({'Unknown':'Other_LIGHTCOND',
                                           'Daylight':'LIGHTCOND_Daylight',
                                           'Dusk':'LIGHTCOND_Dusk',
                                           'Dawn':'LIGHTCOND_Dawn',
                                           'Other':'Other_LIGHTCOND',
                                           'Dark - No Street Lights':'LIGHTCOND_Dark_Street_Lights_Absent/Off',
                                           'Dark - Street Lights Off':'LIGHTCOND_Dark_Street_Lights_Absent/Off',
                                           'Dark - Street Lights On':'LIGHTCOND_Dark_Street_Lights_On'}))

dummy_LIGHTCOND.head()

Unnamed: 0,LIGHTCOND_Dark_Street_Lights_Absent/Off,LIGHTCOND_Dark_Street_Lights_On,LIGHTCOND_Dawn,LIGHTCOND_Daylight,LIGHTCOND_Dusk,Other_LIGHTCOND
0,0,0,0,1,0,0
1,0,1,0,0,0,0
2,0,0,0,1,0,0
3,0,0,0,1,0,0
4,0,0,0,1,0,0


In [None]:
Mid-Block (not related to intersection)              85537
At Intersection (intersection related)               60871
Mid-Block (but intersection related)                 22354
Driveway Junction                                    10527
NaN                                                   3440
At Intersection (but not related to intersection)     2027
Ramp Junction                                          159
Unknown                                                  5