# Capstone Project - Car accident severity


## Applied Data Science Capstone by IBM/Coursera

### Table of contents
* <a href='#section-1'>Introduction: Business Problem and Background</a>
<br />
* <a href='#section-2'>Data</a>

## <a id='section-1'>Introduction: Business Problem and Background</a>

With the high increase in car accidents recently, the Seattle government and related department want some insightful information that can be transformed into action to prevent avoidable car accidents.

Most of the reasons for the accidents are not paying enough attention during driving, abusing drugs, alcohol, or driving over speed control/limits. Besides, the uncontrollable reasons as weather, visibility, or road condition can be prevented with the collected data and warning to the related departments as local government, police, and driving drivers.

## <a id='section-2'>Data</a>


We chose the unbalanced dataset provided by the Seattle Department of Transportation Traffic Management Division with 194673 rows (accidents) and 37 columns (features) where each accident is given a severity code. It covers accidents from January 2004 to May 2020. Some of the features in this dataset include and are not limited to Severity code, Location/Address of accident, Weather condition at the incident site, Driver state (whether under influence or not), collision type. Hence we think its a good generalized dataset which will help us in creating an accurate predictive model. The unbalance with respect to the severity code in the dataset is as follows.

SEVERITY CODE Count

1 — 136485

2 — 58188

Other important variables include:

* ADDRTYPE: Collision address type: Alley, Block, Intersection<br />
* LOCATION: Description of the general location of the collision<br />
* PERSONCOUNT: The total number of people involved in the collision helps identifyseverity level<br />
* PEDCOUNT: The number of pedestrians involved in the collision helps identify severity level<br />
* PEDCYLCOUNT: The number of bicycles involved in the collision helps identify severity level<br />
* VEHCOUNT: The number of vehicles involved in the collision identify severity level<br />
* JUNCTIONTYPE: Category of junction at which collision took place helps identify where most collisions occur<br />
* WEATHER: A description of the weather conditions during the time of the collision<br />
* ROADCOND: The condition of the road during the collision<br />
* LIGHTCOND: The light conditions during the collision<br />
* SPEEDING: Whether or not speeding was a factor in the collision (Y/N)<br />
* SEGLANEKEY: A key for the lane segment in which the collision occurred<br />
* CROSSWALKKEY: A key for the crosswalk at which the collision occurred<br />
* HITPARKEDCAR: Whether or not the collision involved hitting a parked car<br />

## <a id='section-3'>Methodology</a>

In [41]:
#reading the data
import pandas as pd
import numpy as np

In [42]:
df = pd.read_csv("Data-Collisions.csv")
with pd.option_context('display.max_rows', 5, 'display.max_columns', None): 
    display(df)

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,LOCATION,EXCEPTRSNCODE,EXCEPTRSNDESC,SEVERITYCODE.1,SEVERITYDESC,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,INCDATE,INCDTTM,JUNCTIONTYPE,SDOT_COLCODE,SDOT_COLDESC,INATTENTIONIND,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.703140,1,1307,1307,3502005,Matched,Intersection,37475.0,5TH AVE NE AND NE 103RD ST,,,2,Injury Collision,Angles,2,0,0,2,2013/03/27 00:00:00+00,3/27/2013 2:54:00 PM,At Intersection (intersection related),11,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END AT ANGLE",,N,Overcast,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,AURORA BR BETWEEN RAYE ST AND BRIDGE WAY N,,,1,Property Damage Only Collision,Sideswipe,2,0,0,2,2006/12/20 00:00:00+00,12/20/2006 6:55:00 PM,Mid-Block (not related to intersection),16,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, LEFT SIDE SIDESWIPE",,0,Raining,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - both moving - sideswipe,0,0,N
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
194671,2,-122.355317,47.678734,219546,309514,310794,3810083,Matched,Intersection,24349.0,GREENWOOD AVE N AND N 68TH ST,,,2,Injury Collision,Cycles,2,0,1,1,2019/01/15 00:00:00+00,1/15/2019 4:48:00 PM,At Intersection (intersection related),51,PEDALCYCLIST STRUCK MOTOR VEHICLE FRONT END AT ANGLE,,N,Clear,Dry,Dusk,,,,5,Vehicle Strikes Pedalcyclist,4308,0,N
194672,1,-122.289360,47.611017,219547,308220,309500,E868008,Matched,Block,,34TH AVE BETWEEN E MARION ST AND E SPRING ST,,,1,Property Damage Only Collision,Rear Ended,2,0,0,2,2018/11/30 00:00:00+00,11/30/2018 3:45:00 PM,Mid-Block (not related to intersection),14,"MOTOR VEHICLE STRUCK MOTOR VEHICLE, REAR END",,N,Clear,Wet,Daylight,,,,14,From same direction - both going straight - one stopped - rear-end,0,0,N


In [43]:
print(df["SEVERITYCODE"].value_counts())
print('-'*50)
y = df["SEVERITYCODE"].values
df.drop(["SEVERITYCODE.1"], axis=1, inplace=True)
print("Number of data points in data", df.shape)
print("Number of data points in label", y.shape)

1    136485
2     58188
Name: SEVERITYCODE, dtype: int64
--------------------------------------------------
Number of data points in data (194673, 37)
Number of data points in label (194673,)


In [44]:
df.columns

Index(['SEVERITYCODE', 'X', 'Y', 'OBJECTID', 'INCKEY', 'COLDETKEY', 'REPORTNO',
       'STATUS', 'ADDRTYPE', 'INTKEY', 'LOCATION', 'EXCEPTRSNCODE',
       'EXCEPTRSNDESC', 'SEVERITYDESC', 'COLLISIONTYPE', 'PERSONCOUNT',
       'PEDCOUNT', 'PEDCYLCOUNT', 'VEHCOUNT', 'INCDATE', 'INCDTTM',
       'JUNCTIONTYPE', 'SDOT_COLCODE', 'SDOT_COLDESC', 'INATTENTIONIND',
       'UNDERINFL', 'WEATHER', 'ROADCOND', 'LIGHTCOND', 'PEDROWNOTGRNT',
       'SDOTCOLNUM', 'SPEEDING', 'ST_COLCODE', 'ST_COLDESC', 'SEGLANEKEY',
       'CROSSWALKKEY', 'HITPARKEDCAR'],
      dtype='object')

## <a id='section-3'>Data Preprocessing</a>

1. Removal of irrelevant columns<br />

Columns containing descriptions and identification numbers that would not help in the classification are dropped from the data set to reduce the complexity and dimensionality of the data set. ‘OBJECTID’, ‘INCKEY’, ‘COLDETKEY’, ‘REPORTNO’, ‘STATUS’, ‘INTKEY’, ‘EXCEPTRSNCODE’ and more belong to this category. Certain other categorical features were removed as they had a large number of distinct values, example: ‘LOCATION’.

In [45]:
df.drop(["OBJECTID", "INCKEY", "COLDETKEY", "REPORTNO", "STATUS","INTKEY", "EXCEPTRSNCODE", "EXCEPTRSNDESC", "INATTENTIONIND", "UNDERINFL", "PEDROWNOTGRNT", "SDOT_COLDESC", "LOCATION"], axis=1, inplace=True)

In [48]:
df.shape

(194673, 24)

In [47]:
df.head()

Unnamed: 0,SEVERITYCODE,X,Y,ADDRTYPE,SEVERITYDESC,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,INCDATE,INCDTTM,JUNCTIONTYPE,SDOT_COLCODE,WEATHER,ROADCOND,LIGHTCOND,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,Intersection,Injury Collision,Angles,2,0,0,2,2013/03/27 00:00:00+00,3/27/2013 2:54:00 PM,At Intersection (intersection related),11,Overcast,Wet,Daylight,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,Block,Property Damage Only Collision,Sideswipe,2,0,0,2,2006/12/20 00:00:00+00,12/20/2006 6:55:00 PM,Mid-Block (not related to intersection),16,Raining,Wet,Dark - Street Lights On,6354039.0,,11,From same direction - both going straight - both moving - sideswipe,0,0,N
2,1,-122.33454,47.607871,Block,Property Damage Only Collision,Parked Car,4,0,0,3,2004/11/18 00:00:00+00,11/18/2004 10:20:00 AM,Mid-Block (not related to intersection),14,Overcast,Dry,Daylight,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,Block,Property Damage Only Collision,Other,3,0,0,3,2013/03/29 00:00:00+00,3/29/2013 9:26:00 AM,Mid-Block (not related to intersection),11,Clear,Dry,Daylight,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,Intersection,Injury Collision,Angles,2,0,0,2,2004/01/28 00:00:00+00,1/28/2004 8:04:00 AM,At Intersection (intersection related),11,Raining,Wet,Daylight,4028032.0,,10,Entering at angle,0,0,N


2.Identification and handling missing values

In [50]:
df.replace(r'^\s*$', np.nan, regex=True)
df.replace("Unknown", np.nan, inplace=True)
df.replace("Other", np.nan, inplace=True)

In [51]:
df["WEATHER"].value_counts()

Clear                       111135
Raining                      33145
Overcast                     27714
Snowing                        907
Fog/Smog/Smoke                 569
Sleet/Hail/Freezing Rain       113
Blowing Sand/Dirt               56
Severe Crosswind                25
Partly Cloudy                    5
Name: WEATHER, dtype: int64

In [53]:
df["ROADCOND"].value_counts()

Dry               124510
Wet                47474
Ice                 1209
Snow/Slush          1004
Standing Water       115
Sand/Mud/Dirt         75
Oil                   64
Name: ROADCOND, dtype: int64

In [54]:
df["LIGHTCOND"].value_counts()

Daylight                    116137
Dark - Street Lights On      48507
Dusk                          5902
Dawn                          2502
Dark - No Street Lights       1537
Dark - Street Lights Off      1199
Dark - Unknown Lighting         11
Name: LIGHTCOND, dtype: int64

In [55]:
df["SEVERITYDESC"].value_counts()

Property Damage Only Collision    136485
Injury Collision                   58188
Name: SEVERITYDESC, dtype: int64