This notebook will contain code for the reporting of the project topic and initial data preparation and analysis

In [1]:
import pandas as pd
import numpy as np

### Introduction and Business Problem

For the final capstone project in Coursera's 'IBM Data Science Professional Certificate', I will be utilizing a dataset that was initially collected by the Seattle Police Department through their traffic records which contains information regarding vehicular accident severity in the Seattle area from 2004 to the present. After a quick study of the dataset's attributes and information, I decided to base my investigation on the effects of road, weather and light conditions, together with whether or not there is a driver under the influence on the severity of accidents in the Seattle area. I will firstly need to explore the dataset more thoroughly to see the basic correlations between my chosen independent variables on accident severity before I can build a machine learning model that can predict the severity of future accidents. This information could prove extremely useful to  people who commute regularly because they will be able to employ preventative measures depending on the weather. Similarly, roads that are more prone to less desirable conditions due to the weather or lighting issues can be closed off or improved upon based on the findings in this investigation. Moreover, this analysis can also help inform the Seattle Police Department regarding their staffing numbers or locations based on different conditions as well as whether or not they would have to crack down on drivers under the influence more heavily. 

### Description of Data

The first 5 rows of the full dataset for this project is shown below. However, because I decided to limit my investigation to 4 independent variables and a single dependent variable, I will also limit the dataset. This limited dataset will include the variables: 'SEVERITYCODE', 'ROADCOND', 'LIGHTCOND', 'WEATHER', and 'UNDERINFL', all of which are described in the second table below.

In [5]:
# The code was removed by Watson Studio for sharing.

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


In [22]:
severity_desc = "A code that corresponds to the severity of the collision"

In [26]:
info_report = pd.DataFrame(data=np.array([["SEVERITYCODE", severity_desc],
                                     ["WEATHER", "A description of the weather conditions during the time of the collision."],
                                    ["ROADCOND", "The condition of the road during the collision."],
                                    ["LIGHTCOND", "The light conditions during the collision." ],
                                         ["UNDERINFL", "Whether or not a driver involved was under the influence of drugs or alcohol." ]]), columns=["Variable", "Description"])
info_report = info_report.set_index(["Variable", "Description"])
info_report.style.set_properties(**{'text-align': 'middle'})
info_report

Variable,Description
SEVERITYCODE,A code that corresponds to the severity of the collision
WEATHER,A description of the weather conditions during the time of the collision.
ROADCOND,The condition of the road during the collision.
LIGHTCOND,The light conditions during the collision.
UNDERINFL,Whether or not a driver involved was under the influence of drugs or alcohol.


Severity Code Description:
- 3: fatality
- 2b: serious injury
- 2: injury
- 1: prop damage
- 0: unknown 

The focused dataset is shown below and only contains data regarding the severity of each accident, the weather, road and light conditions, together with data relating to the presence of a driver under the influence.

In [14]:
df_data = df_data_1.filter(["SEVERITYCODE", "WEATHER", "ROADCOND", "LIGHTCOND", "UNDERINFL"], axis = 1)
df_data.head()

Unnamed: 0,SEVERITYCODE,WEATHER,ROADCOND,LIGHTCOND,UNDERINFL
0,2,Overcast,Wet,Daylight,N
1,1,Raining,Wet,Dark - Street Lights On,0
2,1,Overcast,Dry,Daylight,0
3,1,Clear,Dry,Daylight,N
4,2,Raining,Wet,Daylight,0


In order to prepare the data, I first had to look at the types of data that I will be working with. As you can see in the succeeding output, severity is measured by integers while my other variables are object types. Before modifying the data types, I first wanted to see more basic information about my chosen variables.  

In [15]:
df_data.dtypes

SEVERITYCODE     int64
WEATHER         object
ROADCOND        object
LIGHTCOND       object
UNDERINFL       object
dtype: object

Firstly, I examined the road conditions dataframe and found that most of the accidents occured with dry road conditions. Therefore, I can convert this variable into a categorical variable with binary numbers. By clustering all the other road conditions together as one 'other' road condition, I can set the type of this variable to integer as well, with 1s and 0s representing dry or other. 

In [10]:
df_data_1["ROADCOND"].value_counts().to_frame()

Unnamed: 0,ROADCOND
Dry,124510
Wet,47474
Unknown,15078
Ice,1209
Snow/Slush,1004
Other,132
Standing Water,115
Sand/Mud/Dirt,75
Oil,64


Similarly, light conditions can also be converted to binary figures by clumping together other light conditions besides 'Daylight', especially since other light conditions can mostly be considered relatively dark.

In [11]:
df_data_1["LIGHTCOND"].value_counts().to_frame()

Unnamed: 0,LIGHTCOND
Daylight,116137
Dark - Street Lights On,48507
Unknown,13473
Dusk,5902
Dawn,2502
Dark - No Street Lights,1537
Dark - Street Lights Off,1199
Other,235
Dark - Unknown Lighting,11


The weather variable will also follow its preceding variables and can be converted into a categorical, binary variable with clear weather being represented by 1 and other weather types as 0.

In [12]:
df_data_1["WEATHER"].value_counts().to_frame()

Unnamed: 0,WEATHER
Clear,111135
Raining,33145
Overcast,27714
Unknown,15091
Snowing,907
Other,832
Fog/Smog/Smoke,569
Sleet/Hail/Freezing Rain,113
Blowing Sand/Dirt,56
Severe Crosswind,25


Lastly, the 'Under the Influence' variable has been categorized with 'Y' and '1' representing yes, 'N' and '0' representing no. This variable can also be cleaned up to be more uniform and just use the binary numbers 1 and 0 to represent yes and no respectively. 

In [13]:
df_data_1["UNDERINFL"].value_counts().to_frame()

Unnamed: 0,UNDERINFL
N,100274
0,80394
Y,5126
1,3995


Upon preparing my full dataset which will include all the categorical, independent variables, I should be able to manufacture a simple machine learning solution that will predict accident severity based on my chosen independent variables.