# IBM Coursera Capstone Project

**Introduction:** In 2017, motor vehicle accidents would have taken 13th place in the CDC's report on the number of deaths and leading causes of death. However, it was not officially listed because motor vehicle accident-related deaths are two-fold: 1) deaths at the scene of the accident, and 2) deaths elsewhere that resulted from injuries caused by motor vehicle accidents. According to the U.S. Department of Transportation, National Highway Traffic Safety Administration, there were 6,734,000 motor vehicle crashes in 2018.

While not all crashes result in death, the volume of crashes and deaths beg an interesting question about the factors at play during a crash. Indeed, in the prompt for this final project, the video asks: what if based on weather and road conditions, we could determine whether it is safe to drive or if we should stay home? This is the question the model below will aim to answer.

**Data:** Our dataset contains approximately 20,000 attributes and has been recommended by the IBM Data Science Certification program on Coursera (therefore, it can be presumed reliable). Upon an initial review, all relevant variables are categorical variables. The dependent variable we are examining is the SEVERITYCODE (how severe the crash was). The independent variables include:
- WEATHER: weather conditions
- ROADCOND: road conditions
- LIGHTCOND: light conditions
- ADDRTYPE: the location type of crash
- UNDERINFL: whether driver was intoxicated

**Methodology:** After building a new data frame with only the relevant variables, we will drop the N/A values. Next, because we are building a model that will help a person determine whether they should stay at home, we will select the decision tree as our machine learning algorithm. Because we will be using the Sklearn Decision Tree, we will convert the categorical variables we have into dummy variables.

In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

#### Data importing and initial preprocessing

In [16]:
data = pd.read_csv (r'C:\Users\Moira\Documents\IBM Data Science Capstone Data.csv')

In [17]:
data.head()

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


In [18]:
data.dtypes

SEVERITYCODE        int64
X                 float64
Y                 float64
OBJECTID            int64
INCKEY              int64
COLDETKEY           int64
REPORTNO           object
STATUS             object
ADDRTYPE           object
INTKEY            float64
LOCATION           object
EXCEPTRSNCODE      object
EXCEPTRSNDESC      object
SEVERITYCODE.1      int64
SEVERITYDESC       object
COLLISIONTYPE      object
PERSONCOUNT         int64
PEDCOUNT            int64
PEDCYLCOUNT         int64
VEHCOUNT            int64
INCDATE            object
INCDTTM            object
JUNCTIONTYPE       object
SDOT_COLCODE        int64
SDOT_COLDESC       object
INATTENTIONIND     object
UNDERINFL          object
WEATHER            object
ROADCOND           object
LIGHTCOND          object
PEDROWNOTGRNT      object
SDOTCOLNUM        float64
SPEEDING           object
ST_COLCODE         object
ST_COLDESC         object
SEGLANEKEY          int64
CROSSWALKKEY        int64
HITPARKEDCAR       object
dtype: objec

In [26]:
data = data.astype({'SEVERITYCODE':object})

In [28]:
df = data[['SEVERITYCODE', 'WEATHER', 'ROADCOND', 'LIGHTCOND', 'ADDRTYPE', 'UNDERINFL']].copy()
df.dtypes

SEVERITYCODE    object
WEATHER         object
ROADCOND        object
LIGHTCOND       object
ADDRTYPE        object
UNDERINFL       object
dtype: object

In [37]:
df.dropna()

Unnamed: 0,SEVERITYCODE,WEATHER,ROADCOND,LIGHTCOND,ADDRTYPE,UNDERINFL
0,2,Overcast,Wet,Daylight,Intersection,N
1,1,Raining,Wet,Dark - Street Lights On,Block,0
2,1,Overcast,Dry,Daylight,Block,0
3,1,Clear,Dry,Daylight,Block,N
4,2,Raining,Wet,Daylight,Intersection,0
...,...,...,...,...,...,...
194668,2,Clear,Dry,Daylight,Block,N
194669,1,Raining,Wet,Daylight,Block,N
194670,2,Clear,Dry,Daylight,Intersection,N
194671,2,Clear,Dry,Dusk,Intersection,N


In [64]:
#Making variable categories uniform
df['UNDERINFL'] = df['UNDERINFL'].replace(['Y'],'1')
df['UNDERINFL'] = df['UNDERINFL'].replace(['N'],'0')
#Build X without 'WEATHER_Unknown', 'ROADCOND_Unknown', 'LIGHTCOND_Dark - Unknown Lighting', 'LIGHTCOND_Unknown' 
#Build X without 'WEATHER_Other', 'ROADCOND_Other', 'LIGHTCOND_Other'
#We want to eliminate any unknown elements

#### Data preprocessing towards decision-tree-building

In [55]:
df_dt = pd.get_dummies(df, columns=['WEATHER', 'ROADCOND', 'LIGHTCOND', 'ADDRTYPE', 'UNDERINFL'],
               dtype=None)
df_dt

Unnamed: 0,SEVERITYCODE,WEATHER_Blowing Sand/Dirt,WEATHER_Clear,WEATHER_Fog/Smog/Smoke,WEATHER_Other,WEATHER_Overcast,WEATHER_Partly Cloudy,WEATHER_Raining,WEATHER_Severe Crosswind,WEATHER_Sleet/Hail/Freezing Rain,...,LIGHTCOND_Dawn,LIGHTCOND_Daylight,LIGHTCOND_Dusk,LIGHTCOND_Other,LIGHTCOND_Unknown,ADDRTYPE_Alley,ADDRTYPE_Block,ADDRTYPE_Intersection,UNDERINFL_0,UNDERINFL_1
0,2,0,0,0,0,1,0,0,0,0,...,0,1,0,0,0,0,0,1,1,0
1,1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,1,0,1,0
2,1,0,0,0,0,1,0,0,0,0,...,0,1,0,0,0,0,1,0,1,0
3,1,0,1,0,0,0,0,0,0,0,...,0,1,0,0,0,0,1,0,1,0
4,2,0,0,0,0,0,0,1,0,0,...,0,1,0,0,0,0,0,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
194668,2,0,1,0,0,0,0,0,0,0,...,0,1,0,0,0,0,1,0,1,0
194669,1,0,0,0,0,0,0,1,0,0,...,0,1,0,0,0,0,1,0,1,0
194670,2,0,1,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,1,1,0
194671,2,0,1,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,1,1,0


In [65]:
df_dt.dtypes

SEVERITYCODE                          object
WEATHER_Blowing Sand/Dirt              uint8
WEATHER_Clear                          uint8
WEATHER_Fog/Smog/Smoke                 uint8
WEATHER_Other                          uint8
WEATHER_Overcast                       uint8
WEATHER_Partly Cloudy                  uint8
WEATHER_Raining                        uint8
WEATHER_Severe Crosswind               uint8
WEATHER_Sleet/Hail/Freezing Rain       uint8
WEATHER_Snowing                        uint8
WEATHER_Unknown                        uint8
ROADCOND_Dry                           uint8
ROADCOND_Ice                           uint8
ROADCOND_Oil                           uint8
ROADCOND_Other                         uint8
ROADCOND_Sand/Mud/Dirt                 uint8
ROADCOND_Snow/Slush                    uint8
ROADCOND_Standing Water                uint8
ROADCOND_Unknown                       uint8
ROADCOND_Wet                           uint8
LIGHTCOND_Dark - No Street Lights      uint8
LIGHTCOND_

In [67]:
X = df_dt.drop('SEVERITYCODE', axis=1)
Y = df_dt['SEVERITYCODE']