# IBM Data Science Capstone Project
Case Study: Predict the severity of an accident

by Ariella Goldman

## Business Understanding

The initial phase is to understand the project's objective from the business or application perspective. Then, you need to translate this knowledge into a machine learning problem with a preliminary plan to achieve the objectives.

The objective of this project is to predict the severity of an accident. If something could warn you, given maybe the weather and the road conditions, about the possibility of getting into the car accident and how severe it would be, you might drive more carefully or perhaps change your travel plans. 

In [75]:
import pandas as pd
import numpy as np

In [78]:
print("Hello Capstone Project Course!")

Hello Capstone Project Course!


### About the data

This data set is about accident (car collisions) severity. This data includes all types of collisions. Collisions will display at the intersection or mid-block of a segment. The data dates weekly from 2004 to present. The data has been collected from the Seattle Department of Transportation.

This data set contains 194673 rows. There are 37 attributes, some numeric and some categorical. Some attributes have missing data. This is not a balanced labeled dataset. Metadata about the dataset can be found at https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Metadata.pdf.  

## Data Understanding

In this phase, you need to collect or extract the dataset from various sources such as csv file or SQL database. Then, you need to determine the attributes (columns) that you will use to train your machine learning model. Also, you will assess the condition of chosen attributes by looking for trends, certain patterns, skewed information, correlations, and so on.

### Load Data from CSV file

In [67]:
df = pd.read_csv('Data-Collisions.csv', dtype={'SEVERITYCODE': 'int64', 'X': float, 'Y': float, 'OBJECTID': 'int64',
                                               'INCKEY': float, 'COLDETKEY': float, 'REPORTNO': object,
                                               'STATUS': object, 'ADDRTYPE': object, 'INTKEY': float,
                                               'LOCATION': object, 'EXCEPTRSNCODE': object, 'EXCEPTRSNDESC': object,
                                               'SEVERITYCODE.1': object, 'SEVERITYDESC': object, 'PERSONCOUNT': float,
                                               'PEDCOUNT': float, 'PEDCYLCOUNT': float, 'VEHCOUNT': float,
                                               'INCDATE': object, 'INCDTTM': object, 'JUNCTIONTYPE': object,
                                               'SDOT_COLCODE': object, 'SDOT_COLDESC': object, 'HITPARKEDCAR': object,
                                               'INATTENTIONIND': object, 'UNDERINFL': object, 'WEATHER': object,
                                               'ROADCOND': object, 'LIGHTCOND': object, 'PEDROWNOTGRNT': object,
                                               'SDOTCOLNUM': object, 'SPEEDING': object, 'ST_COLCODE': object,
                                               'ST_COLDESC': object, 'SEGLANEKEY': float, 'CROSSWALKKEY': float})
df.head()

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307.0,1307.0,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0.0,0.0,N
1,1,-122.347294,47.647172,2,52200.0,52200.0,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0.0,0.0,N
2,1,-122.33454,47.607871,3,26700.0,26700.0,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0.0,0.0,N
3,1,-122.334803,47.604803,4,1144.0,1144.0,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0.0,0.0,N
4,2,-122.306426,47.545739,5,17700.0,17700.0,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0.0,0.0,N


In [68]:
df.shape

(194673, 38)

In [69]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 194673 entries, 0 to 194672
Data columns (total 38 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   SEVERITYCODE    194673 non-null  int64  
 1   X               189339 non-null  float64
 2   Y               189339 non-null  float64
 3   OBJECTID        194673 non-null  int64  
 4   INCKEY          194673 non-null  float64
 5   COLDETKEY       194673 non-null  float64
 6   REPORTNO        194673 non-null  object 
 7   STATUS          194673 non-null  object 
 8   ADDRTYPE        192747 non-null  object 
 9   INTKEY          65070 non-null   float64
 10  LOCATION        191996 non-null  object 
 11  EXCEPTRSNCODE   84811 non-null   object 
 12  EXCEPTRSNDESC   5638 non-null    object 
 13  SEVERITYCODE.1  194673 non-null  object 
 14  SEVERITYDESC    194673 non-null  object 
 15  COLLISIONTYPE   189769 non-null  object 
 16  PERSONCOUNT     194673 non-null  float64
 17  PEDCOUNT  

In [70]:
df.describe(include="all")

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
count,194673.0,189339.0,189339.0,194673.0,194673.0,194673.0,194673.0,194673,192747,65070.0,...,189661,189503,4667,114936.0,9333,194655.0,189769,194673.0,194673.0,194673
unique,,,,,,,194670.0,2,3,,...,9,9,1,114932.0,1,63.0,62,,,2
top,,,,,,,1782439.0,Matched,Block,,...,Dry,Daylight,Y,4116048.0,Y,32.0,One parked--one moving,,,N
freq,,,,,,,2.0,189786,126926,,...,124510,116137,4667,2.0,9333,44421.0,44421,,,187457
mean,1.298901,-122.330518,47.619543,108479.36493,141091.45635,141298.811381,,,,37558.450576,...,,,,,,,,269.401114,9782.452,
std,0.457778,0.029976,0.056157,62649.722558,86634.402737,86986.54211,,,,51745.990273,...,,,,,,,,3315.776055,72269.26,
min,1.0,-122.419091,47.495573,1.0,1001.0,1001.0,,,,23807.0,...,,,,,,,,0.0,0.0,
25%,1.0,-122.348673,47.575956,54267.0,70383.0,70383.0,,,,28667.0,...,,,,,,,,0.0,0.0,
50%,1.0,-122.330224,47.615369,106912.0,123363.0,123363.0,,,,29973.0,...,,,,,,,,0.0,0.0,
75%,2.0,-122.311937,47.663664,162272.0,203319.0,203459.0,,,,33973.0,...,,,,,,,,0.0,0.0,


In [71]:
df['ROADCOND'].isna().sum()

5012

## Data Preparation

The data preparation includes all the required activities to construct the final dataset which will be fed into the modeling tools. Data preparation can be performed multiple times and it includes balancing the labeled data, transformation, filling missing data, and cleaning the dataset.

In [72]:
df['INCDATE'].head()

0    2013/03/27 00:00:00+00
1    2006/12/20 00:00:00+00
2    2004/11/18 00:00:00+00
3    2013/03/29 00:00:00+00
4    2004/01/28 00:00:00+00
Name: INCDATE, dtype: object

In [74]:
df['INCDATE'] = pd.to_datetime(df['INCDATE']).dt.date #all columns have 00:00:00 for time
df['INCTIME'].head()

0    00:00:00
1    00:00:00
2    00:00:00
3    00:00:00
4    00:00:00
Name: INCTIME, dtype: object

## Modeling 

In this phase, various algorithms and methods can be selected and applied to build the model including supervised machine learning techniques. You can select SVM, XGBoost, decision tree, or any other techniques. You can select a single or multiple machine learning models for the same data mining problem. At this phase, stepping back to the data preparation phase is often required.

## Evaluation 

Before proceeding to the deployment stage, the model needs to be evaluated thoroughly to ensure that the business or the applications' objectives are achieved. Certain metrics can be used for the model evaluation such as accuracy, recall, F1-score, precision, and others.

## Deployment

The deployment phase requirements vary from project to project. It can be as simple as creating a report, developing interactive visualization, or making the machine learning model available in the production environment. In this environment, the customers or end-users can utilize the model in different ways such as API, website, or so on.