# Car Accident Severity Report

## Data Science Capstone Project for Coursera

### 1. Introduction and Business Understanding

With the increasing number of traffic on roads each year, there is an increase in number of accidents which occur due to various external factors such as weather conditions, road conditions, etc. There is a need to have a system in place which could predict the probability of an accident based on different factors, that would be a great tool for drivers. Such a system can make use of the real-time incoming data and alert drivers in real-time of the severity of an accident, which would decrease the probability of accidents occuring by a certain margin. This would also contribute to efficient traffic flow on roads, reducing the amount of time spent driving and in turn beneficial to the environment (reduced carbon emissions).

### 2. Data Understanding
The dataset used is from the SDOT Traffic Management Division. The dataset contains a parameter "Severity", which would be our target variable/predictor, which describes how severe an accident might occur based on certain conditions. The dataset has two different severity levels or codes - 1 and 2.
Code 1 : Property damage
Code 2 : Injury
We would use the variables / conditions - weather, light and road conditions to predict the severity of an accident. These would be out dependent variables. 

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import matplotlib.ticker as ticker
from sklearn import preprocessing
from matplotlib.ticker import NullFormatter
%matplotlib inline
#import types
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.
client_451f054f57bf4e8897661efbcb545ebd = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='G2drd56SSK20JifMN_c3gjNR3zI-xXDrujnVGlejsoko',
    ibm_auth_endpoint="https://iam.cloud.ibm.com/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3.eu-geo.objectstorage.service.networklayer.com')

body = client_451f054f57bf4e8897661efbcb545ebd.get_object(Bucket='pythonvisualizationassignment-donotdelete-pr-rssnshdzoqwhbv',Key='Data-Collisions.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df1 = pd.read_csv(body)
df1.head()

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


### 3. Data Preparation
In this section, we have already uploaded the original dataset as a dataframe. We would now delete all columns except the dependent variables. We drop all rows which contain the variable "Unknown" and "Other" as values, since these unknown conditions are not helpful for accurate prediction. We convert the categorical values into numerical values using one-hot encoding, and we drop all categorical values having less that 1% influence on the severity outcome. This also reduces the dataset to a certain extent, reducing computation efforts.

In [195]:
df = pd.DataFrame()
df = df1[['SEVERITYCODE','ROADCOND','LIGHTCOND','WEATHER','PERSONCOUNT','VEHCOUNT']].copy()
df.shape

(194673, 6)

In [196]:
df.head()

Unnamed: 0,SEVERITYCODE,ROADCOND,LIGHTCOND,WEATHER,PERSONCOUNT,VEHCOUNT
0,2,Wet,Daylight,Overcast,2,2
1,1,Wet,Dark - Street Lights On,Raining,2,2
2,1,Dry,Daylight,Overcast,4,3
3,1,Dry,Daylight,Clear,3,3
4,2,Wet,Daylight,Raining,2,2


In [197]:
df.groupby(['SEVERITYCODE'])['LIGHTCOND'].value_counts(normalize=True)

SEVERITYCODE  LIGHTCOND               
1             Daylight                    0.586028
              Dark - Street Lights On     0.257030
              Unknown                     0.097187
              Dusk                        0.029893
              Dawn                        0.012673
              Dark - No Street Lights     0.009086
              Dark - Street Lights Off    0.006669
              Other                       0.001382
              Dark - Unknown Lighting     0.000053
2             Daylight                    0.675050
              Dark - Street Lights On     0.253512
              Dusk                        0.034047
              Dawn                        0.014431
              Unknown                     0.010596
              Dark - No Street Lights     0.005850
              Dark - Street Lights Off    0.005534
              Other                       0.000911
              Dark - Unknown Lighting     0.000070
Name: LIGHTCOND, dtype: float64

In [198]:
df.groupby(['SEVERITYCODE'])['ROADCOND'].value_counts(normalize=True)

SEVERITYCODE  ROADCOND      
1             Dry               0.637170
              Wet               0.239329
              Unknown           0.108116
              Ice               0.007062
              Snow/Slush        0.006315
              Other             0.000672
              Standing Water    0.000641
              Sand/Mud/Dirt     0.000392
              Oil               0.000302
2             Dry               0.701302
              Wet               0.275784
              Unknown           0.013111
              Ice               0.004779
              Snow/Slush        0.002923
              Other             0.000753
              Standing Water    0.000525
              Oil               0.000420
              Sand/Mud/Dirt     0.000403
Name: ROADCOND, dtype: float64

In [199]:
df.groupby(['SEVERITYCODE'])['WEATHER'].value_counts(normalize=True)

SEVERITYCODE  WEATHER                 
1             Clear                       0.568316
              Raining                     0.165819
              Overcast                    0.143175
              Unknown                     0.107746
              Snowing                     0.005555
              Other                       0.005404
              Fog/Smog/Smoke              0.002883
              Sleet/Hail/Freezing Rain    0.000642
              Blowing Sand/Dirt           0.000309
              Severe Crosswind            0.000136
              Partly Cloudy               0.000015
2             Clear                       0.627627
              Raining                     0.195713
              Overcast                    0.153142
              Unknown                     0.014290
              Fog/Smog/Smoke              0.003275
              Snowing                     0.002995
              Other                       0.002031
              Sleet/Hail/Freezing Rain    0

In [200]:
# Replacing the unknown variables with NaN and eventually dropping those rows
df = df.replace(['Unknown', 'Other'], np.nan)
df = df.dropna()
df.shape

(169957, 6)

In [213]:
# One-hot encoding the categorical values and preparing our dataframe for the model
Feature = pd.concat([df, pd.get_dummies(df['LIGHTCOND']), pd.get_dummies(df['ROADCOND']), pd.get_dummies(df['WEATHER'])], axis=1)
Feature.drop(['Dark - Unknown Lighting','Dark - Street Lights Off','Dark - No Street Lights','LIGHTCOND','Standing Water','Oil','Sand/Mud/Dirt','ROADCOND',
        'Partly Cloudy','Severe Crosswind','Blowing Sand/Dirt','Sleet/Hail/Freezing Rain','WEATHER'], axis=1, inplace=True)
Feature.head()

Unnamed: 0,SEVERITYCODE,PERSONCOUNT,VEHCOUNT,Dark - Street Lights On,Dawn,Daylight,Dusk,Dry,Ice,Snow/Slush,Wet,Clear,Fog/Smog/Smoke,Overcast,Raining,Snowing
0,2,2,2,0,0,1,0,0,0,0,1,0,0,1,0,0
1,1,2,2,1,0,0,0,0,0,0,1,0,0,0,1,0
2,1,4,3,0,0,1,0,1,0,0,0,0,0,1,0,0
3,1,3,3,0,0,1,0,1,0,0,0,1,0,0,0,0
4,2,2,2,0,0,1,0,0,0,0,1,0,0,0,1,0


In [214]:
Feature = Feature.rename(columns={'SEVERITYCODE' : 'SC', 'Dark - Street Lights On' : 'Dark', 'PERSONCOUNT' : 'Persons', 'VEHCOUNT' : 'Vehicles'})
Feature.head()

Unnamed: 0,SC,Persons,Vehicles,Dark,Dawn,Daylight,Dusk,Dry,Ice,Snow/Slush,Wet,Clear,Fog/Smog/Smoke,Overcast,Raining,Snowing
0,2,2,2,0,0,1,0,0,0,0,1,0,0,1,0,0
1,1,2,2,1,0,0,0,0,0,0,1,0,0,0,1,0
2,1,4,3,0,0,1,0,1,0,0,0,0,0,1,0,0
3,1,3,3,0,0,1,0,1,0,0,0,1,0,0,0,0
4,2,2,2,0,0,1,0,0,0,0,1,0,0,0,1,0


In [215]:
Feature.shape

(169957, 16)

In [216]:
# Preparing the X and y datasets for the model
y = Feature['SC'].values
y[0:8]
Feature = Feature.drop(['SC'], axis=1)
X = Feature

In [217]:
X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

  return self.partial_fit(X, y)
  if __name__ == '__main__':


array([[-0.35811011,  0.05158027, -0.61597105, -0.12000911,  0.71354465,
        -0.18540296, -1.58324162, -0.07996991, -0.07018105,  1.63366941,
        -1.33422795, -0.0571348 ,  2.30493093, -0.48761694, -0.06984158],
       [-0.35811011,  0.05158027,  1.62345292, -0.12000911, -1.401454  ,
        -0.18540296, -1.58324162, -0.07996991, -0.07018105,  1.63366941,
        -1.33422795, -0.0571348 , -0.43385248,  2.0507901 , -0.06984158],
       [ 1.06002029,  1.77895517, -0.61597105, -0.12000911,  0.71354465,
        -0.18540296,  0.63161554, -0.07996991, -0.07018105, -0.61211895,
        -1.33422795, -0.0571348 ,  2.30493093, -0.48761694, -0.06984158],
       [ 0.35095509,  1.77895517, -0.61597105, -0.12000911,  0.71354465,
        -0.18540296,  0.63161554, -0.07996991, -0.07018105, -0.61211895,
         0.74949711, -0.0571348 , -0.43385248, -0.48761694, -0.06984158],
       [-0.35811011,  0.05158027, -0.61597105, -0.12000911,  0.71354465,
        -0.18540296, -1.58324162, -0.07996991, 

In [218]:
y[0:5]

array([2, 1, 1, 1, 2])