# Capstone Project - Accident Severity predictor
### Applied Data Science Capstone by IBM/Coursera

<img align=center width = 600 src="https://static.seattletimes.com/wp-content/uploads/2017/03/03172017_traffic_185443-780x559.jpg" />
...

# <h1 align=center><font size = 5>Coursera Capstone Project - Accident data analysis to predict accident severity</font></h1>

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

In the United States and throughout much of the world, car accidents are a leading cause of serious injury and death.  In fact, in the U.S. alone, at least 38,800 people were killed in motor vehicle collisions in 2019. The risk of motor vehicle crashes is higher among 16-19-year-olds than among any other age group In 2015, teens ages 16-19 in the United States accounted for 2,333 fatalities and 233,845 injuries due to car accidents. Accident related costs put a major burden on the government budget. Road conditions and weather play a major part in accidents. There have been numerous studies to understand the reasons and reduce the impact of accidents.

In this project we will try to determine or predict the severity of the traffic accident. Specifically, this solution or analysis will be targeted to the following stakeholders.

A. Mobile Map applications 
<br>
Users will be alerted of accident severity in the travel route based on various indicators  

B. Vehicle insurance providers
<br>
This analysis may be useful for insurance providers to develop quotes based on statistics of accident severity and various indicators

C. Department of Motor Vehicles and other government bodies
<br>
This analysis can be used as input to post appropriate alert signage on roads. It can also be used to improve driving conditions and post appropriate speed limit / warnings.


## Data <a name="data"></a>


Based on the definition of our problem, factors that will influence our decision are:

* driving under the influence of alcohol or other substances
* speed of the vehicle
* weather / light conditions
* road conditions

For this project we will be using the sample data provided as part of this project
<br>
https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv

Metadata is described here:
<br>
https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Metadata.pdf


In [1]:
import pandas as pd
import numpy as np
from sklearn.utils import resample
from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
from sklearn import preprocessing

In [2]:

#! pip install seaborn
!conda install -c anaconda seaborn -y

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



In [3]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 

In [4]:
print("Hello Capstone Project Course!")

Hello Capstone Project Course!


In [5]:
collisions_df = pd.read_csv('Data-Collisions.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [6]:
collisions_df.head()

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


In [7]:
collisions_df.describe(include = "all")

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
count,194673.0,189339.0,189339.0,194673.0,194673.0,194673.0,194673.0,194673,192747,65070.0,...,189661,189503,4667,114936.0,9333,194655.0,189769,194673.0,194673.0,194673
unique,,,,,,,194670.0,2,3,,...,9,9,1,,1,115.0,62,,,2
top,,,,,,,1782439.0,Matched,Block,,...,Dry,Daylight,Y,,Y,32.0,One parked--one moving,,,N
freq,,,,,,,2.0,189786,126926,,...,124510,116137,4667,,9333,27612.0,44421,,,187457
mean,1.298901,-122.330518,47.619543,108479.36493,141091.45635,141298.811381,,,,37558.450576,...,,,,7972521.0,,,,269.401114,9782.452,
std,0.457778,0.029976,0.056157,62649.722558,86634.402737,86986.54211,,,,51745.990273,...,,,,2553533.0,,,,3315.776055,72269.26,
min,1.0,-122.419091,47.495573,1.0,1001.0,1001.0,,,,23807.0,...,,,,1007024.0,,,,0.0,0.0,
25%,1.0,-122.348673,47.575956,54267.0,70383.0,70383.0,,,,28667.0,...,,,,6040015.0,,,,0.0,0.0,
50%,1.0,-122.330224,47.615369,106912.0,123363.0,123363.0,,,,29973.0,...,,,,8023022.0,,,,0.0,0.0,
75%,2.0,-122.311937,47.663664,162272.0,203319.0,203459.0,,,,33973.0,...,,,,10155010.0,,,,0.0,0.0,


In [8]:
print("Earliest incident date: ", collisions_df["INCDATE"].min())
print("Latest incident date: ", collisions_df["INCDATE"].max())


Earliest incident date:  2004/01/01 00:00:00+00
Latest incident date:  2020/05/20 00:00:00+00


In [9]:
collisions_df.isnull().sum()

SEVERITYCODE           0
X                   5334
Y                   5334
OBJECTID               0
INCKEY                 0
COLDETKEY              0
REPORTNO               0
STATUS                 0
ADDRTYPE            1926
INTKEY            129603
LOCATION            2677
EXCEPTRSNCODE     109862
EXCEPTRSNDESC     189035
SEVERITYCODE.1         0
SEVERITYDESC           0
COLLISIONTYPE       4904
PERSONCOUNT            0
PEDCOUNT               0
PEDCYLCOUNT            0
VEHCOUNT               0
INCDATE                0
INCDTTM                0
JUNCTIONTYPE        6329
SDOT_COLCODE           0
SDOT_COLDESC           0
INATTENTIONIND    164868
UNDERINFL           4884
WEATHER             5081
ROADCOND            5012
LIGHTCOND           5170
PEDROWNOTGRNT     190006
SDOTCOLNUM         79737
SPEEDING          185340
ST_COLCODE            18
ST_COLDESC          4904
SEGLANEKEY             0
CROSSWALKKEY           0
HITPARKEDCAR           0
dtype: int64

In [10]:
collisions_df.isnull().any()

SEVERITYCODE      False
X                  True
Y                  True
OBJECTID          False
INCKEY            False
COLDETKEY         False
REPORTNO          False
STATUS            False
ADDRTYPE           True
INTKEY             True
LOCATION           True
EXCEPTRSNCODE      True
EXCEPTRSNDESC      True
SEVERITYCODE.1    False
SEVERITYDESC      False
COLLISIONTYPE      True
PERSONCOUNT       False
PEDCOUNT          False
PEDCYLCOUNT       False
VEHCOUNT          False
INCDATE           False
INCDTTM           False
JUNCTIONTYPE       True
SDOT_COLCODE      False
SDOT_COLDESC      False
INATTENTIONIND     True
UNDERINFL          True
WEATHER            True
ROADCOND           True
LIGHTCOND          True
PEDROWNOTGRNT      True
SDOTCOLNUM         True
SPEEDING           True
ST_COLCODE         True
ST_COLDESC         True
SEGLANEKEY        False
CROSSWALKKEY      False
HITPARKEDCAR      False
dtype: bool

In [11]:
collisions_df['INC_MONTH'] = pd.to_datetime(collisions_df['INCDATE']).dt.month

In [12]:
Features_df = collisions_df.drop(["X","Y","OBJECTID", "INCKEY", "COLDETKEY", "REPORTNO", "STATUS", "INTKEY", "LOCATION", "EXCEPTRSNCODE",
                    "EXCEPTRSNDESC","SEVERITYCODE.1","SEVERITYDESC","COLLISIONTYPE","PERSONCOUNT","PEDCOUNT","PEDCYLCOUNT","VEHCOUNT","INCDATE", 
                    "INCDTTM", "SDOT_COLCODE", "SDOT_COLDESC","PEDROWNOTGRNT","SDOTCOLNUM",
                    "ST_COLCODE","ST_COLDESC","SEGLANEKEY","CROSSWALKKEY","HITPARKEDCAR", "JUNCTIONTYPE", "INATTENTIONIND", "SPEEDING"],axis=1)


In [13]:
Features_df.drop_duplicates()

Unnamed: 0,SEVERITYCODE,ADDRTYPE,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,INC_MONTH
0,2,Intersection,N,Overcast,Wet,Daylight,3
1,1,Block,0,Raining,Wet,Dark - Street Lights On,12
2,1,Block,0,Overcast,Dry,Daylight,11
3,1,Block,N,Clear,Dry,Daylight,3
4,2,Intersection,0,Raining,Wet,Daylight,1
...,...,...,...,...,...,...,...
194462,1,Intersection,N,Fog/Smog/Smoke,Wet,Dawn,11
194469,1,Intersection,N,Clear,Unknown,Unknown,11
194553,1,Block,Y,Clear,Dry,Dark - No Street Lights,1
194623,1,Intersection,N,Clear,Unknown,Daylight,12


In [14]:
Features_df.describe(include = "all")

Unnamed: 0,SEVERITYCODE,ADDRTYPE,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,INC_MONTH
count,194673.0,192747,189789,189592,189661,189503,194673.0
unique,,3,4,11,9,9,
top,,Block,N,Clear,Dry,Daylight,
freq,,126926,100274,111135,124510,116137,
mean,1.298901,,,,,,6.549825
std,0.457778,,,,,,3.430056
min,1.0,,,,,,1.0
25%,1.0,,,,,,4.0
50%,1.0,,,,,,7.0
75%,2.0,,,,,,10.0


In [15]:
print(Features_df.groupby(['INC_MONTH','SEVERITYCODE'])['SEVERITYCODE'].count())

INC_MONTH  SEVERITYCODE
1          1               11704
           2                4703
2          1               10293
           2                4097
3          1               11415
           2                4735
4          1               11216
           2                4762
5          1               11567
           2                5196
6          1               11638
           2                4928
7          1               11227
           2                5137
8          1               11214
           2                5082
9          1               11053
           2                4811
10         1               12273
           2                5495
11         1               11683
           2                4899
12         1               11202
           2                4343
Name: SEVERITYCODE, dtype: int64


In [16]:
print(Features_df["LIGHTCOND"].value_counts())

print(Features_df.groupby(['LIGHTCOND','SEVERITYCODE'])['SEVERITYCODE'].count())

Daylight                    116137
Dark - Street Lights On      48507
Unknown                      13473
Dusk                          5902
Dawn                          2502
Dark - No Street Lights       1537
Dark - Street Lights Off      1199
Other                          235
Dark - Unknown Lighting         11
Name: LIGHTCOND, dtype: int64
LIGHTCOND                 SEVERITYCODE
Dark - No Street Lights   1                1203
                          2                 334
Dark - Street Lights Off  1                 883
                          2                 316
Dark - Street Lights On   1               34032
                          2               14475
Dark - Unknown Lighting   1                   7
                          2                   4
Dawn                      1                1678
                          2                 824
Daylight                  1               77593
                          2               38544
Dusk                      1             

In [17]:
collisions_df.dtypes

SEVERITYCODE        int64
X                 float64
Y                 float64
OBJECTID            int64
INCKEY              int64
COLDETKEY           int64
REPORTNO           object
STATUS             object
ADDRTYPE           object
INTKEY            float64
LOCATION           object
EXCEPTRSNCODE      object
EXCEPTRSNDESC      object
SEVERITYCODE.1      int64
SEVERITYDESC       object
COLLISIONTYPE      object
PERSONCOUNT         int64
PEDCOUNT            int64
PEDCYLCOUNT         int64
VEHCOUNT            int64
INCDATE            object
INCDTTM            object
JUNCTIONTYPE       object
SDOT_COLCODE        int64
SDOT_COLDESC       object
INATTENTIONIND     object
UNDERINFL          object
WEATHER            object
ROADCOND           object
LIGHTCOND          object
PEDROWNOTGRNT      object
SDOTCOLNUM        float64
SPEEDING           object
ST_COLCODE         object
ST_COLDESC         object
SEGLANEKEY          int64
CROSSWALKKEY        int64
HITPARKEDCAR       object
INC_MONTH   

In [18]:
print(Features_df["ROADCOND"].value_counts())
print("")
print(Features_df.groupby(['ROADCOND','SEVERITYCODE'])['SEVERITYCODE'].count())

Dry               124510
Wet                47474
Unknown            15078
Ice                 1209
Snow/Slush          1004
Other                132
Standing Water       115
Sand/Mud/Dirt         75
Oil                   64
Name: ROADCOND, dtype: int64

ROADCOND        SEVERITYCODE
Dry             1               84446
                2               40064
Ice             1                 936
                2                 273
Oil             1                  40
                2                  24
Other           1                  89
                2                  43
Sand/Mud/Dirt   1                  52
                2                  23
Snow/Slush      1                 837
                2                 167
Standing Water  1                  85
                2                  30
Unknown         1               14329
                2                 749
Wet             1               31719
                2               15755
Name: SEVERITYCODE, dtype: int64

In [19]:
Features_df["UNDERINFL"].value_counts()

N    100274
0     80394
Y      5126
1      3995
Name: UNDERINFL, dtype: int64

In [20]:
print(Features_df["ADDRTYPE"].value_counts())
print("")
print(Features_df.groupby(['ADDRTYPE','SEVERITYCODE'])['SEVERITYCODE'].count())

Block           126926
Intersection     65070
Alley              751
Name: ADDRTYPE, dtype: int64

ADDRTYPE      SEVERITYCODE
Alley         1                 669
              2                  82
Block         1               96830
              2               30096
Intersection  1               37251
              2               27819
Name: SEVERITYCODE, dtype: int64


In [21]:
Features_df["SEVERITYCODE"].value_counts()

1    136485
2     58188
Name: SEVERITYCODE, dtype: int64

In [22]:
print(Features_df["WEATHER"].value_counts())
print("")
print(Features_df.groupby(['WEATHER','SEVERITYCODE'])['SEVERITYCODE'].count())

Clear                       111135
Raining                      33145
Overcast                     27714
Unknown                      15091
Snowing                        907
Other                          832
Fog/Smog/Smoke                 569
Sleet/Hail/Freezing Rain       113
Blowing Sand/Dirt               56
Severe Crosswind                25
Partly Cloudy                    5
Name: WEATHER, dtype: int64

WEATHER                   SEVERITYCODE
Blowing Sand/Dirt         1                  41
                          2                  15
Clear                     1               75295
                          2               35840
Fog/Smog/Smoke            1                 382
                          2                 187
Other                     1                 716
                          2                 116
Overcast                  1               18969
                          2                8745
Partly Cloudy             1                   2
                   

In [23]:
Features_df.shape

(194673, 7)

In [24]:
print(Features_df.dtypes)

SEVERITYCODE     int64
ADDRTYPE        object
UNDERINFL       object
WEATHER         object
ROADCOND        object
LIGHTCOND       object
INC_MONTH        int64
dtype: object


In [25]:
Features_df.describe(include="all")

Unnamed: 0,SEVERITYCODE,ADDRTYPE,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,INC_MONTH
count,194673.0,192747,189789,189592,189661,189503,194673.0
unique,,3,4,11,9,9,
top,,Block,N,Clear,Dry,Daylight,
freq,,126926,100274,111135,124510,116137,
mean,1.298901,,,,,,6.549825
std,0.457778,,,,,,3.430056
min,1.0,,,,,,1.0
25%,1.0,,,,,,4.0
50%,1.0,,,,,,7.0
75%,2.0,,,,,,10.0


In [26]:
Features_df['UNDERINFL'] = Features_df['UNDERINFL'].replace(['Y'],'1')
Features_df['UNDERINFL'] = Features_df['UNDERINFL'].replace(['N'],'0')
Features_df['UNDERINFL'] = Features_df['UNDERINFL'].replace(np.nan,'0')
Features_df["UNDERINFL"] = Features_df["UNDERINFL"].astype(int)
Features_df['UNDERINFL'].value_counts()

0    185552
1      9121
Name: UNDERINFL, dtype: int64

In [27]:
# Drop data where the accidents included DUI 
Features_df.drop(Features_df[Features_df['UNDERINFL']==1].index, inplace = True) 
Features_df.shape

(185552, 7)

In [28]:
Features_df.drop(Features_df[Features_df['WEATHER']=="Unknown"].index, inplace = True) 
Features_df.shape

(170508, 7)

In [29]:
Features_df.drop(Features_df[Features_df['ROADCOND']=="Unknown"].index, inplace = True) 
Features_df.shape

(169056, 7)

In [30]:
Features_df.drop(Features_df[Features_df['LIGHTCOND']=="Unknown"].index, inplace = True) 
Features_df.shape

(166749, 7)

In [31]:
Features_df["SEVERITYCODE"].value_counts()

1    113292
2     53457
Name: SEVERITYCODE, dtype: int64

In [32]:
# Label Encoding
# Convert column to category
Features_df["WEATHER"] = Features_df["WEATHER"].astype('category')
Features_df["ROADCOND"] = Features_df["ROADCOND"].astype('category')
Features_df["LIGHTCOND"] = Features_df["LIGHTCOND"].astype('category')

# Assign variable to new column for analysis
Features_df["WEATHER_CAT"] = Features_df["WEATHER"].cat.codes
Features_df["ROADCOND_CAT"] = Features_df["ROADCOND"].cat.codes
Features_df["LIGHTCOND_CAT"] = Features_df["LIGHTCOND"].cat.codes

Features_df.head()

Unnamed: 0,SEVERITYCODE,ADDRTYPE,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,INC_MONTH,WEATHER_CAT,ROADCOND_CAT,LIGHTCOND_CAT
0,2,Intersection,0,Overcast,Wet,Daylight,3,4,7,5
1,1,Block,0,Raining,Wet,Dark - Street Lights On,12,6,7,2
2,1,Block,0,Overcast,Dry,Daylight,11,4,0,5
3,1,Block,0,Clear,Dry,Daylight,3,1,0,5
4,2,Intersection,0,Raining,Wet,Daylight,1,6,7,5


In [33]:
# Seperate majority and minority classes
Features_df_major = Features_df[Features_df.SEVERITYCODE==1]
Features_df_minor = Features_df[Features_df.SEVERITYCODE==2]

#Downsample majority class
Features_df_major_downsampled = resample(Features_df_major,
                                replace=False,
                                n_samples=53457,
                                random_state=123)

# Combine minority class with downsampled majority class
Features_df_balanced = pd.concat([Features_df_major_downsampled, Features_df_minor])

# Display new class counts
print(Features_df_balanced.SEVERITYCODE.value_counts())

X = Features_df_balanced[["WEATHER_CAT","ROADCOND_CAT","LIGHTCOND_CAT"]]
y = Features_df_balanced["SEVERITYCODE"].values

2    53457
1    53457
Name: SEVERITYCODE, dtype: int64


In [34]:
print(Features_df_balanced.groupby(['INC_MONTH','SEVERITYCODE'])['SEVERITYCODE'].count())

INC_MONTH  SEVERITYCODE
1          1               4535
           2               4306
2          1               4041
           2               3707
3          1               4461
           2               4324
4          1               4432
           2               4398
5          1               4638
           2               4802
6          1               4570
           2               4548
7          1               4533
           2               4762
8          1               4412
           2               4740
9          1               4443
           2               4479
10         1               4829
           2               5042
11         1               4430
           2               4436
12         1               4133
           2               3913
Name: SEVERITYCODE, dtype: int64


In [35]:
X = Features_df[["WEATHER_CAT","ROADCOND_CAT","LIGHTCOND_CAT","INC_MONTH"]]
y = Features_df["SEVERITYCODE"].values

In [36]:
Features_df_balanced.corr()

Unnamed: 0,SEVERITYCODE,UNDERINFL,INC_MONTH,WEATHER_CAT,ROADCOND_CAT,LIGHTCOND_CAT
SEVERITYCODE,1.0,,0.009229,0.015386,0.014756,0.065311
UNDERINFL,,,,,,
INC_MONTH,0.009229,,1.0,0.022108,0.024883,-0.019284
WEATHER_CAT,0.015386,,0.022108,1.0,0.805803,0.016593
ROADCOND_CAT,0.014756,,0.024883,0.805803,1.0,-0.053897
LIGHTCOND_CAT,0.065311,,-0.019284,0.016593,-0.053897,1.0


In [37]:
X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

  return self.partial_fit(X, y)
  """Entry point for launching an IPython kernel.


array([[ 0.75573579,  1.65613165,  0.57789432, -1.03610938],
       [ 1.68819884,  1.65613165, -1.24272008,  1.60408484],
       [ 0.75573579, -0.59545162,  0.57789432,  1.31072993],
       [-0.6429588 , -0.59545162,  0.57789432, -1.03610938],
       [ 1.68819884,  1.65613165,  0.57789432, -1.6228192 ]])

In [38]:
# WEATHER as potential predictor variable of severity
# sns.regplot(x="WEATHER_CAT", y="SEVERITYCODE", data=Features_df_balanced)
# plt.ylim(0,)

In [39]:
# ROADCOND as potential predictor variable of severity
# sns.regplot(x="ROADCOND_CAT", y="SEVERITYCODE", data=X)


In [40]:
# LIGHTCOND as potential predictor variable of severity
# sns.regplot(x="LIGHTCOND_CAT", y="SEVERITYCODE", data=X)


In [41]:
from sklearn.model_selection import train_test_split

In [42]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=3)

In [43]:
print("Size of X_train", X_train.shape)
print("Size of y_train", y_train.shape)
print("Size of X_test ", X_test.shape)
print("Size of X_test ", y_test.shape)


Size of X_train (116724, 4)
Size of y_train (116724,)
Size of X_test  (50025, 4)
Size of X_test  (50025,)


# K Nearest Neighbor(KNN)

In [44]:
# Modeling
from sklearn.neighbors import KNeighborsClassifier

In [45]:
# Best k
Ks=8
mean_acc=np.zeros((Ks-1))
std_acc=np.zeros((Ks-1))
ConfustionMx=[];
for n in range(1,Ks):
    
    #Train Model and Predict  
    kNN_model = KNeighborsClassifier(n_neighbors=n).fit(X_train,y_train)
    yhat = kNN_model.predict(X_test)
    
    
    mean_acc[n-1]=np.mean(yhat==y_test);
    
    std_acc[n-1]=np.std(yhat==y_test)/np.sqrt(yhat.shape[0])
mean_acc

array([0.57783108, 0.62948526, 0.60315842, 0.63498251, 0.62574713,
       0.65965017, 0.63432284])

In [46]:
k = 6
#Train Model and Predict  
kNN_model = KNeighborsClassifier(n_neighbors=k).fit(X_train,y_train)
kNN_model

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=6, p=2,
           weights='uniform')

In [47]:
kNN_yhat = kNN_model.predict(X_test)
kNN_yhat[0:5]

array([1, 1, 1, 1, 1])

# DecisionTree

In [48]:
from sklearn.tree import DecisionTreeClassifier

In [49]:
DT_model = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
DT_model.fit(X_train,y_train)
DT_model

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=4,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [50]:
DT_yhat = DT_model.predict(X_test)
DT_yhat

array([1, 1, 1, ..., 1, 1, 1])

# Logistic Regression

In [51]:
from sklearn.linear_model import LogisticRegression

In [52]:
LR_model = LogisticRegression(C=0.01).fit(X_train,y_train)
LR_model



LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [53]:
LR_yhat = LR_model.predict(X_test)
LR_yhat

array([1, 1, 1, ..., 1, 1, 1])

## Methodology <a name="methodology"></a>

In this project we will direct our efforts on detecting factors affecting the accidents, particularly those with high severity. 

In first step we have collected the required **data: severitycode, incident date/time and various other attributes or factors related to the accident. 

Of the various attributes availabled to us - these were identified for further analysis:
Incident Date/Time
Attention indicator
Driving under influence indicator
Weather
Road condition
Light condition 
Speeding indicator

Second step in our analysis will be to cleanse the data and filter out rows/columns with null values. More than 90% of the accidents did not have speeding / attention data and hence eliminated. Next we need to eliminate accidents where DUI was a factor. Weather, road and light condition attributes were converted to category object in order to facilitate analysis.

In third step we will focus on the filter attributes and split the data into test and training sets. 
We will apply the following **machine learning** algorithms on the training set :
**K-means Clustering**
**Decision Tree**
**Logistic Regression**

In the final step we will evaluate the machine learning models on the test set and evaluate the accuracy for the model with **jaccard index** and **f1 score**.


# Model Evaluation using Test set

In [54]:
print("KNN Jaccard index: %.2f" % jaccard_similarity_score(y_test, kNN_yhat))
print("KNN F1-score: %.2f" % f1_score(y_test, kNN_yhat, average='weighted') )


KNN Jaccard index: 0.66
KNN F1-score: 0.56


In [55]:
print("DT Jaccard index: %.2f" % jaccard_similarity_score(y_test, DT_yhat))
print("DT F1-score: %.2f" % f1_score(y_test, DT_yhat, average='weighted') )


DT Jaccard index: 0.67
DT F1-score: 0.54


  'precision', 'predicted', average, warn_for)


In [56]:
LR_yhat_prob = LR_model.predict_proba(X_test)
print("LR Jaccard index: %.2f" % jaccard_similarity_score(y_test, LR_yhat))
print("LR F1-score: %.2f" % f1_score(y_test, LR_yhat, average='weighted') )
print("LR LogLoss: %.2f" % log_loss(y_test, LR_yhat_prob))

LR Jaccard index: 0.67
LR F1-score: 0.54
LR LogLoss: 0.63


  'precision', 'predicted', average, warn_for)


## Results and Discussion <a name="results"></a>

Our assumptions going into the analysis was bad weather, road / light conditions may lead to more severe accidents. We also thought certain months may have more accidents compared to the others. Some months may have high leisure travel - for ex: summer months and this could lead to more accidents. Also, some months may have higher count of accidents due to incline weather. Quick analysis for accident counts by severity proved that these counts were more or less even in the same ballpark range. 

The original dataset has 195K observations. Data wranngling and cleaning process left us with 167K observations for analysis. We split the original data 70/30 ratio for training/test analysis. Our results did not improve with 80/20 ratio.

Our analysis evaluated various machine learning models such as **K-means Clustering**, **Decision Tree**, **Logistic Regression** to determine if accident severity can be predicted based on the factors such as **month**, **road conditions**, **weather conditions** and **light conditions**.

**Jaccard index score was 67%** - our test set and predicted result test matched to a reasonable degree. F1 score or **accuracy rate** of prediction was **54%**. Prediction accuracy using this analysis is not reasonable and may not be acceptable to the stakeholders. 


## Conclusion <a name="conclusion"></a>

Purpose of this project was to identify if there was certain factors which lead to more severe accidents. We used machine learning models to see if there is a pattern to predict the accident severity.

**Our results were inconclusive and there were no clear indicators on whether certain factors lead to more severe accidents than others.** We will need to revisit the original dataset and research other attributes/parameters to see if these results could be improved upon. We might also need to pull other relevant datasets for more detailed analysis.