# Capstone Project - Car Severity Accident

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Data Cleaning](#datacleaning)
* [Methodology](#methodology)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction/Business Problem

This project aims to predict the accident “severity”. Which factors have more impact on the accidents such as weather, road condition, light condition, speding or any other type of accidents. 

### Background Discussion

The society as a whole — the accident victims and their families, their employers, insurance firms, emergency and health care personal and many others — is affected by motor vehicle crashes in many ways. It would be great if real-time conditions can be provided to estimate the trip safeness. In this way, it can be decided beforehand if the driver will take the risk, based on reliable information.

## Data

The data was collected by Seattle SPOT Traffic Management Division and provided by Coursera via a link. This dataset is updated weekly and is from 2004 to present. It contains information such as severity code, address type, location, collision type, weather, road condition, speeding, among others.

There are 194,673 observations and 38 variables in this data set. Since we would like to identify the factors that cause the accident and the level of severity, we will use SEVERITYCODE as our dependent variable Y, and try different combinations of independent variables X to get the result. Since the observations are quite large, we may need to filter out the missing value and delete the unrelated columns first. Then we can select the factor which may have more impact on the accidents, such as address type, weather, road condition, and light condition.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv("Data-Collisions.csv", low_memory=False)
df.head()

In [None]:
df.shape

## Choosing relevant variables

From the complete dataset, we will choose only the relevant variables which might have an impact in the model training.

In [None]:
col_data = df[['SEVERITYCODE', 'X', 'Y', 'ADDRTYPE', 'COLLISIONTYPE',
               'PERSONCOUNT', 'VEHCOUNT', 'JUNCTIONTYPE',  'WEATHER', 'ROADCOND', 'LIGHTCOND', 
               'SPEEDING', 'UNDERINFL', 'INATTENTIONIND']]
col_data.head()

Now let's explore the data and get the frequency of each category for a given feature.

In [None]:
for col in col_data.columns:
    if ((col_data[col].value_counts()/len(col_data[col])) > 0.8).any() == True:
        print(col)

In [None]:
def list_count(columns, df):
    for col in columns:
        print(col)
        print(df[col].value_counts())
        print()

data_columns = ['SEVERITYCODE','ADDRTYPE', 'COLLISIONTYPE', 'JUNCTIONTYPE', 'WEATHER', 
 'ROADCOND','LIGHTCOND', 'SPEEDING', 'UNDERINFL', 'INATTENTIONIND']

#Use value_counts() method in each column
list_count(data_columns, col_data)

## Data Cleaning

Some of the categories include 'Other' and 'Unknown' which does not provide enough information. Threfore, we should drop this entries from our dataset.

In [None]:
filterCond = (col_data.LIGHTCOND == 'Other') | (col_data.LIGHTCOND == 'Unknown') | \
                      (col_data.LIGHTCOND == 'Dark - Unknown Lighting') |\
                      (col_data.ROADCOND == 'Other') | (col_data.ROADCOND == 'Unknown') | \
                      (col_data.WEATHER == 'Other') | (col_data.WEATHER == 'Unknown') | \
                      (col_data.JUNCTIONTYPE == 'Other') | (col_data.JUNCTIONTYPE == 'Unknown') | \
                      (col_data.COLLISIONTYPE == 'Other')
col_data = col_data.drop(col_data[filterCond].index)

In [None]:
col_data["LIGHTCOND"] = col_data["LIGHTCOND"].replace("Dark - Street Lights Off", "Dark - No Street Lights")
col_data["UNDERINFL"] = col_data["UNDERINFL"].replace("N", 0)
col_data["UNDERINFL"] = col_data["UNDERINFL"].replace("0", 0)
col_data["UNDERINFL"] = col_data["UNDERINFL"].replace("1", 1)
col_data["UNDERINFL"] = col_data["UNDERINFL"].replace("Y", 1)
col_data["INATTENTIONIND"] = col_data["INATTENTIONIND"].replace("Y", 1)
col_data["SPEEDING"] = col_data["SPEEDING"].replace("Y", 1)

In [None]:
# Check the columns which has NaN values
col_data.isna().sum()

There are many missing values, especially in 'Speeding' and 'Inattentionid' column. We will fill them as 0. (Boolean columns with 0)

In [None]:
col_data['UNDERINFL'] = col_data['UNDERINFL'].fillna(0)
col_data['INATTENTIONIND'] = col_data['INATTENTIONIND'].fillna(0)
col_data['SPEEDING'] = col_data['SPEEDING'].fillna(0)

In [None]:
col_data.dropna(inplace=True)

In [None]:
col_data.info()

Now our data is cleaned. We have 143.741 observations.

In [None]:
col_data['SEVERITYCODE'].unique()

In [None]:
# Rename severitycode to 0,1
col_data["SEVERITYCODE"] = col_data["SEVERITYCODE"].replace(1, 0)
col_data["SEVERITYCODE"] = col_data["SEVERITYCODE"].replace(2, 1)

In [None]:
# One hot encoding for the relevant dataset
feature = pd.concat([pd.get_dummies(col_data['WEATHER']), 
                     pd.get_dummies(col_data['ROADCOND']),
                     pd.get_dummies(col_data['LIGHTCOND'])], axis=1)
feature.head()

In [None]:
col_data.columns

In [None]:
import seaborn as sns
sns.countplot(x="ADDRTYPE", hue="SEVERITYCODE", data=col_data)

The above plot represents that 'block' areas have more property damage than intersection areas.

In [None]:
plt.figure(figsize=(10,5))
ax= sns.countplot(x="COLLISIONTYPE", hue="SEVERITYCODE", data=col_data)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45)

The most accidents happened in parked cars with property damage.

In [None]:
plt.figure(figsize=(8,5))
sns.countplot(y="JUNCTIONTYPE", hue="SEVERITYCODE", data=col_data)

In [None]:
plt.figure(figsize=(10,5))
ax= sns.countplot(x="WEATHER", hue="SEVERITYCODE", data=col_data)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45)

In [None]:
plt.figure(figsize=(8,5))
sns.countplot(y="ROADCOND", hue="SEVERITYCODE", data=col_data)

Suprisingly, the most property damage happened in clear weather. Snowing, rainy and other weather conditions have very low. Moreover, dry road condition has more property damage than other road conditions.

In [None]:
plt.figure(figsize=(8,5))
sns.countplot(y="LIGHTCOND", hue="SEVERITYCODE", data=col_data)

In daylight condition has more accidents than other light conditions.

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(12, 4))
sns.countplot(x="SPEEDING", hue="SEVERITYCODE", data=col_data, ax=axes[0])
sns.countplot(x="UNDERINFL", hue="SEVERITYCODE", data=col_data, ax=axes[1])
sns.countplot(x="INATTENTIONIND", hue="SEVERITYCODE", data=col_data, ax=axes[2])

In [None]:
# !conda install -c conda-forge folium=0.5.0 --yes
import folium

print('Folium installed and imported!')

In [None]:
from folium import plugins
seattle_long= -122.335167
seattle_lat= 47.608013
seattle_map = folium.Map(location=[seattle_lat, seattle_long], zoom_start=4)
# let's start again with a clean copy of the map of Seattle

# instantiate a mark cluster object for the incidents in the dataframe
incidents = plugins.MarkerCluster().add_to(seattle_map)

# loop through the dataframe and add each data point to the mark cluster
for lat, lng, in zip(col_data.Y, col_data.X):
    folium.Marker(
        location=[lat, lng],
        icon=None,
        #popup=label,
    ).add_to(incidents)

# display map
seattle_map

# Methodology

In this project we will explore which areas and conditions in Seattle that cause accidents.

In first step we have analyzed the data using charts and tables.

Second step in our analysis will be exploration of 'severity density' across different type of conditions in Seattle - we will use countplot to identify which factors have more impact on the property damage.

In third and final step we will focus on most property damage conditions and within those create clusters of those conditions and find most accurate model to predict it.

# TRAIN/TEST SPLIT

In [None]:
# Defining X matrix and y vector
X = feature
y = col_data['SEVERITYCODE'].values

In [None]:
# Normalizing and splitting data
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
X = preprocessing.StandardScaler().fit(X).transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

## KNN 

In [None]:
from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

In [None]:
Ks = 10
mean_acc = np.zeros((Ks-1))
std_acc = np.zeros((Ks-1))
ConfustionMx = [];
for n in range(1,Ks):
    
    #Train Model and Predict  
    neigh = KNeighborsClassifier(n_neighbors = n).fit(X_train,y_train)
    yhat=neigh.predict(X_test)
    mean_acc[n-1] = metrics.accuracy_score(y_test, yhat)
    std_acc[n-1]=np.std(yhat==y_test)/np.sqrt(yhat.shape[0])

In [None]:
import matplotlib.pyplot as plt
plt.plot(range(1,Ks),mean_acc,'g')
plt.fill_between(range(1,Ks),mean_acc - 1 * std_acc,mean_acc + 1 * std_acc, alpha=0.10)
plt.legend(('Accuracy ', '+/- 3xstd'))
plt.ylabel('Accuracy ')
plt.xlabel('Number of Nabors (K)')
plt.tight_layout()
plt.show()

In [None]:
print( "The best accuracy was with", mean_acc.max(), "with k=", mean_acc.argmax()+1) 

In [None]:
k = 4
neigh6 = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
yhat6 = neigh6.predict(X_test)
print("Train set Accuracy: ", metrics.accuracy_score(y_train, neigh6.predict(X_train)))
print("Test set Accuracy: ", metrics.accuracy_score(y_test, yhat6))

In [None]:
# predicted y
yhat_knn = neigh.predict(X_test)

# jaccard
jaccard_knn = jaccard_similarity_score(y_test, yhat_knn)
print("KNN Jaccard index: ", jaccard_knn)

# f1_score
f1_score_knn = f1_score(y_test, yhat_knn, average='weighted')
print("KNN F1-score: ", f1_score_knn)

## Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
severityTree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
severityTree.fit(X_train, y_train)
# predicted y
yhat_dt = severityTree.predict(X_test)

# jaccard
jaccard_dt = jaccard_similarity_score(y_test, yhat_dt)
print("DT Jaccard index: ", jaccard_dt)

# f1_score
f1_score_dt = f1_score(y_test, yhat_dt, average='weighted')
print("DT F1-score: ", f1_score_dt)

## Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
LR = LogisticRegression(C=0.01).fit(X_train,y_train)
LR

In [None]:
yhat_lg = LR.predict(X_test)
yhat_lg_prob = LR.predict_proba(X_test)

# jaccard
jaccard_lg = jaccard_similarity_score(y_test, yhat_lg)
print("LR Jaccard index: ", jaccard_lg)

# f1_score
f1_score_lg = f1_score(y_test, yhat_lg, average='weighted')
print("LR F1-score: ", f1_score_lg)

# logloss
logloss_lg = log_loss(y_test, yhat_lg_prob)
print("LR log loss: ", logloss_lg)

## Support Vector Machine (SVM)

In [None]:
from sklearn import svm
# training
clf = svm.SVC()
clf.fit(X_train, y_train)

In [None]:
# predicted y
yhat_svm = clf.predict(X_test)

# jaccard
jaccard_svm = jaccard_similarity_score(y_test, yhat_svm)
print("SVM Jaccard index: ", jaccard_svm)

# f1_score
f1_score_svm = f1_score(y_test, yhat_svm, average='weighted')
print("SVM F1-score: ", f1_score_svm)

### Report

| Algorithm          | Jaccard | F1-score | LogLoss |
|--------------------|---------|----------|---------|
| KNN                | 0.6215  | 0.5654   | NA      |
| Decision Tree      | 0.6638  | 0.5297   | NA      |
| SVM                | 0.6635  | 0.5296   | NA      |
| LogisticRegression | 0.6638  | 0.5297   | 0.6377  |

# Results and Discussion

This project and analysis are quite helpful for the Seattle transportation department. Before I did the analysis, I thought that maybe weather, road, and light condition may cause more accidents, the results showed that it was not correct. However, we do figure out that the accidents are highly related to some specific locations. Thus, the traffic management division could try to improve the safety instructions or some other factors that could reduce the accidents.

Furthermore, there are some places which has more accidents during the dark time. For those places, adding lights might be a good solution to reduce the collisions. 

# Conclusion

Purpose of this project was to predict the accident severity in Seattle which conditions have higher impact of those accidents in order to aid stakeholders or government in narrowing down the search for optimal solution for reduce collisions. We used 4 different algorithms to predict severity. We found out that Decision Tree and Logistic Regression have more accuracy score than others.

Final decision on optimal solution will be made by stakeholders based on specific characteristics of conditions and locations.