# <span style='background:salmon'> TRAFFIC ACCIDENT PREDICTION MODEL #

## 1. Context

John is driving to another city for work or to visit some friends. It is rainy and windy, and on the way, John came across a terrible traffic jam on the other side of the highway. 

<img src="https://i.dailymail.co.uk/i/pix/2011/05/23/article-1390203-01226D9E000004B0-277_468x307.jpg">

Long lines of cars barely moving. As John keeps driving, police car start appearing from afar shutting down the highway. Oh, it is an accident and there's a helicopter transporting the ones involved in the crash to the nearest hospital. They must be in critical condition for all of this to be happening. Now, wouldn't it be great if there is something in place that could warn John, given the weather and the road conditions about the possibility of John getting into a car accident and how severe it would be, so that John would drive more carefully or even change his travel if he is able to.

<img src="https://cdn.techinasia.com/wp-content/uploads/2016/02/traffic-jam-india.jpg">

## 2. Data benefits:

By looking at variables such as :
* road considion
* light condition
* weather
* speend
* lane

we can build a model to see if the variables abovementioned will affect severity of traffic accident.

This model can classify the severity of the accident which would provide the driver with the 'worst-case scenario', rather than a probabilistic estimate of an accident occuring. 
This will help in inducing an appropriate level of cautiousness in the driver!

Features used are speeding and road condition for this model.

## 3. Data Cleaning

In [3]:
import pandas as pd
import numpy as np
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression

print("Done")

Done


In [10]:
#reading the csv
df = pd.read_csv("Data-Collisions.csv")
# selecting the data
selected = ['SEVERITYCODE', 'SPEEDING','ROADCOND']
df = df[selected]
# check dimensions
for feature in ["SPEEDING", "ROADCOND"]:
    print(df[feature].unique())

[nan 'Y']
['Wet' 'Dry' nan 'Unknown' 'Snow/Slush' 'Ice' 'Other' 'Sand/Mud/Dirt'
 'Standing Water' 'Oil']


<img src="https://image.flaticon.com/icons/svg/1491/1491416.svg" width=50 align="left"> 

### &nbsp; _Assumptions for missing data:_

1. Drivers are not speending
2. Road condition is unknown

In [11]:
# replacing 'nan' with 'N' for speeding
df['SPEEDING'] = df['SPEEDING'].fillna('N')
#replacing 'nan' witih 'Unknown' fpr road condition
df['ROADCOND'] = df['ROADCOND'].fillna('Unknown')

# checking value once again...
for feature in ["SPEEDING", "ROADCOND"]:
    print(df[feature].unique())

['N' 'Y']
['Wet' 'Dry' 'Unknown' 'Snow/Slush' 'Ice' 'Other' 'Sand/Mud/Dirt'
 'Standing Water' 'Oil']


In [12]:
df.shape

(194673, 3)

<img src="https://image.flaticon.com/icons/svg/1491/1491416.svg" width=50 align="left"> 

### &nbsp; _Assumptions for data analysis:_

We assume that all road condition are not desired except for `Dry`, `Unknown` and `Other`

In [13]:
# Replacing ROADCOND values:
df['ROADCOND'].replace(to_replace=['Wet','Dry','Unknown','Snow/Slush','Ice','Other','Sand/Mud/Dirt','Standing Water','Oil'], value = ['Bad','Good','Good','Bad','Bad','Good','Bad','Bad','Bad'], inplace=True)

In [14]:
# Changing the data into numerical values...
df["SPEEDING"].replace(to_replace=['N', 'Y'], value=[0,1], inplace=True)
df['ROADCOND'].replace(to_replace=['Good','Bad'],value=[0,1],inplace=True)
# Defining dataset
test_condition = df[['SPEEDING','ROADCOND']]
test_condition.head()

Unnamed: 0,SPEEDING,ROADCOND
0,0,1
1,0,1
2,0,0
3,0,0
4,0,1


## 4. Data Analysis

The proportion of L2 severity is higher when the driver speed:

In [15]:
speed_analysis = df.groupby(['SPEEDING'])['SEVERITYCODE'].value_counts(normalize=True)
speed_analysis

SPEEDING  SEVERITYCODE
0         1               0.705099
          2               0.294901
1         1               0.621665
          2               0.378335
Name: SEVERITYCODE, dtype: float64

The proportion of L2 severity is higher when the road condition is bad:

In [16]:
road_analysis = df.groupby(['ROADCOND'])['SEVERITYCODE'].value_counts(normalize=True)
road_analysis

ROADCOND  SEVERITYCODE
0         1               0.710389
          2               0.289611
1         1               0.674176
          2               0.325824
Name: SEVERITYCODE, dtype: float64

### This means that these features do have an effect on the severity of accidents when it happens..

## 5. Methodology

Metrics:

In [17]:
x = test_condition
y = df['SEVERITYCODE'].values.astype(str)
x = preprocessing.StandardScaler().fit(x).transform(x)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1234)

# verifying set dimensions
print("Training set: ", x_train.shape, y_train.shape)
print("Testing set: ", x_test.shape, y_test.shape)

Training set:  (155738, 2) (155738,)
Testing set:  (38935, 2) (38935,)


### Method I

In [None]:
KNN_model = KNeighborsClassifier(n_neighbors = 4).fit(x_train, y_train)
predicted = KNN_model.predict(x_test)
KNN_f1 = f1_score(y_test, predicted, average='weighted')
KNN_acc = accuracy_score(y_test, predicted)

### Method II

In [None]:
Tree_model = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
Tree_model.fit(x_train, y_train)
predicted = Tree_model.predict(x_test)
Tree_f1 = f1_score(y_test, predicted, average='weighted')
Tree_acc = accuracy_score(y_test, predicted)

### Method III

In [None]:
LR_model = LogisticRegression(C=0.01, solver='liblinear').fit(x_train, y_train)
predicted = LR_model.predict(x_test)

LR_f1 = f1_score(y_test, predicted, average='weighted')
LR_acc = accuracy_score(y_test, predicted)

## 6. Results

In [22]:
table = {
    "Algorithm": ["KNN", "Decision Tree", "LogisticRegression"],
    "F1-score": [KNN_f1, Tree_f1, LR_f1],
    "Accuracy": [KNN_acc, Tree_acc, LR_acc]
}

table = pd.DataFrame(table)
table

Unnamed: 0,Algorithm,F1-score,Accuracy
0,KNN,0.591378,0.696751
1,Decision Tree,0.576051,0.699679
2,LogisticRegression,0.576051,0.699679


I will choose LR model.

In [101]:

table = {
    "Intercept": LR_model.intercept_,
    "Coef:SPEEDING ": LR_model.coef_[:,0],
    "Coef:ROADCOND ": LR_model.coef_[:,1],
}

table = pd.DataFrame(table)
table

Unnamed: 0,Intercept,Coef:SPEEDING,Coef:ROADCOND
0,-0.853729,0.067702,0.068295


As the coefficients are positive, I conclude that the 2 conditions (speeding and road conditions) have an effect of increasing accident severity.

## Conclusion

This model provide empirical evidence against speeding and road conditions