# Coursera Capstone Project


## Introduction/Business Problem

Emergency services all around the world have a limited number of ressources available to respond to any number of situations which might require their intervention.
Accidents happen in every city and cannot be predicted. They are random events. But fortunately, thanks to advances in statistics, data analysis and computer science, the likelyhood of the events and their root causes can be traced down.

***

The purpose of this report will be to try and predict which external factors have greater effect on the severity of accidents. Major accidents can strain emergency services blocking all available personel. There is a clear need to know when these events might happen. Urban planners and emergency managers must allocate resources accordingly in these kind of situations.


There are 3 main points of action that this report will try to identify:


1. **What are the main external factors driving the severity of accidents?** 

Weather, road conditions, junction type, light conditions are all external factors that may or may not influence the likelyhood of an accident. Urban planners and emergency managers need to know which are the most important to enact policies changing the urban environment and allocate ressources by location and expected conditions.


2. **Which ones of the external conditionants influence most?** 

Weather and light conditions cannot be changed but if they are identified, they may give an edge to emergency services. They may increase temporarily the ammount of resources available, change shift or put more personnel on call.


3. **Which urban configurations influence the most?** 

Road conditions, junction type are urban configurations that may be playing a role. Are parked cars increasing the ammount of damage in an accident? If so, urban planners could push for a removal of ground-level parking space for underground parkings and free up space for pedestrians.
  
***
    
This report is aimed at city emergency services and urban planners. It will not try to assess the relation between driver condition and severity. Driver condition cannot be predicted beforehand and therefore of no use for accident response or urban planning. That kind of data is out of scope and may be usefull for public information, statistics or concienciation but not the main purpose of this Data Science Project.

## Data

Data to be used for this project is [this](https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv) as provided by *Coursera Capstone Project* with metadata description from [here](https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Metadata.pdf).

***

The target or label column will be accident " severity". The machine learning model will be able to predict the severity of the accident. This severity is graded as:
    
- 3.—fatality
- 2.b—serious injury
- 2.—injury
- 1.—prop damage
- 0.—unknown

As we are not interested in predicting "unknown" severity, we will remove those cases and set the scale from 1 to 2:

1. Property damage
2. Injuries

***

In a first stage we will use the following data to predict outcomes:

- Time: This will be extracted from 'INCDTTM'. Removing the date and leaving only, time of the day.
- Weather: WEATHER
- Light Conditions: LIGHTCOND
- Junction Type: From 'JUNCTIONTYPE' column.
- Road Conditions: ROADCOND
- Hit Parked car: HITPARKED CAR.


Then the prediction will be separated and influence on the outcome studied from two separate groups:

- External conditions (Emergency Services information):
    - Time: INCDTTM
    - Weather: WEATHER
    - Light Conditions: LIGHTCOND
    
    
- Urban configuration (Urban Planners information):
    - Junction Type: JUNCTIONTYPE
    - Road Conditions: ROADCOND
    - Hit Parked car: HITPARKED CAR

## Methodology 

Importing libraries and reading data:

In [None]:
import pandas as pd
import numpy as np

In [None]:
df_acc=pd.read_csv('https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv')

First we check our database:

In [None]:
df_acc.head()

And we check the columns present:

In [None]:
df_acc.columns

As described in Data section we will focus on a few variables only, so we create a dataframe with those variables:

In [None]:
df_sev=df_acc[["SEVERITYCODE", "INCDTTM", "WEATHER", "LIGHTCOND", "JUNCTIONTYPE", "ROADCOND", "HITPARKEDCAR"]]

In [None]:
df_sev.head()

We want time to be a discrete variable for prediction. So we can check what hours are the most dangerous.

In [None]:
df_sev["INCDTTM"].head(10)

Watch out! There are timestamps without time!

In [None]:
myseries=pd.to_datetime(df_sev['INCDTTM'], errors='coerce')
dtidx = pd.DatetimeIndex(myseries)
df_sev.index = dtidx
df_sev=df_sev.between_time('00:00:01', '23:59:59')

In [None]:
df_sev=df_sev.reset_index(drop=True)
df_sev.head(10)

 We replace timestamp by hour of day:

In [None]:
myseries=pd.to_datetime(df_sev['INCDTTM'], errors='coerce')
df_h={'Hour': myseries.dt.hour}

df_sev = df_sev.assign(INCDTTM=df_h['Hour'])

In [None]:
df_sev["INCDTTM"].head(10)

### We check empty values:

In [None]:
df_sev.isnull().sum()


All the variables but 'time' and 'hitparkedcar' have missing values. We will drop rows with missing value

In [None]:
#We drop rows with missing values
# simply drop whole row with NaN in "price" column
df_sev.dropna(subset=["WEATHER"], axis=0, inplace=True)
df_sev.dropna(subset=["LIGHTCOND"], axis=0, inplace=True)
df_sev.dropna(subset=["JUNCTIONTYPE"], axis=0, inplace=True)
df_sev.dropna(subset=["ROADCOND"], axis=0, inplace=True)

# reset index, because we droped two rows
df_sev.reset_index(drop=True, inplace=True)

In [None]:
df_sev.describe(include = "all")

In [None]:
df_sev.isnull().sum()

### We remove unknowns:

In [None]:
df_sev["WEATHER"].unique()

In [None]:
df_sev["LIGHTCOND"].unique()

In [None]:
df_sev["JUNCTIONTYPE"].unique()

In [None]:
df_sev["ROADCOND"].unique()

In [None]:
df_sev["HITPARKEDCAR"].unique()

In [None]:
#We drop rows with unknown nan

# simply drop whole row with NaN in "price" column
df_sev.replace("Unknown", np.nan, inplace = True)
df_sev.replace("Other", np.nan, inplace = True)
df_sev.replace("nan", np.nan, inplace = True)

df_sev.dropna(subset=["WEATHER"], axis=0, inplace=True)
df_sev.dropna(subset=["LIGHTCOND"], axis=0, inplace=True)
df_sev.dropna(subset=["JUNCTIONTYPE"], axis=0, inplace=True)
df_sev.dropna(subset=["ROADCOND"], axis=0, inplace=True)

# reset index, because we droped two rows
df_sev.reset_index(drop=True, inplace=True)

In [None]:
df_sev.head(10)

### We group inside variables

#### First light conditions

In [None]:
df_sev.groupby(['LIGHTCOND'])['SEVERITYCODE'].value_counts(normalize=True)

We group all Dark categories because we consider light conditoins to be an external factor:

In [None]:
df_sev.replace("Dark - No Street Lights", "Dark", inplace = True)
df_sev.replace("Dark - Street Lights Off", "Dark", inplace = True)
df_sev.replace("Dark - Street Lights On", "Dark", inplace = True)
df_sev.replace("Dark - Unknown Lighting", "Dark", inplace = True)

In [None]:
df_sev['LIGHTCOND'].unique()

#### Second weather

In [None]:
df_sev.groupby(['WEATHER'])['SEVERITYCODE'].value_counts(normalize=True)

In [None]:
df_sev['WEATHER'].unique()

We do not change weather


#### Third Junction Type

In [None]:
df_sev.groupby(['JUNCTIONTYPE'])['SEVERITYCODE'].value_counts(normalize=True)

We consider that this data is location of reported accident but we want where it happened. So (intersection related) gets converted to Intersection. (not related to intersection) to Mid-Block and Ramp/Driveway Junction we leave it as it is.

In [None]:
df_sev.replace("At Intersection (but not related to intersection)", "Mid-Block", inplace = True)
df_sev.replace("At Intersection (intersection related)", "Intersection", inplace = True)
df_sev.replace("Mid-Block (not related to intersection)", "Mid-Block", inplace = True)
df_sev.replace("Mid-Block (but intersection related)", "Intersection", inplace = True)

In [None]:
df_sev.groupby(['JUNCTIONTYPE'])['SEVERITYCODE'].value_counts(normalize=True)

#### Fourth Road Conditions

In [None]:
df_sev.groupby(['ROADCOND'])['SEVERITYCODE'].value_counts(normalize=True)

We do not change

### We are going to check some graphics 


So we get an idea of what influences the most

In [None]:
!conda install -c anaconda seaborn -y

In [None]:
#importing missing libraries
import itertools
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter

import matplotlib.ticker as ticker
from sklearn import preprocessing
%matplotlib inline

In [None]:
import seaborn as sns

bins = np.linspace(df_sev.INCDTTM.min(), df_sev.INCDTTM.max(), 24)
g = sns.FacetGrid(df_sev, col="WEATHER", hue="SEVERITYCODE", palette="Set1", col_wrap=2)
g.map(plt.hist, 'INCDTTM', bins=bins, ec="k")

g.axes[-1].legend()
plt.show()

In [None]:
bins = np.linspace(df_sev.INCDTTM.min(), df_sev.INCDTTM.max(), 24)
g = sns.FacetGrid(df_sev, col="LIGHTCOND", hue="SEVERITYCODE", palette="Set1", col_wrap=2)
g.map(plt.hist, 'INCDTTM', bins=bins, ec="k")

g.axes[-1].legend()
plt.show()

In [None]:
bins = np.linspace(df_sev.INCDTTM.min(), df_sev.INCDTTM.max(), 24)
g = sns.FacetGrid(df_sev, col="JUNCTIONTYPE", hue="SEVERITYCODE", palette="Set1", col_wrap=2)
g.map(plt.hist, 'INCDTTM', bins=bins, ec="k")

g.axes[-1].legend()
plt.show()

We can start obtaining some insights from data:

1. Accidents are more severe at Intersections than Ramp junctions
2. Bad weather increases severity (not number)
3. Light conditions unconclusive

### Data is ready.

We are going to perform some machine learning by selecting **decision tree** evaluation method:

In [None]:
X = df_sev[["INCDTTM", "WEATHER", "LIGHTCOND", "JUNCTIONTYPE", "ROADCOND", "HITPARKEDCAR"]].values

some features in this dataset are categorical such as **WEATHER** or **JUNCTIONTYPE**. Unfortunately, Sklearn Decision Trees do not handle categorical variables. But still we can convert these features to numerical values. 
Convert categorical variable into dummy/indicator variables.


In [None]:
from sklearn import preprocessing
le_hit = preprocessing.LabelEncoder()
le_hit.fit(['N','Y'])
X[:,5] = le_hit.transform(X[:,5]) 


le_road = preprocessing.LabelEncoder()
le_road.fit([ 'Wet', 'Dry', 'Unknown', 'Ice', 'Snow/Slush', 'Other',
       'Sand/Mud/Dirt', 'Standing Water', 'Oil'])
X[:,4] = le_road.transform(X[:,4])


le_junc = preprocessing.LabelEncoder()
le_junc.fit([ 'Driveway Junction', 'Intersection', 'Mid-Block', 'Ramp Junction'])
X[:,3] = le_junc.transform(X[:,3]) 

le_light = preprocessing.LabelEncoder()
le_light.fit([ 'Daylight', 'Dark', 'Dusk', 'Dawn'])
X[:,2] = le_light.transform(X[:,2]) 

le_wea = preprocessing.LabelEncoder()
le_wea.fit([ 'Overcast', 'Raining', 'Clear', 'Snowing', 'Fog/Smog/Smoke',
       'Sleet/Hail/Freezing Rain', 'Blowing Sand/Dirt',
       'Severe Crosswind', 'Partly Cloudy'])
X[:,1] = le_wea.transform(X[:,1]) 


X[0:5]


In [None]:
y=df_sev['SEVERITYCODE']

### Setting up the Decision Tree

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

In [None]:
X_trainset, X_testset, y_trainset, y_testset = train_test_split(X, y, test_size=0.3, random_state=3)

In [None]:
sevTree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)

In [None]:
sevTree.fit(X_trainset,y_trainset)

In [None]:
predTree = sevTree.predict(X_testset)

In [None]:
print (predTree [0:5])
print (y_testset [0:5])

In [None]:
from sklearn import metrics
import matplotlib.pyplot as plt
print("DecisionTrees's Accuracy: ", metrics.accuracy_score(y_testset, predTree))

## Results

Let's visualize results

In [None]:
# Notice: You might need to uncomment and install the pydotplus and graphviz libraries if you have not installed these before
!conda install -c conda-forge pydotplus -y
!conda install -c conda-forge python-graphviz -y

In [None]:
from sklearn.externals.six import StringIO
import pydotplus
import matplotlib.image as mpimg
from sklearn import tree
%matplotlib inline 

In [None]:
dot_data = StringIO()
filename = "sevtree.png"
featureNames = my_data.columns[0:5]
targetNames = my_data["SEVERITYCODE"].unique().tolist()
out=tree.export_graphviz(sevTree,feature_names=featureNames, out_file=dot_data, class_names= np.unique(y_trainset), filled=True,  special_characters=True,rotate=False)  
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png(filename)
img = mpimg.imread(filename)
plt.figure(figsize=(100, 200))
plt.imshow(img,interpolation='nearest')

## Discussion 

As the results show accidents are highly impredictable but some rough predictions can be made:
    
- Accidents are more severe at Intersections than Ramp junctions
- Bad weather increases severity (not number)
- Light conditions unconclusive

More insights need to be performed.

## Conclusion 

Severity of accidents are hard to predict and a lot of factors can influence the outcome. Both emergency services and urban planners need some prediction model.

This model provides an idea of when accidents may be more likely (by time, weather or road conditions) and where (junction type is a major influencer). With this data they can respond more effectively and design junctions, roads more efficiently.

The model may be used to predict in certain conditions, the seriousness of an accident.