# Coursera Capstone Project

# Week 1

## 1. Introduction/Business Problem

### 1.1 Background

Say you are driving to another city for work or to visit some friends. It is rainy and windy, and on the way, you come across a terrible traffic jam on the other side of the highway. Long lines of cars barely moving. As you keep driving, police car start appearing from afar shutting down the highway. Oh, it is an accident and there's a helicopter transporting the ones involved in the crash to the nearest hospital. They must be in critical condition for all of this to be happening. Now, wouldn't it be great if there is something in place that could warn you, given the weather and the road conditions about the possibility of you getting into a car accident and how severe it would be, so that you would drive more carefully or even change your travel if you are able to.

### 1.2 Problem

The purpose of this project is to use data related to weather and road conditions to predict the possibility of getting into a car accident and how severe it would be. This could help the user to make decisions on whether to change travel plans or drive more carefully under certain conditions.

### 1.3 Interest

Insurance companies that'd like to understand how certain weather patterns will affect accident severity can use such information to calculate appropriate premiums. A more severe accident may cost more and thus an increased likelihood of this based on weather information will result in a higher premimum.

## 2. Data Acquistion and Cleaning

### 2.1 Data Source

The data is provided by the course materials. In order to understand the data to understand what cleaning needs to be done, we can use pandas library to import the csv file and read it.

In [5]:
import itertools
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn import preprocessing
%matplotlib inline
data = pd.read_csv("Data-Collisions.csv", delimiter = ',', low_memory = False)
data.head()

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


A first look at the data confirms that SEVERITY CODE is the target which we want our model to predict. 

### 2.2 Data Cleaning

To perform data cleaning, we analyze the possible values for our target parameter.

In [10]:
data['SEVERITYCODE'].value_counts()

1    136485
2     58188
Name: SEVERITYCODE, dtype: int64

If we look at the target parameter, it appears the dataset we have has more cases for SEVERITY 1 v.s. SEVERITY 2. Such a bias may skew the model that we train. As a result, the data needs to be balanced before we proceed with machine learning. The machine learning model will be trained to predict accident severity based on attributes such as weather, road conditions, and light conditions features. The data will be split into training and testing sets to determine the model to use for highest classification accuracy.

### 2.3 Feature Selection

We look at all the columns first to determine which features may be relevant.

In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 194673 entries, 0 to 194672
Data columns (total 38 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   SEVERITYCODE    194673 non-null  int64  
 1   X               189339 non-null  float64
 2   Y               189339 non-null  float64
 3   OBJECTID        194673 non-null  int64  
 4   INCKEY          194673 non-null  int64  
 5   COLDETKEY       194673 non-null  int64  
 6   REPORTNO        194673 non-null  object 
 7   STATUS          194673 non-null  object 
 8   ADDRTYPE        192747 non-null  object 
 9   INTKEY          65070 non-null   float64
 10  LOCATION        191996 non-null  object 
 11  EXCEPTRSNCODE   84811 non-null   object 
 12  EXCEPTRSNDESC   5638 non-null    object 
 13  SEVERITYCODE.1  194673 non-null  int64  
 14  SEVERITYDESC    194673 non-null  object 
 15  COLLISIONTYPE   189769 non-null  object 
 16  PERSONCOUNT     194673 non-null  int64  
 17  PEDCOUNT  

Features that relevant to external conditions which we can identify are WEATHER, ROADCOND, and LIGHTCOND.

# Week 2

## 3. Data Preparation

As most of the data features we wish to use are objects, we need to perform one-hot encoding for the features WEATHER, ROADCOND, and LIGHTCOND and build a feature dataframe.

In [11]:
data.groupby(['LIGHTCOND'])['SEVERITYCODE'].value_counts(normalize = True)

LIGHTCOND                 SEVERITYCODE
Dark - No Street Lights   1               0.782694
                          2               0.217306
Dark - Street Lights Off  1               0.736447
                          2               0.263553
Dark - Street Lights On   1               0.701589
                          2               0.298411
Dark - Unknown Lighting   1               0.636364
                          2               0.363636
Dawn                      1               0.670663
                          2               0.329337
Daylight                  1               0.668116
                          2               0.331884
Dusk                      1               0.670620
                          2               0.329380
Other                     1               0.778723
                          2               0.221277
Unknown                   1               0.955095
                          2               0.044905
Name: SEVERITYCODE, dtype: float64

In [6]:
data[['WEATHER', 'ROADCOND', 'LIGHTCOND']].head()

Unnamed: 0,WEATHER,ROADCOND,LIGHTCOND
0,Overcast,Wet,Daylight
1,Raining,Wet,Dark - Street Lights On
2,Overcast,Dry,Daylight
3,Clear,Dry,Daylight
4,Raining,Wet,Daylight


In [10]:
Feature = data[[]]
Feature = pd.concat([Feature, pd.get_dummies(data['WEATHER'])], axis = 1)
Feature = pd.concat([Feature, pd.get_dummies(data['ROADCOND'])], axis = 1)
Feature = pd.concat([Feature, pd.get_dummies(data['LIGHTCOND'])], axis = 1)
Feature.head()

Unnamed: 0,Blowing Sand/Dirt,Clear,Fog/Smog/Smoke,Other,Overcast,Partly Cloudy,Raining,Severe Crosswind,Sleet/Hail/Freezing Rain,Snowing,...,Wet,Dark - No Street Lights,Dark - Street Lights Off,Dark - Street Lights On,Dark - Unknown Lighting,Dawn,Daylight,Dusk,Other.1,Unknown
0,0,0,0,0,1,0,0,0,0,0,...,1,0,0,0,0,0,1,0,0,0
1,0,0,0,0,0,0,1,0,0,0,...,1,0,0,1,0,0,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,0,0,0,0,0,0,1,0,0,0,...,1,0,0,0,0,0,1,0,0,0


In [12]:
X = Feature
Y = data ['SEVERITYCODE'].values

In [15]:
X = preprocessing.StandardScaler().fit(X).transform(X)

In [21]:
# use a train test split to the train the model
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
#from sklearn.externals.six import StringIO

#split the data into training and testing folds
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state = 3)
    
best_depth = 1
lists =[[],[]]
maxdepth = 20

for c in range (3, maxdepth):
    tree = DecisionTreeClassifier(criterion="entropy", max_depth = c)
    tree.fit(x_train, y_train)
    prediction = tree.predict(x_test)
    
    accuracy = metrics.accuracy_score(y_test, prediction)
    lists[0].append(accuracy)
    lists[1].append(c)

    if accuracy == max(lists[0]):
        best_accuracy = accuracy
        best_depth = c
        best_tree = tree

print ("The best accuracy is ", best_accuracy, " with a max depth of ", best_depth)

The best accuracy is  0.6994109790760591  with a max depth of  6


In [22]:
!conda install -c conda-forge pydotplus -y
!conda install -c conda-forge python-graphviz -y

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: D:\anaconda3

  added / updated specs:
    - pydotplus


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    conda-4.8.4                |           py38_0         2.9 MB
    graphviz-2.38.0            |                7        37.7 MB  conda-forge
    pydotplus-2.0.2            |             py_2          23 KB  conda-forge
    ------------------------------------------------------------
                                           Total:        40.6 MB

The following NEW packages will be INSTALLED:

  graphviz           conda-forge/win-32::graphviz-2.38.0-7
  pydotplus          conda-forge/noarch::pydotplus-2.0.2-py_2

The following packages will be UPDATED:

  conda                                        4.8.3-py38_0 --> 4.8.4-py38_0





In [31]:
from io import StringIO
import pydotplus
import matplotlib.image as mpimg
from sklearn import tree
%matplotlib inline

dot_data = StringIO()
filename = "loan_tree.png"
featureNames = data.columns[0:29]
targetNames = data["SEVERITYCODE"].unique().tolist()
out=tree.export_graphviz(best_tree,feature_names=featureNames, out_file=dot_data, class_names= np.unique(y_train), filled=True,  special_characters=True,rotate=False)  
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png(filename)
img = mpimg.imread(filename)
plt.figure(figsize=(100, 200))
plt.imshow(img,interpolation='nearest')

TypeError: can only concatenate str (not "numpy.int64") to str