# Capstone Project - Car accident severity (Week 2)
### Applied Data Science Capstone by Pablo César López

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data Visualization and pre-processing](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

### Introduction: Business Problem <a name="introduction"></a>

The present analysis aims to generate a Machine Learning model to predict the severity of car accidents in Seattle based of some features registered and related to said accidents. Various algorithms will be analyzed and it will be determined which has the best fit to the provided test data.

### Data Visualization and pre-processing<a name="data"></a>

Importing Python libraries and downloading data set from ibm cloud and loading it into a pandas dataframed

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pylab as pltlab
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import matplotlib.ticker as ticker
from sklearn import preprocessing
%matplotlib inline


In [2]:
df=pd.read_csv("https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv",dtype={'ST_COLCODE': np.object})

In [3]:
df.head()

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


There are 38 columns and 194673 rows. The label is SEVERITYCODE, which is the value that we want to predict. The other columns are features that we'll use to train ML models. Next step shows the name of columns:

In [4]:
df.columns

Index(['SEVERITYCODE', 'X', 'Y', 'OBJECTID', 'INCKEY', 'COLDETKEY', 'REPORTNO',
       'STATUS', 'ADDRTYPE', 'INTKEY', 'LOCATION', 'EXCEPTRSNCODE',
       'EXCEPTRSNDESC', 'SEVERITYCODE.1', 'SEVERITYDESC', 'COLLISIONTYPE',
       'PERSONCOUNT', 'PEDCOUNT', 'PEDCYLCOUNT', 'VEHCOUNT', 'INCDATE',
       'INCDTTM', 'JUNCTIONTYPE', 'SDOT_COLCODE', 'SDOT_COLDESC',
       'INATTENTIONIND', 'UNDERINFL', 'WEATHER', 'ROADCOND', 'LIGHTCOND',
       'PEDROWNOTGRNT', 'SDOTCOLNUM', 'SPEEDING', 'ST_COLCODE', 'ST_COLDESC',
       'SEGLANEKEY', 'CROSSWALKKEY', 'HITPARKEDCAR'],
      dtype='object')

Next, we get information about columns for checking the data types:

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 194673 entries, 0 to 194672
Data columns (total 38 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   SEVERITYCODE    194673 non-null  int64  
 1   X               189339 non-null  float64
 2   Y               189339 non-null  float64
 3   OBJECTID        194673 non-null  int64  
 4   INCKEY          194673 non-null  int64  
 5   COLDETKEY       194673 non-null  int64  
 6   REPORTNO        194673 non-null  object 
 7   STATUS          194673 non-null  object 
 8   ADDRTYPE        192747 non-null  object 
 9   INTKEY          65070 non-null   float64
 10  LOCATION        191996 non-null  object 
 11  EXCEPTRSNCODE   84811 non-null   object 
 12  EXCEPTRSNDESC   5638 non-null    object 
 13  SEVERITYCODE.1  194673 non-null  int64  
 14  SEVERITYDESC    194673 non-null  object 
 15  COLLISIONTYPE   189769 non-null  object 
 16  PERSONCOUNT     194673 non-null  int64  
 17  PEDCOUNT  

Next, we convert into datetime INCDATE and INCDTTM columns:

In [6]:
df['INCDATE'] = pd.to_datetime(df['INCDATE'])
df['INCDTTM'] = pd.to_datetime(df['INCDTTM'])
df.head()

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


In [7]:
df['SEVERITYCODE'].value_counts()

1    136485
2     58188
Name: SEVERITYCODE, dtype: int64

In [8]:
df['SEVERITYCODE.1'].value_counts()

1    136485
2     58188
Name: SEVERITYCODE.1, dtype: int64

We observe that SEVERITYCODE column is repeated and we proceed to drop the duplicated:

In [10]:
df.drop(columns="SEVERITYCODE.1",inplace=True)

In [11]:
df.shape

(194673, 37)

In [12]:
df_list=df
for column in df_list.columns.values.tolist():
    print(column)
    print (df_list[column].value_counts())
    print("")
#df['loan_status'].value_counts()

SEVERITYCODE
1    136485
2     58188
Name: SEVERITYCODE, dtype: int64

X
-122.332653    265
-122.344896    254
-122.328079    252
-122.344997    239
-122.299160    231
              ... 
-122.322768      1
-122.288680      1
-122.405699      1
-122.323578      1
-122.343898      1
Name: X, Length: 23563, dtype: int64

Y
47.708655    265
47.717173    254
47.604161    252
47.725036    239
47.579673    231
            ... 
47.556705      1
47.709101      1
47.513899      1
47.565438      1
47.563521      1
Name: Y, Length: 23839, dtype: int64

OBJECTID
2047     1
1194     1
58550    1
64693    1
62644    1
        ..
96890    1
90745    1
92792    1
70263    1
2049     1
Name: OBJECTID, Length: 194673, dtype: int64

INCKEY
266238    1
81549     1
104088    1
126615    1
124566    1
         ..
164613    1
176899    1
178946    1
172801    1
295445    1
Name: INCKEY, Length: 194673, dtype: int64

COLDETKEY
266238    1
122129    1
111900    1
101659    1
99610     1
         ..
137750    1


# Identify missing values
### Evaluating for Missing Data

In [13]:
missing_data = df.isnull()
missing_data.head(5)

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,False,False,False,False,False,False,False,False,False,False,...,False,False,True,True,True,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,True,...,False,False,True,False,True,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,True,...,False,False,True,False,True,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,True,...,False,False,True,True,True,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,True,False,True,False,False,False,False,False


"True" stands for missing value, while "False" stands for not missing value.

## Count missing values in each column
We check the number of missing values in each column. As mentioned above, "True" represents a missing value, "False" means the value is present in the dataset:

In [14]:
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("")

NameError: name 'missing_data' is not defined

Now, we'll choose the features for training the ML model

SEVERITYCODE	---> Label <br>
X -----------------> feature: drop null's	<br>
Y -----------------> feature: drop null's	<br>
OBJECTID	---> drop column <br>
INCKEY 	---> drop column <br>
COLDETKEY	---> drop column <br>
REPORTNO	---> drop column <br>
STATUS		---> drop column <br>
ADDRTYPE	---> feature: hot encoding, drop null's <br>
INTKEY 	---> drop column <br>
LOCATION	---> drop column <br>
EXCEPTRSNCODE	---> drop column <br>
EXCEPTRSNDESC	---> drop column <br>
SEVERITYDESC	---> drop column <br>
COLLISIONTYPE	---> hot encoding <br>
PERSONCOUNT	---> feature <br>
PEDCOUNT	---> feature <br>
PEDCYLCOUNT	---> feature <br>
VEHCOUNT	---> feature <br>
INCDATE_DAYOFWEEK -> calculated feature from INCDATE <br>
INCDATE_DAY_MTH -->calculated feature from INCDATE <br>
JUNCTIONTYPE 	---> hot encoding <br>
SDOT_COLCODE	---> OK <br>
SDOT_COLDESC	---> drop column because It's descriptive <br>
INATTENTIONIND	---> drop cdrop column because It's not possible knowing this feature before a forensic analysis. <br> <br>
UNDERINFL	---> drop column because It's not possible knowing this feature before a forensic analysis. <br>
WEATHER	---> hot encoding <br>
ROADCOND	---> drop column because It's not possible knowing this feature before a forensic analysis. <br>
LIGHTCOND	---> drop column because It's not possible knowing this feature before a forensic analysis. <br>
PEDROWNOTGRNT	---> drop column because It's not possible knowing this feature before a forensic analysis. <br>
SPEEDING	---> drop column because It's not possible knowing this feature before a forensic analysis. <br>
ST_COLCODE	---> feature
ST_COLDESC	---> drop column because It's descriptive

## Methodology <a name="methodology"></a>

The analytical approach will be a predictive model. Acording to the main goal of this project, we're going to determinate the best ML model to predict the severity of car crashes in Seattle. We'll analyze diferrent algorithms and compare each other to select the best fit to data set.

## Analysis <a name="analysis"></a>