# Analysis of Collision Data

This is the notebook I will be using to complete my capstone project for the IBM Data Science Professional Certificate.

In [None]:
import pandas as pd
import numpy as np

Business Understanding/Introduction

From 2004 to present there have been 194,673 collisions reported by the Seattle Police Department to the Seattle Department of Transportation (SDOT). 58,188 of those collisions involved an injury. We aim to reduce the number of accidents, especially those with injuries, in order to increase the wellbeing and longevity of our community. We will prepare a presentation for SDOT and the Vision Zero Network, "a collaborative campaign helping communities reach their goals of Vision Zero -- eliminating all traffic fatalities and severe injuries -- while increasing safe, healthy, equitable mobility for all." (https://visionzeronetwork.org)


Data

We will examine attributes such as in order to see what the most effective INATTENTIONIND    29805 non-null object
UNDERINFL         189789 non-null object
WEATHER           189592 non-null object
ROADCOND          189661 non-null object
LIGHTCOND         189503 non-null object
PEDROWNOTGRNT     4667 non-null object
SDOTCOLNUM        114936 non-null float64
SPEEDING 
2. Data Requirements: We’re going to use available employee data including work hours and on-the-job experience (to rule out errors made from inexperience), patient data including numbers of admissions and types of illnesses (to look at whether the most errors are occurring with more obscure or hard to manage illness), and the data we have on the errors that have occurred both before and after the recent layoffs.


3. Data Collection: We have collected our available employee, patient, and error data, but we decide to collect additional data by conducting surveys of our employees and patients to review their experiences and general understanding of what is happening in the hospital.


4. Data Understanding & Preparation: Now we will use descriptive statistics and visualization techniques to examine the data we have collected. We find that there is a pattern between the number of hours that the medical professionals are working and the number of errors that have occurred. Before the recent layoffs most employees worked 55 hours a week, but since the staff was reduced many have been working 75 and more hours. The number of errors has doubled in that time. We realize we need to know the cost of the errors as well as the savings from the layoffs so we circle back to the data collection stage to gather this information. It is now clear that the savings to the hospital have been lost with the added costs of time and resources fixing errors, not to mention the human costs of the errors. We are now able to enter the preparation stage where we will combine all the data from our surveys, employee data, patient data, and data on errors. We also take this time to thoroughly clean the data of all inaccuracies, duplicates, inconsistencies, missing values, etcetera.

In [4]:
# The code was removed by Watson Studio for sharing.

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 194673 entries, 0 to 194672
Data columns (total 38 columns):
SEVERITYCODE      194673 non-null int64
X                 189339 non-null float64
Y                 189339 non-null float64
OBJECTID          194673 non-null int64
INCKEY            194673 non-null int64
COLDETKEY         194673 non-null int64
REPORTNO          194673 non-null object
STATUS            194673 non-null object
ADDRTYPE          192747 non-null object
INTKEY            65070 non-null float64
LOCATION          191996 non-null object
EXCEPTRSNCODE     84811 non-null object
EXCEPTRSNDESC     5638 non-null object
SEVERITYCODE.1    194673 non-null int64
SEVERITYDESC      194673 non-null object
COLLISIONTYPE     189769 non-null object
PERSONCOUNT       194673 non-null int64
PEDCOUNT          194673 non-null int64
PEDCYLCOUNT       194673 non-null int64
VEHCOUNT          194673 non-null int64
INCDATE           194673 non-null object
INCDTTM           194673 non-null obje

In [6]:
df['PEDCOUNT'].describe()

count    194673.000000
mean          0.037139
std           0.198150
min           0.000000
25%           0.000000
50%           0.000000
75%           0.000000
max           6.000000
Name: PEDCOUNT, dtype: float64

In [7]:
df['PEDCOUNT'].unique()

array([0, 1, 2, 3, 4, 5, 6])

In [8]:
sum(df['PEDCOUNT'].isnull())

0

In [9]:
df['PEDCOUNT'].value_counts()

0    187734
1      6685
2       226
3        22
4         4
6         1
5         1
Name: PEDCOUNT, dtype: int64

In [10]:
df['SEVERITYCODE.1'].value_counts()

1    136485
2     58188
Name: SEVERITYCODE.1, dtype: int64

In [11]:
df['SEVERITYCODE.1'].unique()

array([2, 1])

In [12]:
df['SEVERITYDESC'].unique()

array(['Injury Collision', 'Property Damage Only Collision'], dtype=object)

In [13]:
df.columns

Index(['SEVERITYCODE', 'X', 'Y', 'OBJECTID', 'INCKEY', 'COLDETKEY', 'REPORTNO',
       'STATUS', 'ADDRTYPE', 'INTKEY', 'LOCATION', 'EXCEPTRSNCODE',
       'EXCEPTRSNDESC', 'SEVERITYCODE.1', 'SEVERITYDESC', 'COLLISIONTYPE',
       'PERSONCOUNT', 'PEDCOUNT', 'PEDCYLCOUNT', 'VEHCOUNT', 'INCDATE',
       'INCDTTM', 'JUNCTIONTYPE', 'SDOT_COLCODE', 'SDOT_COLDESC',
       'INATTENTIONIND', 'UNDERINFL', 'WEATHER', 'ROADCOND', 'LIGHTCOND',
       'PEDROWNOTGRNT', 'SDOTCOLNUM', 'SPEEDING', 'ST_COLCODE', 'ST_COLDESC',
       'SEGLANEKEY', 'CROSSWALKKEY', 'HITPARKEDCAR'],
      dtype='object')

In [14]:
df['COLLISIONTYPE'].unique()

array(['Angles', 'Sideswipe', 'Parked Car', 'Other', 'Cycles',
       'Rear Ended', 'Head On', nan, 'Left Turn', 'Pedestrian',
       'Right Turn'], dtype=object)

In [15]:
df_test= df[['COLLISIONTYPE', 'SEVERITYDESC', 'SEVERITYCODE.1', 'PEDCOUNT']]
df_test

Unnamed: 0,COLLISIONTYPE,SEVERITYDESC,SEVERITYCODE.1,PEDCOUNT
0,Angles,Injury Collision,2,0
1,Sideswipe,Property Damage Only Collision,1,0
2,Parked Car,Property Damage Only Collision,1,0
3,Other,Property Damage Only Collision,1,0
4,Angles,Injury Collision,2,0
5,Angles,Property Damage Only Collision,1,0
6,Angles,Property Damage Only Collision,1,0
7,Cycles,Injury Collision,2,0
8,Parked Car,Property Damage Only Collision,1,0
9,Angles,Injury Collision,2,0


In [16]:
sum(df_test['SEVERITYDESC']== "Injury Collision")

58188

In [17]:
sum(df_test['SEVERITYDESC']== "Property Damage Only Collision")

136485