# Introduction: Overview of Traffic Accident Prediction Capstone Project

In this project I will build up a prediction modell about the **outcome of traffic car accidents**.

The base of my analysis is the accident report of the **Seattle Police Department**. This report contains the historical data of every accident since 2004, a collection of 194 thousand events, categorizing the accidents by the dates and daytimes, circumstances, participants, etc., and telling the **severity of the outcome**: whether there was only a property damage or personal injury also happened.


## Business Problem

### Who and how can utilize the findings of our predictions? 

First and foremost we expect to have an actionable list for the **Seattle Police Department**:

- knowing _where_ the severe accidents happen, they can implement precautionary steps, like changing traffic tables, continuous presence, speed limits, etc
- knowing what are the components of a possibly severe accident, they can initate targeted checks, like alcohol/drug tests, etc

Besides the Police, **Insurance companies** would also profit from our predictions:

- they might raise the insurance fees for the habitants of risky areas
- they might calculate insurance fees for personal injuries
- they also will have insight on bicycle users' traffic risks

Coming to the risks, **real estate agents** or **prospective buyers** could also consider worthful to being aware of the safest areas. 


## The Data

The source of the data is the Seattle Police Department.

The raw data contains 38 columns and 194674 rows.

In [2]:
import pandas as pd
import numpy as np

# Import clean data 
path = 'Data-Collisions.csv'
df = pd.read_csv(path)


Columns (33) have mixed types.Specify dtype option on import or set low_memory=False.



In [15]:
df.head()

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


In [3]:
print(df.dtypes)

SEVERITYCODE        int64
X                 float64
Y                 float64
OBJECTID            int64
INCKEY              int64
COLDETKEY           int64
REPORTNO           object
STATUS             object
ADDRTYPE           object
INTKEY            float64
LOCATION           object
EXCEPTRSNCODE      object
EXCEPTRSNDESC      object
SEVERITYCODE.1      int64
SEVERITYDESC       object
COLLISIONTYPE      object
PERSONCOUNT         int64
PEDCOUNT            int64
PEDCYLCOUNT         int64
VEHCOUNT            int64
INCDATE            object
INCDTTM            object
JUNCTIONTYPE       object
SDOT_COLCODE        int64
SDOT_COLDESC       object
INATTENTIONIND     object
UNDERINFL          object
WEATHER            object
ROADCOND           object
LIGHTCOND          object
PEDROWNOTGRNT      object
SDOTCOLNUM        float64
SPEEDING           object
ST_COLCODE         object
ST_COLDESC         object
SEGLANEKEY          int64
CROSSWALKKEY        int64
HITPARKEDCAR       object
dtype: objec

## Renaming columns to verbose names

Most of the column names speak for themselves, however we would like to have a more pleasant format at least for those ones which will be used in the analysis.


In [4]:
df.rename(columns = {'SEVERITYCODE':'Severity_level','SEVERITYDESC':'Severity_desc', 'ADDRTYPE':'Location_type','LOCATION':'Location','COLLISIONTYPE':'Collision_type','PERSONCOUNT':'Affected_persons','PEDCOUNT':'Affected_pedestrians','PEDCYLCOUNT':'Affected_bikers','VEHCOUNT':'Affected_vehicles','INCDATE':'Accident_date','INCDTTM':'Accident_time','JUNCTIONTYPE':'Junctiontype','SDOT_COLCODE':'Accident_code','SDOT_COLDESC':'Accident_desc','INATTENTIONIND':'Inattention','UNDERINFL':'Under_influence','WEATHER':'Weather','ROADCOND':'Road_condition','LIGHTCOND':'Light_condition','PEDROWNOTGRNT':'Pedestrian_right','SPEEDING':'Speeding','ST_COLCODE':'Collision_code','ST_COLDESC':'Collision_description'}, inplace = True)

In [5]:
print(df.dtypes)

Severity_level             int64
X                        float64
Y                        float64
OBJECTID                   int64
INCKEY                     int64
COLDETKEY                  int64
REPORTNO                  object
STATUS                    object
Location_type             object
INTKEY                   float64
Location                  object
EXCEPTRSNCODE             object
EXCEPTRSNDESC             object
SEVERITYCODE.1             int64
Severity_desc             object
Collision_type            object
Affected_persons           int64
Affected_pedestrians       int64
Affected_bikers            int64
Affected_vehicles          int64
Accident_date             object
Accident_time             object
Junctiontype              object
Accident_code              int64
Accident_desc             object
Inattention               object
Under_influence           object
Weather                   object
Road_condition            object
Light_condition           object
Pedestrian

In [6]:
df.shape

(194673, 38)

## Cleaning data

1. The Column EXCEPTRSNCODE shows if an accident doesn't have enough data (value 'NEI')

In [7]:
df.loc[df['EXCEPTRSNCODE'] == 'NEI']

Unnamed: 0,Severity_level,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,Location_type,INTKEY,...,Road_condition,Light_condition,Pedestrian_right,SDOTCOLNUM,Speeding,Collision_code,Collision_description,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
74,1,,,86,5721,5721,1786583,Unmatched,,,...,Dry,Daylight,,3239035.0,,32,One parked--one moving,0,0,N
103,1,-122.340986,47.664662,119,330807,332307,EA29618,Matched,Block,,...,Dry,Daylight,,,,32,One parked--one moving,0,0,Y
106,1,-122.377451,47.562455,122,1319,1319,3615284,Matched,Block,,...,Dry,Daylight,,,,32,One parked--one moving,0,0,N
116,1,-122.331859,47.617610,132,1303,1303,3645762,Matched,Intersection,29172.0,...,Dry,Daylight,,,,16,From same direction - one right turn - one str...,0,0,N
171,1,,,193,320833,322333,E937155,Matched,,,...,Dry,Daylight,,,,32,One parked--one moving,0,0,N
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
194600,1,-122.351301,47.650281,219462,309036,310316,E878456,Unmatched,Block,,...,,,,,,,,0,0,N
194621,1,-122.320058,47.642273,219485,307425,308705,E863226,Matched,Block,,...,Unknown,Daylight,,,,32,One parked--one moving,0,0,Y
194623,1,-122.392754,47.515273,219487,312221,313641,3746065,Matched,Intersection,38057.0,...,Unknown,Daylight,,,,14,From same direction - both going straight - on...,0,0,N
194657,1,-122.337137,47.610709,219525,307834,309114,3811279,Matched,Block,,...,Wet,Dark - Street Lights On,,,,11,From same direction - both going straight - bo...,0,0,N


The missing information is usually the **geolocation** of the accident (X,Y coordinates), the **road condition**, the **accident type** or the **location type**. I consider these as crucial information for the model, so I just drop these rows.

This affects less than 3% of the data, which is an acceptable loss.

In [8]:
df = df[df.EXCEPTRSNCODE != 'NEI']
df.shape

(189035, 38)

In [13]:
import pandas as pd

!conda install -c conda-forge folium=0.5.0 --yes
import folium

print('Folium installed and imported!')

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.8.3
  latest version: 4.8.4

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.

Folium installed and imported!


In [None]:
seattle_loc = folium.Map(
       location=[-122.3321, 47.6062],
       zoom_start=12)

# filtered_df = df[df['X', 'Y'].notnull().all(1)]

filtered_df = df.dropna(subset=['X', 'Y'])


locations = filtered_df[['X', 'Y']]


locationlist = locations.values.tolist()
len(locationlist)
locationlist[111114]

In [36]:
limit = 40000
#df_incidents = filtered_df.iloc[0:limit, :]

df_incidents = filtered_df[filtered_df.Accident_date >= '2020/01/01 00:00:00+00']


latitude = 47.6062
longitude = -122.3321

# create map and display it
seattle_map = folium.Map(location=[latitude, longitude], zoom_start=12)

# display the map of San Francisco
seattle_map

In [37]:
# instantiate a feature group for the incidents in the dataframe
incidents = folium.map.FeatureGroup()

# loop through the 100 crimes and add each to the incidents feature group
for lat, lng, in zip(df_incidents.Y, df_incidents.X):
    incidents.add_child(
        folium.features.CircleMarker(
            [lat, lng],
            radius=1, # define how big you want the circle markers to be
            color='yellow',
            fill=True,
            fill_color='blue',
            fill_opacity=0.6
        )
    )

# add incidents to map
seattle_map.add_child(incidents)

In [40]:
import folium.plugins
from folium.plugins import MarkerCluster

# let's start again with a clean copy of the map of San Francisco
seattle_map = folium.Map(location = [latitude, longitude], zoom_start = 12)

# instantiate a mark cluster object for the incidents in the dataframe
incidents = plugins.MarkerCluster().add_to(seattle_map)

# loop through the dataframe and add each data point to the mark cluster
for lat, lng, label, in zip(df_incidents.Y, df_incidents.X, df_incidents.Severity_desc):
    folium.Marker(
        location=[lat, lng],
        icon=None,
        popup=label,
    ).add_to(incidents)

# display map
seattle_map

In [None]:
df["ROADCOND"].value_counts()

In [7]:
df.corr()

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,INTKEY,SEVERITYCODE.1,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,SDOT_COLCODE,SDOTCOLNUM,SEGLANEKEY,CROSSWALKKEY
SEVERITYCODE,1.0,0.010309,0.017737,0.020131,0.022065,0.022079,0.006553,1.0,0.130949,0.246338,0.214218,-0.054686,0.188905,0.004226,0.104276,0.175093
X,0.010309,1.0,-0.160262,0.009956,0.010309,0.0103,0.120754,0.010309,0.012887,0.011304,-0.001752,-0.012168,0.010904,-0.001016,-0.001618,0.013586
Y,0.017737,-0.160262,1.0,-0.023848,-0.027396,-0.027415,-0.114935,0.017737,-0.01385,0.010178,0.026304,0.017058,-0.019694,-0.006958,0.004618,0.009508
OBJECTID,0.020131,0.009956,-0.023848,1.0,0.946383,0.945837,0.046929,0.020131,-0.062333,0.024604,0.034432,-0.09428,-0.037094,0.969276,0.028076,0.056046
INCKEY,0.022065,0.010309,-0.027396,0.946383,1.0,0.999996,0.048524,0.022065,-0.0615,0.024918,0.031342,-0.107528,-0.027617,0.990571,0.019701,0.048179
COLDETKEY,0.022079,0.0103,-0.027415,0.945837,0.999996,1.0,0.048499,0.022079,-0.061403,0.024914,0.031296,-0.107598,-0.027461,0.990571,0.019586,0.048063
INTKEY,0.006553,0.120754,-0.114935,0.046929,0.048524,0.048499,1.0,0.006553,0.001886,-0.004784,0.000531,-0.012929,0.007114,0.032604,-0.01051,0.01842
SEVERITYCODE.1,1.0,0.010309,0.017737,0.020131,0.022065,0.022079,0.006553,1.0,0.130949,0.246338,0.214218,-0.054686,0.188905,0.004226,0.104276,0.175093
PERSONCOUNT,0.130949,0.012887,-0.01385,-0.062333,-0.0615,-0.061403,0.001886,0.130949,1.0,-0.023464,-0.038809,0.380523,-0.12896,0.011784,-0.021383,-0.032258
PEDCOUNT,0.246338,0.011304,0.010178,0.024604,0.024918,0.024914,-0.004784,0.246338,-0.023464,1.0,-0.01692,-0.261285,0.260393,0.021461,0.00181,0.565326


In [9]:
c = df.corr().abs()

s = c.unstack()
so = s.sort_values(kind="quicksort")
so

PEDCYLCOUNT     INTKEY          0.000531
INTKEY          PEDCYLCOUNT     0.000531
X               SDOTCOLNUM      0.001016
SDOTCOLNUM      X               0.001016
X               SEGLANEKEY      0.001618
                                  ...   
PEDCYLCOUNT     PEDCYLCOUNT     1.000000
VEHCOUNT        VEHCOUNT        1.000000
SDOT_COLCODE    SDOT_COLCODE    1.000000
SEVERITYCODE.1  SEVERITYCODE    1.000000
CROSSWALKKEY    CROSSWALKKEY    1.000000
Length: 256, dtype: float64