# Cleaning the fatalities/injuries data
This notebook will explain the cleaning process for the fatalities/injuries data which is executed in the `make-master.py` cleaning script in the `cleaning-workflow/cleaning-scripts` folder in the data repository. This cleaning process is helpful to enable comparisons and useful mapping data, but there are important caveats that will be explained in this script that should be noted before doing much comparing

In [1]:
import pandas as pd
import numpy as np
import re
from datetime import date, time

pd.set_option('display.max_columns', None)

First, let's look at the fatality/injury data structure in our different source data files.

In [2]:
# read in the csv
crash_df = pd.read_csv('../data/source-data/moco-crash-2022.csv')  
crash_df[['Number Dead','Number Injured']]

Unnamed: 0,Number Dead,Number Injured
0,0,0
1,0,0
2,0,0
3,0,0
4,0,0
...,...,...
3641,0,0
3642,0,1
3643,0,2
3644,0,0


This formatting is simple and useful. It gives a number of fatalities and injuries as an `int` for each crash. This is the same format in the files for 2021, 2020 and 2019.

Let's look at the 2013-2018 file.

In [3]:
# read in the csv
crash_df_1318 = pd.read_csv('../data/source-data/moco-crash-2013-2018.csv', encoding='unicode_escape')  
crash_df_1318[['DEAD','INJ']]

Unnamed: 0,DEAD,INJ
0,0.0,0.0
1,0.0,0.0
2,0.0,0.0
3,0.0,0.0
4,0.0,1.0
...,...,...
22406,0.0,0.0
22407,0.0,0.0
22408,0.0,0.0
22409,0.0,1.0


Very similar. Just has different column names and it's floats instead of ints.

The real issue is with the 2003-2015 data. Let's see what's going on.

In [4]:
# read in the csv
crash_df_0315 = pd.read_csv('../data/source-data/moco-crash-2003-2015.csv', encoding='unicode_escape')  
crash_df_0315['Injury Type'].unique()

array(['No injury/unknown', 'Non-incapacitating', 'Incapacitating',
       'Fatal'], dtype=object)

In [5]:
crash_df_0315['Collision Type']

0        2-Car
1        2-Car
2        2-Car
3        2-Car
4        2-Car
         ...  
53938    2-Car
53939    1-Car
53940    2-Car
53941    2-Car
53942    2-Car
Name: Collision Type, Length: 53943, dtype: object

There's no numbers that indicate the number of fatalities or injuries. Instead, there is one field with string inputs that only tell us if there were more than one injury or fatality. 

This will still allow us to do some comparisons across data years and generate some estimates for fatatities and injuries, **but the fatality/injury numbers for 2003-2015 should not be taken as true counts, because they are only low estimates.**

However, let's go ahead and create columns to estimate these values based on the flags.

In [6]:
def injury_estimate(String):
    if String == 'Non-incapacitating':
        return 1
    elif String == 'Incapacitating':
        return 1
    else:
        return 0

In [7]:
def fatality_estimate(String):
    if String == 'Fatal':
        return 1
    else:
        return 0

In [8]:
crash_df_0315['Injury Type'].apply(fatality_estimate).value_counts()

0    53828
1      115
Name: Injury Type, dtype: int64

In [9]:
crash_df_0315['Injury Type'].apply(injury_estimate).value_counts()

0    41718
1    12225
Name: Injury Type, dtype: int64

In [10]:
crash_df_0315['Number Dead'] = crash_df_0315['Injury Type'].apply(fatality_estimate)
crash_df_0315['Number Injured'] = crash_df_0315['Injury Type'].apply(injury_estimate)
crash_df_0315

Unnamed: 0,Master Record Number,Year,Month,Day,Weekend?,Hour,Collision Type,Injury Type,Primary Factor,Reported_Location,Latitude,Longitude,Number Dead,Number Injured
0,902363382,2015,1,5,Weekday,0.0,2-Car,No injury/unknown,OTHER (DRIVER) - EXPLAIN IN NARRATIVE,1ST & FESS,39.159207,-86.525874,0,0
1,902364268,2015,1,6,Weekday,1500.0,2-Car,No injury/unknown,FOLLOWING TOO CLOSELY,2ND & COLLEGE,39.161440,-86.534848,0,0
2,902364412,2015,1,6,Weekend,2300.0,2-Car,Non-incapacitating,DISREGARD SIGNAL/REG SIGN,BASSWOOD & BLOOMFIELD,39.149780,-86.568890,0,1
3,902364551,2015,1,7,Weekend,900.0,2-Car,Non-incapacitating,FAILURE TO YIELD RIGHT OF WAY,GATES & JACOBS,39.165655,-86.575956,0,1
4,902364615,2015,1,7,Weekend,1100.0,2-Car,No injury/unknown,FAILURE TO YIELD RIGHT OF WAY,W 3RD,39.164848,-86.579625,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53938,900084526,2003,10,6,Weekday,1700.0,2-Car,No injury/unknown,IMPROPER LANE USAGE,DUNN & WHITE LOT WEST,0.000000,0.000000,0,0
53939,900089213,2003,11,3,Weekday,800.0,1-Car,No injury/unknown,UNSAFE SPEED,RED OAK & SR446,0.000000,0.000000,0,0
53940,900095322,2003,12,5,Weekday,1200.0,2-Car,No injury/unknown,BRAKE FAILURE OR DEFECTIVE,2ND ST & WALNUT,0.000000,0.000000,0,0
53941,900099922,2003,12,1,Weekend,700.0,2-Car,No injury/unknown,UNSAFE BACKING,NINETH & NORTH,0.000000,0.000000,0,0


Great, now that we have the estimates for fatalities and injuries, we can also take a look at the `Collision Type` field, which also has information we can standardize to be more similar to the other datasets. 

In [11]:
def num_vehicles_estimate(String):
    if String == '1-Car':
        return 1
    elif String == '2-Car':
        return 2
    elif String == '3+ Cars':
        return 3
    else:
        return 0

In [12]:
crash_df_0315['Number Vehicles'] = crash_df_0315['Collision Type'].apply(num_vehicles_estimate)
crash_df_0315['Number Vehicles']

0        2
1        2
2        2
3        2
4        2
        ..
53938    2
53939    1
53940    2
53941    2
53942    2
Name: Number Vehicles, Length: 53943, dtype: int64

And one final data point we can extract from this column actually gives more information for the 2003-2015 data than we have for the more recent data. The `Collision Type` column has two values, `Pedestrian` and `Cyclist`, that give more insight into crashes involving pedestrians and cyclists, which is not recorded in recent public crash data. Let's make those into their own columns to enable easier analysis on these fields as well.

In [13]:
def ped_involved(String):
    if String == 'Pedestrian':
        return True
    else:
        return False

In [14]:
def cyclist_involved(String):
    if String == 'Cyclist':
        return True
    else:
        return False

In [15]:
crash_df_0315['Pedestrian Involved'] = crash_df_0315['Collision Type'].apply(ped_involved)
crash_df_0315['Pedestrian Involved'].value_counts()

False    53334
True       609
Name: Pedestrian Involved, dtype: int64

In [16]:
crash_df_0315['Cyclist Involved'] = crash_df_0315['Collision Type'].apply(cyclist_involved)
crash_df_0315['Cyclist Involved'].value_counts()

False    53475
True       468
Name: Cyclist Involved, dtype: int64

In [17]:
crash_df_0315

Unnamed: 0,Master Record Number,Year,Month,Day,Weekend?,Hour,Collision Type,Injury Type,Primary Factor,Reported_Location,Latitude,Longitude,Number Dead,Number Injured,Number Vehicles,Pedestrian Involved,Cyclist Involved
0,902363382,2015,1,5,Weekday,0.0,2-Car,No injury/unknown,OTHER (DRIVER) - EXPLAIN IN NARRATIVE,1ST & FESS,39.159207,-86.525874,0,0,2,False,False
1,902364268,2015,1,6,Weekday,1500.0,2-Car,No injury/unknown,FOLLOWING TOO CLOSELY,2ND & COLLEGE,39.161440,-86.534848,0,0,2,False,False
2,902364412,2015,1,6,Weekend,2300.0,2-Car,Non-incapacitating,DISREGARD SIGNAL/REG SIGN,BASSWOOD & BLOOMFIELD,39.149780,-86.568890,0,1,2,False,False
3,902364551,2015,1,7,Weekend,900.0,2-Car,Non-incapacitating,FAILURE TO YIELD RIGHT OF WAY,GATES & JACOBS,39.165655,-86.575956,0,1,2,False,False
4,902364615,2015,1,7,Weekend,1100.0,2-Car,No injury/unknown,FAILURE TO YIELD RIGHT OF WAY,W 3RD,39.164848,-86.579625,0,0,2,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53938,900084526,2003,10,6,Weekday,1700.0,2-Car,No injury/unknown,IMPROPER LANE USAGE,DUNN & WHITE LOT WEST,0.000000,0.000000,0,0,2,False,False
53939,900089213,2003,11,3,Weekday,800.0,1-Car,No injury/unknown,UNSAFE SPEED,RED OAK & SR446,0.000000,0.000000,0,0,1,False,False
53940,900095322,2003,12,5,Weekday,1200.0,2-Car,No injury/unknown,BRAKE FAILURE OR DEFECTIVE,2ND ST & WALNUT,0.000000,0.000000,0,0,2,False,False
53941,900099922,2003,12,1,Weekend,700.0,2-Car,No injury/unknown,UNSAFE BACKING,NINETH & NORTH,0.000000,0.000000,0,0,2,False,False


## The End
This script has standardized how the fatalities, injuries, number of vehicles, and ped/cyclist involvement is shown in the dataset to allow easier comparison.