# Car Crash Data Cleaning
Preliminary cleaning/exploring the car crash data for Monroe County, IN. (https://data.bloomington.in.gov/dataset/traffic-data)

In [48]:
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
import re
from datetime import date, time

pd.set_option('display.max_columns', None)

## Exploring the data

Let's take a look at the first csv, which includes data for 2021. The most recent three years (2019-21) of data are each stored in their own table, while 2013-18 and 2003-15 are combined into multi-year files.

In [166]:
# read in the csv
crash_df = pd.read_csv('../source-data/moco-crash-2021.csv')  
# take a look at the columns (there's so many that doing .head() won't show all of them)
crash_df.dtypes

_id                        int64
Agency                    object
City                      object
Collision Date            object
Collision Time            object
Vehicles Involved          int64
Trailers Involved        float64
Number Injured           float64
Number Dead                int64
Number Deer                int64
House Number              object
Roadway Interchange       object
Roadway Ramp              object
Roadway Id                object
Intersecting Road         object
Interchange               object
Feet From                float64
Direction                 object
Latitude                 float64
Longitude                float64
Roadway Class             object
Hit and Run?              object
Locality                  object
School Zone?              object
Rumble Strips?            object
Construction?             object
Construction Type         object
Light Condition           object
Weather Conditions        object
Surface Condition         object
Type of Me

In [50]:
crash_df.head()

Unnamed: 0,_id,Agency,City,Collision Date,Collision Time,Vehicles Involved,Trailers Involved,Number Injured,Number Dead,Number Deer,House Number,Roadway Interchange,Roadway Ramp,Roadway Id,Intersecting Road,Interchange,Feet From,Direction,Latitude,Longitude,Roadway Class,Hit and Run?,Locality,School Zone?,Rumble Strips?,Construction?,Construction Type,Light Condition,Weather Conditions,Surface Condition,Type of Median,Roadway Junction Type,Road Character,Roadway Surface,Primary Factor,Manner of Collision,Unique Location Id,Traffic Control
0,1,MONROE SD,BLOOMINGTON,2021-12-31T00:00:00,12:30 AM,1,0.0,1.0,0,0,,,,W ROCK EAST RD,,,1300.0,W,,,LOCAL/CITY ROAD,N,URBAN,N,N,N,,DARK (NOT LIGHTED),RAIN,WET,,NO JUNCTION INVOLVED,CURVE/LEVEL,ASPHALT,SPEED TOO FAST FOR WEATHER CONDITIONS,RAN OFF ROAD,MOUNTZIONRDROCKEASTRD,NONE
1,2,MONROE SD,BLOOMINGTON,2021-12-31T00:00:00,12:50 AM,1,0.0,1.0,0,0,,,,E SMITHVILLE RD,,,300.0,E,,,LOCAL/CITY ROAD,N,RURAL,N,N,N,,DARK (NOT LIGHTED),RAIN,WET,,NO JUNCTION INVOLVED,CURVE/GRADE,ASPHALT,SPEED TOO FAST FOR WEATHER CONDITIONS,RAN OFF ROAD,POPULARESTSMITHVILLERD,LANE CONTROL
2,3,BLOOMINGTON PD,BLOOMINGTON,2021-12-31T00:00:00,12:31 PM,2,,0.0,0,0,,,,SR 46,E 3RD ST,,0.0,,39.164278,-86.498384,LOCAL/CITY ROAD,,URBAN,N,N,N,,DAYLIGHT,CLEAR,DRY,,FOUR-WAY INTERSECTION,,ASPHALT,FAILURE TO MAINTAIN LANE,SAME DIRECTION SIDESWIPE,STATERD46E3RDST,
3,4,MONROE SD,BLOOMINGTON,2021-12-31T00:00:00,10:15 PM,2,0.0,0.0,0,0,7533.0,,,N CAPTAINS WAY,,,,,,,LOCAL/CITY ROAD,Y,RURAL,N,N,N,,DARK (NOT LIGHTED),RAIN,WET,,NO JUNCTION INVOLVED,NON-ROADWAY CRASH,GRAVEL,UNSAFE BACKING,BACKING CRASH,NCAPTAINSWAY,NONE
4,5,MONROE SD,BLOOMINGTON,2021-12-30T00:00:00,2:41 AM,1,0.0,0.0,0,0,,,,W VERNAL PIKE,N OARD RD,,,,39.181344,-86.637104,COUNTY ROAD,N,RURAL,N,N,N,,DARK (NOT LIGHTED),CLEAR,WET,,Y-INTERSECTION,CURVE/LEVEL,ASPHALT,RAN OFF ROAD RIGHT,RAN OFF ROAD,OARDRDVERNALPIKE,NONE


In [51]:
crash_df.describe()

Unnamed: 0,_id,Vehicles Involved,Trailers Involved,Number Injured,Number Dead,Number Deer,Feet From,Latitude,Longitude,Type of Median
count,3057.0,3057.0,2659.0,3054.0,3057.0,3057.0,1318.0,2807.0,2807.0,0.0
mean,1529.0,1.785411,0.021813,0.263916,0.003271,0.055283,441.749621,35.033724,-77.416053,
std,882.624212,0.551423,0.148652,0.593469,0.05711,0.231415,761.578352,12.030684,26.584704,
min,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,-87.054654,
25%,765.0,1.0,0.0,0.0,0.0,0.0,41.25,39.131741,-86.560235,
50%,1529.0,2.0,0.0,0.0,0.0,0.0,171.0,39.16426,-86.533568,
75%,2293.0,2.0,0.0,0.0,0.0,0.0,500.0,39.176629,-86.503742,
max,3057.0,4.0,2.0,5.0,1.0,2.0,6864.0,39.657788,0.0,


Looks like we've got a dataset with 3,057 records. Each car crash has info about when it happened, where it happened and how it happened. There are several different columns describing the location/conditions, including `Roadway Class`, `Construction?`, `School Zone?` and `Number Deer`. There's also latitude and longitude for most rows, which will be helpful for mapping the crashes. 

I'm intrigued by the `Number Deer` category — it seems like deer-involved crashes must happen a lot to be included in this city dataset.

In [52]:
crash_df['Number Deer'].where(crash_df['Number Deer'] > 0).count() / crash_df['Number Deer'].count()

0.05462872096826955

Looks like deer were involved in 5.5% of crashes last year. Interesting!

I wonder how many cities are included in this data, which records all the crashes in Monroe County.

In [53]:
crash_df['City'].unique()

array(['BLOOMINGTON', 'ELLETTSVILLE', nan, 'CHAPELHILL', 'HARRODSBURG',
       'UNIONVILLE(MONROE)', 'STINESVILLE', 'STANFORD'], dtype=object)

I'm also curious about which weather conditions are included in the dataset.

In [54]:
crash_df['Weather Conditions'].unique()

array(['RAIN', 'CLEAR', 'CLOUDY', 'BLOWING SAND/SOIL/SNOW',
       'SLEET/HAIL/FREEZING RAIN', nan, 'FOG/SMOKE/SMOG', 'SNOW'],
      dtype=object)

Let's see how many records actually have latitude/longitude coordinates. Missing data will affect the robustness of mapping visualizations when visualizing this dataset.

In [55]:
print('Total rows:', crash_df.shape[0], '\nRows w lat/lon:', crash_df['Latitude'].dropna().count())

Total rows: 3057 
Rows w lat/lon: 2807


Nearly 92% of the rows contain lat/lon data, so that's not terrible.

## Cleaning the data

In [167]:
crash_df_clean = crash_df

### Figuring out the date/time situation
The 2021 dataset has two columns to indicate the time the collision happened, `Collision Date` and `Collision Time`. This isn't the most useful setup for data analysis. Ideally we probably want the day and time combined into one datetime object, which can then be easily manipulated using `pandas` built-in datetime formulas.

So let's create a clean new column with datetime objects.

In [168]:
# given a date string (`Collision Date` field) and a time string (`Collision Time` field),,,
def get_datetime(date_string, time_string):
    time_string = str(time_string)
    # combine the date and time into one `datetime` object
    date_val = date(pd.to_datetime(date_string).year,
                 pd.to_datetime(date_string).month,
                 pd.to_datetime(date_string).day
                 )
    if time_string:
        # if the time string is `NaN`, set it to midnight
        try:
            time_val = time(pd.to_datetime(time_string).hour,
                      pd.to_datetime(time_string).minute
                     )
        except:
            time_val = time(0,0,0)
        
    return(pd.Timestamp.combine(date_val,time_val))
                            

In [58]:
# use regex to test how many rows have certain formats

In [169]:
# example of how the function works
get_datetime('2021-12-31T00:00:00','12:30 AM')

Timestamp('2021-12-31 00:30:00')

In [170]:
# apply the function to the `Collision Date` and `Collision Time` fields,
# and create a new `DateTime` field with the combined information
crash_df_clean['DateTime'] = crash_df_clean.apply(lambda row: get_datetime(row['Collision Date'], row['Collision Time']), axis=1)


In [171]:
crash_df_clean['DateTime']

0      2021-12-31 00:30:00
1      2021-12-31 00:50:00
2      2021-12-31 12:31:00
3      2021-12-31 22:15:00
4      2021-12-30 02:41:00
               ...        
3052   2021-01-01 07:59:00
3053   2021-01-01 09:40:00
3054   2021-01-01 09:55:00
3055   2021-01-01 10:35:00
3056   2021-01-01 23:21:00
Name: DateTime, Length: 3057, dtype: datetime64[ns]

### Dropping unhelpful columns

Now that there's a new, clean datetime col, we can remove the previous time/day cols. But we should also check the usefulness of the other columns, because we won't need all of them.


In [62]:
crash_df_clean['House Number'].count() / crash_df_clean.shape[0]

0.09388289172391233

Looks like only 10% of rows have a House Number, and that's not really useful anyway, especially since there's lat/lon data. So let's drop it.

In [63]:
crash_df_clean.count()

_id                      3057
Agency                   3057
City                     3051
Collision Date           3057
Collision Time           3057
Vehicles Involved        3057
Trailers Involved        2659
Number Injured           3054
Number Dead              3057
Number Deer              3057
House Number              287
Roadway Interchange        57
Roadway Ramp               68
Roadway Id               3057
Intersecting Road        2447
Interchange                 4
Feet From                1318
Direction                1002
Latitude                 2807
Longitude                2807
Roadway Class            3028
Hit and Run?             2659
Locality                 3048
School Zone?             3054
Rumble Strips?           3053
Construction?            3057
Construction Type          47
Light Condition          3057
Weather Conditions       3053
Surface Condition        3052
Type of Median              0
Roadway Junction Type    3056
Road Character           2659
Roadway Su

Looking at the other column counts, there are several columns with very few non-NaN rows. So let's also drop all of those.

In [64]:
crash_df_clean['Traffic Control'].where(crash_df_clean['Traffic Control'] == 'NONE').dropna().count()

1181

While technically there are 1K+ entries for this, most of them say "none". so let's also drop it

In [176]:
cols_to_drop = [
    'Collision Date',
    'Collision Time',
    'House Number',
    'Roadway Interchange',
    'Roadway Ramp',
    'Interchange',
    'Feet From',
    'Direction',
    'Construction Type',
    'Type of Median',
    '_id',
    'Traffic Control'
]

In [66]:
crash_df_clean = crash_df_clean.drop(columns=cols_to_drop)

In [67]:
print(crash_df_clean.count())

Agency                   3057
City                     3051
Vehicles Involved        3057
Trailers Involved        2659
Number Injured           3054
Number Dead              3057
Number Deer              3057
Roadway Id               3057
Intersecting Road        2447
Latitude                 2807
Longitude                2807
Roadway Class            3028
Hit and Run?             2659
Locality                 3048
School Zone?             3054
Rumble Strips?           3053
Construction?            3057
Light Condition          3057
Weather Conditions       3053
Surface Condition        3052
Roadway Junction Type    3056
Road Character           2659
Roadway Surface          3056
Primary Factor           3013
Manner of Collision      3028
Unique Location Id       3057
DateTime                 3057
dtype: int64


That looks fine for now.

## Combining with previous years

In [174]:
crash_df_20 = pd.read_csv('../source-data/moco-crash-2020.csv')  
crash_df_20.head()

Unnamed: 0,Agency,City,Collision Date,Collision Time,Vehicles Involved,Trailers Involved,Number Injured,Number Dead,Number Deer,House Number,Roadway Interchange,Roadway Ramp,Roadway Id,Intersecting Road,Interchange,Feet From,Direction,Latitude,Longitude,Roadway Class,Hit and Run?,Locality,School Zone?,Rumble Strips?,Construction?,Construction Type,Light Condition,Weather Conditions,Surface Condition,Type of Median,Roadway Junction Type,Road Character,Roadway Surface,Primary Factor,Manner of Collision,Unique Location Id,Traffic Control
0,MONROE SD,BLOOMINGTON,1/1/2020,2:50 AM,1,0.0,1.0,0.0,0,,,,N CURRY PIKE,,,1000.0,N,0.0,0.0,COUNTY ROAD,N,RURAL,N,N,N,,DARK (NOT LIGHTED),CLEAR,DRY,,NO JUNCTION INVOLVED,STRAIGHT/GRADE,ASPHALT,RAN OFF ROAD RIGHT,RAN OFF ROAD,NCURRYPIKESTONEBRANCHDR,NONE
1,ISP BLOOMINGTON 33,CLEAR CREEK,1/1/2020,11:45 AM,2,,0.0,0.0,0,,,,E SMITHVILLE RD,S FAIRFAX RD,,0.0,S,39.071151,-86.503186,LOCAL/CITY ROAD,,RURAL,N,N,N,,DAYLIGHT,CLEAR,DRY,,FOUR-WAY INTERSECTION,,ASPHALT,FAILURE TO YIELD RIGHT OF WAY,RIGHT ANGLE,ESMITHVILLERDSFAIRFAXRD,
2,ISP BLOOMINGTON 33,BLOOMINGTON,1/1/2020,12:10 PM,2,,0.0,0.0,0,,Y,116C,I-69 S,W TAPP RD,,0.0,S,39.140799,-86.573278,INTERSTATE,,URBAN,N,N,N,,DAYLIGHT,CLEAR,DRY,,RAMP,,ASPHALT,FOLLOWING TOO CLOSELY,REAR END,116.2I-69,
3,MONROE SD,ELLETTSVILLE,1/1/2020,1:09 PM,1,0.0,0.0,0.0,0,1724.0,,,N RIDGEWAY DR,W VALLEY VIEW DR (SE),,,,39.239952,-86.606704,LOCAL/CITY ROAD,N,URBAN,N,N,N,,DAYLIGHT,CLEAR,DRY,,NO JUNCTION INVOLVED,STRAIGHT/LEVEL,ASPHALT,RAN OFF ROAD RIGHT,RAN OFF ROAD,NRIDGEWAYDRWVALLEYVIEWDR,NONE
4,ISP BLOOMINGTON 33,BLOOMINGTON,1/1/2020,5:20 PM,1,,0.0,0.0,1,,,,I-69 N,,,898.0,W,39.070974,-86.664678,INTERSTATE,,RURAL,N,N,N,,DAYLIGHT,CLEAR,DRY,,NO JUNCTION INVOLVED,,ASPHALT,ANIMAL/OBJECT IN ROADWAY,COLLISION WITH DEER,I-69I69,


This table has very similar columns to the first one, with the same field names. So it can be cleaned with the same functions used above.

It is missing the `_id` field, so we can remove that from the list of cols to drop.

In [177]:
cols_to_drop.remove('_id')

In [178]:
crash_df_20_clean = crash_df_20

crash_df_20_clean['DateTime'] = crash_df_20_clean.apply(lambda row: get_datetime(row['Collision Date'], row['Collision Time']), axis=1)

crash_df_20_clean = crash_df_20_clean.drop(columns=cols_to_drop)

crash_df_20_clean.head()

Unnamed: 0,Agency,City,Vehicles Involved,Trailers Involved,Number Injured,Number Dead,Number Deer,Roadway Id,Intersecting Road,Latitude,Longitude,Roadway Class,Hit and Run?,Locality,School Zone?,Rumble Strips?,Construction?,Light Condition,Weather Conditions,Surface Condition,Roadway Junction Type,Road Character,Roadway Surface,Primary Factor,Manner of Collision,Unique Location Id,DateTime
0,MONROE SD,BLOOMINGTON,1,0.0,1.0,0.0,0,N CURRY PIKE,,0.0,0.0,COUNTY ROAD,N,RURAL,N,N,N,DARK (NOT LIGHTED),CLEAR,DRY,NO JUNCTION INVOLVED,STRAIGHT/GRADE,ASPHALT,RAN OFF ROAD RIGHT,RAN OFF ROAD,NCURRYPIKESTONEBRANCHDR,2020-01-01 02:50:00
1,ISP BLOOMINGTON 33,CLEAR CREEK,2,,0.0,0.0,0,E SMITHVILLE RD,S FAIRFAX RD,39.071151,-86.503186,LOCAL/CITY ROAD,,RURAL,N,N,N,DAYLIGHT,CLEAR,DRY,FOUR-WAY INTERSECTION,,ASPHALT,FAILURE TO YIELD RIGHT OF WAY,RIGHT ANGLE,ESMITHVILLERDSFAIRFAXRD,2020-01-01 11:45:00
2,ISP BLOOMINGTON 33,BLOOMINGTON,2,,0.0,0.0,0,I-69 S,W TAPP RD,39.140799,-86.573278,INTERSTATE,,URBAN,N,N,N,DAYLIGHT,CLEAR,DRY,RAMP,,ASPHALT,FOLLOWING TOO CLOSELY,REAR END,116.2I-69,2020-01-01 12:10:00
3,MONROE SD,ELLETTSVILLE,1,0.0,0.0,0.0,0,N RIDGEWAY DR,W VALLEY VIEW DR (SE),39.239952,-86.606704,LOCAL/CITY ROAD,N,URBAN,N,N,N,DAYLIGHT,CLEAR,DRY,NO JUNCTION INVOLVED,STRAIGHT/LEVEL,ASPHALT,RAN OFF ROAD RIGHT,RAN OFF ROAD,NRIDGEWAYDRWVALLEYVIEWDR,2020-01-01 13:09:00
4,ISP BLOOMINGTON 33,BLOOMINGTON,1,,0.0,0.0,1,I-69 N,,39.070974,-86.664678,INTERSTATE,,RURAL,N,N,N,DAYLIGHT,CLEAR,DRY,NO JUNCTION INVOLVED,,ASPHALT,ANIMAL/OBJECT IN ROADWAY,COLLISION WITH DEER,I-69I69,2020-01-01 17:20:00


Let's clean 2019 & 2022 the same way

In [179]:
crash_df_19 = pd.read_csv('../source-data/moco-crash-2019.csv')  

In [180]:
crash_df_19_clean = crash_df_19
crash_df_19_clean['DateTime'] = crash_df_19_clean.apply(lambda row: get_datetime(row['Collision Date'], row['Collision Time']), axis=1)
crash_df_19_clean = crash_df_19_clean.drop(columns=cols_to_drop)

# crash_df_19_clean.head()



In [181]:
crash_df_22 = pd.read_csv('../source-data/moco-crash-2022.csv')  

In [182]:
crash_df_22_clean = crash_df_22
crash_df_22_clean['DateTime'] = crash_df_22_clean.apply(lambda row: get_datetime(row['Collision Date'], row['Collision Time']), axis=1)
crash_df_22_clean = crash_df_22_clean.drop(columns=['Collision Date','Collision Time','House Number','Roadway Interchange','Roadway Ramp','Interchange','Feet From','Direction','Construction Type','Type of Median','Traffic Control'])

# crash_df_22_clean.head()

The next dataset combines 2013-2018. Unlike 2019-21 data, this one has quite different columns/values for each record :o

In [183]:
crash_df_13_18 = pd.read_csv('../source-data/moco-crash-2013-2018.csv',encoding='unicode_escape')  
crash_df_13_18.head()

Unnamed: 0,Agency,City,DATE,TIME,VEH#,Trailers,INJ,DEAD,DEER,House#,Roadway Id,Intersect Rd.,Interchange,Ramp,Property Type,Feet From,Dir,Latitude,Longitude,Road Class,H&R,Locality,School,Rumble Strips,CN Zone,CN Type,Light,Weather,Surf Con,Median,Rd Junction,Road Char,Surface,Primary Factor,Collision Type,Unique Id,Traffic Control
0,MCSD,BLOOMINGTON,7/30/2014,1:16 AM,1,0,0.0,0.0,0.0,,1ST AVE,SANDERS SECOND,,,OTHER,250.0,S,39.052982,-86.51332,CR,N,RURAL,N,N,N,,DARK,CLEAR,DRY,NONE,,CURVE/GRADE,ASP,RAN OFF ROAD RIGHT,RAN OFF ROAD,1STAVENUESANDERSAVES FAIRFAXRD,LANE CONTROL
1,BPD,BLOOMINGTON,5/11/2015,9:50 AM,2,0,0.0,0.0,0.0,203.0,1ST ST,,,,,,,39.159344,-86.532373,UNK,N,URBAN,N,N,N,,DAYLIGHT,CLOUDY,DRY,,,STRAIGHT/LEVEL,ASP,OTHER (DRIVER),RIGHT ANGLE,W 1STST,NONE
2,BPD,BLOOMINGTON,1/4/2016,2:06 PM,2,0,0.0,0.0,0.0,709.0,1ST ST,,,,,,,39.17664,-86.541082,CITY,N,URBAN,N,N,N,,DAYLIGHT,CLEAR,DRY,,,NON-ROADWAY CRASH,ASP,UNSAFE BACKING,BACKING CRASH,W1STST,NONE
3,BPD,BLOOMINGTON,6/14/2016,3:00 PM,2,0,0.0,0.0,0.0,1007.0,1ST ST,,,,,,,39.159344,-86.522435,CITY,Y,URBAN,N,N,N,,DAYLIGHT,RAIN,WET,,,STRAIGHT/GRADE,GRAVEL,IMPROPER TURNING,RIGHT TURN,1STST,NONE
4,BPD,BLOOMINGTON,8/6/2013,1340,1,0,1.0,0.0,0.0,1210.0,1ST ST,,,,OTHER,,,39.159344,-86.518936,CITY,N,URBAN,N,N,N,,DAYLIGHT,CLOUDY,DRY,NONE,,STRAIGHT/LEVEL,ASP,DRIVER ILLNESS,RAN OFF ROAD,E1STST,NONE


In [184]:
print(crash_df_13_18.dtypes)

Agency              object
City                object
DATE                object
TIME                object
VEH#                 int64
Trailers             int64
INJ                float64
DEAD               float64
DEER               float64
House#              object
Roadway Id          object
Intersect Rd.       object
Interchange         object
Ramp                object
Property Type       object
Feet From          float64
Dir                 object
Latitude           float64
Longitude          float64
Road Class          object
H&R                 object
Locality            object
School              object
Rumble Strips       object
CN Zone             object
CN Type             object
Light               object
Weather             object
Surf Con            object
Median              object
Rd Junction         object
Road Char           object
Surface             object
Primary Factor      object
Collision Type      object
Unique Id           object
Traffic Control     object
d

Several of the columns correlate, like `INJ` with `Number Injured` and `DEER` with `Number Deer`. But many are missing. 

Let's rename the cols to match the naming conventions of 2019-21 to make it easier to combine the tables later.

In [185]:
crash_df_13_18_clean = crash_df_13_18.rename(columns={
    "DATE": "Collision Date", 
    "TIME": "Collision Time", 
    "Trailers": "Trailers Involved",
    "INJ": "Number Injured",
    "DEAD": "Number Dead",
    "DEER": "Number Deer",
    "House#": "House Number",
    "VEH#": "Vehicles Involved",
    "Surf Con": "Surface Condition",
    "Collision Type": "Manner of Collision",
    "Rd Junction": "Roadway Junction Type",
    "Weather": "Weather Conditions",
    "Road Char": "Road Character",
    "Surface": "Roadway Surface",
    "CN Zone": "Construction?",
    "Unique Id": "Unique Location Id",
    'Intersect Rd.': "Intersecting Road"
})

Now let's take a look at the time data. It's not all formatted the same way, so we'll have to clean it.

In [186]:
crash_df_13_18_clean['Collision Time'].sample(30)

21366     7:20 PM
19261     4:00 PM
13648     7:03 AM
6825     10:21 AM
4075      9:37 PM
18739     3:15 PM
6322     11:06 PM
21024     5:07 PM
16772        1246
16748     7:50 AM
8101      8:54 AM
18433        1309
6492      7:21 AM
10600        2305
13460         715
4163      6:00 PM
22113     8:11 AM
21551     8:50 AM
9424      2:00 PM
9037      1:10 PM
5410      5:00 PM
12275     6:18 PM
6133     10:30 PM
5729      3:34 AM
2093     10:03 AM
2246      9:50 AM
15211     6:25 PM
2527      5:30 PM
8829      4:06 PM
11389     6:09 PM
Name: Collision Time, dtype: object

Looks like some of the times were input as 4-digit numbers. I'm assuming those indicate military time, so `1046` would mean `10:46 a.m.` and `1420` would mean `2:20 p.m.`. There are also some 3-digit numbers, like `145`. I'm going to assume this means `1:45 p.m.`. 

In [187]:
# Function to clean the times
def Clean_Times(Time):
    
    # Search for 3- or 4-digit times
    # these are in military time. so need to be converted to regular time to match the rest.
    if re.search('\d{3,4}', Time):
        
        # ex 1230, 1024
        if len(Time) == 4:
            if int(Time[0:2])>12:
                return str(int(Time[0:2]) - 12) + ":" + Time[2:4] + " PM"
            elif Time[0:2] == '12':
                return Time[0:2] + ":" + Time[2:4] + " PM"
            else:
                return Time[0:2] + ":" + Time[2:4] + " AM"
            
        # ex 130, 930, 725
        if len(Time) == 3:
            if Time[0:1] == '0':
                return "12:" + Time[1:3] + " AM"
            else:
                return Time[0:1] + ":" + Time[1:3] + " AM" 
        return Time
    else:
        # if clean up not needed, return the same name
        return Time

In [188]:
# remove all times with one or two digit numbers 
def FindWrongTimes(Time):
    if not re.search('\d{1,2}:\d{2} [A,P]M', Time):
        return float('NaN')
    else:
        return Time

Find the number of time entries with only one or two digit numbers, which is not helpful in showing the time.

In [189]:
crash_df_13_18_clean['Collision Time'].count() - crash_df_13_18_clean['Collision Time'].apply(FindWrongTimes).dropna().count()


3061

In [190]:
# Updated the time columns
crash_df_13_18_clean['Collision Time'] = crash_df_13_18_clean['Collision Time'].apply(Clean_Times)
crash_df_13_18_clean['Collision Time'] = crash_df_13_18_clean['Collision Time'].apply(FindWrongTimes)

In [191]:
crash_df_13_18_clean['Collision Time']

0        1:16 AM
1        9:50 AM
2        2:06 PM
3        3:00 PM
4        1:40 PM
          ...   
22406    5:58 PM
22407    3:21 AM
22408    7:50 AM
22409    7:28 PM
22410    1:44 AM
Name: Collision Time, Length: 22411, dtype: object

And now we can clean it the same way as the others.

In [192]:
crash_df_13_18_clean['DateTime'] = crash_df_13_18_clean.apply(lambda x: get_datetime(x['Collision Date'],x['Collision Time']), axis=1)

crash_df_13_18_clean = crash_df_13_18_clean.drop(columns=['Collision Date','Collision Time','House Number','Interchange','Feet From','Traffic Control', 'CN Type'])


In [196]:
crash_df_13_18_clean.sample()

Unnamed: 0,Agency,City,Vehicles Involved,Trailers Involved,Number Injured,Number Dead,Number Deer,Roadway Id,Intersecting Road,Ramp,Property Type,Dir,Latitude,Longitude,Road Class,H&R,Locality,School,Rumble Strips,Construction?,Light,Weather Conditions,Surface Condition,Median,Roadway Junction Type,Road Character,Roadway Surface,Primary Factor,Manner of Collision,Unique Location Id,DateTime
10577,IUPD,BLOOMINGTON,1,0,1.0,0.0,0.0,JORDAN AVE,ATWATER AVE,,,,39.163168,-86.516384,CITY,N,URBAN,N,N,N,DARK,CLEAR,DRY,,,STRAIGHT/LEVEL,ASP,HEADLIGHT DEFECTIVE OR NOT ON,LEFT TURN,EATWATERAVESJORDANAVE,2017-04-27 00:15:00


### Cleaning the last csv
The last table is the least similar to the others. It only has a few columns with far less data, but it also has way more years of data. Also, the data from 2013-2015 is included in this table as well as the previous one, so there's some weird overlap. For simplicty, I'll drop all the repeat data from the oldest table, since it has the least information about each record.

In [197]:
crash_df_03_15 = pd.read_csv('../source-data/moco-crash-2003-2015.csv',encoding='unicode_escape')  
crash_df_03_15.head()

Unnamed: 0,Master Record Number,Year,Month,Day,Weekend?,Hour,Collision Type,Injury Type,Primary Factor,Reported_Location,Latitude,Longitude
0,902363382,2015,1,5,Weekday,0.0,2-Car,No injury/unknown,OTHER (DRIVER) - EXPLAIN IN NARRATIVE,1ST & FESS,39.159207,-86.525874
1,902364268,2015,1,6,Weekday,1500.0,2-Car,No injury/unknown,FOLLOWING TOO CLOSELY,2ND & COLLEGE,39.16144,-86.534848
2,902364412,2015,1,6,Weekend,2300.0,2-Car,Non-incapacitating,DISREGARD SIGNAL/REG SIGN,BASSWOOD & BLOOMFIELD,39.14978,-86.56889
3,902364551,2015,1,7,Weekend,900.0,2-Car,Non-incapacitating,FAILURE TO YIELD RIGHT OF WAY,GATES & JACOBS,39.165655,-86.575956
4,902364615,2015,1,7,Weekend,1100.0,2-Car,No injury/unknown,FAILURE TO YIELD RIGHT OF WAY,W 3RD,39.164848,-86.579625


In [198]:
# drop overlapping years, 2013-2015
# recast the dataframe with only rows where `Year` is less than 2013
crash_df_03_12 = crash_df_03_15[crash_df_03_15['Year'] < 2013]

In [130]:
crash_df_03_12['Injury Type'].unique()

array(['No injury/unknown', 'Non-incapacitating', 'Incapacitating',
       'Fatal'], dtype=object)

Adding column to estimate fatalities based on injury column

In [131]:
def FatalityEstimate(String):
    if String == 'Fatal':
        return 1
    else:
        return 0

In [132]:
crash_df_03_12['Number Dead'] = crash_df_03_12['Injury Type'].apply(FatalityEstimate)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  crash_df_03_12['Number Dead'] = crash_df_03_12['Injury Type'].apply(FatalityEstimate)


This 2003-2015 dataset has some clear issues that are fixed in the others we looked at. The `Hour` category is strangely filled out with floats in the thousands, probably a result of them not noting the minutes and also not using the correct data type when creating this table. Also several categories which were represented with numbers, like Injuries and Collision type, are now recorded with descriptive strings that will be tricky to compare well with the previous datasets. Also there are way fewer data points so we can't compare things like the number of deer for the early years of MoCo car crash data. 

However, we do still have the dates, times and locations, which will allow a good amount of visualization across all years we have data (2003-21). So I'm still going to clean this data table like the others, and create a concatenated table of all the data. The years 2003-12 will just have a lot of NaN entries for the columns they don't have recorded.

So let's go ahead and rename columns, fix datetime formatting and drop unneeded columns as we did above.

In [199]:
crash_df_03_12_clean = crash_df_03_12.rename(columns={
    "Collision Type": "Vehicles Involved"
})

In [200]:
crash_df_03_12_clean['Collision Date'] = pd.to_datetime(crash_df_03_12_clean[['Year', 'Month', 'Day']])

In [201]:
# is hour reliable? see how many rows don't have an `Hour` field filled in
crash_df_03_12_clean['Hour'].isna().sum()

225

In [202]:
def CleanTime0315(hour_string):
    # test if float exists
    if hour_string == hour_string:
        return str(time(int(hour_string / 100)))

In [203]:
CleanTime0315(2300)

'23:00:00'

In [204]:
crash_df_03_12_clean['Collision Time'] = crash_df_03_12_clean['Hour'].apply(CleanTime0315)
crash_df_03_12_clean = crash_df_03_12_clean.drop(columns=["Year","Month","Day","Weekend?","Hour"])


In [206]:
crash_df_03_12_clean.sample()

Unnamed: 0,Master Record Number,Vehicles Involved,Injury Type,Primary Factor,Reported_Location,Latitude,Longitude,Collision Date,Collision Time
17697,901697134,3+ Cars,Non-incapacitating,FOLLOWING TOO CLOSELY,SR37S & VERNAL,39.176544,-86.5624,2011-09-07,14:00:00


In [208]:
# now clean it the same way as the others
# crash_df_03_12_clean['DateTime'] = 
crash_df_03_12_clean.apply(lambda x: get_datetime(x['Collision Date'], x['Collision Time']), axis=1)
# crash_df_13_18_clean.apply(lambda x: get_datetime(x['Collision Date'],x['Collision Time']), axis=1)



12538   2012-04-01 16:00:00
12539   2012-08-06 12:00:00
12540   2012-01-05 11:00:00
12541   2012-12-04 21:00:00
12542   2012-03-06 07:00:00
                ...        
53938   2003-10-06 17:00:00
53939   2003-11-03 08:00:00
53940   2003-12-05 12:00:00
53941   2003-12-01 07:00:00
53942   2003-12-07 17:00:00
Length: 41405, dtype: datetime64[ns]

In [211]:
cols_to_drop_0312 = [
    'Master Record Number',
]

In [212]:
crash_df_03_12_clean = crash_df_03_12_clean.drop(columns=cols_to_drop_0312)

In [201]:
crash_df_03_12_clean.head()

Unnamed: 0,Vehicles Involved,Injury Type,Primary Factor,Reported_Location,Latitude,Longitude,Number Dead,DateTime
12538,1-Car,No injury/unknown,RAN OFF ROAD RIGHT,8618 N BEAN BLOSSOM RD & E ANDERSON,40.135524,-86.432148,0,2012-04-01 16:00:00
12539,2-Car,No injury/unknown,FAILURE TO YIELD RIGHT OF WAY,W RILEY DR,39.984323,-85.614558,0,2012-08-06 12:00:00
12540,2-Car,No injury/unknown,FAILURE TO YIELD RIGHT OF WAY,48 W & DANIELS WAY,39.566661,-86.59911,0,2012-01-05 11:00:00
12541,1-Car,No injury/unknown,ANIMAL/OBJECT IN ROADWAY,N BURMA & N MT PLEASANT RD,39.54282,-86.576707,0,2012-12-04 21:00:00
12542,1-Car,No injury/unknown,RAN OFF ROAD RIGHT,BOLTINGHOUSE RD & EARL YOUNG,39.442154,-86.47338,0,2012-03-06 07:00:00


Let's create a column for `Number Injured`.

In [202]:
crash_df_03_12_clean['Injury Type'].unique()

array(['No injury/unknown', 'Non-incapacitating', 'Incapacitating',
       'Fatal'], dtype=object)

To get an estimate at least, even though this won't give us the accurate number of injuries from each crash, let's just add `1` for the `Number Injured` estimate if the field is `Non-incapacitating` or `Incapacitating`. Unfortunately this won't compare very well, but we can note this limitation in the Methodology.

In [204]:
def InjuryEstimate(String):
    if String == 'Non-incapacitating':
        return 1
    elif String == 'Incapacitating':
        return 1
    else:
        return 0

In [205]:
crash_df_03_12_clean['Number Injured'] = crash_df_03_12_clean['Injury Type'].apply(InjuryEstimate)

We can also extract the number of vehicles involved as an integer estimate from the current string format.

Also, the 2003-2012 data has the benefit of having some documentation of whether cyclists or pedestrians were involved. We can at least show this information for these years, even though it's not publicly accessible for 2015 - 2022. Let's add a col for each of those flags.

In [206]:
crash_df_03_12_clean['Vehicles Involved'].unique()

array(['1-Car', '2-Car', 'Moped/Motorcycle', '3+ Cars', 'Cyclist', 'Bus',
       'Pedestrian', nan], dtype=object)

In [207]:
def NumberVehiclesEstimate(String):
    if String == '1-Car':
        return 1
    elif String == '2-Car':
        return 2
    elif String == '3+ Cars':
        return 3
    else:
        return 0

In [208]:
crash_df_03_12_clean['Vehicles Involved'] = crash_df_03_12_clean['Vehicles Involved'].apply(NumberVehiclesEstimate)

In [209]:
def PedestrianInvolved(String):
    if String == 'Pedestrian':
        return True
    else:
        return False

In [135]:
crash_df_03_12_clean['Pedestrian Involved'] = crash_df_03_12_clean['Vehicles Involved'].apply(PedestrianInvolved)

In [136]:
def CyclistInvolved(String):
    if String == 'Cyclist':
        return True
    else:
        return False

In [210]:
crash_df_03_12_clean['Cyclist Involved'] = crash_df_03_12_clean['Vehicles Involved'].apply(CyclistInvolved)

In [211]:
crash_df_03_12_clean = crash_df_03_12_clean.drop(columns=['Vehicles Involved', 'Injury Type'])

In [212]:
crash_df_03_12_clean

Unnamed: 0,Primary Factor,Reported_Location,Latitude,Longitude,Number Dead,DateTime,Number Injured,Cyclist Involved
12538,RAN OFF ROAD RIGHT,8618 N BEAN BLOSSOM RD & E ANDERSON,40.135524,-86.432148,0,2012-04-01 16:00:00,0,False
12539,FAILURE TO YIELD RIGHT OF WAY,W RILEY DR,39.984323,-85.614558,0,2012-08-06 12:00:00,0,False
12540,FAILURE TO YIELD RIGHT OF WAY,48 W & DANIELS WAY,39.566661,-86.599110,0,2012-01-05 11:00:00,0,False
12541,ANIMAL/OBJECT IN ROADWAY,N BURMA & N MT PLEASANT RD,39.542820,-86.576707,0,2012-12-04 21:00:00,0,False
12542,RAN OFF ROAD RIGHT,BOLTINGHOUSE RD & EARL YOUNG,39.442154,-86.473380,0,2012-03-06 07:00:00,0,False
...,...,...,...,...,...,...,...,...
53938,IMPROPER LANE USAGE,DUNN & WHITE LOT WEST,0.000000,0.000000,0,2003-10-06 17:00:00,0,False
53939,UNSAFE SPEED,RED OAK & SR446,0.000000,0.000000,0,2003-11-03 08:00:00,0,False
53940,BRAKE FAILURE OR DEFECTIVE,2ND ST & WALNUT,0.000000,0.000000,0,2003-12-05 12:00:00,0,False
53941,UNSAFE BACKING,NINETH & NORTH,0.000000,0.000000,0,2003-12-01 07:00:00,0,False


## Combining all the data into one table
Even though there are differences in the columns between the different tables, specifically in the 2003-15 one, combining all the data into one concatenated table will be useful for visualizing across time and space.

Another script will be devoted to cleaning the address data in the master file, as it will be easier to clean that once the data is all combined.

In [213]:
dfs = [crash_df_22_clean, crash_df_clean, crash_df_20_clean, crash_df_19_clean, crash_df_13_18_clean, crash_df_03_12_clean]
combo_df = pd.concat(dfs)

In [214]:
combo_df

Unnamed: 0,_id,Agency,City,Vehicles Involved,Trailers Involved,Number Injured,Number Dead,Number Deer,Roadway Id,Intersecting Road,Latitude,Longitude,Roadway Class,Hit and Run?,Locality,School Zone?,Rumble Strips?,Construction?,Light Condition,Weather Conditions,Surface Condition,Roadway Junction Type,Road Character,Roadway Surface,Primary Factor,Manner of Collision,Unique Location Id,DateTime,Ramp,Property Type,Dir,Road Class,H&R,School,Rumble Strips,Light,Median,Reported_Location,Cyclist Involved
0,1.0,MONROE SD,BLOOMINGTON,1.0,0.0,0.0,0.0,0.0,I69N,STATE RD 37,38.329723,-86.509226,INTERSTATE,N,RURAL,N,N,N,DARK (NOT LIGHTED),CLEAR,DRY,NO JUNCTION INVOLVED,STRAIGHT/GRADE,ASPHALT,ANIMAL/OBJECT IN ROADWAY,COLLISION WITH ANIMAL OTHER,I69NSTATERD37RD,2022-01-07 05:14:00,,,,,,,,,,,
1,2.0,ELLETTSVILLE PD,ELLETTSVILLE,1.0,0.0,0.0,0.0,1.0,SR46W,DEER PARK,39.212153,-86.587526,STATE ROAD,N,URBAN,N,N,N,DARK (NOT LIGHTED),CLOUDY,DRY,T-INTERSECTION,CURVE/LEVEL,ASPHALT,ANIMAL/OBJECT IN ROADWAY,COLLISION WITH DEER,DEERPARKDRSR46W,2022-01-08 08:35:00,,,,,,,,,,,
2,3.0,MONROE SD,ELLETTSVILLE,1.0,0.0,0.0,0.0,0.0,W REEVES,,39.235012,-86.676553,LOCAL/CITY ROAD,N,RURAL,N,N,N,DAWN/DUSK,CLEAR,ICE,NO JUNCTION INVOLVED,CURVE/HILLCREST,ASPHALT,RAN OFF ROAD RIGHT,RAN OFF ROAD,WREEVESRD,2022-01-17 07:33:00,,,,,,,,,,,
3,4.0,INDIANA UNIV BLOOMINGTON PD,BLOOMINGTON,2.0,0.0,0.0,0.0,0.0,THIRD,S HAWTHORNE,39.156888,-86.520324,LOCAL/CITY ROAD,N,URBAN,N,N,N,DAYLIGHT,CLEAR,DRY,FOUR-WAY INTERSECTION,STRAIGHT/LEVEL,ASPHALT,UNSAFE LANE MOVEMENT,SAME DIRECTION SIDESWIPE,SHAWTHORNEDRTHIRDST,2022-01-04 12:32:00,,,,,,,,,,,
4,5.0,BLOOMINGTON PD,BLOOMINGTON,2.0,0.0,0.0,0.0,0.0,S HENDERSON,E HILLSIDE,39.150640,-86.526960,LOCAL/CITY ROAD,N,URBAN,Y,N,N,DARK (LIGHTED),RAIN,WET,FOUR-WAY INTERSECTION,STRAIGHT/LEVEL,ASPHALT,FAILURE TO YIELD RIGHT OF WAY,RIGHT ANGLE,EHILLSIDEDRSHENDERSONST,2022-01-01 05:33:00,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53938,,,,,,0.0,0.0,,,,0.000000,0.000000,,,,,,,,,,,,,IMPROPER LANE USAGE,,,2003-10-06 17:00:00,,,,,,,,,,DUNN & WHITE LOT WEST,False
53939,,,,,,0.0,0.0,,,,0.000000,0.000000,,,,,,,,,,,,,UNSAFE SPEED,,,2003-11-03 08:00:00,,,,,,,,,,RED OAK & SR446,False
53940,,,,,,0.0,0.0,,,,0.000000,0.000000,,,,,,,,,,,,,BRAKE FAILURE OR DEFECTIVE,,,2003-12-05 12:00:00,,,,,,,,,,2ND ST & WALNUT,False
53941,,,,,,0.0,0.0,,,,0.000000,0.000000,,,,,,,,,,,,,UNSAFE BACKING,,,2003-12-01 07:00:00,,,,,,,,,,NINETH & NORTH,False


Now that we can see the whole dataset combined, I'll do one more round of dropping columns that I don't plan to use in my visualizations.

In [215]:
final_cols_to_drop = [
    '_id',
    'Agency',
    'City',
    'Trailers Involved',
    'Number Deer',
    'Roadway Class',
    'Hit and Run?',
    'Locality',
    'School Zone?',
    'Rumble Strips?',
    'Construction?',
    'Roadway Junction Type',
    'Road Character',
    'Roadway Surface',
    'Ramp',
    'Property Type',
    'Dir',
    'Road Class',
    'H&R',
    'School ',
    'Light',
    'Median',
    'Rumble Strips',
]

In [216]:
combo_df = combo_df.drop(columns=final_cols_to_drop)

In [219]:
combo_df['DateTime'].sample(10)

17251   2011-11-02 09:00:00
24958   2009-06-03 22:00:00
49384   2003-01-07 09:00:00
33375   2007-09-07 00:00:00
21275   2010-01-02 09:00:00
611     2021-10-28 07:31:00
53858   2003-11-04 21:00:00
50227   2003-08-06 16:00:00
7556    2013-09-30 11:20:00
2346    2013-05-07 11:16:00
Name: DateTime, dtype: datetime64[ns]

In [220]:
combo_df.shape

(74622, 16)

In [221]:
no_deaths_injuries = combo_df[(combo_df['Number Injured']) == 0 & (combo_df['Number Dead'] == 0)]

In [222]:
only_injuries = combo_df[(combo_df['Number Injured']) > 0 & (combo_df['Number Dead'] == 0)]

In [223]:
deaths = combo_df[(combo_df['Number Dead'] > 0)]

In [224]:
combo_df.shape[0] - no_deaths_injuries.shape[0] - only_injuries.shape[0] - deaths.shape[0]

-152

In [225]:
combo_df[combo_df['Number Dead'] != combo_df['Number Dead']]

Unnamed: 0,Vehicles Involved,Number Injured,Number Dead,Roadway Id,Intersecting Road,Latitude,Longitude,Light Condition,Weather Conditions,Surface Condition,Primary Factor,Manner of Collision,Unique Location Id,DateTime,Reported_Location,Cyclist Involved
21,2.0,,,W 12TH ST,N LINDBERGH DR,39.174352,-86.551181,DARK (LIGHTED),CLEAR,DRY,UNSAFE LANE MOVEMENT,SAME DIRECTION SIDESWIPE,12THSTLINDBERGHDR,2020-01-05 23:15:00,,
2138,1.0,,,N FEE,E 13TH ST,39.175633,-86.518987,DAYLIGHT,CLEAR,DRY,PEDESTRIAN ACTION,OTHER - EXPLAIN IN NARRATIVE,13THSTFEELN,2019-09-23 11:00:00,,
2894,2.0,,,4TH ST,GRANT ST,39.165373,-86.529716,,CLEAR,DRY,,,4THSTGRANTST,2018-02-06 13:30:00,,
4847,2.0,,,11TH ST,FAIRVIEW ST,39.173152,-86.5408,,CLOUDY,WET,,,NFAIRVIEWSTW11THST,2018-06-21 21:52:00,,
8518,1.0,,,DUNN ST,,39.187891,-86.528476,,RAIN,WET,RAN OFF ROAD RIGHT,RAN OFF ROAD,NDUNNST,2018-09-10 01:16:00,,
10715,2.0,,,JORDAN AVE,,39.17559,-86.51442,,,DRY,,,NJORDANAVE,2018-04-28 12:00:00,,
12159,1.0,,,MOORES PIKE,SARE RD,39.150355,-86.49861,,CLOUDY,DRY,,RAN OFF ROAD,EMOORESPIKESSARERD,2018-06-23 06:00:00,,
12328,2.0,,,NORTH,,39.144832,-86.527381,,,,,,ENORTHDR,2018-09-13 09:00:00,,


I'll save it to a csv which can be used in other scripts to visualize this data.

In [152]:
combo_df.to_csv(r'./data_output/master_crash.csv', index=False)

I'll also make three CSVs separating out crashes with injuries or deaths, for mapping purposes.

In [153]:
no_deaths_injuries.to_csv(r'./data_output/master_minor.csv',index=False)
only_injuries.to_csv(r'./data_output/master_injuries.csv',index=False)
deaths.to_csv(r'./data_output/master_deaths.csv',index=False)