# Processing DfT Road Traffic 'Accidents' data

__May 2022__

I will be using the terminology of 'crash' instead of 'accident'.

This notebook builds on the `../process-dft-data.R` script which formats the raw data taken from here: https://data.gov.uk/dataset/cb7ae6f0-4be6-4935-9277-47e5ce24a11f/road-safety-data. We use R here as there is a handy 'stats19' package that does a lot of the work for us.

With that data pulled and filtered to 2000 - 2020, this notebook then:
1. Filters to crashes within a grid covering most of London
2. Filter vehicles and casualties based on the crash IDs in London
3. Filter to junction only crashes
4. Filter to only accidents that involved a cyclist
5. Recalculate the accident severity based on the most serious cyclist casualty

Will move this to a .py script in future if useful.

__Note -__ there is a question mark around the 2020 data. I'm not 100% sure if the format may have changed (how the data was encoded), which may have impacted the way the stats19 package formatted the data.

In [1]:
import pandas as pd
import numpy as np

In [2]:
def accident_severity_counts(row):
    '''
    Count severities of each type
    '''
    severities = row['casualty_severity']
    severities = severities.tolist()
    
    fatal = severities.count('Fatal')
    serious = severities.count('Serious')
    slight = severities.count('Slight')
    
    return fatal, serious, slight


def get_danger_metric(row):
    '''
    Upweights more severe collisions for junction comparison.
    '''
    year = row['accident_year']
    fatal = row['fatal_cyclist_casualties']
    serious = row['serious_cyclist_casualties']
    
    total_severity = 3 * fatal + serious
        
    return total_severity


def get_recency_danger_metric(row):
    '''
    Upweights more severe collisions for junction comparison.
    '''
    year = row['accident_year']
    fatal = row['fatal_cyclist_casualties']
    serious = row['serious_cyclist_casualties']
    
    recency_weight = np.log10(year - 2009)
    
    total_severity = 3 * fatal + serious
    weighted_severity = total_severity * recency_weight
        
    return weighted_severity


def get_max_cyclist_severity(row):
    if row['fatal_cyclist_casualties'] > 0:
        return 'fatal'
    if row['serious_cyclist_casualties'] > 0:
        return 'serious'
    if row['slight_cyclist_casualties'] > 0:
        return 'slight'
    else:
        return None

## Read in data

Three datasets:
- Crashes (aka 'accidents')
- Casualties
- Vehicles

'London grid' is just taken roughly from some coordinates that cover Wembley to Bromley...

In [3]:
# define grid
min_lat = 51.404683
max_lat = 51.571305
min_lon = -0.332720
max_lon = 0.037394

crashes = pd.read_csv('../data/collision-data/crashes.csv', low_memory=False)
casualties = pd.read_csv('../data/collision-data/casualties.csv', low_memory=False)
vehicles = pd.read_csv('../data/collision-data/vehicles.csv', low_memory=False)

print(crashes.shape)
crashes = crashes[
    (crashes['latitude'] >= min_lat) &
    (crashes['latitude'] <= max_lat) &
    (crashes['longitude'] >= min_lon) &
    (crashes['longitude'] <= max_lon)
]
print(crashes.shape)

(1320056, 37)
(141222, 37)


## View sample of data

To see what data available and various fields.

In [4]:
crashes.sample(5).T.head(50)

Unnamed: 0,170566,727997,592677,298685,585439
accident_index,201201WW50207,2016010019639,201501LX50617,201301CW10824,201501EK41165
accident_year,2012,2016,2015,2013,2015
accident_reference,01WW50207,010019639,01LX50617,01CW10824,01EK41165
location_easting_osgr,522080.0,530790.0,529070.0,529510.0,528220.0
location_northing_osgr,175400.0,171560.0,173170.0,180650.0,185600.0
longitude,-0.243977,-0.120079,-0.144238,-0.135152,-0.151949
latitude,51.464359,51.427906,51.442824,51.509891,51.554723
police_force,Metropolitan Police,Metropolitan Police,Metropolitan Police,Metropolitan Police,Metropolitan Police
accident_severity,Slight,Slight,Slight,Slight,Slight
number_of_vehicles,2,2,1,2,2


## Filter to junction only crashes

This can be done using the `junction_type` field. I need to look more into how this field is recorded, but for now I will exclude the following:
- 'Not at junction or within 20 metres'
- 'Data missing or out of range'
- 'unknown (self reported)'

In [5]:
crashes.junction_detail.unique()

junction_types = [
    # 'Not at junction or within 20 metres',
    'Other junction',
    'Crossroads',
    'T or staggered junction',
    'More than 4 arms (not roundabout)',
    'Private drive or entrance',
    'Roundabout',
    'Mini-roundabout',
    'Slip road',
    # 'Data missing or out of range',
    # 'unknown (self reported)'
]

crashes = crashes[crashes.junction_detail.isin(junction_types)]

# make sure to apply filtered crashes to casualty and vehicle data...

print(casualties.shape)
casualties = casualties[casualties.accident_index.isin(crashes.accident_index)]
print(casualties.shape)

print(vehicles.shape)
vehicles = vehicles[vehicles.accident_index.isin(crashes.accident_index)]
print(vehicles.shape)

(1745725, 18)
(125014, 18)
(2422908, 27)
(193248, 27)


## Filter to only crashes involving at least one cyclist

Using `casualty_type` field.

In [6]:
cyclist_crash_ids = casualties[
    casualties['casualty_type'] == 'Cyclist'
]['accident_index'].unique()

print(f'{len(cyclist_crash_ids)} crash IDs in data')

crashes = crashes[crashes.accident_index.isin(cyclist_crash_ids)]
casualties = casualties[casualties.accident_index.isin(cyclist_crash_ids)]
vehicles = vehicles[vehicles.accident_index.isin(cyclist_crash_ids)]

28956 crash IDs in data


## Recalculate crash severity

This is tagged based on any casualties involved, but we want to look at just the most serious cyclist casualty...

In [7]:
recalculated_severities = (
    casualties[casualties['casualty_type'] == 'Cyclist']
    .groupby(['accident_index', 'accident_year'])
    .apply(accident_severity_counts)
    .reset_index()
)

# split out cols
new_cols = [
    'fatal_cyclist_casualties', 'serious_cyclist_casualties', 'slight_cyclist_casualties'
]
recalculated_severities[new_cols] = pd.DataFrame(
    recalculated_severities[0].tolist(),
    index=recalculated_severities.index
)

recalculated_severities['danger_metric'] = recalculated_severities.apply(get_danger_metric, axis=1)
recalculated_severities['recency_danger_metric'] = recalculated_severities.apply(get_recency_danger_metric, axis=1)
recalculated_severities['max_cyclist_severity'] = recalculated_severities.apply(get_max_cyclist_severity, axis=1)

# remove unrequired cols
recalculated_severities.drop(columns=[0, 'accident_year'], inplace=True)

recalculated_severities.head()

Unnamed: 0,accident_index,fatal_cyclist_casualties,serious_cyclist_casualties,slight_cyclist_casualties,danger_metric,recency_danger_metric,max_cyclist_severity
0,201101BS70088,0,0,1,0,0.0,slight
1,201101BS70110,0,0,1,0,0.0,slight
2,201101BS70125,0,0,1,0,0.0,slight
3,201101BS70132,0,0,1,0,0.0,slight
4,201101BS70172,0,0,1,0,0.0,slight


In [8]:
recalculated_severities[recalculated_severities['max_cyclist_severity'] == 'fatal'].head()

Unnamed: 0,accident_index,fatal_cyclist_casualties,serious_cyclist_casualties,slight_cyclist_casualties,danger_metric,recency_danger_metric,max_cyclist_severity
2551,201101TD00022,1,0,0,3,0.90309,fatal
2552,201101TD00060,1,0,0,3,0.90309,fatal
2553,201101TD00069,1,0,0,3,0.90309,fatal
2554,201101TD00084,1,0,0,3,0.90309,fatal
2555,201101TD00115,1,0,0,3,0.90309,fatal


In [9]:
# join back to the datasets with severity in it
crashes = crashes.merge(recalculated_severities, how='left', on='accident_index')
casualties = casualties.merge(recalculated_severities, how='left', on='accident_index')

## Output datasets back to csv

Ready for analysis in next notebook

In [10]:
crashes.to_csv('../data/london-crashes.csv', index=False)
casualties.to_csv('../data/london-casualties.csv', index=False)
vehicles.to_csv('../data/london-vehicles.csv', index=False)