# Processing DfT Road Traffic 'Accidents' data

__May 2022__

I will be using the terminology of 'crash' instead of 'accident'.

This notebook builds on the `../process-dft-data.R` script which formats the raw data taken from here: https://data.gov.uk/dataset/cb7ae6f0-4be6-4935-9277-47e5ce24a11f/road-safety-data. We use R here as there is a handy 'stats19' package that does a lot of the work for us.

With that data pulled and filtered to 2000 - 2020, this notebook then:
1. Filters to crashes within a grid covering most of London
2. Filter vehicles and casualties based on the crash IDs in London
3. Filter to junction only crashes
4. Filter to only accidents that involved a cyclist
5. Recalculate the accident severity based on the most serious cyclist casualty

Will move this to a .py script in future if useful.

__Note -__ there is a question mark around the 2020 data. I'm not 100% sure if the format may have changed (how the data was encoded), which may have impacted the way the stats19 package formatted the data.

In [15]:
import pandas as pd

## Read in data

Three datasets:
- Crashes (aka 'accidents')
- Casualties
- Vehicles

'London grid' is just taken roughly from some coordinates that cover Wembley to Bromley...

In [16]:
# define grid
min_lat = 51.404683
max_lat = 51.571305
min_lon = -0.332720
max_lon = 0.037394

crashes = pd.read_csv('../data/crashes.csv', low_memory=False)
casualties = pd.read_csv('../data/casualties.csv', low_memory=False)
vehicles = pd.read_csv('../data/vehicles.csv', low_memory=False)

print(crashes.shape)
crashes = crashes[
    (crashes['latitude'] >= min_lat) &
    (crashes['latitude'] <= max_lat) &
    (crashes['longitude'] >= min_lon) &
    (crashes['longitude'] <= max_lon)
]
print(crashes.shape)

(3484560, 37)
(314173, 37)


## View sample of data

To see what data available and various fields.

In [17]:
crashes.sample(5).T.head(50)

Unnamed: 0,1512464,1694085,470260,2467178,904425
accident_index,200701WW50641,200801VW49045,200201HT00545,201301HT20285,200401HT00297
accident_year,2007,2008,2002,2013,2004
accident_reference,01WW50641,01VW49045,01HT00545,01HT20285,01HT00297
location_easting_osgr,528110.0,528300.0,535170.0,534730.0,537290.0
location_northing_osgr,170890.0,170210.0,181270.0,181310.0,180980.0
longitude,-0.158849,-0.156364,-0.053402,-0.059724,-0.022982
latitude,51.422496,51.416342,51.514137,51.514602,51.511021
police_force,Metropolitan Police,Metropolitan Police,Metropolitan Police,Metropolitan Police,Metropolitan Police
accident_severity,Slight,Slight,Slight,Slight,Slight
number_of_vehicles,4,2,2,2,1


## Filter to junction only crashes

This can be done using the `junction_type` field. I need to look more into how this field is recorded, but for now I will exclude the following:
- 'Not at junction or within 20 metres'
- 'Data missing or out of range'
- 'unknown (self reported)'

In [18]:
crashes.junction_detail.unique()

junction_types = [
    # 'Not at junction or within 20 metres',
    'Other junction',
    'Crossroads',
    'T or staggered junction',
    'More than 4 arms (not roundabout)',
    'Private drive or entrance',
    'Roundabout',
    'Mini-roundabout',
    'Slip road',
    # 'Data missing or out of range',
    # 'unknown (self reported)'
]

crashes = crashes[crashes.junction_detail.isin(junction_types)]

# make sure to apply filtered crashes to casualty and vehicle data...

print(casualties.shape)
casualties = casualties[casualties.accident_index.isin(crashes.accident_index)]
print(casualties.shape)

print(vehicles.shape)
vehicles = vehicles[vehicles.accident_index.isin(crashes.accident_index)]
print(vehicles.shape)

(4692269, 18)
(277900, 18)
(6394069, 27)
(423565, 27)


## Filter to only crashes involving at least one cyclist

Using `casualty_type` field.

In [19]:
cyclist_crash_ids = casualties[
    casualties['casualty_type'] == 'Cyclist'
]['accident_index'].unique()

print(f'{len(cyclist_crash_ids)} crash IDs in data')

crashes = crashes[crashes.accident_index.isin(cyclist_crash_ids)]
casualties = casualties[casualties.accident_index.isin(cyclist_crash_ids)]
vehicles = vehicles[vehicles.accident_index.isin(cyclist_crash_ids)]

50214 crash IDs in data


## Recalculate crash severity

This is tagged based on any casualties involved, but we want to look at just the most serious cyclist casualty...

In [20]:
def calculate_accident_severity(severities):
    severities = severities.tolist()

    if 'Critical' in severities:
        return 'Critical'
    if 'Serious' in severities:
        return 'Serious'
    if 'Slight' in severities:
        return 'Slight'

recalculated_severities = (
    casualties[casualties['casualty_type'] == 'Cyclist']
    .groupby('accident_index')['casualty_severity']
    .apply(calculate_accident_severity)
    .reset_index(name='max_cyclist_severity')
)

recalculated_severities

Unnamed: 0,accident_index,max_cyclist_severity
0,200001BS00005,Slight
1,200001BS00010,Slight
2,200001BS00019,Slight
3,200001BS00024,Serious
4,200001BS00036,Serious
...,...,...
50209,2020480999117,Slight
50210,2020481001417,Slight
50211,2020481001449,Serious
50212,2020481002082,Serious


In [21]:
# join back to the datasets with severity in it
crashes = crashes.merge(recalculated_severities, how='left', on='accident_index')
casualties = casualties.merge(recalculated_severities, how='left', on='accident_index')

## Output datasets back to csv

Ready for analysis in next notebook

In [22]:
crashes.to_csv('../data/london-crashes.csv', index=False)
casualties.to_csv('../data/london-casualties.csv', index=False)
vehicles.to_csv('../data/london-vehicles.csv', index=False)