In [1]:
import pandas as pd
from utils import read_data

pd.set_option('display.max_rows', 150)
pd.set_option('display.max_columns', 150)

# Our task

In this activity, we're asked to "[look] for accident / fatality risk factors, and [provide] an understanding of the factors that contribute to accidents and the severity of accidents." Examples include location, time of day, weather, road surfaces, etc. To answer these questions we are to use road accident and safety data from Great Britain, which can be found [here](https://www.data.gov.uk/dataset/cb7ae6f0-4be6-4935-9277-47e5ce24a11f/road-safety-data). We're to create a Jupyter Notebook presenting these findings, communicating these suggested improvements to the Department of Transport. Additionally, if we're able, we are to provide information to car manufacturers to improve their products.

# Modeling strategy

Implicitly, this asks us to create a causal model of the factors leading to an accident or fatality. However, ***we lack data about non-accidents/fatalities as well as a plausible exogenous source of variation from which a causal effect could be identified** (for more, [see](https://en.wikipedia.org/wiki/Endogeneity_(econometrics))). For this reason I interpret model results as **correlational, not causal**.

This task is also distinctly different from classical machine learning modeling efforts, where **prediction** is chiefly of interest. **Rather, we are interested in model coefficients** à la regression analysis. 

# 1.0 Data preprocessing & data selection

These data are hierarchical in nature, presenting some initial modeling and data challenges in their raw form. There are three distinct entities and corresponding datasets: accidents (i.e. general information where, when, and why), vehicles (i.e. cars, vans, buses, trollies, bicycles, and other vehicles involved in an accident), and casualties (i.e. within a particular vehicle but also potentially pedestrians struck by a vehicle). Casualties are nested within vehicles, which in turn are nested within accidents. 

I focus on *vehicles* as the level of analysis, aggregating casualty information to the vehicle level (e.g. total casualties, mean age of casualties, share male) and merge this onto the vehicle file. In a final step I merge the accident data onto the vehicle file as well. Because substantial data preprocessing was required, I outsourced this several functions in `utils.py`. In addition to this data restructuring, I drop several features that I suspect have no causal effect on the outcome, recode unknown category values to missing, engineer some new datetime features, consolidate the number of distinct vehicle types down from 

In [2]:
df = read_data()

In [3]:
df.head()

Unnamed: 0,accident_year,accident_reference,vehicle_reference,vehicle_type,towing_and_articulation,vehicle_manoeuvre,vehicle_location_restricted_lane,junction_location,skidding_and_overturning,hit_object_in_carriageway,vehicle_leaving_carriageway,hit_object_off_carriageway,first_point_of_impact,vehicle_left_hand_drive,journey_purpose_of_driver,sex_of_driver,age_of_driver,engine_capacity_cc,propulsion_code,age_of_vehicle,driver_imd_decile,driver_home_area_type,casualty_class_0,casualty_class_1,casualty_class_2,casualty_class_3,casualty_severity_1,casualty_severity_2,casualty_severity_3,casualty_severity_4,casualty_worst,casualty_share_male,casualty_mean_age,casualty_total,longitude,latitude,number_of_vehicles,local_authority_district,first_road_class,road_type,speed_limit,junction_detail,junction_control,second_road_class,light_conditions,weather_conditions,road_surface_conditions,special_conditions_at_site,carriageway_hazards,urban_or_rural_area,datetime,month,day,dayw,hour,elapsed_time
0,2019,10128300,1,9,0,-1,-1,-1,-1,-1,-1,-1,4,-1,-1,1,58,-1,-1,-1,2,1,0,1,2,0,0,0,3,0,3,0.333333,58.0,3,-0.153842,51.508057,2,1,3,1,30,1,2,3,1,1,1,0,0,1,2019-02-18 17:50:00,2,18,0,17,4211040.0
1,2019,10128300,2,9,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,2,1,1,0,0,0,0,0,0,1,4,-1.0,-1.0,1,-0.153842,51.508057,2,1,3,1,30,1,2,3,1,1,1,0,0,1,2019-02-18 17:50:00,2,18,0,17,4211040.0
2,2019,10152270,1,9,0,18,-1,0,-1,-1,-1,-1,1,-1,-1,2,24,-1,-1,-1,3,1,0,1,0,0,0,0,1,0,3,0.0,24.0,1,-0.127949,51.436208,2,9,3,2,30,0,-1,-1,4,1,1,0,0,1,2019-01-15 21:45:00,1,15,1,21,1287540.0
3,2019,10152270,2,9,0,18,-1,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,6,1,1,0,0,0,0,0,0,1,4,-1.0,-1.0,1,-0.127949,51.436208,2,9,3,2,30,0,-1,-1,4,1,1,0,0,1,2019-01-15 21:45:00,1,15,1,21,1287540.0
4,2019,10155191,1,9,0,3,0,1,0,0,0,0,2,1,-1,1,45,-1,-1,-1,4,1,1,0,0,0,0,0,0,1,4,-1.0,-1.0,1,-0.124193,51.526795,2,2,4,6,30,3,4,6,4,1,1,0,0,1,2019-01-01 01:50:00,1,1,1,1,6240.0
