# Road Accident Analysis and Prediction

This comprehensive dataset provides detailed information on road accidents reported over multiple years. The dataset encompasses various attributes related to accident status, vehicle and casualty references, demographics, and severity of casualties. It includes essential factors such as pedestrian details, casualty types, road maintenance worker involvement, and the Index of Multiple Deprivation (IMD) decile for casualties' home areas.

The dataset is provided by the UK Department for Transport and is updated annually. It is a valuable resource for understanding the causes and effects of road accidents and for developing strategies to prevent them.

In this notebook, we will perform an exploratory data analysis (EDA) to understand the dataset's structure and contents. We will also build a machine learning model to predict the severity of road accidents based on various attributes.

At the end we will suggest some strategies to prevent road accidents based on the insights gained from the analysis.

### Importing Libraries


In [1]:
import warnings 
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score




## Load the Data

In [2]:
df = pd.read_csv("dataset/dft-road-casualty-statistics-casualty-provisional-mid-year-unvalidated-2022 (1).csv").drop_duplicates()
print(f'shape of dataset is {df.shape}')
df.head().T

shape of dataset is (61352, 20)


Unnamed: 0,0,1,2,3,4
status,Unvalidated,Unvalidated,Unvalidated,Unvalidated,Unvalidated
accident_index,2022070151244,2022070152668,2022070154696,2022070154696,2022070154696
accident_year,2022,2022,2022,2022,2022
accident_reference,070151244,070152668,070154696,070154696,070154696
vehicle_reference,2,1,1,2,3
casualty_reference,1,1,1,3,2
casualty_class,1,1,1,1,1
sex_of_casualty,2,1,2,2,1
age_of_casualty,46,30,58,78,63
age_band_of_casualty,8,6,9,11,9


**Columns**:



**Status**: The status of the accident (e.g., reported, under investigation).

**Accident_Index**: A unique identifier for each reported accident.

**Accident_Year**: The year in which the accident occurred.

**Accident_Reference**: A reference number associated with the accident.

**Vehicle_Reference**: A reference number for the involved vehicle in the accident.

**Casualty_Reference**: A reference number for the casualty involved in the accident.

**Casualty_Class**: Indicates the class of the casualty (e.g., driver, passenger, pedestrian).

**Sex_of_Casualty**: The gender of the casualty (male or female).

**Age_of_Casualty**: The age of the casualty.

**Age_Band_of_Casualty**: Age group to which the casualty belongs (e.g., 0-5, 6-10, 11-15).

**Casualty_Severity**: The severity of the casualty's injuries (e.g., fatal, serious, slight).

**Pedestrian_Location**: The location of the pedestrian at the time of the accident.

**Pedestrian_Movement**: The movement of the pedestrian during the accident.

**Car_Passenger**: Indicates whether the casualty was a car passenger at the time of the accident (yes or no).

**Bus_or_Coach_Passenger**: Indicates whether the casualty was a bus or coach passenger (yes or no).

**Pedestrian_Road_Maintenance_Worker**: Indicates whether the casualty was a road maintenance worker (yes or no).

**Casualty_Type**: The type of casualty (e.g., driver/rider, passenger, pedestrian).

**Casualty_Home_Area_Type**: The type of area in which the casualty resides (e.g., urban, rural).

**Casualty_IMD_Decile**: The IMD decile of the area where the casualty resides (a measure of deprivation).

**LSOA_of_Casualty**: The Lower Layer Super Output Area (LSOA) associated with the casualty's location.

## Data Preprocessing

In [3]:
# Check for missing values
df.isnull().sum()

status                                0
accident_index                        0
accident_year                         0
accident_reference                    0
vehicle_reference                     0
casualty_reference                    0
casualty_class                        0
sex_of_casualty                       0
age_of_casualty                       0
age_band_of_casualty                  0
casualty_severity                     0
pedestrian_location                   0
pedestrian_movement                   0
car_passenger                         0
bus_or_coach_passenger                0
pedestrian_road_maintenance_worker    0
casualty_type                         0
casualty_home_area_type               0
casualty_imd_decile                   0
lsoa_of_casualty                      0
dtype: int64

In [4]:
(df["status"]=="Unvalidated").value_counts()
# Remove the unvalidated data
df.drop(columns=['status','accident_index','accident_reference', 'accident_year', 'lsoa_of_casualty'], inplace=True)


In [17]:
# Drop rows with missing values (-1)
df.replace(-1, np.nan, inplace=True)
df.dropna(inplace=True)
# Convert all columns to int
df = df.astype(int)
df

Unnamed: 0,vehicle_reference,casualty_reference,casualty_class,sex_of_casualty,age_of_casualty,age_band_of_casualty,casualty_severity,pedestrian_location,pedestrian_movement,car_passenger,bus_or_coach_passenger,pedestrian_road_maintenance_worker,casualty_type,casualty_home_area_type,casualty_imd_decile
0,2,1,1,2,46,8,3,0,0,0,0,0,9,1,9
1,1,1,1,1,30,6,3,0,0,0,0,0,9,1,2
2,1,1,1,2,58,9,3,0,0,0,0,0,9,1,10
3,2,3,1,2,78,11,3,0,0,0,0,0,9,2,10
4,3,2,1,1,63,9,3,0,0,0,0,0,9,3,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
61346,1,1,1,1,69,10,2,0,0,0,0,0,4,1,7
61347,1,1,3,2,56,9,3,10,9,0,0,0,0,1,10
61349,2,1,1,1,42,7,3,0,0,0,0,0,9,1,5
61350,1,1,1,2,40,7,3,0,0,0,0,0,9,1,3
