# Introduction

This is a multiclass classification project to classify the severity of road accidents into three categories. This project is based on real-world data, and the dataset is also highly imbalanced. There are three types of injuries in a target variable: minor, severe, and fatal.

Road accidents are the major cause of unnatural deaths around the world. All governments work hard to raise awareness about the rules and regulations that must be followed when driving a vehicle on the road in order to reduce fatalities. Thus, it is necessary to have a mechanism that predicts the severity of such accidents and helps in reducing fatalities.

# Objective:

To learn Pipeline

# Problem statement

The target feature is “Accident_severity,” which is a multi-class variable. The task is to classify this variable based on the other 31 features step-by-step by going through each data science process and task. Our metric for evaluation will be your “F1 score, confusion matrics, Classification Report"


# Dataset description

Time — time of the accident (In 24 hours format)

Day_of_week — A day when an accident occurred

Age_band_of_driver —The age group of the driver

Sex_of_driver — Gender of driver

Educational_level — Driver’s highest education level

Vehical_driver_relation — What’s the relation of a driver with the vehicle

Driving_experience — How many years of driving experience the driver has

Type_of_vehicle — What’s the type of vehicle

Owner_of_vehicle — Who’s the owner of the vehicle

Service_year_of_vehicle — The last service year of the vehicle

Defect_of_vehicle — Is there any defect on the vehicle or not?

Area_accident_occured — Locality of an accident site

Lanes_or_Medians — Are there any lanes or medians at the accident site?

Road_allignment — Road alignment with the terrain of the land

Types_of_junction — Type of junction at the accident site

Road_surface_type — A surface type of road

Road_surface_conditions — What was the condition of the road surface?

Light_conditions — Lighting conditions at the site

Weather_conditions — Weather situation at the site of an accident

Type_of_collision — What is the type of collision

Number_of_vehicles_involved — Total number of vehicles involved in an accident

Number_of_casualties — Total number of casualties in an accident

Vehicle_movement — How the vehicle was moving before the accident occurred

Casualty_class — A person who got killed during an accident

Sex_of_casualty — What the gender of a person who got killed

Age_band_of_casualty — Age group of casualty

Casualty_severtiy — How severely the casualty was injured

Work_of_casualty — What was the work of the casualty

Fitness_of_casualty — Fitness level of casualty

Pedestrain_movement — Was there any pedestrian movement on the road?

Cause_of-accident — What was the cause of an accident?


Accident_severity — How severe an accident was? (Target variable)

# Load Dataset

In [None]:
# importing pandas 
import pandas as pd 

# using pandas read_csv function to load the dataset 
df = pd.read_csv("10 pipe dataset.csv") 

df.head()

In [None]:
df.shape

In [None]:
# print the dataset information
df.info()

In [None]:
# target variable classes counts and bar plot
print(df['Accident_severity'].value_counts())
df['Accident_severity'].value_counts().plot(kind='bar')

#  Exploratory data analysis of the dataset

In [None]:
# Education levels of car drivers
df['Educational_level'].value_counts().plot(kind='bar')

In [None]:
# plot the bar plot of road_surface_type and accident severity feature
import matplotlib.pyplot as plt
import seaborn as sns 


plt.figure(figsize=(6,5))
sns.countplot(x='Road_surface_type', hue='Accident_severity', data=df)
plt.xlabel('Rode surafce type')
plt.xticks(rotation=60)
plt.show

# Data Preparationm

We will start pre-processing the dataset by changing the “Time” column datatype to the “datetime” datatype. We will then extract the hour of the day feature to prepare the data for modeling.


In [None]:
# convert object type column into datetime datatype column
df['Time'] = pd.to_datetime(df['Time'])

# Extrating 'Hour_of_Day' feature from the Time column
new_df = df.copy()
new_df['Hour_of_Day'] = df['Time'].dt.hour
new_df.drop('Time',axis=1,inplace=True)
new_df.head()

# Encode Target Column

In [None]:
# import labelencoder from sklearn.preprocessing
from sklearn.preprocessing import LabelEncoder

# create labelencoder object
lb = LabelEncoder()
new_df['Accident_severity'] = lb.fit_transform(new_df['Accident_severity'])

new_df.head()

# Balance Dataset

In [None]:
new_df['Accident_severity'].value_counts()

In [None]:
from imblearn.over_sampling import RandomOverSampler

# Define X and y
X = new_df.drop(columns=['Accident_severity'])
y = new_df['Accident_severity']

# Resample the dataset
oversampler = RandomOverSampler(random_state=42)
X_resampled, y_resampled = oversampler.fit_resample(X, y)


In [None]:
y_resampled.value_counts(), X_resampled.shape

# train/test/split

In [None]:
# train/test/split
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X_resampled,
                                                 y_resampled,
                                                 test_size=0.2,
                                                random_state=42)

# Fill Missing Values

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

# Define the strategies for each column
strategies = {
    3: 'most_frequent',   # Educational_level
    4: 'most_frequent',   # Vehicle_driver_relation
    5: 'most_frequent',   # Driving_experience
    6: 'most_frequent',   # Type_of_vehicle
    8: 'constant',        # Service_year_of_vehicle
    9: 'constant',        # Defect_of_vehicle
    10: 'most_frequent',  # Area_accident_occured
    11: 'most_frequent',  # Lanes_or_Medians
    12: 'most_frequent',  # Road_allignment
    13: 'most_frequent',  # Types_of_Junction
    14: 'most_frequent',  # Road_surface_type
    18: 'most_frequent',  # Type_of_collision
    21: 'most_frequent',  # Vehicle_movement
    26: 'most_frequent',  # Work_of_casuality
    27: 'most_frequent'   # Fitness_of_casuality
}

# Create a ColumnTransformer for data preprocessing
tf1 = ColumnTransformer([
    ('impute_educational_level', SimpleImputer(strategy=strategies[3]), [3]),
    ('impute_Vehicle_driver_relation', SimpleImputer(strategy=strategies[4]), [4]),
    ('impute_Driving_experience', SimpleImputer(strategy=strategies[5]), [5]),
    ('impute_Type_of_vehicle', SimpleImputer(strategy=strategies[6]), [6]),
    ('impute_Service_year_of_vehicle', SimpleImputer(strategy=strategies[8], fill_value='Unknown'), [8]),
    ('impute_Defect_of_vehicle', SimpleImputer(strategy=strategies[9], fill_value='Unknown'), [9]),
    ('impute_Area_accident_occured', SimpleImputer(strategy=strategies[10]), [10]),
    ('impute_Lanes_or_Medians', SimpleImputer(strategy=strategies[11]), [11]),
    ('impute_Road_allignment', SimpleImputer(strategy=strategies[12]), [12]),
    ('impute_Types_of_Junction', SimpleImputer(strategy=strategies[13]), [13]),
    ('impute_Road_surface_type', SimpleImputer(strategy=strategies[14]), [14]),
    ('impute_Type_of_collision', SimpleImputer(strategy=strategies[18]), [18]),
    ('impute_Vehicle_movement', SimpleImputer(strategy=strategies[21]), [21]),
    ('impute_Work_of_casuality', SimpleImputer(strategy=strategies[26]), [26]),
    ('impute_Fitness_of_casuality', SimpleImputer(strategy=strategies[27]), [27])
], remainder='passthrough')


# Encode Categorical Columns

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# Define the object columns indices
object_columns_indices = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,19,20,21,22,23,24,25,26,27,28,29,30]

# Create a ColumnTransformer for one-hot encoding only object columns
tf2 = ColumnTransformer([
    (f'ohe_{col}', OneHotEncoder(sparse=False, handle_unknown='ignore'), [col])
    for col in object_columns_indices
], remainder='passthrough')

# Apply the transformation
X_train_encoded = tf2.fit_transform(X_train)
X_train_encoded.shape

# Scaling

In [None]:
# # Scaling
# from sklearn.preprocessing import MinMaxScaler

# tf3 = ColumnTransformer([
#     ('scale',MinMaxScaler(),slice(# give proper slicing))
# ])

# Feature selection using the ‘Chi2’ Statistic

chi2: This is one of the scoring functions available for feature selection in scikit-learn. It calculates the chi-squared statistic between each feature and the target variable (accidents) to determine the relevance of each feature. chi2 is commonly used for feature selection when dealing with categorical target variables.

In [None]:
# feature seleciton method using chi2 for categorical output, categorical input
from sklearn.feature_selection import SelectKBest, chi2

tf4 = SelectKBest(chi2, k=50)

# Model (Random Forest Classifier)

In [None]:
# import the necessary liabrary
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report, f1_score


tf5 = RandomForestClassifier()

# Create Pipeline

In [None]:
# Construct the pipeline
from sklearn.pipeline import Pipeline
 
pipe = Pipeline([
    ('trf1', tf1),
    ('trf2', tf2),
    ('trf4', tf4),
    ('trf5', tf5)
])

# Train the pipeline
pipe.fit(X_train, y_train)

In [None]:
# Predict
y_pred = pipe.predict(X_test)
y_pred

# Explore the pipeline

In [None]:
# Code here
pipe.named_steps

# accuracy Score

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

# classificatoin report

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

# confusion matrix

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,y_pred)

# Save pipe

In [None]:
import pickle 
pickle.dump(pipe,open("10 pipe.pkl",'wb'))

# Use Pipe For Website (New user inputs)

# prediction System

In [None]:
pipe = pickle.load(open("10 pipe.pkl",'rb'))

In [None]:
import numpy as np

# record from X_test 10 row
print("Prediction :",pipe.predict(np.array(['Thursday', '31-50', 'Male', 'Junior high school', 'Owner', 'Unknown', 'Long lorry', 'Owner', 
                       'Unknown', 'Unknown', 'Other', 'Two-way (divided with solid lines road marking)',
                       'Tangent road with flat terrain','Unknown', 'Unknown', 'Dry', 'Daylight', 'Normal', 
                       'Collision with animals', 2, 1, 'Going straight', 'Driver or rider','Male', '18-30', 3, 'Driver',
                       'Normal', 'Not a Pedestrian', 'Changing lane to the left', 12,],dtype=object).reshape(1,-1)))
print("Actual :",y_test.iloc[10])

In [None]:
def pred(Day_of_week, Age_band_of_driver, Sex_of_driver, Educational_level, Vehicle_driver_relation,
         Driving_experience, Type_of_vehicle, Owner_of_vehicle, Service_year_of_vehicle,
         Defect_of_vehicle, Area_accident_occured, Lanes_or_Medians, Road_allignment,
         Types_of_Junction, Road_surface_type, Road_surface_conditions, Light_conditions,
         Weather_conditions, Type_of_collision, Number_of_vehicles_involved,
         Number_of_casualties, Vehicle_movement, Casualty_class, Sex_of_casualty,
         Age_band_of_casualty, Casualty_severity, Work_of_casuality, Fitness_of_casuality,
         Pedestrian_movement, Cause_of_accident, Hour_of_Day):
    
    # Your prediction code here
    features = np.array([[Day_of_week, Age_band_of_driver, Sex_of_driver, Educational_level, Vehicle_driver_relation,
         Driving_experience, Type_of_vehicle, Owner_of_vehicle, Service_year_of_vehicle,
         Defect_of_vehicle, Area_accident_occured, Lanes_or_Medians, Road_allignment,
         Types_of_Junction, Road_surface_type, Road_surface_conditions, Light_conditions,
         Weather_conditions, Type_of_collision, Number_of_vehicles_involved,
         Number_of_casualties, Vehicle_movement, Casualty_class, Sex_of_casualty,
         Age_band_of_casualty, Casualty_severity, Work_of_casuality, Fitness_of_casuality,
         Pedestrian_movement, Cause_of_accident, Hour_of_Day]])
    
    results = pipe.predict(features)
    return results


In [None]:
predicted_class = pred(Day_of_week="Thursday", 
                                Age_band_of_driver='31-50',
                                Sex_of_driver='Male',
                                Educational_level='Junior high school',
                                Vehicle_driver_relation='Owner',
                                Driving_experience=None,
                                Type_of_vehicle='Long lorry',
                                Owner_of_vehicle='Owner',
                                Service_year_of_vehicle='Unknown',
                                Defect_of_vehicle=None,
                                Area_accident_occured='Other',
                                Lanes_or_Medians='Two-way (divided with solid lines road marking)',
                                Road_allignment='Tangent road with flat terrain',
                                Types_of_Junction=None,
                                Road_surface_type=None,
                                Road_surface_conditions='Dry',
                                Light_conditions='Daylight',
                                Weather_conditions='Normal',
                                Type_of_collision='Collision with animals',
                                Number_of_vehicles_involved=2,
                                Number_of_casualties=1,
                                Vehicle_movement='Going straight',
                                Casualty_class='Driver or rider',
                                Sex_of_casualty='Male',
                                Age_band_of_casualty='18-30',
                                Casualty_severity=3,
                                Work_of_casuality='Driver',
                                Fitness_of_casuality='Normal',
                                Pedestrian_movement='Not a Pedestrian',
                                Cause_of_accident='Changing lane to the left',
                                Hour_of_Day=12)

if predicted_class[0] == 2:
    print("Slight Injury.....")
elif predicted_class[0] == 1:
    print("Serious Injury")
else:
    print("Fatal Injury")

In [None]:
# test 2
predicted_class = pred(Day_of_week="Friday", 
                       Age_band_of_driver='31-50',
                       Sex_of_driver='Male',
                       Educational_level='Elementary school',
                       Vehicle_driver_relation='Employee',
                       Driving_experience='1-2yr',
                       Type_of_vehicle='Lorry (41?100Q)',
                       Owner_of_vehicle='Owner',
                       Service_year_of_vehicle=None,
                       Defect_of_vehicle='No defect',
                       Area_accident_occured='Office areas',
                       Lanes_or_Medians='Two-way (divided with broken lines road marking)',
                       Road_allignment='Tangent road with flat terrain',
                       Types_of_Junction='Y Shape',
                       Road_surface_type='Asphalt roads',
                       Road_surface_conditions='Dry',
                       Light_conditions='Daylight',
                       Weather_conditions='Normal',
                       Type_of_collision='Vehicle with vehicle collision',
                       Number_of_vehicles_involved=2,
                       Number_of_casualties=2,
                       Vehicle_movement='Going straight',
                       Casualty_class='na',
                       Sex_of_casualty='na',
                       Age_band_of_casualty='na',
                       Casualty_severity='na',
                       Work_of_casuality='Driver',
                       Fitness_of_casuality='Normal',
                       Pedestrian_movement='Not a Pedestrian',
                       Cause_of_accident='Changing lane to the left',
                       Hour_of_Day=1)

if predicted_class[0] == 2:
    print("Slight Injury")
elif predicted_class[0] == 1:
    print("Serious Injury")
else:
    print("Fatal Injury")
