# Traffic Accident Injury Prediction
This project aims to predict the total number of injuries in a traffic accident using machine learning models. The dataset, sourced from Kaggle, contains detailed information on traffic accidents across various regions and time periods. It includes accident-related features such as weather conditions, lighting conditions, crash types, and vehicle involvement.

The data preprocessing pipeline includes:
* Handling missing values (imputation).
* Standardizing numerical features.
* Encoding categorical variables using one-hot encoding.

To find the best model, several regression models were trained, including:
* Linear Regression
* Decision Tree Regressor
* Random Forest Regressor

The models are evaluated using Root Mean Squared Error (RMSE), and cross-validation is applied to assess their generalization performance. The final trained model can be used to estimate the expected number of injuries in future traffic accidents..

Data is from Oktay Ördekçi Kaggle repository:
https://www.kaggle.com/datasets/oktayrdeki/traffic-accidents

In [2]:
# Loading data
import pandas as pd
from pathlib import Path

def loadData():
    return pd.read_csv(Path('./data/traffic_accidents.csv'))

accidentsData = loadData()

In [3]:
# A very short view of the data
accidentsData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209306 entries, 0 to 209305
Data columns (total 24 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   crash_date                     209306 non-null  object 
 1   traffic_control_device         209306 non-null  object 
 2   weather_condition              209306 non-null  object 
 3   lighting_condition             209306 non-null  object 
 4   first_crash_type               209306 non-null  object 
 5   trafficway_type                209306 non-null  object 
 6   alignment                      209306 non-null  object 
 7   roadway_surface_cond           209306 non-null  object 
 8   road_defect                    209306 non-null  object 
 9   crash_type                     209306 non-null  object 
 10  intersection_related_i         209306 non-null  object 
 11  damage                         209306 non-null  object 
 12  prim_contributory_cause       

In [4]:
# Date transformation
accidentsData['crash_date'] = pd.to_datetime(accidentsData['crash_date'], errors='coerce')
accidentsData['crash_year'] = accidentsData['crash_date'].dt.year
accidentsData['crash_month'] = accidentsData['crash_date'].dt.month
accidentsData['crash_day_of_week'] = accidentsData['crash_date'].dt.dayofweek

accidentsData = accidentsData.drop(columns=['crash_date'])


  accidentsData['crash_date'] = pd.to_datetime(accidentsData['crash_date'], errors='coerce')


## Data Preprocessing Pipelines

To handle different types of data effectively, we create separate preprocessing pipelines for **numerical** and **categorical** features.

### 1. **Numerical Pipeline**
The numerical pipeline performs the following steps:
- **Imputation**: Replaces missing values with the **median** of the column.
- **Scaling**: Standardizes numerical features using **StandardScaler**, which ensures that each feature has a mean of 0 and a standard deviation of 1.

### 2. **Categorical Pipeline**
The categorical pipeline processes categorical data using:

* Imputation: Fills missing values with the most frequent category.
* One-Hot Encoding: Converts categorical variables into numerical format using OneHotEncoder, ignoring unknown categories.

In [5]:
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])
cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

## Feature Transformation with `make_column_transformer`

To apply different preprocessing steps to **numerical** and **categorical** features, we use `make_column_transformer`. This allows us to automatically select and transform columns based on their data type.

### **Building the Preprocessing Pipeline**
We use `make_column_transformer` to combine:
- The **numerical pipeline** (`num_pipeline`) for columns with numerical data.
- The **categorical pipeline** (`cat_pipeline`) for columns with categorical data.

### How It Works:
* make_column_selector(dtype_include=np.number): Automatically selects numerical columns.
* make_column_selector(dtype_include=object): Automatically selects categorical columns.
* The preprocessing pipeline ensures that all features are properly processed before feeding them into a machine learning model.

This automated feature transformation simplifies data preprocessing, making it more flexible and scalable

In [6]:
from sklearn.compose import make_column_selector, make_column_transformer
import numpy as np
preprocessing = make_column_transformer(
    (num_pipeline, make_column_selector(dtype_include=np.number)),
    (cat_pipeline, make_column_selector(dtype_include=object))
)

In [7]:
# Apply the preprocessing pipeline to the dataset.
# This transforms numerical features (imputation + scaling) and encodes categorical features.
# The output is a NumPy array with the processed data.
accidentsData_prepared = preprocessing.fit_transform(accidentsData)

In [8]:
# Wrapping accidentsData_prepared in a Pandas DataFrame
accidentsData_prepared = pd.DataFrame(
    accidentsData_prepared.toarray(),
    columns=preprocessing.get_feature_names_out(),
    index=accidentsData.index
)
accidentsData_prepared.head(2)

Unnamed: 0,pipeline-1__num_units,pipeline-1__injuries_total,pipeline-1__injuries_fatal,pipeline-1__injuries_incapacitating,pipeline-1__injuries_non_incapacitating,pipeline-1__injuries_reported_not_evident,pipeline-1__injuries_no_indication,pipeline-1__crash_hour,pipeline-1__crash_day_of_week,pipeline-1__crash_month,...,pipeline-2__prim_contributory_cause_TURNING RIGHT ON RED,pipeline-2__prim_contributory_cause_UNABLE TO DETERMINE,pipeline-2__prim_contributory_cause_UNDER THE INFLUENCE OF ALCOHOL/DRUGS (USE WHEN ARREST IS EFFECTED),"pipeline-2__prim_contributory_cause_VISION OBSCURED (SIGNS, TREE LIMBS, BUILDINGS, ETC.)",pipeline-2__prim_contributory_cause_WEATHER,pipeline-2__most_severe_injury_FATAL,pipeline-2__most_severe_injury_INCAPACITATING INJURY,pipeline-2__most_severe_injury_NO INDICATION OF INJURY,pipeline-2__most_severe_injury_NONINCAPACITATING INJURY,"pipeline-2__most_severe_injury_REPORTED, NOT EVIDENT"
0,-0.159843,-0.478565,-0.039126,-0.162855,-0.359765,-0.269518,0.6091,-0.06657,1.037877,0.066571,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,-0.159843,-0.478565,-0.039126,-0.162855,-0.359765,-0.269518,-0.19659,-2.386418,1.553809,0.358322,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [10]:

from sklearn.model_selection import train_test_split
# Split the dataset into training and testing sets.
# 80% of the data is used for training, and 20% for testing.
# The random_state is set to ensure reproducibility of the split.
train_set, test_set = train_test_split(accidentsData_prepared, test_size=0.2, random_state=42)

# Split the training data into features (accidentsTrain) and labels (accidentsTrain_labels)
# 'pipeline-1__injuries_total' is the target variable (number of injuries).
# We drop the target variable from the features and store it separately as the label.
accidentsTrain = train_set.drop("pipeline-1__injuries_total", axis=1)  # Features without the target variable
accidentsTrain_labels = train_set['pipeline-1__injuries_total'].copy()   # The target variable (number of injuries)


## Training and Comparing Machine Learning Models

In this section, we train multiple machine learning models to predict the total number of injuries in traffic accidents and compare their performance.

### 1. **Linear Regression**

Linear Regression is a simple yet powerful model used for predicting continuous variables. It is trained on the training set and its predictions are evaluated using Root Mean Squared Error (RMSE).

In [11]:
from sklearn.linear_model import LinearRegression
lin_reg = make_pipeline(preprocessing, LinearRegression())
lin_reg.fit(accidentsTrain, accidentsTrain_labels)
predictions = lin_reg.predict(accidentsTrain)

In [12]:
# Import the root mean squared error (RMSE) function from sklearn.metrics
from sklearn.metrics import root_mean_squared_error

# Calculate the RMSE between the true labels (accidentsTrain_labels) and the predictions made by the model
# RMSE gives an idea of how far off the model's predictions are from the actual values
lin_rmse = root_mean_squared_error(accidentsTrain_labels, predictions)

# Output the RMSE value for the linear regression model
lin_rmse

1.3256252688793172e-13

### 2. Decision Tree Regressor
Decision Tree Regressor is a non-linear model that works well for capturing complex relationships between features. We also evaluate its performance using RMSE.

In [13]:
from sklearn.tree import DecisionTreeRegressor

# Initialize the decision tree regressor model
tree_reg = make_pipeline(preprocessing, DecisionTreeRegressor(random_state=42))

# Train the model on the training data
tree_reg.fit(accidentsTrain, accidentsTrain_labels)

# Make predictions and calculate RMSE
predictions = tree_reg.predict(accidentsTrain)
tree_rmse = root_mean_squared_error(accidentsTrain_labels, predictions)

In [None]:
# Import cross_val_score for cross-validation evaluation
from sklearn.model_selection import cross_val_score

# Perform 10-fold cross-validation on the decision tree regressor model
# The scoring parameter is set to "neg_root_mean_squared_error", which returns the negative RMSE (for minimization).
# The negative sign is used because cross_val_score maximizes the score, so we negate RMSE to follow that convention.
tree_rmses = -cross_val_score(tree_reg, accidentsTrain, accidentsTrain_labels,
                              scoring="neg_root_mean_squared_error", cv=10)

# Convert the RMSE values to a pandas Series and display statistical summary (mean, std, etc.)
# This gives us an overview of how the model performs across different folds of the cross-validation.
pd.Series(tree_rmses).describe()


count    10.000000
mean      0.058139
std       0.018787
min       0.037425
25%       0.044544
50%       0.053367
75%       0.069773
max       0.089090
dtype: float64

### 3. Random Forest Regressor
Random Forest Regressor is an ensemble method that combines multiple decision trees. This model tends to perform better than a single decision tree and is more robust to overfitting.

In [18]:
from sklearn.ensemble import RandomForestRegressor

# Initialize the random forest regressor model
full_pipeline = Pipeline([
    ("preprocessing", preprocessing),
    ("random_forest", RandomForestRegressor(random_state=42)),
])

# Train the model on the training data
full_pipeline.fit(accidentsTrain, accidentsTrain_labels)

# Make predictions with the trained random forest model
predictions_rf = full_pipeline.predict(accidentsTrain)

# Calculate RMSE by taking the square root of the Mean Squared Error (MSE)
rmse_rf = root_mean_squared_error(accidentsTrain_labels, predictions_rf)

# Output the RMSE for the Random Forest model
rmse_rf


0.01791849140226002

## Saving the Model

Once the model is trained and evaluated, it's important to save it for later use. This allows you to load the model without retraining it, making predictions on new data quickly.

### 1. **Saving the Model Using `joblib`**

In this step, we save the trained model pipeline, which includes both the preprocessing steps and the Random Forest model. The model is stored in a `.pkl` file, which can be loaded later for inference or further evaluation.


In [20]:
import joblib

# Save the entire model pipeline (preprocessing + Random Forest)
joblib.dump(full_pipeline, "Traffic_Accident_Injury_Predictor_Model.pkl")

['Traffic_Accident_Injury_Predictor_Model.pkl']