# A Data-Driven and Interpretable Approach to Traffic Crash Severity Prediction 

# Business Understanding
## Business Overview

Road traffic crashes remain a major public safety challenge, leading to injuries, loss of life, and substantial economic costs. While large volumes of traffic crash data are routinely collected, there is a need for analytical approaches that not only predict crash-related outcomes but also clearly explain the factors that contribute to accidents.

This project develops a machine learning framework that uses both interpretable (white-box) models and more complex (black-box) models to predict the primary contributory cause of a traffic crash. The models leverage information about vehicles, people involved, roadway characteristics, traffic controls, and environmental conditions to identify patterns associated with different crash causes.

White-box models are used to provide transparent, rule-based insights that are easily understood by non-technical stakeholders, while black-box models are employed to capture more complex relationships within the data. By comparing these approaches, the project balances predictive performance with explainability.

The resulting insights are intended to support vehicle safety boards, transportation agencies, and city authorities in identifying high-risk crash causes, prioritizing safety interventions, and designing evidence-based policies aimed at reducing preventable traffic accidents.

## Problem Statement
Transportation agencies and safety organizations often rely on descriptive statistics and manual reporting to understand the causes of traffic crashes. While useful, these approaches may fail to capture complex interactions among driver behavior, vehicle characteristics, roadway design, and environmental conditions.

The core problem addressed in this project is the lack of interpretable and accurate predictive tools for identifying the primary contributory cause of traffic crashes. High-performing black-box models can achieve strong predictive accuracy but often lack transparency, making their outputs difficult to trust or justify in policy and safety contexts. Conversely, fully interpretable models may sacrifice predictive performance.

This project seeks to address this challenge by developing and evaluating both white-box and black-box machine learning models to predict crash contributory causes. By comparing model performance and interpretability, the project aims to identify an approach that provides reliable predictions while offering clear explanations that can be communicated to non-technical decision-makers. The ultimate goal is to support transparent, accountable, and data-driven traffic safety interventions.

## Business Objectives
The primary objective of this project is to support traffic safety decision-making by developing an interpretable machine learning model that predicts the primary contributory cause of a traffic crash using information related to vehicles, occupants, roadway characteristics, traffic controls, and environmental conditions.
By identifying the most likely cause of a crash, this project aims to help transportation authorities and safety organizations better understand underlying risk patterns and design targeted strategies to reduce traffic accidents.

### Specific Objectives
1. Predict the Primary Contributory Cause of Crashes:
Develop supervised machine learning models capable of classifying traffic crashes according to their primary contributory cause based on vehicle-level, person-level, roadway, and environmental features.

2. Ensure Model Interpretability:
Apply interpretable modeling techniques and explanation methods to make both global model behavior and individual crash predictions understandable, allowing stakeholders to see how specific factors influence predicted crash causes.

3. Identify High-Impact Risk Factors:
Determine the most influential factors associated with different contributory causes of crashes, including roadway conditions, environmental factors, vehicle characteristics, driver behavior, and temporal patterns.

4. Compare Interpretability and Performance Trade-offs:
Evaluate and compare interpretable (white-box) models with more complex (black-box) models to understand the trade-offs between predictive accuracy and explainability when predicting crash causes.

5. Support Actionable Safety Interventions:
Translate model insights into practical recommendations that can inform infrastructure improvements, targeted enforcement strategies, driver education programs, and policy decisions aimed at reducing preventable crash causes.

6. Promote Transparent and Accountable Decision-Making:
Provide clear, defensible explanations that can be effectively communicated to non-technical stakeholders such as vehicle safety boards and city transportation authorities, supporting responsible and evidence-based traffic safety initiatives.

## Data Understanding.

This project uses the Traffic Crashes dataset provided by the City of Chicago and collected under the jurisdiction of the Chicago Police Department (CPD). The data originates from an electronic crash reporting system and contains detailed records of reported traffic crashes.

Each observation represents a single crash event and includes information related to vehicles involved, individuals in the vehicles, roadway characteristics, traffic control devices, and environmental conditions at the time of the crash. The dataset also includes a labeled field identifying the primary contributory cause of each crash, which serves as the target variable for this study.

This project aims to use the dataset to develop both interpretable (white-box) and high-performing (black-box) machine learning models capable of predicting the primary contributory cause of traffic crashes. Beyond prediction, the analysis emphasizes understanding how different factors contribute to specific crash causes, ensuring that model outputs remain transparent, explainable, and actionable for traffic safety stakeholders.

By leveraging this dataset, the project seeks to uncover meaningful patterns that can support data-driven decision-making by vehicle safety boards and city transportation authorities, with the ultimate goal of reducing preventable traffic accidents.

## Data Analysis
### Importing libraries
The necessary python libraries for data cleaning, visualizations and modeling are imported. These include pandas, Numpy, Scikit-learn and many others.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import classification_report

### Data loading and Inspection
In this section the Traffic Crashes dataset is loaded from a CSV file ito a pandas DataFrame. Basic inspection functions are applied to confirm successful loading and preview the structure of the dataset before further analysis.

In [2]:
# Loading the dataset
df1 = pd.read_csv("../Traffic_Crashes.csv")
# display first five rows.
df1.head()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,CRASH_RECORD_ID,CRASH_DATE_EST_I,CRASH_DATE,POSTED_SPEED_LIMIT,TRAFFIC_CONTROL_DEVICE,DEVICE_CONDITION,WEATHER_CONDITION,LIGHTING_CONDITION,FIRST_CRASH_TYPE,TRAFFICWAY_TYPE,...,INJURIES_NON_INCAPACITATING,INJURIES_REPORTED_NOT_EVIDENT,INJURIES_NO_INDICATION,INJURIES_UNKNOWN,CRASH_HOUR,CRASH_DAY_OF_WEEK,CRASH_MONTH,LATITUDE,LONGITUDE,LOCATION
0,97f1975e8f3e9a1b53ae1abfb6982a374074d8649d9e97...,,1/28/2026 22:56,30,NO CONTROLS,NO CONTROLS,SNOW,"DARKNESS, LIGHTED ROAD",FIXED OBJECT,NOT DIVIDED,...,0.0,0.0,1.0,0.0,22,4,1,41.713829,-87.551093,POINT (-87.551093105845 41.713829100033)
1,1a00190102664f10ee5c2ee8767d45c331991692f12dfc...,,1/28/2026 22:25,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,"DARKNESS, LIGHTED ROAD",REAR END,NOT DIVIDED,...,0.0,0.0,2.0,0.0,22,4,1,41.796711,-87.755202,POINT (-87.755202215729 41.796710893317)
2,a4fc7133c8193ec53288a9acec055321dee47515621012...,Y,1/28/2026 22:10,30,OTHER,OTHER,OTHER,UNKNOWN,PARKED MOTOR VEHICLE,OTHER,...,0.0,0.0,2.0,0.0,22,4,1,41.813005,-87.603823,POINT (-87.603822899265 41.813004951227)
3,e79f2db27a528710d42b2eb1991876b7a9bf029aee3685...,,1/28/2026 22:10,30,STOP SIGN/FLASHER,FUNCTIONING PROPERLY,CLEAR,"DARKNESS, LIGHTED ROAD",REAR END,FOUR WAY,...,0.0,0.0,3.0,0.0,22,4,1,41.868335,-87.705668,POINT (-87.705668192505 41.868335288795)
4,48040347f534c316e38421a60b65ab7017ae47cb4a0c3c...,,1/28/2026 22:05,30,NO CONTROLS,NO CONTROLS,CLEAR,"DARKNESS, LIGHTED ROAD",PARKED MOTOR VEHICLE,NOT DIVIDED,...,0.0,0.0,2.0,0.0,22,4,1,41.866618,-87.696128,POINT (-87.696128029764 41.866617682133)


In [3]:
df_cleaned = pd.read_csv("../cleaned_traffic_crashes.csv", low_memory=False)
df_cleaned.head()

Unnamed: 0,CRASH_RECORD_ID,CRASH_DATE_EST_I,CRASH_DATE,POSTED_SPEED_LIMIT,TRAFFIC_CONTROL_DEVICE,DEVICE_CONDITION,WEATHER_CONDITION,LIGHTING_CONDITION,FIRST_CRASH_TYPE,TRAFFICWAY_TYPE,...,INJURIES_NON_INCAPACITATING,INJURIES_REPORTED_NOT_EVIDENT,INJURIES_NO_INDICATION,INJURIES_UNKNOWN,CRASH_HOUR,CRASH_DAY_OF_WEEK,CRASH_MONTH,LATITUDE,LONGITUDE,LOCATION
0,97f1975e8f3e9a1b53ae1abfb6982a374074d8649d9e97...,,01/28/2026 10:56:00 PM,30,NO CONTROLS,NO CONTROLS,SNOW,"DARKNESS, LIGHTED ROAD",FIXED OBJECT,NOT DIVIDED,...,0.0,0.0,1.0,0.0,22,4,1,41.713829,-87.551093,POINT (-87.551093105845 41.713829100033)
1,1a00190102664f10ee5c2ee8767d45c331991692f12dfc...,,01/28/2026 10:25:00 PM,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,"DARKNESS, LIGHTED ROAD",REAR END,NOT DIVIDED,...,0.0,0.0,2.0,0.0,22,4,1,41.796711,-87.755202,POINT (-87.755202215729 41.796710893317)
2,a4fc7133c8193ec53288a9acec055321dee47515621012...,Y,01/28/2026 10:10:00 PM,30,OTHER,OTHER,OTHER,UNKNOWN,PARKED MOTOR VEHICLE,OTHER,...,0.0,0.0,2.0,0.0,22,4,1,41.813005,-87.603823,POINT (-87.603822899265 41.813004951227)
3,e79f2db27a528710d42b2eb1991876b7a9bf029aee3685...,,01/28/2026 10:10:00 PM,30,STOP SIGN/FLASHER,FUNCTIONING PROPERLY,CLEAR,"DARKNESS, LIGHTED ROAD",REAR END,FOUR WAY,...,0.0,0.0,3.0,0.0,22,4,1,41.868335,-87.705668,POINT (-87.705668192505 41.868335288795)
4,48040347f534c316e38421a60b65ab7017ae47cb4a0c3c...,,01/28/2026 10:05:00 PM,30,NO CONTROLS,NO CONTROLS,CLEAR,"DARKNESS, LIGHTED ROAD",PARKED MOTOR VEHICLE,NOT DIVIDED,...,0.0,0.0,2.0,0.0,22,4,1,41.866618,-87.696128,POINT (-87.696128029764 41.866617682133)


In [4]:
# Checking the columns in the dataset
df_cleaned.columns

Index(['CRASH_RECORD_ID', 'CRASH_DATE_EST_I', 'CRASH_DATE',
       'POSTED_SPEED_LIMIT', 'TRAFFIC_CONTROL_DEVICE', 'DEVICE_CONDITION',
       'WEATHER_CONDITION', 'LIGHTING_CONDITION', 'FIRST_CRASH_TYPE',
       'TRAFFICWAY_TYPE', 'LANE_CNT', 'ALIGNMENT', 'ROADWAY_SURFACE_COND',
       'ROAD_DEFECT', 'REPORT_TYPE', 'CRASH_TYPE', 'INTERSECTION_RELATED_I',
       'NOT_RIGHT_OF_WAY_I', 'HIT_AND_RUN_I', 'DAMAGE', 'DATE_POLICE_NOTIFIED',
       'PRIM_CONTRIBUTORY_CAUSE', 'SEC_CONTRIBUTORY_CAUSE', 'STREET_NO',
       'STREET_DIRECTION', 'STREET_NAME', 'BEAT_OF_OCCURRENCE',
       'PHOTOS_TAKEN_I', 'STATEMENTS_TAKEN_I', 'DOORING_I', 'WORK_ZONE_I',
       'WORK_ZONE_TYPE', 'WORKERS_PRESENT_I', 'NUM_UNITS',
       'MOST_SEVERE_INJURY', 'INJURIES_TOTAL', 'INJURIES_FATAL',
       'INJURIES_INCAPACITATING', 'INJURIES_NON_INCAPACITATING',
       'INJURIES_REPORTED_NOT_EVIDENT', 'INJURIES_NO_INDICATION',
       'INJURIES_UNKNOWN', 'CRASH_HOUR', 'CRASH_DAY_OF_WEEK', 'CRASH_MONTH',
       'LATITUDE', 

In [5]:
# checking the structure of our target variable
df_cleaned['PRIM_CONTRIBUTORY_CAUSE']

0                    UNABLE TO DETERMINE
1                    UNABLE TO DETERMINE
2                    UNABLE TO DETERMINE
3            IMPROPER OVERTAKING/PASSING
4                    IMPROPER LANE USAGE
                       ...              
1024024              UNABLE TO DETERMINE
1024025    FAILING TO YIELD RIGHT-OF-WAY
1024026              UNABLE TO DETERMINE
1024027              UNABLE TO DETERMINE
1024028              IMPROPER LANE USAGE
Name: PRIM_CONTRIBUTORY_CAUSE, Length: 1024029, dtype: object

### Dataset Description

 - Records: 1,024,029 traffic crash events

 - Columns: 48 features, including categorical, numerical, and temporal data

 - Target variable: PRIM_CONTRIBUTORY_CAUSE (primary cause of crash)

 - Categorical features:(e.g., weather, lighting, traffic control, crash type)

 - Numerical features:(e.g., speed limit, number of units, injury counts, crash hour/day/month)

 - Geographic info: Latitude, longitude, and location for spatial analysis

## Data cleaning

In [6]:
# checking for data types
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1024029 entries, 0 to 1024028
Data columns (total 48 columns):
 #   Column                         Non-Null Count    Dtype  
---  ------                         --------------    -----  
 0   CRASH_RECORD_ID                1024029 non-null  object 
 1   CRASH_DATE_EST_I               74318 non-null    object 
 2   CRASH_DATE                     1024029 non-null  object 
 3   POSTED_SPEED_LIMIT             1024029 non-null  int64  
 4   TRAFFIC_CONTROL_DEVICE         1024029 non-null  object 
 5   DEVICE_CONDITION               1024029 non-null  object 
 6   WEATHER_CONDITION              1024029 non-null  object 
 7   LIGHTING_CONDITION             1024029 non-null  object 
 8   FIRST_CRASH_TYPE               1024029 non-null  object 
 9   TRAFFICWAY_TYPE                1024029 non-null  object 
 10  LANE_CNT                       199035 non-null   object 
 11  ALIGNMENT                      1024029 non-null  object 
 12  ROADWAY_SURFAC

In [7]:
# Checking for the shape of the dataset
df_cleaned.shape

(1024029, 48)

In [8]:
# checking for null values
df_cleaned.isna().sum()

CRASH_RECORD_ID                        0
CRASH_DATE_EST_I                  949711
CRASH_DATE                             0
POSTED_SPEED_LIMIT                     0
TRAFFIC_CONTROL_DEVICE                 0
DEVICE_CONDITION                       0
WEATHER_CONDITION                      0
LIGHTING_CONDITION                     0
FIRST_CRASH_TYPE                       0
TRAFFICWAY_TYPE                        0
LANE_CNT                          824994
ALIGNMENT                              0
ROADWAY_SURFACE_COND                   0
ROAD_DEFECT                            0
REPORT_TYPE                        34014
CRASH_TYPE                             0
INTERSECTION_RELATED_I            788579
NOT_RIGHT_OF_WAY_I                978151
HIT_AND_RUN_I                     702756
DAMAGE                                 0
DATE_POLICE_NOTIFIED                   0
PRIM_CONTRIBUTORY_CAUSE                0
SEC_CONTRIBUTORY_CAUSE                 0
STREET_NO                              0
STREET_DIRECTION

Most critical columns like PRIM_CONTRIBUTORY_CAUSE, CRASH_DATE, and POSTED_SPEED_LIMIT have no missing values.

Some optional or situational columns have high number of missing values, including CRASH_DATE_EST_I, LANE_CNT, INTERSECTION_RELATED_I, NOT_RIGHT_OF_WAY_I, HIT_AND_RUN_I, WORK_ZONE_I, and PHOTOS_TAKEN_I.

A few columns have low number of missing values, such as STREET_DIRECTION, STREET_NAME, BEAT_OF_OCCURRENCE, and injury-related fields.

This can be resolved by either dropping or imputing missing values


In [9]:
# checking the percentage of missing values.
n_percent= (df_cleaned.isnull().mean() * 100).sort_values(ascending=False)
n_percent

WORKERS_PRESENT_I                99.861625
DOORING_I                        99.681454
WORK_ZONE_TYPE                   99.586828
WORK_ZONE_I                      99.458218
PHOTOS_TAKEN_I                   98.572990
STATEMENTS_TAKEN_I               97.610322
NOT_RIGHT_OF_WAY_I               95.519853
CRASH_DATE_EST_I                 92.742588
LANE_CNT                         80.563539
INTERSECTION_RELATED_I           77.007487
HIT_AND_RUN_I                    68.626572
REPORT_TYPE                       3.321586
LOCATION                          0.760428
LATITUDE                          0.760428
LONGITUDE                         0.760428
MOST_SEVERE_INJURY                0.217767
INJURIES_UNKNOWN                  0.216400
INJURIES_INCAPACITATING           0.216400
INJURIES_NON_INCAPACITATING       0.216400
INJURIES_NO_INDICATION            0.216400
INJURIES_REPORTED_NOT_EVIDENT     0.216400
INJURIES_TOTAL                    0.216400
INJURIES_FATAL                    0.216400
BEAT_OF_OCC

In [10]:
df_cleaned.columns

Index(['CRASH_RECORD_ID', 'CRASH_DATE_EST_I', 'CRASH_DATE',
       'POSTED_SPEED_LIMIT', 'TRAFFIC_CONTROL_DEVICE', 'DEVICE_CONDITION',
       'WEATHER_CONDITION', 'LIGHTING_CONDITION', 'FIRST_CRASH_TYPE',
       'TRAFFICWAY_TYPE', 'LANE_CNT', 'ALIGNMENT', 'ROADWAY_SURFACE_COND',
       'ROAD_DEFECT', 'REPORT_TYPE', 'CRASH_TYPE', 'INTERSECTION_RELATED_I',
       'NOT_RIGHT_OF_WAY_I', 'HIT_AND_RUN_I', 'DAMAGE', 'DATE_POLICE_NOTIFIED',
       'PRIM_CONTRIBUTORY_CAUSE', 'SEC_CONTRIBUTORY_CAUSE', 'STREET_NO',
       'STREET_DIRECTION', 'STREET_NAME', 'BEAT_OF_OCCURRENCE',
       'PHOTOS_TAKEN_I', 'STATEMENTS_TAKEN_I', 'DOORING_I', 'WORK_ZONE_I',
       'WORK_ZONE_TYPE', 'WORKERS_PRESENT_I', 'NUM_UNITS',
       'MOST_SEVERE_INJURY', 'INJURIES_TOTAL', 'INJURIES_FATAL',
       'INJURIES_INCAPACITATING', 'INJURIES_NON_INCAPACITATING',
       'INJURIES_REPORTED_NOT_EVIDENT', 'INJURIES_NO_INDICATION',
       'INJURIES_UNKNOWN', 'CRASH_HOUR', 'CRASH_DAY_OF_WEEK', 'CRASH_MONTH',
       'LATITUDE', 

In [11]:
# Checking  for duplicates
df_cleaned.duplicated().value_counts()

False    1024029
dtype: int64

Here we resulted to dropping the columns with a huge number of missing values since the percentages were more than 30%

In [12]:
# Dropping unnecessary columns 
df = df_cleaned.drop(
    columns=[
        'CRASH_RECORD_ID',
        'HIT_AND_RUN_I',
        'NOT_RIGHT_OF_WAY_I',
        'LANE_CNT',
        'INTERSECTION_RELATED_I',
        'CRASH_DATE_EST_I',
        'PHOTOS_TAKEN_I',
        'STATEMENTS_TAKEN_I',
        'DOORING_I',
        'WORK_ZONE_I',
        'WORK_ZONE_TYPE',
        'WORKERS_PRESENT_I',
        'LATITUDE',
        'LONGITUDE',
        'CRASH_DATE'
    ], errors='ignore'
) 
df.head()

Unnamed: 0,POSTED_SPEED_LIMIT,TRAFFIC_CONTROL_DEVICE,DEVICE_CONDITION,WEATHER_CONDITION,LIGHTING_CONDITION,FIRST_CRASH_TYPE,TRAFFICWAY_TYPE,ALIGNMENT,ROADWAY_SURFACE_COND,ROAD_DEFECT,...,INJURIES_FATAL,INJURIES_INCAPACITATING,INJURIES_NON_INCAPACITATING,INJURIES_REPORTED_NOT_EVIDENT,INJURIES_NO_INDICATION,INJURIES_UNKNOWN,CRASH_HOUR,CRASH_DAY_OF_WEEK,CRASH_MONTH,LOCATION
0,30,NO CONTROLS,NO CONTROLS,SNOW,"DARKNESS, LIGHTED ROAD",FIXED OBJECT,NOT DIVIDED,STRAIGHT AND LEVEL,SNOW OR SLUSH,NO DEFECTS,...,0.0,0.0,0.0,0.0,1.0,0.0,22,4,1,POINT (-87.551093105845 41.713829100033)
1,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,"DARKNESS, LIGHTED ROAD",REAR END,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,2.0,0.0,22,4,1,POINT (-87.755202215729 41.796710893317)
2,30,OTHER,OTHER,OTHER,UNKNOWN,PARKED MOTOR VEHICLE,OTHER,STRAIGHT AND LEVEL,OTHER,UNKNOWN,...,0.0,0.0,0.0,0.0,2.0,0.0,22,4,1,POINT (-87.603822899265 41.813004951227)
3,30,STOP SIGN/FLASHER,FUNCTIONING PROPERLY,CLEAR,"DARKNESS, LIGHTED ROAD",REAR END,FOUR WAY,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,3.0,0.0,22,4,1,POINT (-87.705668192505 41.868335288795)
4,30,NO CONTROLS,NO CONTROLS,CLEAR,"DARKNESS, LIGHTED ROAD",PARKED MOTOR VEHICLE,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,2.0,0.0,22,4,1,POINT (-87.696128029764 41.866617682133)


In [13]:
# confirming the percentage of missing values in remaining columns 
n_percent= (df.isnull().mean() * 100).sort_values(ascending=False)
n_percent

REPORT_TYPE                      3.321586
LOCATION                         0.760428
MOST_SEVERE_INJURY               0.217767
INJURIES_NO_INDICATION           0.216400
INJURIES_REPORTED_NOT_EVIDENT    0.216400
INJURIES_NON_INCAPACITATING      0.216400
INJURIES_INCAPACITATING          0.216400
INJURIES_FATAL                   0.216400
INJURIES_TOTAL                   0.216400
INJURIES_UNKNOWN                 0.216400
BEAT_OF_OCCURRENCE               0.000488
STREET_DIRECTION                 0.000391
STREET_NAME                      0.000098
DEVICE_CONDITION                 0.000000
TRAFFICWAY_TYPE                  0.000000
ROADWAY_SURFACE_COND             0.000000
ALIGNMENT                        0.000000
WEATHER_CONDITION                0.000000
FIRST_CRASH_TYPE                 0.000000
LIGHTING_CONDITION               0.000000
TRAFFIC_CONTROL_DEVICE           0.000000
ROAD_DEFECT                      0.000000
STREET_NO                        0.000000
CRASH_TYPE                       0

In [14]:
# Checking the shape of our cleaned dataset
df.shape

(1024029, 33)

The dataset has reduced in size from 48 columns to 33 columns

Using mode and median to fill null values for categorical and numerical columns

In [15]:
# Using mode to fill categorical columns
cat_cols = df.select_dtypes(include='object').columns

df[cat_cols] = df[cat_cols].fillna(df[cat_cols].mode().iloc[0])

In [16]:
# Using median to fill numerical columns
num_cols = df.select_dtypes(include=['int', 'float']).columns

df[num_cols] = df[num_cols].fillna(
    df[num_cols].median()
)

In [17]:
# Using isna to confirm we have dealt with all missing values
df.isna().sum()

POSTED_SPEED_LIMIT               0
TRAFFIC_CONTROL_DEVICE           0
DEVICE_CONDITION                 0
WEATHER_CONDITION                0
LIGHTING_CONDITION               0
FIRST_CRASH_TYPE                 0
TRAFFICWAY_TYPE                  0
ALIGNMENT                        0
ROADWAY_SURFACE_COND             0
ROAD_DEFECT                      0
REPORT_TYPE                      0
CRASH_TYPE                       0
DAMAGE                           0
DATE_POLICE_NOTIFIED             0
PRIM_CONTRIBUTORY_CAUSE          0
SEC_CONTRIBUTORY_CAUSE           0
STREET_NO                        0
STREET_DIRECTION                 0
STREET_NAME                      0
BEAT_OF_OCCURRENCE               0
NUM_UNITS                        0
MOST_SEVERE_INJURY               0
INJURIES_TOTAL                   0
INJURIES_FATAL                   0
INJURIES_INCAPACITATING          0
INJURIES_NON_INCAPACITATING      0
INJURIES_REPORTED_NOT_EVIDENT    0
INJURIES_NO_INDICATION           0
INJURIES_UNKNOWN    

In [18]:
n_percent= (df.isnull().mean() * 100).sort_values(ascending=False)
n_percent

LOCATION                         0.0
SEC_CONTRIBUTORY_CAUSE           0.0
TRAFFIC_CONTROL_DEVICE           0.0
DEVICE_CONDITION                 0.0
WEATHER_CONDITION                0.0
LIGHTING_CONDITION               0.0
FIRST_CRASH_TYPE                 0.0
TRAFFICWAY_TYPE                  0.0
ALIGNMENT                        0.0
ROADWAY_SURFACE_COND             0.0
ROAD_DEFECT                      0.0
REPORT_TYPE                      0.0
CRASH_TYPE                       0.0
DAMAGE                           0.0
DATE_POLICE_NOTIFIED             0.0
PRIM_CONTRIBUTORY_CAUSE          0.0
STREET_NO                        0.0
CRASH_MONTH                      0.0
STREET_DIRECTION                 0.0
STREET_NAME                      0.0
BEAT_OF_OCCURRENCE               0.0
NUM_UNITS                        0.0
MOST_SEVERE_INJURY               0.0
INJURIES_TOTAL                   0.0
INJURIES_FATAL                   0.0
INJURIES_INCAPACITATING          0.0
INJURIES_NON_INCAPACITATING      0.0
I

In [19]:
# Stripping whitespaces and changing column names to lower case
df.columns = df.columns.str.strip().str.lower()
df.columns

Index(['posted_speed_limit', 'traffic_control_device', 'device_condition',
       'weather_condition', 'lighting_condition', 'first_crash_type',
       'trafficway_type', 'alignment', 'roadway_surface_cond', 'road_defect',
       'report_type', 'crash_type', 'damage', 'date_police_notified',
       'prim_contributory_cause', 'sec_contributory_cause', 'street_no',
       'street_direction', 'street_name', 'beat_of_occurrence', 'num_units',
       'most_severe_injury', 'injuries_total', 'injuries_fatal',
       'injuries_incapacitating', 'injuries_non_incapacitating',
       'injuries_reported_not_evident', 'injuries_no_indication',
       'injuries_unknown', 'crash_hour', 'crash_day_of_week', 'crash_month',
       'location'],
      dtype='object')

In [20]:
# Saving the changes made during cleaning
df.to_csv("cleaned_traffic_crashes.csv", index=False)


In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1024029 entries, 0 to 1024028
Data columns (total 33 columns):
 #   Column                         Non-Null Count    Dtype  
---  ------                         --------------    -----  
 0   posted_speed_limit             1024029 non-null  int64  
 1   traffic_control_device         1024029 non-null  object 
 2   device_condition               1024029 non-null  object 
 3   weather_condition              1024029 non-null  object 
 4   lighting_condition             1024029 non-null  object 
 5   first_crash_type               1024029 non-null  object 
 6   trafficway_type                1024029 non-null  object 
 7   alignment                      1024029 non-null  object 
 8   roadway_surface_cond           1024029 non-null  object 
 9   road_defect                    1024029 non-null  object 
 10  report_type                    1024029 non-null  object 
 11  crash_type                     1024029 non-null  object 
 12  damage        

## Consolidating Crash Causes

The dataset contains over 40 specific crash contributory causes, which can be highly granular and sparse for modeling. To simplify the analysis and improve interpretability, we grouped these causes into 5 broader categories or “buckets”:

 - Driver Error – Includes causes such as following too closely, failing to yield, improper turning, distracted driving, and other errors made by the driver.

 - External Factors – Includes causes outside the driver’s control, such as weather, road defects, visual obstructions, or animals.

 - Vehicle Defects – Covers mechanical issues such as brake failure or other equipment malfunctions.

 - Unknown – Cases where the contributory cause could not be determined or is not applicable.

 - Other – Any remaining causes not captured in the above categories.

Using a mapping dictionary, each record in prim_contributory_cause was mapped to one of these 5 categories, with unmapped causes defaulting to “Other.”

This step reduces noise, creates more balanced target classes, and improves the interpretability of the predictive model.

After applying the mapping, we verified the distribution of crashes across the new categories to understand the relative prevalence of each broad cause.
After applying the mapping, we verified the distribution of crashes across the new categories to understand the relative prevalence of each broad cause.

In [22]:
# Create a mapping dictionary
# This reduces 40 specific causes into 5 broad "Buckets"
cause_mapping = {
    # DRIVER ERROR (The biggest category)
    'FOLLOWING TOO CLOSELY': 'Driver Error',
    'FAILING TO YIELD RIGHT-OF-WAY': 'Driver Error',
    'FAILING TO REDUCE SPEED TO AVOID CRASH': 'Driver Error',
    'IMPROPER BACKING': 'Driver Error',
    'IMPROPER OVERTAKING/PASSING': 'Driver Error',
    'IMPROPER TURNING/NO SIGNAL': 'Driver Error',
    'DRIVING SKILLS/KNOWLEDGE/EXPERIENCE': 'Driver Error',
    'DISREGARDING TRAFFIC SIGNALS': 'Driver Error',
    'OPERATING VEHICLE IN ERRATIC, RECKLESS, CARELESS, NEGLIGENT OR AGGRESSIVE MANNER': 'Driver Error',
    'TEXTING': 'Driver Error',
    'DISTRACTION - FROM INSIDE VEHICLE': 'Driver Error',
    'DISTRACTION - FROM OUTSIDE VEHICLE': 'Driver Error',
    'PHYSICAL CONDITION OF DRIVER': 'Driver Error',
    
    # EXTERNAL FACTORS
    'WEATHER': 'External Factors',
    'ROAD ENGINEERING/SURFACE/MARKING DEFECTS': 'External Factors',
    'VISION OBSCURED (SIGNS, TREE LIMBS, BUILDINGS, ETC.)': 'External Factors',
    'ANIMAL': 'External Factors',
    
    # VEHICLE DEFECTS
    'EQUIPMENT - VEHICLE CONDITION': 'Vehicle Defect',
    'BRAKESLESS/FAILURE': 'Vehicle Defect',
    
    # UNKNOWN (Usually the biggest or second biggest)
    'UNABLE TO DETERMINE': 'Unknown',
    'NOT APPLICABLE': 'Unknown'
}

# 1. Apply the mapping
# If a cause is NOT in the dictionary, we default it to 'Other'
df['Crash_Cause'] = df['prim_contributory_cause'].map(cause_mapping).fillna('Other')

# 2. Check the new counts
print(df['Crash_Cause'].value_counts())

Driver Error        464480
Unknown             456049
Other                72885
External Factors     24364
Vehicle Defect        6251
Name: Crash_Cause, dtype: int64


We can now see our different causes catergories and how they are distributed in the dataset

## Defining X and Y
Here we define our target and features variables to help us use Supervised machine learning methods for our white box models.

In [23]:
X = df.drop(['Crash_Cause', 'prim_contributory_cause'], axis=1)
y = df['Crash_Cause']

## Train test split
Here we separate our dataset into two parts i.e the training and testing data set for modelling and prediction

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [25]:
# Checking X train
X_train

Unnamed: 0,posted_speed_limit,traffic_control_device,device_condition,weather_condition,lighting_condition,first_crash_type,trafficway_type,alignment,roadway_surface_cond,road_defect,...,injuries_fatal,injuries_incapacitating,injuries_non_incapacitating,injuries_reported_not_evident,injuries_no_indication,injuries_unknown,crash_hour,crash_day_of_week,crash_month,location
899405,40,NO CONTROLS,NO CONTROLS,CLEAR,DARKNESS,FIXED OBJECT,DIVIDED - W/MEDIAN BARRIER,"CURVE, LEVEL",DRY,NO DEFECTS,...,0.0,0.0,1.0,0.0,0.0,0.0,3,1,11,POINT (-87.618091911783 41.898389053094)
479809,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,REAR END,NOT DIVIDED,STRAIGHT AND LEVEL,UNKNOWN,UNKNOWN,...,0.0,0.0,0.0,0.0,10.0,0.0,13,1,9,POINT (-87.642384512979 41.940186722574)
553121,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,SNOW,"DARKNESS, LIGHTED ROAD",ANGLE,FOUR WAY,STRAIGHT AND LEVEL,SNOW OR SLUSH,NO DEFECTS,...,0.0,0.0,0.0,0.0,2.0,0.0,22,3,1,POINT (-87.80634529093 41.930744417308)
992763,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,"DARKNESS, LIGHTED ROAD",REAR END,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,4.0,0.0,22,2,8,POINT (-87.70096006787 41.877305760362)
514753,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,PARKED MOTOR VEHICLE,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,UNKNOWN,...,0.0,0.0,0.0,0.0,1.0,0.0,13,7,6,POINT (-87.688304588055 41.953491697799)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
919601,30,OTHER,OTHER,CLEAR,DAYLIGHT,TURNING,DIVIDED - W/MEDIAN BARRIER,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,2.0,0.0,15,2,9,POINT (-87.660174752888 41.991780377892)
784211,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,DAYLIGHT,PARKED MOTOR VEHICLE,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,UNKNOWN,...,0.0,0.0,0.0,0.0,1.0,0.0,15,5,11,POINT (-87.663815590987 41.907808782674)
673016,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,DAYLIGHT,TURNING,DIVIDED - W/MEDIAN (NOT RAISED),STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,2.0,0.0,5,6,10,POINT (-87.704257223316 41.811600339039)
236334,40,NO CONTROLS,NO CONTROLS,CLEAR,DUSK,ANGLE,DIVIDED - W/MEDIAN BARRIER,STRAIGHT AND LEVEL,DRY,WORN SURFACE,...,0.0,0.0,0.0,0.0,2.0,0.0,16,5,12,POINT (-87.653110814421 41.985449532208)


In [26]:
# checking X test
X_test

Unnamed: 0,posted_speed_limit,traffic_control_device,device_condition,weather_condition,lighting_condition,first_crash_type,trafficway_type,alignment,roadway_surface_cond,road_defect,...,injuries_fatal,injuries_incapacitating,injuries_non_incapacitating,injuries_reported_not_evident,injuries_no_indication,injuries_unknown,crash_hour,crash_day_of_week,crash_month,location
426857,30,NO CONTROLS,NO CONTROLS,CLEAR,"DARKNESS, LIGHTED ROAD",SIDESWIPE SAME DIRECTION,FOUR WAY,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,2.0,0.0,23,1,3,POINT (-87.766253945447 41.880333955527)
42008,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,SIDESWIPE SAME DIRECTION,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,2.0,0.0,16,6,9,POINT (-87.70508348897 41.798881170398)
39997,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,DAYLIGHT,REAR END,FOUR WAY,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,2.0,0.0,9,6,9,POINT (-87.562908182042 41.76619520231)
620460,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,REAR END,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,2.0,0.0,15,6,5,POINT (-87.617719481874 41.758471711463)
731576,30,NO CONTROLS,NO CONTROLS,CLEAR,"DARKNESS, LIGHTED ROAD",REAR END,PARKING LOT,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,4.0,0.0,22,6,4,POINT (-87.614709812077 41.721991872532)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
640257,20,NO CONTROLS,NO CONTROLS,UNKNOWN,UNKNOWN,PARKED MOTOR VEHICLE,ONE-WAY,STRAIGHT AND LEVEL,UNKNOWN,UNKNOWN,...,0.0,0.0,0.0,0.0,3.0,0.0,18,1,2,POINT (-87.631180891189 41.89093246868)
969443,30,NO CONTROLS,NO CONTROLS,CLEAR,DUSK,TURNING,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,9.0,0.0,15,4,1,POINT (-87.607532022114 41.767416946505)
304252,30,NO CONTROLS,NO CONTROLS,CLEAR,UNKNOWN,PARKED MOTOR VEHICLE,DIVIDED - W/MEDIAN BARRIER,STRAIGHT AND LEVEL,UNKNOWN,NO DEFECTS,...,0.0,0.0,0.0,0.0,1.0,0.0,11,7,5,POINT (-87.765519558311 41.880346236952)
971835,30,NO CONTROLS,NO CONTROLS,CLEAR,DAWN,PARKED MOTOR VEHICLE,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,1.0,0.0,0,2,12,POINT (-87.575890342787 41.753911617689)


## Preprocessing

In this step we prepare our data for modeling by normalizing features to prevent feature dominance and One Hot Encoding to covert catergorical variables to binary the computer can understand.We preprocess X train and X test separately to avoid data leakage.

In [27]:
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    ConfusionMatrixDisplay,
    accuracy_score,
    f1_score
)

from sklearn.inspection import permutation_importance

In [28]:
# Make copies first (optional but safe)
X_train = X_train.copy()
X_test  = X_test.copy()

# Group rare categories → fewer one-hot columns
rare_threshold = 200  # adjust if needed

for col in ['street_name', 'first_crash_type', 'trafficway_type']:
    if col in X_train.columns:
        counts = X_train[col].value_counts()
        common = counts[counts >= rare_threshold].index
        
        # Safe assignment using .loc
        X_train.loc[:, col] = X_train[col].where(X_train[col].isin(common), 'Rare/Other')
        X_test.loc[:, col]  = X_test[col].where(X_test[col].isin(common), 'Rare/Other')
        
        print(f"Reduced {col}: {len(common)} common categories kept")


Reduced street_name: 509 common categories kept
Reduced first_crash_type: 17 common categories kept
Reduced trafficway_type: 19 common categories kept


In [29]:
categorical_features = X_train.select_dtypes(include=['object']).columns
numeric_features = X_train.select_dtypes(exclude=['object']).columns

In [30]:
# Drop very high categorical columns from OHE to save memory
high_card = ['LOCATION', 'STREET_NAME', 'DATE_POLICE_NOTIFIED']
cat_reduced = [c for c in categorical_features if c not in high_card]

preprocessor_light = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore', sparse=True), cat_reduced),
        ('num', StandardScaler(), numeric_features)
    ]
)

In [31]:

numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")), # handles missing values
    ("ohe", OneHotEncoder(handle_unknown="ignore")) # encoding categories
])

preprocessor = ColumnTransformer(transformers=[
    ("num", numeric_transformer, numeric_features),
    ("cat", categorical_transformer, categorical_features)
])

In [32]:
from sklearn.linear_model import LogisticRegression

logreg_pipe = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("model", LogisticRegression(
        max_iter=2000,
        class_weight='balanced',
        random_state=42 ))
])

In [33]:
# Model Pipeline
tree_pipe = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("model", DecisionTreeClassifier(max_depth=5, random_state=42))
])

rf_pipe = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("model", RandomForestClassifier(n_estimators=50,class_weight='balanced_subsample', random_state=42, n_jobs=-1))
])

In [34]:
import matplotlib.pyplot as plt
def evaluate(name, pipe):
    pipe.fit(X_train, y_train)
    preds = pipe.predict(X_test)

    acc = accuracy_score(y_test, preds)
    macro_f1 = f1_score(y_test, preds, average="macro")

    print("\n" + "="*70)
    print(name)
    print(f"Accuracy: {acc:.4f}")
    print(f"Macro F1 : {macro_f1:.4f}")
    print("\nClassification report:")
    print(classification_report(y_test, preds))

    cm = confusion_matrix(y_test, preds, labels=pipe.classes_)
    ConfusionMatrixDisplay(cm, display_labels=pipe.classes_).plot(xticks_rotation=30)
    plt.title(f"Confusion Matrix - {name}")
    plt.show()

    return {"Model": name, "Accuracy": acc, "Macro_F1": macro_f1}

### Categorical Feature Summary
A few features have very many unique features(e.g., date and location fields), which may increase model complexity.

Most categorical variables have low to moderate unique features, making them suitable for encoding and interpretation.

Since street name has very many unique values and is a very critical column, we proceeded to group the unique values into categories

One Hot encoding categorical variables in X train dataset

## Numerical Features
These are the features in the dataset that are numeric in nature.

## Scaling numeric columns
This is whereby we normalize all columns to be centered around the mean so as to avoid feature dominance.

Now we have a fully processed training set to work with when training our models i.e White Box and Black Box models for prediction and comparison 


### Preprocessing X_test
To ensure a fair and unbiased evaluation, the test dataset was preprocessed using the same transformations learned from the training data. The preprocessing pipeline fitted on the training set was applied to X_test using the transform method only, preventing data leakage and maintaining consistency between the training and testing data.

## Encoding the Target Variable

To prepare the target variable for model training, the categorical crash cause labels were encoded into numerical values using LabelEncoder. This transformation converts each unique class into an integer, enabling compatibility with machine learning algorithms.

In [35]:
from sklearn.preprocessing import LabelEncoder

# Instantiate
le = LabelEncoder()

# Encode the NEW grouped column
df['Target_Encoded'] = le.fit_transform(df['Crash_Cause'])

# Update your y variable
y = df['Target_Encoded']

print("Encoding complete. Classes are:", le.classes_)

Encoding complete. Classes are: ['Driver Error' 'External Factors' 'Other' 'Unknown' 'Vehicle Defect']


## Modeling: White-Box and Black-Box Models

With the preprocessing complete and the target variable encoded, the fully prepared training dataset was used to train and evaluate both white-box and black-box machine learning models. This dual-modeling approach allows for a balanced comparison between model interpretability and predictive performance.

### White-Box Models

White-box models are inherently interpretable, meaning their decision-making processes can be easily understood and explained. These models are particularly valuable in domains where transparency and accountability are critical.

In this project, white-box models were trained to:

 - Understand the key factors contributing to traffic crash causes

 - Provide clear explanations for predictions

 - Establish a strong baseline for comparison

Examples of white-box models used include:

 - Logistic Regression

 - Decision Trees

These models enable direct interpretation of feature importance and decision logic, making them suitable for policy-making and safety analysis.

### Black-Box Models

Black-box models prioritize predictive accuracy over interpretability. While they often achieve higher performance, their internal decision processes are more complex and less transparent.

In this project, black-box models were trained to:

 - Capture complex, non-linear relationships within the data

 - Improve prediction performance over simpler models

 - Serve as a benchmark for evaluating the trade-off between accuracy and explainability

Examples of black-box models used include:

 - Random Forest

 - Gradient Boosting / XGBoost

Although these models are less interpretable by default, post-hoc explanation techniques can be applied to better understand their predictions.

## Modeling Objective (Model Interpretability Focus)

The primary objective of the modeling stage is to balance predictive performance with interpretability. By training both white-box and black-box models, this project aims to understand not only what predictions are made, but why they are made.

Specifically, the modeling objectives are to:

 - Develop white-box models that provide transparent and easily interpretable decision rules, enabling clear identification of the factors that contribute to different crash causes.

 - Train black-box models to capture complex, non-linear relationships in the data and maximize predictive performance.

 - Compare the explanations produced by interpretable models with insights derived from black-box models using post-hoc interpretation techniques.

 - Evaluate the trade-offs between model accuracy and explainability to determine which models are most suitable for real-world decision-making.

 - Support actionable insights by identifying key features that influence crash outcomes in a way that can be communicated to both technical and non-technical stakeholders.

This approach ensures that model selection is guided not only by performance metrics but also by the ability to provide meaningful, trustworthy, and explainable insights.

In [36]:
#Making necessary imports
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
import matplotlib.pyplot as plt