# A Data-Driven and Interpretable Approach to Traffic Crash Severity Prediction 

# Business Understanding
## Business Overview

Road traffic crashes continue to pose a significant public safety and economic challenge, particularly when they result in serious injuries or fatalities. Transportation authorities and public safety agencies are responsible for making critical decisions about infrastructure investments, traffic enforcement, and emergency response planning, often under constrained budgets and limited resources. To be effective, these decisions must be supported by data-driven insights that are not only accurate but also transparent and explainable.

Conventional traffic safety analyses typically focus on historical trends and aggregated statistics, which may fail to capture the complex interactions between roadway conditions, environmental factors, driver behavior, and temporal patterns that contribute to severe crash outcomes. While advanced machine learning models can improve predictive performance, their limited interpretability can reduce trust and limit their usefulness in policy and operational contexts where accountability and justification are required.

This project addresses these challenges by applying interpretable machine learning models to traffic crash data in order to predict crash severity and explain the key factors associated with injury- and fatal-level crashes. By integrating both interpretable (white-box) models and high-performing (black-box) models enhanced with explainability techniques, the analysis identifies how factors such as speed limits, lighting conditions, roadway characteristics, hit-and-run involvement, and time of day influence crash severity.

The insights produced by this project are intended to directly support decision-making related to roadway safety improvements, targeted enforcement strategies, and resource prioritization. By translating model outputs into clear, actionable explanations, this analysis enables stakeholders to move beyond reactive responses and toward proactive, evidence-based interventions aimed at reducing the frequency and severity of traffic-related injuries and fatalities.

## Problem Statement
Traffic safety stakeholders face ongoing challenges in reducing the severity of road traffic crashes while operating under limited resources and increasing system complexity. Although large volumes of traffic crash data are available, decision-makers often lack clear, explainable insights into the factors that most strongly contribute to injury- and fatal-level crashes. Existing analyses frequently focus on descriptive statistics or predictive accuracy alone, which limits their usefulness for guiding infrastructure investments, enforcement strategies, and policy interventions.

The core business problem addressed in this project is the absence of transparent, interpretable models that can both predict crash severity and clearly explain why certain crashes are more likely to result in severe outcomes. Without interpretable insights, stakeholders risk making decisions that are difficult to justify, inefficiently targeted, or misaligned with real-world risk factors. There is a need for a data-driven approach that balances predictive performance with explainability to support accountable and evidence-based traffic safety decisions.

## Business Objectives
The primary objective of this project is to support traffic safety decision-making by developing an interpretable machine learning framework that predicts crash severity and identifies the key factors driving severe crash outcomes.

### Specific Objectives
1. Predict Crash Severity
Develop machine learning models to classify traffic crashes based on severity, with a focus on identifying crashes that result in injuries or fatalities.

2. Ensure Model Interpretability
Apply interpretable modeling techniques to explain both global and individual-level predictions, enabling stakeholders to understand how specific features influence crash severity.

3. Identify High-Impact Risk Factors
Determine the most influential factors associated with severe crashes, including roadway conditions, environmental factors, driver-related behaviors, and temporal patterns.

4. Compare Interpretability and Performance Trade-offs
Evaluate the performance of interpretable (white-box) models against more complex (black-box) models to assess the balance between accuracy and explainability.

5. Support Actionable Safety Interventions
Translate model insights into practical recommendations that can inform infrastructure improvements, targeted enforcement strategies, and resource allocation.

6. Promote Transparent and Accountable Decision-Making
Provide clear, defensible explanations that can be communicated to non-technical stakeholders, supporting responsible and evidence-based traffic safety policies.

Data Preprocessing 

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import classification_report

In [3]:
df1 = pd.read_csv("Traffic_Crashes.csv")
df1.head()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,CRASH_RECORD_ID,CRASH_DATE_EST_I,CRASH_DATE,POSTED_SPEED_LIMIT,TRAFFIC_CONTROL_DEVICE,DEVICE_CONDITION,WEATHER_CONDITION,LIGHTING_CONDITION,FIRST_CRASH_TYPE,TRAFFICWAY_TYPE,...,INJURIES_NON_INCAPACITATING,INJURIES_REPORTED_NOT_EVIDENT,INJURIES_NO_INDICATION,INJURIES_UNKNOWN,CRASH_HOUR,CRASH_DAY_OF_WEEK,CRASH_MONTH,LATITUDE,LONGITUDE,LOCATION
0,e268f2eeda8ac7b5a4c2a1df4fd2ce3754bde4e92bfbfc...,,01/29/2026 10:30:00 PM,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,"DARKNESS, LIGHTED ROAD",SIDESWIPE SAME DIRECTION,ONE-WAY,...,0.0,0.0,3.0,0.0,22,5,1,41.87856,-87.636524,POINT (-87.636523872055 41.878560153624)
1,c157b0c950338c89f8de0f1c5de1fad8751cf2a5e4189a...,,01/29/2026 10:10:00 PM,25,STOP SIGN/FLASHER,NO CONTROLS,CLEAR,"DARKNESS, LIGHTED ROAD",TURNING,FOUR WAY,...,0.0,0.0,3.0,0.0,22,5,1,42.021287,-87.673023,POINT (-87.673023252655 42.021287350749)
2,9e6b31d4cd88bb220fa8ff4c0f92426f99ae64edbd8cd6...,,01/29/2026 09:44:00 PM,30,NO CONTROLS,NO CONTROLS,CLEAR,"DARKNESS, LIGHTED ROAD",SIDESWIPE OPPOSITE DIRECTION,NOT DIVIDED,...,0.0,0.0,2.0,0.0,21,5,1,41.953595,-87.741477,POINT (-87.741477234719 41.953595431054)
3,d5656bd91be03c1848369fb76427a0d6b05c5e300b665c...,,01/29/2026 09:40:00 PM,20,STOP SIGN/FLASHER,FUNCTIONING PROPERLY,CLEAR,"DARKNESS, LIGHTED ROAD",TURNING,T-INTERSECTION,...,0.0,0.0,2.0,0.0,21,5,1,41.874687,-87.76451,POINT (-87.764510374021 41.874686639254)
4,764fd30db9bef388872f692ded03776c1b58fe96584ece...,,01/29/2026 09:05:00 PM,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,SNOW,"DARKNESS, LIGHTED ROAD",REAR END,DIVIDED - W/MEDIAN (NOT RAISED),...,0.0,0.0,3.0,0.0,21,5,1,41.874401,-87.725592,POINT (-87.725592011617 41.874401315544)


In [4]:
df_cleaned = pd.read_csv("cleaned_traffic_crashes.csv")

In [5]:
df_cleaned = pd.read_csv("cleaned_traffic_crashes.csv", low_memory=False)
df_cleaned.head()

Unnamed: 0,posted_speed_limit,traffic_control_device,device_condition,weather_condition,lighting_condition,first_crash_type,trafficway_type,alignment,roadway_surface_cond,road_defect,...,injuries_fatal,injuries_incapacitating,injuries_non_incapacitating,injuries_reported_not_evident,injuries_no_indication,injuries_unknown,crash_hour,crash_day_of_week,crash_month,location
0,30,NO CONTROLS,NO CONTROLS,SNOW,"DARKNESS, LIGHTED ROAD",FIXED OBJECT,NOT DIVIDED,STRAIGHT AND LEVEL,SNOW OR SLUSH,NO DEFECTS,...,0.0,0.0,0.0,0.0,1.0,0.0,22,4,1,POINT (-87.551093105845 41.713829100033)
1,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,"DARKNESS, LIGHTED ROAD",REAR END,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,2.0,0.0,22,4,1,POINT (-87.755202215729 41.796710893317)
2,30,OTHER,OTHER,OTHER,UNKNOWN,PARKED MOTOR VEHICLE,OTHER,STRAIGHT AND LEVEL,OTHER,UNKNOWN,...,0.0,0.0,0.0,0.0,2.0,0.0,22,4,1,POINT (-87.603822899265 41.813004951227)
3,30,STOP SIGN/FLASHER,FUNCTIONING PROPERLY,CLEAR,"DARKNESS, LIGHTED ROAD",REAR END,FOUR WAY,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,3.0,0.0,22,4,1,POINT (-87.705668192505 41.868335288795)
4,30,NO CONTROLS,NO CONTROLS,CLEAR,"DARKNESS, LIGHTED ROAD",PARKED MOTOR VEHICLE,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,2.0,0.0,22,4,1,POINT (-87.696128029764 41.866617682133)


In [6]:
df_cleaned['prim_contributory_cause']

0                    UNABLE TO DETERMINE
1                    UNABLE TO DETERMINE
2                    UNABLE TO DETERMINE
3            IMPROPER OVERTAKING/PASSING
4                    IMPROPER LANE USAGE
                       ...              
1024024              UNABLE TO DETERMINE
1024025    FAILING TO YIELD RIGHT-OF-WAY
1024026              UNABLE TO DETERMINE
1024027              UNABLE TO DETERMINE
1024028              IMPROPER LANE USAGE
Name: prim_contributory_cause, Length: 1024029, dtype: object

Data cleaning

In [7]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1024029 entries, 0 to 1024028
Data columns (total 33 columns):
 #   Column                         Non-Null Count    Dtype  
---  ------                         --------------    -----  
 0   posted_speed_limit             1024029 non-null  int64  
 1   traffic_control_device         1024029 non-null  object 
 2   device_condition               1024029 non-null  object 
 3   weather_condition              1024029 non-null  object 
 4   lighting_condition             1024029 non-null  object 
 5   first_crash_type               1024029 non-null  object 
 6   trafficway_type                1024029 non-null  object 
 7   alignment                      1024029 non-null  object 
 8   roadway_surface_cond           1024029 non-null  object 
 9   road_defect                    1024029 non-null  object 
 10  report_type                    1024029 non-null  object 
 11  crash_type                     1024029 non-null  object 
 12  damage        

In [8]:
df_cleaned.shape

(1024029, 33)

In [9]:
df_cleaned.isna().sum()

posted_speed_limit               0
traffic_control_device           0
device_condition                 0
weather_condition                0
lighting_condition               0
first_crash_type                 0
trafficway_type                  0
alignment                        0
roadway_surface_cond             0
road_defect                      0
report_type                      0
crash_type                       0
damage                           0
date_police_notified             0
prim_contributory_cause          0
sec_contributory_cause           0
street_no                        0
street_direction                 0
street_name                      0
beat_of_occurrence               0
num_units                        0
most_severe_injury               0
injuries_total                   0
injuries_fatal                   0
injuries_incapacitating          0
injuries_non_incapacitating      0
injuries_reported_not_evident    0
injuries_no_indication           0
injuries_unknown    

In [10]:
n_percent= (df_cleaned.isnull().mean() * 100).sort_values(ascending=False)
n_percent

location                         0.0
sec_contributory_cause           0.0
traffic_control_device           0.0
device_condition                 0.0
weather_condition                0.0
lighting_condition               0.0
first_crash_type                 0.0
trafficway_type                  0.0
alignment                        0.0
roadway_surface_cond             0.0
road_defect                      0.0
report_type                      0.0
crash_type                       0.0
damage                           0.0
date_police_notified             0.0
prim_contributory_cause          0.0
street_no                        0.0
crash_month                      0.0
street_direction                 0.0
street_name                      0.0
beat_of_occurrence               0.0
num_units                        0.0
most_severe_injury               0.0
injuries_total                   0.0
injuries_fatal                   0.0
injuries_incapacitating          0.0
injuries_non_incapacitating      0.0
i

In [11]:
df_cleaned.columns

Index(['posted_speed_limit', 'traffic_control_device', 'device_condition',
       'weather_condition', 'lighting_condition', 'first_crash_type',
       'trafficway_type', 'alignment', 'roadway_surface_cond', 'road_defect',
       'report_type', 'crash_type', 'damage', 'date_police_notified',
       'prim_contributory_cause', 'sec_contributory_cause', 'street_no',
       'street_direction', 'street_name', 'beat_of_occurrence', 'num_units',
       'most_severe_injury', 'injuries_total', 'injuries_fatal',
       'injuries_incapacitating', 'injuries_non_incapacitating',
       'injuries_reported_not_evident', 'injuries_no_indication',
       'injuries_unknown', 'crash_hour', 'crash_day_of_week', 'crash_month',
       'location'],
      dtype='object')

In [12]:
df_cleaned.duplicated().value_counts()

False    1023967
True          62
dtype: int64

In [13]:
df = df_cleaned.drop(
    columns=[
        'CRASH_RECORD_ID',
        'HIT_AND_RUN_I',
        'NOT_RIGHT_OF_WAY_I',
        'LANE_CNT',
        'INTERSECTION_RELATED_I',
        'CRASH_DATE_EST_I',
        'PHOTOS_TAKEN_I',
        'STATEMENTS_TAKEN_I',
        'DOORING_I',
        'WORK_ZONE_I',
        'WORK_ZONE_TYPE',
        'WORKERS_PRESENT_I',
        'LATITUDE',
        'LONGITUDE',
        'CRASH_DATE'
    ], errors='ignore'
) 
df.head()

Unnamed: 0,posted_speed_limit,traffic_control_device,device_condition,weather_condition,lighting_condition,first_crash_type,trafficway_type,alignment,roadway_surface_cond,road_defect,...,injuries_fatal,injuries_incapacitating,injuries_non_incapacitating,injuries_reported_not_evident,injuries_no_indication,injuries_unknown,crash_hour,crash_day_of_week,crash_month,location
0,30,NO CONTROLS,NO CONTROLS,SNOW,"DARKNESS, LIGHTED ROAD",FIXED OBJECT,NOT DIVIDED,STRAIGHT AND LEVEL,SNOW OR SLUSH,NO DEFECTS,...,0.0,0.0,0.0,0.0,1.0,0.0,22,4,1,POINT (-87.551093105845 41.713829100033)
1,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,"DARKNESS, LIGHTED ROAD",REAR END,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,2.0,0.0,22,4,1,POINT (-87.755202215729 41.796710893317)
2,30,OTHER,OTHER,OTHER,UNKNOWN,PARKED MOTOR VEHICLE,OTHER,STRAIGHT AND LEVEL,OTHER,UNKNOWN,...,0.0,0.0,0.0,0.0,2.0,0.0,22,4,1,POINT (-87.603822899265 41.813004951227)
3,30,STOP SIGN/FLASHER,FUNCTIONING PROPERLY,CLEAR,"DARKNESS, LIGHTED ROAD",REAR END,FOUR WAY,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,3.0,0.0,22,4,1,POINT (-87.705668192505 41.868335288795)
4,30,NO CONTROLS,NO CONTROLS,CLEAR,"DARKNESS, LIGHTED ROAD",PARKED MOTOR VEHICLE,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,2.0,0.0,22,4,1,POINT (-87.696128029764 41.866617682133)


In [14]:
n_percent= (df.isnull().mean() * 100).sort_values(ascending=False)
n_percent

location                         0.0
sec_contributory_cause           0.0
traffic_control_device           0.0
device_condition                 0.0
weather_condition                0.0
lighting_condition               0.0
first_crash_type                 0.0
trafficway_type                  0.0
alignment                        0.0
roadway_surface_cond             0.0
road_defect                      0.0
report_type                      0.0
crash_type                       0.0
damage                           0.0
date_police_notified             0.0
prim_contributory_cause          0.0
street_no                        0.0
crash_month                      0.0
street_direction                 0.0
street_name                      0.0
beat_of_occurrence               0.0
num_units                        0.0
most_severe_injury               0.0
injuries_total                   0.0
injuries_fatal                   0.0
injuries_incapacitating          0.0
injuries_non_incapacitating      0.0
i

In [15]:
df.shape

(1024029, 33)

Filling null values for categorical and numerical columns

In [16]:
cat_cols = df.select_dtypes(include='object').columns

df[cat_cols] = df[cat_cols].fillna(df[cat_cols].mode().iloc[0])



In [17]:
num_cols = df.select_dtypes(include=['int', 'float']).columns

df[num_cols] = df[num_cols].fillna(
    df[num_cols].median()
)

In [18]:
df.isna().sum()

posted_speed_limit               0
traffic_control_device           0
device_condition                 0
weather_condition                0
lighting_condition               0
first_crash_type                 0
trafficway_type                  0
alignment                        0
roadway_surface_cond             0
road_defect                      0
report_type                      0
crash_type                       0
damage                           0
date_police_notified             0
prim_contributory_cause          0
sec_contributory_cause           0
street_no                        0
street_direction                 0
street_name                      0
beat_of_occurrence               0
num_units                        0
most_severe_injury               0
injuries_total                   0
injuries_fatal                   0
injuries_incapacitating          0
injuries_non_incapacitating      0
injuries_reported_not_evident    0
injuries_no_indication           0
injuries_unknown    

In [19]:
n_percent= (df.isnull().mean() * 100).sort_values(ascending=False)
n_percent

location                         0.0
sec_contributory_cause           0.0
traffic_control_device           0.0
device_condition                 0.0
weather_condition                0.0
lighting_condition               0.0
first_crash_type                 0.0
trafficway_type                  0.0
alignment                        0.0
roadway_surface_cond             0.0
road_defect                      0.0
report_type                      0.0
crash_type                       0.0
damage                           0.0
date_police_notified             0.0
prim_contributory_cause          0.0
street_no                        0.0
crash_month                      0.0
street_direction                 0.0
street_name                      0.0
beat_of_occurrence               0.0
num_units                        0.0
most_severe_injury               0.0
injuries_total                   0.0
injuries_fatal                   0.0
injuries_incapacitating          0.0
injuries_non_incapacitating      0.0
i

In [20]:
df.columns = df.columns.str.strip().str.lower()
df.columns

Index(['posted_speed_limit', 'traffic_control_device', 'device_condition',
       'weather_condition', 'lighting_condition', 'first_crash_type',
       'trafficway_type', 'alignment', 'roadway_surface_cond', 'road_defect',
       'report_type', 'crash_type', 'damage', 'date_police_notified',
       'prim_contributory_cause', 'sec_contributory_cause', 'street_no',
       'street_direction', 'street_name', 'beat_of_occurrence', 'num_units',
       'most_severe_injury', 'injuries_total', 'injuries_fatal',
       'injuries_incapacitating', 'injuries_non_incapacitating',
       'injuries_reported_not_evident', 'injuries_no_indication',
       'injuries_unknown', 'crash_hour', 'crash_day_of_week', 'crash_month',
       'location'],
      dtype='object')

In [21]:
# Saving the changes made during cleaning
df.to_csv("cleaned_traffic_crashes.csv", index=False)


In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1024029 entries, 0 to 1024028
Data columns (total 33 columns):
 #   Column                         Non-Null Count    Dtype  
---  ------                         --------------    -----  
 0   posted_speed_limit             1024029 non-null  int64  
 1   traffic_control_device         1024029 non-null  object 
 2   device_condition               1024029 non-null  object 
 3   weather_condition              1024029 non-null  object 
 4   lighting_condition             1024029 non-null  object 
 5   first_crash_type               1024029 non-null  object 
 6   trafficway_type                1024029 non-null  object 
 7   alignment                      1024029 non-null  object 
 8   roadway_surface_cond           1024029 non-null  object 
 9   road_defect                    1024029 non-null  object 
 10  report_type                    1024029 non-null  object 
 11  crash_type                     1024029 non-null  object 
 12  damage        

In [23]:
# Create a mapping dictionary
# This reduces 40 specific causes into 5 broad "Buckets"
cause_mapping = {
    # DRIVER ERROR (The biggest category)
    'FOLLOWING TOO CLOSELY': 'Driver Error',
    'FAILING TO YIELD RIGHT-OF-WAY': 'Driver Error',
    'FAILING TO REDUCE SPEED TO AVOID CRASH': 'Driver Error',
    'IMPROPER BACKING': 'Driver Error',
    'IMPROPER OVERTAKING/PASSING': 'Driver Error',
    'IMPROPER TURNING/NO SIGNAL': 'Driver Error',
    'DRIVING SKILLS/KNOWLEDGE/EXPERIENCE': 'Driver Error',
    'DISREGARDING TRAFFIC SIGNALS': 'Driver Error',
    'OPERATING VEHICLE IN ERRATIC, RECKLESS, CARELESS, NEGLIGENT OR AGGRESSIVE MANNER': 'Driver Error',
    'TEXTING': 'Driver Error',
    'DISTRACTION - FROM INSIDE VEHICLE': 'Driver Error',
    'DISTRACTION - FROM OUTSIDE VEHICLE': 'Driver Error',
    'PHYSICAL CONDITION OF DRIVER': 'Driver Error',
    
    # EXTERNAL FACTORS
    'WEATHER': 'External Factors',
    'ROAD ENGINEERING/SURFACE/MARKING DEFECTS': 'External Factors',
    'VISION OBSCURED (SIGNS, TREE LIMBS, BUILDINGS, ETC.)': 'External Factors',
    'ANIMAL': 'External Factors',
    
    # VEHICLE DEFECTS
    'EQUIPMENT - VEHICLE CONDITION': 'Vehicle Defect',
    'BRAKESLESS/FAILURE': 'Vehicle Defect',
    
    # UNKNOWN (Usually the biggest or second biggest)
    'UNABLE TO DETERMINE': 'Unknown',
    'NOT APPLICABLE': 'Unknown'
}

# 1. Apply the mapping
# If a cause is NOT in the dictionary, we default it to 'Other'
df['Crash_Cause'] = df['prim_contributory_cause'].map(cause_mapping).fillna('Other')

# 2. Check the new counts
print(df['Crash_Cause'].value_counts())

Driver Error        464480
Unknown             456049
Other                72885
External Factors     24364
Vehicle Defect        6251
Name: Crash_Cause, dtype: int64


Define X and Y

In [24]:
X = df.drop(['Crash_Cause', 'prim_contributory_cause'], axis=1)
y = df['Crash_Cause']

Train test split

In [25]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [26]:
X_train

Unnamed: 0,posted_speed_limit,traffic_control_device,device_condition,weather_condition,lighting_condition,first_crash_type,trafficway_type,alignment,roadway_surface_cond,road_defect,...,injuries_fatal,injuries_incapacitating,injuries_non_incapacitating,injuries_reported_not_evident,injuries_no_indication,injuries_unknown,crash_hour,crash_day_of_week,crash_month,location
899405,40,NO CONTROLS,NO CONTROLS,CLEAR,DARKNESS,FIXED OBJECT,DIVIDED - W/MEDIAN BARRIER,"CURVE, LEVEL",DRY,NO DEFECTS,...,0.0,0.0,1.0,0.0,0.0,0.0,3,1,11,POINT (-87.618091911783 41.898389053094)
479809,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,REAR END,NOT DIVIDED,STRAIGHT AND LEVEL,UNKNOWN,UNKNOWN,...,0.0,0.0,0.0,0.0,10.0,0.0,13,1,9,POINT (-87.642384512979 41.940186722574)
553121,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,SNOW,"DARKNESS, LIGHTED ROAD",ANGLE,FOUR WAY,STRAIGHT AND LEVEL,SNOW OR SLUSH,NO DEFECTS,...,0.0,0.0,0.0,0.0,2.0,0.0,22,3,1,POINT (-87.80634529093 41.930744417308)
992763,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,"DARKNESS, LIGHTED ROAD",REAR END,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,4.0,0.0,22,2,8,POINT (-87.70096006787 41.877305760362)
514753,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,PARKED MOTOR VEHICLE,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,UNKNOWN,...,0.0,0.0,0.0,0.0,1.0,0.0,13,7,6,POINT (-87.688304588055 41.953491697799)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
919601,30,OTHER,OTHER,CLEAR,DAYLIGHT,TURNING,DIVIDED - W/MEDIAN BARRIER,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,2.0,0.0,15,2,9,POINT (-87.660174752888 41.991780377892)
784211,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,DAYLIGHT,PARKED MOTOR VEHICLE,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,UNKNOWN,...,0.0,0.0,0.0,0.0,1.0,0.0,15,5,11,POINT (-87.663815590987 41.907808782674)
673016,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,DAYLIGHT,TURNING,DIVIDED - W/MEDIAN (NOT RAISED),STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,2.0,0.0,5,6,10,POINT (-87.704257223316 41.811600339039)
236334,40,NO CONTROLS,NO CONTROLS,CLEAR,DUSK,ANGLE,DIVIDED - W/MEDIAN BARRIER,STRAIGHT AND LEVEL,DRY,WORN SURFACE,...,0.0,0.0,0.0,0.0,2.0,0.0,16,5,12,POINT (-87.653110814421 41.985449532208)


In [27]:
X_test

Unnamed: 0,posted_speed_limit,traffic_control_device,device_condition,weather_condition,lighting_condition,first_crash_type,trafficway_type,alignment,roadway_surface_cond,road_defect,...,injuries_fatal,injuries_incapacitating,injuries_non_incapacitating,injuries_reported_not_evident,injuries_no_indication,injuries_unknown,crash_hour,crash_day_of_week,crash_month,location
426857,30,NO CONTROLS,NO CONTROLS,CLEAR,"DARKNESS, LIGHTED ROAD",SIDESWIPE SAME DIRECTION,FOUR WAY,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,2.0,0.0,23,1,3,POINT (-87.766253945447 41.880333955527)
42008,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,SIDESWIPE SAME DIRECTION,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,2.0,0.0,16,6,9,POINT (-87.70508348897 41.798881170398)
39997,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,DAYLIGHT,REAR END,FOUR WAY,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,2.0,0.0,9,6,9,POINT (-87.562908182042 41.76619520231)
620460,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,REAR END,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,2.0,0.0,15,6,5,POINT (-87.617719481874 41.758471711463)
731576,30,NO CONTROLS,NO CONTROLS,CLEAR,"DARKNESS, LIGHTED ROAD",REAR END,PARKING LOT,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,4.0,0.0,22,6,4,POINT (-87.614709812077 41.721991872532)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
640257,20,NO CONTROLS,NO CONTROLS,UNKNOWN,UNKNOWN,PARKED MOTOR VEHICLE,ONE-WAY,STRAIGHT AND LEVEL,UNKNOWN,UNKNOWN,...,0.0,0.0,0.0,0.0,3.0,0.0,18,1,2,POINT (-87.631180891189 41.89093246868)
969443,30,NO CONTROLS,NO CONTROLS,CLEAR,DUSK,TURNING,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,9.0,0.0,15,4,1,POINT (-87.607532022114 41.767416946505)
304252,30,NO CONTROLS,NO CONTROLS,CLEAR,UNKNOWN,PARKED MOTOR VEHICLE,DIVIDED - W/MEDIAN BARRIER,STRAIGHT AND LEVEL,UNKNOWN,NO DEFECTS,...,0.0,0.0,0.0,0.0,1.0,0.0,11,7,5,POINT (-87.765519558311 41.880346236952)
971835,30,NO CONTROLS,NO CONTROLS,CLEAR,DAWN,PARKED MOTOR VEHICLE,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,1.0,0.0,0,2,12,POINT (-87.575890342787 41.753911617689)


Preprocessing X train 

In [28]:
X_train_cat = X_train.select_dtypes(include= ['object', 'string']).copy()
X_train_cat.head()

Unnamed: 0,traffic_control_device,device_condition,weather_condition,lighting_condition,first_crash_type,trafficway_type,alignment,roadway_surface_cond,road_defect,report_type,crash_type,damage,date_police_notified,sec_contributory_cause,street_direction,street_name,most_severe_injury,location
899405,NO CONTROLS,NO CONTROLS,CLEAR,DARKNESS,FIXED OBJECT,DIVIDED - W/MEDIAN BARRIER,"CURVE, LEVEL",DRY,NO DEFECTS,ON SCENE,INJURY AND / OR TOW DUE TO CRASH,"OVER $1,500",11/19/2017 03:46:00 AM,NOT APPLICABLE,N,LAKE SHORE DR SB,NONINCAPACITATING INJURY,POINT (-87.618091911783 41.898389053094)
479809,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,REAR END,NOT DIVIDED,STRAIGHT AND LEVEL,UNKNOWN,UNKNOWN,NOT ON SCENE (DESK REPORT),NO INJURY / DRIVE AWAY,"$501 - $1,500",09/20/2021 04:20:00 PM,UNABLE TO DETERMINE,W,BELMONT AVE,NO INDICATION OF INJURY,POINT (-87.642384512979 41.940186722574)
553121,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,SNOW,"DARKNESS, LIGHTED ROAD",ANGLE,FOUR WAY,STRAIGHT AND LEVEL,SNOW OR SLUSH,NO DEFECTS,ON SCENE,INJURY AND / OR TOW DUE TO CRASH,"OVER $1,500",01/19/2021 10:15:00 PM,NOT APPLICABLE,N,HARLEM AVE,NO INDICATION OF INJURY,POINT (-87.80634529093 41.930744417308)
992763,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,"DARKNESS, LIGHTED ROAD",REAR END,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,NOT ON SCENE (DESK REPORT),NO INJURY / DRIVE AWAY,"$501 - $1,500",08/08/2016 10:40:00 PM,UNABLE TO DETERMINE,S,SACRAMENTO BLVD,NO INDICATION OF INJURY,POINT (-87.70096006787 41.877305760362)
514753,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,PARKED MOTOR VEHICLE,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,UNKNOWN,NOT ON SCENE (DESK REPORT),NO INJURY / DRIVE AWAY,"OVER $1,500",06/06/2021 01:43:00 PM,UNABLE TO DETERMINE,N,WESTERN AVE,NO INDICATION OF INJURY,POINT (-87.688304588055 41.953491697799)


In [29]:
X_train_cat.head()
X_train_cat.shape


(819223, 18)

In [30]:
X_train_cat.nunique().sort_values(ascending=False)

date_police_notified      648600
location                  301590
street_name                 1626
sec_contributory_cause        40
trafficway_type               20
traffic_control_device        19
first_crash_type              18
weather_condition             12
device_condition               8
roadway_surface_cond           7
road_defect                    7
alignment                      6
lighting_condition             6
most_severe_injury             5
street_direction               4
damage                         3
report_type                    3
crash_type                     2
dtype: int64

In [31]:
X_train_cat = X_train_cat.drop(columns=['date_police_notified','location'], errors='ignore')
X_train_cat.head()

Unnamed: 0,traffic_control_device,device_condition,weather_condition,lighting_condition,first_crash_type,trafficway_type,alignment,roadway_surface_cond,road_defect,report_type,crash_type,damage,sec_contributory_cause,street_direction,street_name,most_severe_injury
899405,NO CONTROLS,NO CONTROLS,CLEAR,DARKNESS,FIXED OBJECT,DIVIDED - W/MEDIAN BARRIER,"CURVE, LEVEL",DRY,NO DEFECTS,ON SCENE,INJURY AND / OR TOW DUE TO CRASH,"OVER $1,500",NOT APPLICABLE,N,LAKE SHORE DR SB,NONINCAPACITATING INJURY
479809,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,REAR END,NOT DIVIDED,STRAIGHT AND LEVEL,UNKNOWN,UNKNOWN,NOT ON SCENE (DESK REPORT),NO INJURY / DRIVE AWAY,"$501 - $1,500",UNABLE TO DETERMINE,W,BELMONT AVE,NO INDICATION OF INJURY
553121,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,SNOW,"DARKNESS, LIGHTED ROAD",ANGLE,FOUR WAY,STRAIGHT AND LEVEL,SNOW OR SLUSH,NO DEFECTS,ON SCENE,INJURY AND / OR TOW DUE TO CRASH,"OVER $1,500",NOT APPLICABLE,N,HARLEM AVE,NO INDICATION OF INJURY
992763,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,"DARKNESS, LIGHTED ROAD",REAR END,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,NOT ON SCENE (DESK REPORT),NO INJURY / DRIVE AWAY,"$501 - $1,500",UNABLE TO DETERMINE,S,SACRAMENTO BLVD,NO INDICATION OF INJURY
514753,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,PARKED MOTOR VEHICLE,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,UNKNOWN,NOT ON SCENE (DESK REPORT),NO INJURY / DRIVE AWAY,"OVER $1,500",UNABLE TO DETERMINE,N,WESTERN AVE,NO INDICATION OF INJURY


In [32]:
X_train_cat.isna().sum()

traffic_control_device    0
device_condition          0
weather_condition         0
lighting_condition        0
first_crash_type          0
trafficway_type           0
alignment                 0
roadway_surface_cond      0
road_defect               0
report_type               0
crash_type                0
damage                    0
sec_contributory_cause    0
street_direction          0
street_name               0
most_severe_injury        0
dtype: int64

In [33]:
X_train_cat.shape

(819223, 16)

In [34]:
# Keep only top 50 streets, rest as 'Other'
top_streets = X_train_cat['street_name'].value_counts().nlargest(50).index
X_train_cat['street_grouped'] = X_train_cat['street_name'].where(X_train_cat['street_name'].isin(top_streets), 'Other')


In [35]:
X_train_cat.columns

Index(['traffic_control_device', 'device_condition', 'weather_condition',
       'lighting_condition', 'first_crash_type', 'trafficway_type',
       'alignment', 'roadway_surface_cond', 'road_defect', 'report_type',
       'crash_type', 'damage', 'sec_contributory_cause', 'street_direction',
       'street_name', 'most_severe_injury', 'street_grouped'],
      dtype='object')

In [36]:
X_train_cat = X_train_cat.drop(columns=["street_name"], errors="ignore")

In [37]:
X_train_cat.nunique().sort_values(ascending=False)

street_grouped            51
sec_contributory_cause    40
trafficway_type           20
traffic_control_device    19
first_crash_type          18
weather_condition         12
device_condition           8
road_defect                7
roadway_surface_cond       7
alignment                  6
lighting_condition         6
most_severe_injury         5
street_direction           4
damage                     3
report_type                3
crash_type                 2
dtype: int64

One Hot encoding

In [38]:
ohe = OneHotEncoder(drop='first', sparse=False, handle_unknown='ignore')

In [39]:
X_train_encoded = ohe.fit_transform(X_train_cat)




In [40]:
X_train_encoded

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [41]:
X_train_encoded = pd.DataFrame(
    X_train_encoded, 
    columns=ohe.get_feature_names_out(X_train_cat.columns),
    index=X_train.index
)


In [42]:
X_train_encoded.head()

Unnamed: 0,traffic_control_device_DELINEATORS,traffic_control_device_FLASHING CONTROL SIGNAL,traffic_control_device_LANE USE MARKING,traffic_control_device_NO CONTROLS,traffic_control_device_NO PASSING,traffic_control_device_OTHER,traffic_control_device_OTHER RAILROAD CROSSING,traffic_control_device_OTHER REG. SIGN,traffic_control_device_OTHER WARNING SIGN,traffic_control_device_PEDESTRIAN CROSSING SIGN,...,street_grouped_MONTROSE AVE,street_grouped_NORTH AVE,street_grouped_OGDEN AVE,street_grouped_Other,street_grouped_PULASKI RD,street_grouped_ROOSEVELT RD,street_grouped_SHERIDAN RD,street_grouped_STATE ST,street_grouped_STONY ISLAND AVE,street_grouped_WESTERN AVE
899405,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
479809,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
553121,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
992763,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
514753,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


Numeric columns

In [43]:
numerical_cols = df.select_dtypes(include=["int", "float"]).copy()
numerical_cols.head()

Unnamed: 0,beat_of_occurrence,injuries_total,injuries_fatal,injuries_incapacitating,injuries_non_incapacitating,injuries_reported_not_evident,injuries_no_indication,injuries_unknown
0,431.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,814.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0
2,222.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0
3,1134.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0
4,1135.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0


In [44]:
X_train_num = X_train[numerical_cols.columns].copy()
X_train_num

Unnamed: 0,beat_of_occurrence,injuries_total,injuries_fatal,injuries_incapacitating,injuries_non_incapacitating,injuries_reported_not_evident,injuries_no_indication,injuries_unknown
899405,1833.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
479809,1925.0,0.0,0.0,0.0,0.0,0.0,10.0,0.0
553121,2512.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0
992763,1124.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
514753,1921.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...
919601,2433.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0
784211,1433.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
673016,821.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0
236334,2022.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0


In [45]:
X_train_num.shape

(819223, 8)

Scaling numeric columns

In [46]:
from sklearn.preprocessing import StandardScaler


scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_num)
X_train_scaled

array([[ 8.29250829e-01,  1.39068515e+00, -3.09646156e-02, ...,
        -2.01962075e-01, -1.73180673e+00,  0.00000000e+00],
       [ 9.59841970e-01, -3.42921568e-01, -3.09646156e-02, ...,
        -2.01962075e-01,  6.94207650e+00,  0.00000000e+00],
       [ 1.79307022e+00, -3.42921568e-01, -3.09646156e-02, ...,
        -2.01962075e-01,  2.96991692e-03,  0.00000000e+00],
       ...,
       [-6.07251712e-01, -3.42921568e-01, -3.09646156e-02, ...,
        -2.01962075e-01,  2.96991692e-03,  0.00000000e+00],
       [ 1.09753045e+00, -3.42921568e-01, -3.09646156e-02, ...,
        -2.01962075e-01,  2.96991692e-03,  0.00000000e+00],
       [ 2.31654417e-01, -3.42921568e-01, -3.09646156e-02, ...,
        -2.01962075e-01,  2.96991692e-03,  0.00000000e+00]])

In [47]:
X_train_scaled = pd.DataFrame(
    X_train_scaled, 
    columns= X_train_num.columns,
    index=X_train.index
)
X_train_scaled

Unnamed: 0,beat_of_occurrence,injuries_total,injuries_fatal,injuries_incapacitating,injuries_non_incapacitating,injuries_reported_not_evident,injuries_no_indication,injuries_unknown
899405,0.829251,1.390685,-0.030965,-0.11818,2.096091,-0.201962,-1.731807,0.0
479809,0.959842,-0.342922,-0.030965,-0.11818,-0.257890,-0.201962,6.942077,0.0
553121,1.793070,-0.342922,-0.030965,-0.11818,-0.257890,-0.201962,0.002970,0.0
992763,-0.177153,-0.342922,-0.030965,-0.11818,-0.257890,-0.201962,1.737747,0.0
514753,0.954164,-0.342922,-0.030965,-0.11818,-0.257890,-0.201962,-0.864418,0.0
...,...,...,...,...,...,...,...,...
919601,1.680932,-0.342922,-0.030965,-0.11818,-0.257890,-0.201962,0.002970,0.0
784211,0.261463,-0.342922,-0.030965,-0.11818,-0.257890,-0.201962,-0.864418,0.0
673016,-0.607252,-0.342922,-0.030965,-0.11818,-0.257890,-0.201962,0.002970,0.0
236334,1.097530,-0.342922,-0.030965,-0.11818,-0.257890,-0.201962,0.002970,0.0


In [48]:
# join the two dataframes to make one large dataframe with numeric and categorical variables
X_train_full = pd.concat([X_train_scaled, X_train_encoded], axis=1)
X_train_full

Unnamed: 0,beat_of_occurrence,injuries_total,injuries_fatal,injuries_incapacitating,injuries_non_incapacitating,injuries_reported_not_evident,injuries_no_indication,injuries_unknown,traffic_control_device_DELINEATORS,traffic_control_device_FLASHING CONTROL SIGNAL,...,street_grouped_MONTROSE AVE,street_grouped_NORTH AVE,street_grouped_OGDEN AVE,street_grouped_Other,street_grouped_PULASKI RD,street_grouped_ROOSEVELT RD,street_grouped_SHERIDAN RD,street_grouped_STATE ST,street_grouped_STONY ISLAND AVE,street_grouped_WESTERN AVE
899405,0.829251,1.390685,-0.030965,-0.11818,2.096091,-0.201962,-1.731807,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
479809,0.959842,-0.342922,-0.030965,-0.11818,-0.257890,-0.201962,6.942077,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
553121,1.793070,-0.342922,-0.030965,-0.11818,-0.257890,-0.201962,0.002970,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
992763,-0.177153,-0.342922,-0.030965,-0.11818,-0.257890,-0.201962,1.737747,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
514753,0.954164,-0.342922,-0.030965,-0.11818,-0.257890,-0.201962,-0.864418,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
919601,1.680932,-0.342922,-0.030965,-0.11818,-0.257890,-0.201962,0.002970,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
784211,0.261463,-0.342922,-0.030965,-0.11818,-0.257890,-0.201962,-0.864418,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
673016,-0.607252,-0.342922,-0.030965,-0.11818,-0.257890,-0.201962,0.002970,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
236334,1.097530,-0.342922,-0.030965,-0.11818,-0.257890,-0.201962,0.002970,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [49]:
X_train_full.shape

(819223, 203)

Preprocessing X_test

In [50]:
X_test

Unnamed: 0,posted_speed_limit,traffic_control_device,device_condition,weather_condition,lighting_condition,first_crash_type,trafficway_type,alignment,roadway_surface_cond,road_defect,...,injuries_fatal,injuries_incapacitating,injuries_non_incapacitating,injuries_reported_not_evident,injuries_no_indication,injuries_unknown,crash_hour,crash_day_of_week,crash_month,location
426857,30,NO CONTROLS,NO CONTROLS,CLEAR,"DARKNESS, LIGHTED ROAD",SIDESWIPE SAME DIRECTION,FOUR WAY,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,2.0,0.0,23,1,3,POINT (-87.766253945447 41.880333955527)
42008,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,SIDESWIPE SAME DIRECTION,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,2.0,0.0,16,6,9,POINT (-87.70508348897 41.798881170398)
39997,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,DAYLIGHT,REAR END,FOUR WAY,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,2.0,0.0,9,6,9,POINT (-87.562908182042 41.76619520231)
620460,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,REAR END,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,2.0,0.0,15,6,5,POINT (-87.617719481874 41.758471711463)
731576,30,NO CONTROLS,NO CONTROLS,CLEAR,"DARKNESS, LIGHTED ROAD",REAR END,PARKING LOT,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,4.0,0.0,22,6,4,POINT (-87.614709812077 41.721991872532)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
640257,20,NO CONTROLS,NO CONTROLS,UNKNOWN,UNKNOWN,PARKED MOTOR VEHICLE,ONE-WAY,STRAIGHT AND LEVEL,UNKNOWN,UNKNOWN,...,0.0,0.0,0.0,0.0,3.0,0.0,18,1,2,POINT (-87.631180891189 41.89093246868)
969443,30,NO CONTROLS,NO CONTROLS,CLEAR,DUSK,TURNING,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,9.0,0.0,15,4,1,POINT (-87.607532022114 41.767416946505)
304252,30,NO CONTROLS,NO CONTROLS,CLEAR,UNKNOWN,PARKED MOTOR VEHICLE,DIVIDED - W/MEDIAN BARRIER,STRAIGHT AND LEVEL,UNKNOWN,NO DEFECTS,...,0.0,0.0,0.0,0.0,1.0,0.0,11,7,5,POINT (-87.765519558311 41.880346236952)
971835,30,NO CONTROLS,NO CONTROLS,CLEAR,DAWN,PARKED MOTOR VEHICLE,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,1.0,0.0,0,2,12,POINT (-87.575890342787 41.753911617689)


In [51]:
X_test_cat = X_test.select_dtypes(include=["object", "string"]).copy()
X_test_cat.head()

Unnamed: 0,traffic_control_device,device_condition,weather_condition,lighting_condition,first_crash_type,trafficway_type,alignment,roadway_surface_cond,road_defect,report_type,crash_type,damage,date_police_notified,sec_contributory_cause,street_direction,street_name,most_severe_injury,location
426857,NO CONTROLS,NO CONTROLS,CLEAR,"DARKNESS, LIGHTED ROAD",SIDESWIPE SAME DIRECTION,FOUR WAY,STRAIGHT AND LEVEL,DRY,NO DEFECTS,NOT ON SCENE (DESK REPORT),NO INJURY / DRIVE AWAY,"$501 - $1,500",03/21/2022 01:50:00 PM,UNABLE TO DETERMINE,W,MADISON ST,NO INDICATION OF INJURY,POINT (-87.766253945447 41.880333955527)
42008,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,SIDESWIPE SAME DIRECTION,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,NOT ON SCENE (DESK REPORT),NO INJURY / DRIVE AWAY,"OVER $1,500",09/05/2025 05:00:00 PM,NOT APPLICABLE,S,SAWYER AVE,NO INDICATION OF INJURY,POINT (-87.70508348897 41.798881170398)
39997,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,DAYLIGHT,REAR END,FOUR WAY,STRAIGHT AND LEVEL,DRY,NO DEFECTS,NOT ON SCENE (DESK REPORT),NO INJURY / DRIVE AWAY,"OVER $1,500",09/12/2025 10:00:00 AM,UNABLE TO DETERMINE,S,SOUTH SHORE DR,NO INDICATION OF INJURY,POINT (-87.562908182042 41.76619520231)
620460,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,REAR END,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,NOT ON SCENE (DESK REPORT),NO INJURY / DRIVE AWAY,"OVER $1,500",05/01/2020 04:10:00 PM,UNABLE TO DETERMINE,E,75TH ST,NO INDICATION OF INJURY,POINT (-87.617719481874 41.758471711463)
731576,NO CONTROLS,NO CONTROLS,CLEAR,"DARKNESS, LIGHTED ROAD",REAR END,PARKING LOT,STRAIGHT AND LEVEL,DRY,NO DEFECTS,ON SCENE,NO INJURY / DRIVE AWAY,$500 OR LESS,04/26/2019 10:25:00 PM,NOT APPLICABLE,E,95TH ST,NO INDICATION OF INJURY,POINT (-87.614709812077 41.721991872532)


In [52]:
X_test_cat.shape

(204806, 18)

In [53]:
X_test_cat.nunique().sort_values(ascending=False)

date_police_notified      191126
location                  115750
street_name                 1471
sec_contributory_cause        40
trafficway_type               20
traffic_control_device        19
first_crash_type              18
weather_condition             12
device_condition               8
roadway_surface_cond           7
road_defect                    7
alignment                      6
lighting_condition             6
most_severe_injury             5
street_direction               4
damage                         3
report_type                    3
crash_type                     2
dtype: int64

In [54]:
X_test_cat = X_test_cat.drop(columns=['date_police_notified','location'], errors='ignore')
X_test_cat.head()

Unnamed: 0,traffic_control_device,device_condition,weather_condition,lighting_condition,first_crash_type,trafficway_type,alignment,roadway_surface_cond,road_defect,report_type,crash_type,damage,sec_contributory_cause,street_direction,street_name,most_severe_injury
426857,NO CONTROLS,NO CONTROLS,CLEAR,"DARKNESS, LIGHTED ROAD",SIDESWIPE SAME DIRECTION,FOUR WAY,STRAIGHT AND LEVEL,DRY,NO DEFECTS,NOT ON SCENE (DESK REPORT),NO INJURY / DRIVE AWAY,"$501 - $1,500",UNABLE TO DETERMINE,W,MADISON ST,NO INDICATION OF INJURY
42008,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,SIDESWIPE SAME DIRECTION,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,NOT ON SCENE (DESK REPORT),NO INJURY / DRIVE AWAY,"OVER $1,500",NOT APPLICABLE,S,SAWYER AVE,NO INDICATION OF INJURY
39997,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,DAYLIGHT,REAR END,FOUR WAY,STRAIGHT AND LEVEL,DRY,NO DEFECTS,NOT ON SCENE (DESK REPORT),NO INJURY / DRIVE AWAY,"OVER $1,500",UNABLE TO DETERMINE,S,SOUTH SHORE DR,NO INDICATION OF INJURY
620460,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,REAR END,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,NOT ON SCENE (DESK REPORT),NO INJURY / DRIVE AWAY,"OVER $1,500",UNABLE TO DETERMINE,E,75TH ST,NO INDICATION OF INJURY
731576,NO CONTROLS,NO CONTROLS,CLEAR,"DARKNESS, LIGHTED ROAD",REAR END,PARKING LOT,STRAIGHT AND LEVEL,DRY,NO DEFECTS,ON SCENE,NO INJURY / DRIVE AWAY,$500 OR LESS,NOT APPLICABLE,E,95TH ST,NO INDICATION OF INJURY


In [55]:
# Keep only top 50 streets, rest as 'Other'
top_streets = X_test_cat['street_name'].value_counts().nlargest(50).index
X_test_cat['street_grouped'] = X_test_cat['street_name'].where(X_test_cat['street_name'].isin(top_streets), 'Other')


In [56]:
X_test_cat.columns

Index(['traffic_control_device', 'device_condition', 'weather_condition',
       'lighting_condition', 'first_crash_type', 'trafficway_type',
       'alignment', 'roadway_surface_cond', 'road_defect', 'report_type',
       'crash_type', 'damage', 'sec_contributory_cause', 'street_direction',
       'street_name', 'most_severe_injury', 'street_grouped'],
      dtype='object')

In [57]:
X_test_cat = X_test_cat.drop(columns=["street_name"], errors="ignore")

In [58]:
X_test_cat.nunique().sort_values(ascending=False)

street_grouped            51
sec_contributory_cause    40
trafficway_type           20
traffic_control_device    19
first_crash_type          18
weather_condition         12
device_condition           8
road_defect                7
roadway_surface_cond       7
alignment                  6
lighting_condition         6
most_severe_injury         5
street_direction           4
damage                     3
report_type                3
crash_type                 2
dtype: int64

One Hot encoding 

In [59]:
X_test_encoded = ohe.fit_transform(X_test_cat)



In [60]:
X_test_encoded

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [61]:
X_test_encoded = pd.DataFrame(
    X_test_encoded, 
    columns=ohe.get_feature_names_out(X_test_cat.columns),
    index=X_test.index
)
X_test_encoded

Unnamed: 0,traffic_control_device_DELINEATORS,traffic_control_device_FLASHING CONTROL SIGNAL,traffic_control_device_LANE USE MARKING,traffic_control_device_NO CONTROLS,traffic_control_device_NO PASSING,traffic_control_device_OTHER,traffic_control_device_OTHER RAILROAD CROSSING,traffic_control_device_OTHER REG. SIGN,traffic_control_device_OTHER WARNING SIGN,traffic_control_device_PEDESTRIAN CROSSING SIGN,...,street_grouped_NORTH AVE,street_grouped_OGDEN AVE,street_grouped_Other,street_grouped_PULASKI RD,street_grouped_ROOSEVELT RD,street_grouped_SHERIDAN RD,street_grouped_STATE ST,street_grouped_STONY ISLAND AVE,street_grouped_WENTWORTH AVE,street_grouped_WESTERN AVE
426857,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
42008,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
39997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
620460,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
731576,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
640257,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
969443,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
304252,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
971835,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Numeric columns

In [62]:
X_test_num = X_test[numerical_cols.columns].copy()
X_test_num

Unnamed: 0,beat_of_occurrence,injuries_total,injuries_fatal,injuries_incapacitating,injuries_non_incapacitating,injuries_reported_not_evident,injuries_no_indication,injuries_unknown
426857,1513.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0
42008,822.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0
39997,334.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0
620460,323.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0
731576,633.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
...,...,...,...,...,...,...,...,...
640257,1831.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0
969443,321.0,0.0,0.0,0.0,0.0,0.0,9.0,0.0
304252,1513.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
971835,414.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [63]:
from sklearn.preprocessing import StandardScaler


scaler = StandardScaler()
X_test_scaled = scaler.fit_transform(X_test_num)
X_test_scaled

array([[ 0.37793243, -0.34062459, -0.03206598, ..., -0.20276126,
         0.00441706,  0.        ],
       [-0.60242974, -0.34062459, -0.03206598, ..., -0.20276126,
         0.00441706,  0.        ],
       [-1.29478393, -0.34062459, -0.03206598, ..., -0.20276126,
         0.00441706,  0.        ],
       ...,
       [ 0.37793243, -0.34062459, -0.03206598, ..., -0.20276126,
        -0.8671054 ,  0.        ],
       [-1.18128324, -0.34062459, -0.03206598, ..., -0.20276126,
        -0.8671054 ,  0.        ],
       [ 0.26443175, -0.34062459, -0.03206598, ..., -0.20276126,
         0.00441706,  0.        ]])

In [64]:
X_test_scaled = pd.DataFrame(
    X_test_scaled, 
    columns= X_test_num.columns,
    index=X_test.index
)
X_test_scaled

Unnamed: 0,beat_of_occurrence,injuries_total,injuries_fatal,injuries_incapacitating,injuries_non_incapacitating,injuries_reported_not_evident,injuries_no_indication,injuries_unknown
426857,0.377932,-0.340625,-0.032066,-0.118477,-0.254285,-0.202761,0.004417,0.0
42008,-0.602430,-0.340625,-0.032066,-0.118477,-0.254285,-0.202761,0.004417,0.0
39997,-1.294784,-0.340625,-0.032066,-0.118477,-0.254285,-0.202761,0.004417,0.0
620460,-1.310390,-0.340625,-0.032066,-0.118477,-0.254285,-0.202761,0.004417,0.0
731576,-0.870575,-0.340625,-0.032066,-0.118477,-0.254285,-0.202761,1.747462,0.0
...,...,...,...,...,...,...,...,...
640257,0.829098,-0.340625,-0.032066,-0.118477,-0.254285,-0.202761,0.875940,0.0
969443,-1.313228,-0.340625,-0.032066,-0.118477,-0.254285,-0.202761,6.105074,0.0
304252,0.377932,-0.340625,-0.032066,-0.118477,-0.254285,-0.202761,-0.867105,0.0
971835,-1.181283,-0.340625,-0.032066,-0.118477,-0.254285,-0.202761,-0.867105,0.0


In [65]:
# join the two dataframes to make one large dataframe with numeric and categorical variables
X_test_full = pd.concat([X_test_scaled, X_test_encoded], axis=1)
X_test_full

Unnamed: 0,beat_of_occurrence,injuries_total,injuries_fatal,injuries_incapacitating,injuries_non_incapacitating,injuries_reported_not_evident,injuries_no_indication,injuries_unknown,traffic_control_device_DELINEATORS,traffic_control_device_FLASHING CONTROL SIGNAL,...,street_grouped_NORTH AVE,street_grouped_OGDEN AVE,street_grouped_Other,street_grouped_PULASKI RD,street_grouped_ROOSEVELT RD,street_grouped_SHERIDAN RD,street_grouped_STATE ST,street_grouped_STONY ISLAND AVE,street_grouped_WENTWORTH AVE,street_grouped_WESTERN AVE
426857,0.377932,-0.340625,-0.032066,-0.118477,-0.254285,-0.202761,0.004417,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
42008,-0.602430,-0.340625,-0.032066,-0.118477,-0.254285,-0.202761,0.004417,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
39997,-1.294784,-0.340625,-0.032066,-0.118477,-0.254285,-0.202761,0.004417,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
620460,-1.310390,-0.340625,-0.032066,-0.118477,-0.254285,-0.202761,0.004417,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
731576,-0.870575,-0.340625,-0.032066,-0.118477,-0.254285,-0.202761,1.747462,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
640257,0.829098,-0.340625,-0.032066,-0.118477,-0.254285,-0.202761,0.875940,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
969443,-1.313228,-0.340625,-0.032066,-0.118477,-0.254285,-0.202761,6.105074,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
304252,0.377932,-0.340625,-0.032066,-0.118477,-0.254285,-0.202761,-0.867105,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
971835,-1.181283,-0.340625,-0.032066,-0.118477,-0.254285,-0.202761,-0.867105,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [66]:
from sklearn.preprocessing import LabelEncoder

# Instantiate
le = LabelEncoder()

# Encode the NEW grouped column
df['Target_Encoded'] = le.fit_transform(df['Crash_Cause'])

# Update your y variable
y = df['Target_Encoded']

print("Encoding complete. Classes are:", le.classes_)

Encoding complete. Classes are: ['Driver Error' 'External Factors' 'Other' 'Unknown' 'Vehicle Defect']
