# A Data-Driven and Interpretable Approach to Traffic Crash Severity Prediction 

# Business Understanding
## Business Overview

Road traffic crashes continue to pose a significant public safety and economic challenge, particularly when they result in serious injuries or fatalities. Transportation authorities and public safety agencies are responsible for making critical decisions about infrastructure investments, traffic enforcement, and emergency response planning, often under constrained budgets and limited resources. To be effective, these decisions must be supported by data-driven insights that are not only accurate but also transparent and explainable.

Conventional traffic safety analyses typically focus on historical trends and aggregated statistics, which may fail to capture the complex interactions between roadway conditions, environmental factors, driver behavior, and temporal patterns that contribute to severe crash outcomes. While advanced machine learning models can improve predictive performance, their limited interpretability can reduce trust and limit their usefulness in policy and operational contexts where accountability and justification are required.

This project addresses these challenges by applying interpretable machine learning models to traffic crash data in order to predict crash severity and explain the key factors associated with injury- and fatal-level crashes. By integrating both interpretable (white-box) models and high-performing (black-box) models enhanced with explainability techniques, the analysis identifies how factors such as speed limits, lighting conditions, roadway characteristics, hit-and-run involvement, and time of day influence crash severity.

The insights produced by this project are intended to directly support decision-making related to roadway safety improvements, targeted enforcement strategies, and resource prioritization. By translating model outputs into clear, actionable explanations, this analysis enables stakeholders to move beyond reactive responses and toward proactive, evidence-based interventions aimed at reducing the frequency and severity of traffic-related injuries and fatalities.

## Problem Statement
Traffic safety stakeholders face ongoing challenges in reducing the severity of road traffic crashes while operating under limited resources and increasing system complexity. Although large volumes of traffic crash data are available, decision-makers often lack clear, explainable insights into the factors that most strongly contribute to injury- and fatal-level crashes. Existing analyses frequently focus on descriptive statistics or predictive accuracy alone, which limits their usefulness for guiding infrastructure investments, enforcement strategies, and policy interventions.

The core business problem addressed in this project is the absence of transparent, interpretable models that can both predict crash severity and clearly explain why certain crashes are more likely to result in severe outcomes. Without interpretable insights, stakeholders risk making decisions that are difficult to justify, inefficiently targeted, or misaligned with real-world risk factors. There is a need for a data-driven approach that balances predictive performance with explainability to support accountable and evidence-based traffic safety decisions.

## Business Objectives
The primary objective of this project is to support traffic safety decision-making by developing an interpretable machine learning framework that predicts crash severity and identifies the key factors driving severe crash outcomes.

### Specific Objectives
1. Predict Crash Severity
Develop machine learning models to classify traffic crashes based on severity, with a focus on identifying crashes that result in injuries or fatalities.

2. Ensure Model Interpretability
Apply interpretable modeling techniques to explain both global and individual-level predictions, enabling stakeholders to understand how specific features influence crash severity.

3. Identify High-Impact Risk Factors
Determine the most influential factors associated with severe crashes, including roadway conditions, environmental factors, driver-related behaviors, and temporal patterns.

4. Compare Interpretability and Performance Trade-offs
Evaluate the performance of interpretable (white-box) models against more complex (black-box) models to assess the balance between accuracy and explainability.

5. Support Actionable Safety Interventions
Translate model insights into practical recommendations that can inform infrastructure improvements, targeted enforcement strategies, and resource allocation.

6. Promote Transparent and Accountable Decision-Making
Provide clear, defensible explanations that can be communicated to non-technical stakeholders, supporting responsible and evidence-based traffic safety policies.

Data Preprocessing 

In [64]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import classification_report

In [65]:
df1 = pd.read_csv("Traffic_Crashes.csv")
df1.head()

  df1 = pd.read_csv("Traffic_Crashes.csv")


Unnamed: 0,CRASH_RECORD_ID,CRASH_DATE_EST_I,CRASH_DATE,POSTED_SPEED_LIMIT,TRAFFIC_CONTROL_DEVICE,DEVICE_CONDITION,WEATHER_CONDITION,LIGHTING_CONDITION,FIRST_CRASH_TYPE,TRAFFICWAY_TYPE,...,INJURIES_NON_INCAPACITATING,INJURIES_REPORTED_NOT_EVIDENT,INJURIES_NO_INDICATION,INJURIES_UNKNOWN,CRASH_HOUR,CRASH_DAY_OF_WEEK,CRASH_MONTH,LATITUDE,LONGITUDE,LOCATION
0,97f1975e8f3e9a1b53ae1abfb6982a374074d8649d9e97...,,01/28/2026 10:56:00 PM,30,NO CONTROLS,NO CONTROLS,SNOW,"DARKNESS, LIGHTED ROAD",FIXED OBJECT,NOT DIVIDED,...,0.0,0.0,1.0,0.0,22,4,1,41.713829,-87.551093,POINT (-87.551093105845 41.713829100033)
1,1a00190102664f10ee5c2ee8767d45c331991692f12dfc...,,01/28/2026 10:25:00 PM,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,"DARKNESS, LIGHTED ROAD",REAR END,NOT DIVIDED,...,0.0,0.0,2.0,0.0,22,4,1,41.796711,-87.755202,POINT (-87.755202215729 41.796710893317)
2,a4fc7133c8193ec53288a9acec055321dee47515621012...,Y,01/28/2026 10:10:00 PM,30,OTHER,OTHER,OTHER,UNKNOWN,PARKED MOTOR VEHICLE,OTHER,...,0.0,0.0,2.0,0.0,22,4,1,41.813005,-87.603823,POINT (-87.603822899265 41.813004951227)
3,e79f2db27a528710d42b2eb1991876b7a9bf029aee3685...,,01/28/2026 10:10:00 PM,30,STOP SIGN/FLASHER,FUNCTIONING PROPERLY,CLEAR,"DARKNESS, LIGHTED ROAD",REAR END,FOUR WAY,...,0.0,0.0,3.0,0.0,22,4,1,41.868335,-87.705668,POINT (-87.705668192505 41.868335288795)
4,48040347f534c316e38421a60b65ab7017ae47cb4a0c3c...,,01/28/2026 10:05:00 PM,30,NO CONTROLS,NO CONTROLS,CLEAR,"DARKNESS, LIGHTED ROAD",PARKED MOTOR VEHICLE,NOT DIVIDED,...,0.0,0.0,2.0,0.0,22,4,1,41.866618,-87.696128,POINT (-87.696128029764 41.866617682133)


In [66]:
df_cleaned = pd.read_csv("cleaned_traffic_crashes.csv")

In [67]:
df_cleaned = pd.read_csv("cleaned_traffic_crashes.csv", low_memory=False)
df_cleaned.head()

Unnamed: 0,posted_speed_limit,traffic_control_device,device_condition,weather_condition,lighting_condition,first_crash_type,trafficway_type,alignment,roadway_surface_cond,road_defect,...,injuries_fatal,injuries_incapacitating,injuries_non_incapacitating,injuries_reported_not_evident,injuries_no_indication,injuries_unknown,crash_hour,crash_day_of_week,crash_month,location
0,30,NO CONTROLS,NO CONTROLS,SNOW,"DARKNESS, LIGHTED ROAD",FIXED OBJECT,NOT DIVIDED,STRAIGHT AND LEVEL,SNOW OR SLUSH,NO DEFECTS,...,0.0,0.0,0.0,0.0,1.0,0.0,22,4,1,POINT (-87.551093105845 41.713829100033)
1,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,"DARKNESS, LIGHTED ROAD",REAR END,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,2.0,0.0,22,4,1,POINT (-87.755202215729 41.796710893317)
2,30,OTHER,OTHER,OTHER,UNKNOWN,PARKED MOTOR VEHICLE,OTHER,STRAIGHT AND LEVEL,OTHER,UNKNOWN,...,0.0,0.0,0.0,0.0,2.0,0.0,22,4,1,POINT (-87.603822899265 41.813004951227)
3,30,STOP SIGN/FLASHER,FUNCTIONING PROPERLY,CLEAR,"DARKNESS, LIGHTED ROAD",REAR END,FOUR WAY,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,3.0,0.0,22,4,1,POINT (-87.705668192505 41.868335288795)
4,30,NO CONTROLS,NO CONTROLS,CLEAR,"DARKNESS, LIGHTED ROAD",PARKED MOTOR VEHICLE,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,2.0,0.0,22,4,1,POINT (-87.696128029764 41.866617682133)


In [103]:
df_cleaned['prim_contributory_cause']

0                    UNABLE TO DETERMINE
1                    UNABLE TO DETERMINE
2                    UNABLE TO DETERMINE
3            IMPROPER OVERTAKING/PASSING
4                    IMPROPER LANE USAGE
                       ...              
1024024              UNABLE TO DETERMINE
1024025    FAILING TO YIELD RIGHT-OF-WAY
1024026              UNABLE TO DETERMINE
1024027              UNABLE TO DETERMINE
1024028              IMPROPER LANE USAGE
Name: prim_contributory_cause, Length: 1024029, dtype: object

Data cleaning

In [68]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1024029 entries, 0 to 1024028
Data columns (total 33 columns):
 #   Column                         Non-Null Count    Dtype  
---  ------                         --------------    -----  
 0   posted_speed_limit             1024029 non-null  int64  
 1   traffic_control_device         1024029 non-null  object 
 2   device_condition               1024029 non-null  object 
 3   weather_condition              1024029 non-null  object 
 4   lighting_condition             1024029 non-null  object 
 5   first_crash_type               1024029 non-null  object 
 6   trafficway_type                1024029 non-null  object 
 7   alignment                      1024029 non-null  object 
 8   roadway_surface_cond           1024029 non-null  object 
 9   road_defect                    1024029 non-null  object 
 10  report_type                    1024029 non-null  object 
 11  crash_type                     1024029 non-null  object 
 12  damage        

In [69]:
df_cleaned.shape

(1024029, 33)

In [70]:
df_cleaned.isna().sum()

posted_speed_limit               0
traffic_control_device           0
device_condition                 0
weather_condition                0
lighting_condition               0
first_crash_type                 0
trafficway_type                  0
alignment                        0
roadway_surface_cond             0
road_defect                      0
report_type                      0
crash_type                       0
damage                           0
date_police_notified             0
prim_contributory_cause          0
sec_contributory_cause           0
street_no                        0
street_direction                 0
street_name                      0
beat_of_occurrence               0
num_units                        0
most_severe_injury               0
injuries_total                   0
injuries_fatal                   0
injuries_incapacitating          0
injuries_non_incapacitating      0
injuries_reported_not_evident    0
injuries_no_indication           0
injuries_unknown    

In [71]:
n_percent= (df_cleaned.isnull().mean() * 100).sort_values(ascending=False)
n_percent

posted_speed_limit               0.0
street_direction                 0.0
crash_month                      0.0
crash_day_of_week                0.0
crash_hour                       0.0
injuries_unknown                 0.0
injuries_no_indication           0.0
injuries_reported_not_evident    0.0
injuries_non_incapacitating      0.0
injuries_incapacitating          0.0
injuries_fatal                   0.0
injuries_total                   0.0
most_severe_injury               0.0
num_units                        0.0
beat_of_occurrence               0.0
street_name                      0.0
street_no                        0.0
traffic_control_device           0.0
sec_contributory_cause           0.0
prim_contributory_cause          0.0
date_police_notified             0.0
damage                           0.0
crash_type                       0.0
report_type                      0.0
road_defect                      0.0
roadway_surface_cond             0.0
alignment                        0.0
t

In [72]:
df_cleaned.columns

Index(['posted_speed_limit', 'traffic_control_device', 'device_condition',
       'weather_condition', 'lighting_condition', 'first_crash_type',
       'trafficway_type', 'alignment', 'roadway_surface_cond', 'road_defect',
       'report_type', 'crash_type', 'damage', 'date_police_notified',
       'prim_contributory_cause', 'sec_contributory_cause', 'street_no',
       'street_direction', 'street_name', 'beat_of_occurrence', 'num_units',
       'most_severe_injury', 'injuries_total', 'injuries_fatal',
       'injuries_incapacitating', 'injuries_non_incapacitating',
       'injuries_reported_not_evident', 'injuries_no_indication',
       'injuries_unknown', 'crash_hour', 'crash_day_of_week', 'crash_month',
       'location'],
      dtype='object')

In [73]:
df_cleaned.duplicated().value_counts()

False    1023967
True          62
Name: count, dtype: int64

In [74]:
df = df_cleaned.drop(
    columns=[
        'CRASH_RECORD_ID',
        'HIT_AND_RUN_I',
        'NOT_RIGHT_OF_WAY_I',
        'LANE_CNT',
        'INTERSECTION_RELATED_I',
        'CRASH_DATE_EST_I',
        'PHOTOS_TAKEN_I',
        'STATEMENTS_TAKEN_I',
        'DOORING_I',
        'WORK_ZONE_I',
        'WORK_ZONE_TYPE',
        'WORKERS_PRESENT_I',
        'LATITUDE',
        'LONGITUDE',
        'CRASH_DATE'
    ], errors='ignore'
) 
df.head()

Unnamed: 0,posted_speed_limit,traffic_control_device,device_condition,weather_condition,lighting_condition,first_crash_type,trafficway_type,alignment,roadway_surface_cond,road_defect,...,injuries_fatal,injuries_incapacitating,injuries_non_incapacitating,injuries_reported_not_evident,injuries_no_indication,injuries_unknown,crash_hour,crash_day_of_week,crash_month,location
0,30,NO CONTROLS,NO CONTROLS,SNOW,"DARKNESS, LIGHTED ROAD",FIXED OBJECT,NOT DIVIDED,STRAIGHT AND LEVEL,SNOW OR SLUSH,NO DEFECTS,...,0.0,0.0,0.0,0.0,1.0,0.0,22,4,1,POINT (-87.551093105845 41.713829100033)
1,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,"DARKNESS, LIGHTED ROAD",REAR END,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,2.0,0.0,22,4,1,POINT (-87.755202215729 41.796710893317)
2,30,OTHER,OTHER,OTHER,UNKNOWN,PARKED MOTOR VEHICLE,OTHER,STRAIGHT AND LEVEL,OTHER,UNKNOWN,...,0.0,0.0,0.0,0.0,2.0,0.0,22,4,1,POINT (-87.603822899265 41.813004951227)
3,30,STOP SIGN/FLASHER,FUNCTIONING PROPERLY,CLEAR,"DARKNESS, LIGHTED ROAD",REAR END,FOUR WAY,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,3.0,0.0,22,4,1,POINT (-87.705668192505 41.868335288795)
4,30,NO CONTROLS,NO CONTROLS,CLEAR,"DARKNESS, LIGHTED ROAD",PARKED MOTOR VEHICLE,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,2.0,0.0,22,4,1,POINT (-87.696128029764 41.866617682133)


In [75]:
n_percent= (df.isnull().mean() * 100).sort_values(ascending=False)
n_percent

posted_speed_limit               0.0
street_direction                 0.0
crash_month                      0.0
crash_day_of_week                0.0
crash_hour                       0.0
injuries_unknown                 0.0
injuries_no_indication           0.0
injuries_reported_not_evident    0.0
injuries_non_incapacitating      0.0
injuries_incapacitating          0.0
injuries_fatal                   0.0
injuries_total                   0.0
most_severe_injury               0.0
num_units                        0.0
beat_of_occurrence               0.0
street_name                      0.0
street_no                        0.0
traffic_control_device           0.0
sec_contributory_cause           0.0
prim_contributory_cause          0.0
date_police_notified             0.0
damage                           0.0
crash_type                       0.0
report_type                      0.0
road_defect                      0.0
roadway_surface_cond             0.0
alignment                        0.0
t

In [76]:
df.shape

(1024029, 33)

Filling null values for categorical and numerical columns

In [77]:
cat_cols = df.select_dtypes(include='object').columns

df[cat_cols] = df[cat_cols].fillna(df[cat_cols].mode().iloc[0])



In [78]:
num_cols = df.select_dtypes(include=['int', 'float']).columns

df[num_cols] = df[num_cols].fillna(
    df[num_cols].median()
)

In [79]:
df.isna().sum()

posted_speed_limit               0
traffic_control_device           0
device_condition                 0
weather_condition                0
lighting_condition               0
first_crash_type                 0
trafficway_type                  0
alignment                        0
roadway_surface_cond             0
road_defect                      0
report_type                      0
crash_type                       0
damage                           0
date_police_notified             0
prim_contributory_cause          0
sec_contributory_cause           0
street_no                        0
street_direction                 0
street_name                      0
beat_of_occurrence               0
num_units                        0
most_severe_injury               0
injuries_total                   0
injuries_fatal                   0
injuries_incapacitating          0
injuries_non_incapacitating      0
injuries_reported_not_evident    0
injuries_no_indication           0
injuries_unknown    

In [80]:
n_percent= (df.isnull().mean() * 100).sort_values(ascending=False)
n_percent

posted_speed_limit               0.0
street_direction                 0.0
crash_month                      0.0
crash_day_of_week                0.0
crash_hour                       0.0
injuries_unknown                 0.0
injuries_no_indication           0.0
injuries_reported_not_evident    0.0
injuries_non_incapacitating      0.0
injuries_incapacitating          0.0
injuries_fatal                   0.0
injuries_total                   0.0
most_severe_injury               0.0
num_units                        0.0
beat_of_occurrence               0.0
street_name                      0.0
street_no                        0.0
traffic_control_device           0.0
sec_contributory_cause           0.0
prim_contributory_cause          0.0
date_police_notified             0.0
damage                           0.0
crash_type                       0.0
report_type                      0.0
road_defect                      0.0
roadway_surface_cond             0.0
alignment                        0.0
t

In [81]:
df.columns = df.columns.str.strip().str.lower()
df.columns

Index(['posted_speed_limit', 'traffic_control_device', 'device_condition',
       'weather_condition', 'lighting_condition', 'first_crash_type',
       'trafficway_type', 'alignment', 'roadway_surface_cond', 'road_defect',
       'report_type', 'crash_type', 'damage', 'date_police_notified',
       'prim_contributory_cause', 'sec_contributory_cause', 'street_no',
       'street_direction', 'street_name', 'beat_of_occurrence', 'num_units',
       'most_severe_injury', 'injuries_total', 'injuries_fatal',
       'injuries_incapacitating', 'injuries_non_incapacitating',
       'injuries_reported_not_evident', 'injuries_no_indication',
       'injuries_unknown', 'crash_hour', 'crash_day_of_week', 'crash_month',
       'location'],
      dtype='object')

In [82]:
# Saving the changes made during cleaning
df.to_csv("cleaned_traffic_crashes.csv", index=False)


In [83]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1024029 entries, 0 to 1024028
Data columns (total 33 columns):
 #   Column                         Non-Null Count    Dtype  
---  ------                         --------------    -----  
 0   posted_speed_limit             1024029 non-null  int64  
 1   traffic_control_device         1024029 non-null  object 
 2   device_condition               1024029 non-null  object 
 3   weather_condition              1024029 non-null  object 
 4   lighting_condition             1024029 non-null  object 
 5   first_crash_type               1024029 non-null  object 
 6   trafficway_type                1024029 non-null  object 
 7   alignment                      1024029 non-null  object 
 8   roadway_surface_cond           1024029 non-null  object 
 9   road_defect                    1024029 non-null  object 
 10  report_type                    1024029 non-null  object 
 11  crash_type                     1024029 non-null  object 
 12  damage        

In [138]:
# Create a mapping dictionary
# This reduces 40 specific causes into 5 broad "Buckets"
cause_mapping = {
    # DRIVER ERROR (The biggest category)
    'FOLLOWING TOO CLOSELY': 'Driver Error',
    'FAILING TO YIELD RIGHT-OF-WAY': 'Driver Error',
    'FAILING TO REDUCE SPEED TO AVOID CRASH': 'Driver Error',
    'IMPROPER BACKING': 'Driver Error',
    'IMPROPER OVERTAKING/PASSING': 'Driver Error',
    'IMPROPER TURNING/NO SIGNAL': 'Driver Error',
    'DRIVING SKILLS/KNOWLEDGE/EXPERIENCE': 'Driver Error',
    'DISREGARDING TRAFFIC SIGNALS': 'Driver Error',
    'OPERATING VEHICLE IN ERRATIC, RECKLESS, CARELESS, NEGLIGENT OR AGGRESSIVE MANNER': 'Driver Error',
    'TEXTING': 'Driver Error',
    'DISTRACTION - FROM INSIDE VEHICLE': 'Driver Error',
    'DISTRACTION - FROM OUTSIDE VEHICLE': 'Driver Error',
    'PHYSICAL CONDITION OF DRIVER': 'Driver Error',
    
    # EXTERNAL FACTORS
    'WEATHER': 'External Factors',
    'ROAD ENGINEERING/SURFACE/MARKING DEFECTS': 'External Factors',
    'VISION OBSCURED (SIGNS, TREE LIMBS, BUILDINGS, ETC.)': 'External Factors',
    'ANIMAL': 'External Factors',
    
    # VEHICLE DEFECTS
    'EQUIPMENT - VEHICLE CONDITION': 'Vehicle Defect',
    'BRAKESLESS/FAILURE': 'Vehicle Defect',
    
    # UNKNOWN (Usually the biggest or second biggest)
    'UNABLE TO DETERMINE': 'Unknown',
    'NOT APPLICABLE': 'Unknown'
}

# 1. Apply the mapping
# If a cause is NOT in the dictionary, we default it to 'Other'
df['Crash_Cause'] = df['prim_contributory_cause'].map(cause_mapping).fillna('Other')

# 2. Check the new counts
print(df['Crash_Cause'].value_counts())

Crash_Cause
Driver Error        464480
Unknown             456049
Other                72885
External Factors     24364
Vehicle Defect        6251
Name: count, dtype: int64


Define X and Y

In [139]:
X = df.drop(columns=['Crash_Cause'])
y = df['Crash_Cause']

Train test split

In [85]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [86]:
X_train

Unnamed: 0,posted_speed_limit,traffic_control_device,device_condition,weather_condition,lighting_condition,first_crash_type,trafficway_type,alignment,roadway_surface_cond,road_defect,...,injuries_fatal,injuries_incapacitating,injuries_non_incapacitating,injuries_reported_not_evident,injuries_no_indication,injuries_unknown,crash_hour,crash_day_of_week,crash_month,location
497148,15,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,REAR END,ONE-WAY,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,2.0,5.0,0.0,18,4,7,POINT (-87.689579145955 41.775369783644)
620748,20,NO CONTROLS,NO CONTROLS,RAIN,DARKNESS,PARKED MOTOR VEHICLE,NOT DIVIDED,STRAIGHT AND LEVEL,WET,NO DEFECTS,...,0.0,0.0,0.0,0.0,1.0,0.0,21,4,4,POINT (-87.540171498289 41.704689925426)
146471,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,REAR END,DIVIDED - W/MEDIAN (NOT RAISED),STRAIGHT AND LEVEL,DRY,UNKNOWN,...,0.0,0.0,0.0,0.0,3.0,0.0,17,4,9,POINT (-87.77384449185 41.938369065236)
963960,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,FIXED OBJECT,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,1.0,0.0,16,7,2,POINT (-87.589533763454 41.762293310966)
812905,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,TURNING,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,3.0,0.0,15,2,8,POINT (-87.722977534605 41.784223692235)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
770760,30,NO CONTROLS,NO CONTROLS,CLEAR,"DARKNESS, LIGHTED ROAD",PARKED MOTOR VEHICLE,DIVIDED - W/MEDIAN (NOT RAISED),STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,2.0,0.0,0,5,12,POINT (-87.823259819486 41.984388863291)
518958,30,NO CONTROLS,NO CONTROLS,CLOUDY/OVERCAST,DAYLIGHT,REAR END,ONE-WAY,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,2.0,0.0,17,1,5,POINT (-87.905309125103 41.976201139024)
148564,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,DAYLIGHT,TURNING,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,1.0,1.0,0.0,16,5,9,POINT (-87.621934031166 41.896876961996)
429850,35,NO CONTROLS,UNKNOWN,SNOW,"DARKNESS, LIGHTED ROAD",PARKED MOTOR VEHICLE,ONE-WAY,STRAIGHT AND LEVEL,SNOW OR SLUSH,NO DEFECTS,...,0.0,0.0,0.0,0.0,2.0,0.0,0,6,3,POINT (-87.715917090236 41.849728913432)


In [87]:
X_test

Unnamed: 0,posted_speed_limit,traffic_control_device,device_condition,weather_condition,lighting_condition,first_crash_type,trafficway_type,alignment,roadway_surface_cond,road_defect,...,injuries_fatal,injuries_incapacitating,injuries_non_incapacitating,injuries_reported_not_evident,injuries_no_indication,injuries_unknown,crash_hour,crash_day_of_week,crash_month,location
389985,15,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,PEDESTRIAN,PARKING LOT,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,1.0,0.0,1.0,0.0,8,5,7,POINT (-87.603551470724 41.658029023473)
932984,50,NO CONTROLS,NO CONTROLS,CLEAR,"DARKNESS, LIGHTED ROAD",SIDESWIPE SAME DIRECTION,DIVIDED - W/MEDIAN BARRIER,CURVE ON GRADE,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,2.0,0.0,21,7,7,POINT (-87.613763431332 41.891615435111)
700032,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,PARKED MOTOR VEHICLE,ONE-WAY,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,1.0,0.0,14,6,7,POINT (-87.627135874478 41.714253498018)
890389,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,SIDESWIPE SAME DIRECTION,DIVIDED - W/MEDIAN BARRIER,STRAIGHT AND LEVEL,WET,NO DEFECTS,...,0.0,0.0,0.0,0.0,3.0,0.0,14,2,12,POINT (-87.60479457816 41.790692096732)
519475,35,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,FIXED OBJECT,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,1.0,0.0,9,7,5,POINT (-87.740922859385 41.740911065365)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
710896,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,DAYLIGHT,REAR END,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,2.0,0.0,13,3,6,POINT (-87.765398533209 41.891185870372)
730933,30,NO CONTROLS,NO CONTROLS,CLEAR,DUSK,PEDESTRIAN,ONE-WAY,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,1.0,0.0,2.0,0.0,20,1,4,POINT (-87.711276264979 41.84434082995)
601561,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,TURNING,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,2.0,0.0,6,6,7,POINT (-87.613819632519 41.715247546229)
100810,30,NO CONTROLS,NO CONTROLS,UNKNOWN,DAYLIGHT,TURNING,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,2.0,0.0,8,2,3,POINT (-87.646165835591 41.857659313588)


Preprocessing X train 

In [88]:
X_train_cat = X_train.select_dtypes(include= ['object', 'string']).copy()
X_train_cat.head()

Unnamed: 0,traffic_control_device,device_condition,weather_condition,lighting_condition,first_crash_type,trafficway_type,alignment,roadway_surface_cond,road_defect,report_type,crash_type,damage,date_police_notified,sec_contributory_cause,street_direction,street_name,most_severe_injury,location
497148,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,REAR END,ONE-WAY,STRAIGHT AND LEVEL,DRY,NO DEFECTS,ON SCENE,INJURY AND / OR TOW DUE TO CRASH,"$501 - $1,500",07/28/2021 06:31:00 PM,UNABLE TO DETERMINE,W,65TH ST,"REPORTED, NOT EVIDENT",POINT (-87.689579145955 41.775369783644)
620748,NO CONTROLS,NO CONTROLS,RAIN,DARKNESS,PARKED MOTOR VEHICLE,NOT DIVIDED,STRAIGHT AND LEVEL,WET,NO DEFECTS,ON SCENE,NO INJURY / DRIVE AWAY,"OVER $1,500",04/30/2020 11:05:00 AM,NOT APPLICABLE,E,105TH ST,NO INDICATION OF INJURY,POINT (-87.540171498289 41.704689925426)
146471,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,REAR END,DIVIDED - W/MEDIAN (NOT RAISED),STRAIGHT AND LEVEL,DRY,UNKNOWN,NOT ON SCENE (DESK REPORT),NO INJURY / DRIVE AWAY,"OVER $1,500",09/25/2024 05:15:00 PM,NOT APPLICABLE,W,BELMONT AVE,NO INDICATION OF INJURY,POINT (-87.77384449185 41.938369065236)
963960,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,FIXED OBJECT,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,ON SCENE,INJURY AND / OR TOW DUE TO CRASH,"OVER $1,500",02/11/2017 04:40:00 PM,NOT APPLICABLE,S,DANTE AVE,NO INDICATION OF INJURY,POINT (-87.589533763454 41.762293310966)
812905,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,TURNING,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,NOT ON SCENE (DESK REPORT),NO INJURY / DRIVE AWAY,"OVER $1,500",08/13/2018 04:05:00 PM,DRIVING SKILLS/KNOWLEDGE/EXPERIENCE,W,60TH ST,NO INDICATION OF INJURY,POINT (-87.722977534605 41.784223692235)


In [89]:
X_train_cat.head()
X_train_cat.shape


(819223, 18)

In [90]:
X_train_cat.nunique().sort_values(ascending=False)

date_police_notified      648348
location                  301756
street_name                 1647
sec_contributory_cause        40
trafficway_type               20
traffic_control_device        19
first_crash_type              18
weather_condition             12
device_condition               8
roadway_surface_cond           7
road_defect                    7
alignment                      6
lighting_condition             6
most_severe_injury             5
street_direction               4
damage                         3
report_type                    3
crash_type                     2
dtype: int64

In [91]:
X_train_cat = X_train_cat.drop(columns=['date_police_notified','location'], errors='ignore')
X_train_cat.head()

Unnamed: 0,traffic_control_device,device_condition,weather_condition,lighting_condition,first_crash_type,trafficway_type,alignment,roadway_surface_cond,road_defect,report_type,crash_type,damage,sec_contributory_cause,street_direction,street_name,most_severe_injury
497148,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,REAR END,ONE-WAY,STRAIGHT AND LEVEL,DRY,NO DEFECTS,ON SCENE,INJURY AND / OR TOW DUE TO CRASH,"$501 - $1,500",UNABLE TO DETERMINE,W,65TH ST,"REPORTED, NOT EVIDENT"
620748,NO CONTROLS,NO CONTROLS,RAIN,DARKNESS,PARKED MOTOR VEHICLE,NOT DIVIDED,STRAIGHT AND LEVEL,WET,NO DEFECTS,ON SCENE,NO INJURY / DRIVE AWAY,"OVER $1,500",NOT APPLICABLE,E,105TH ST,NO INDICATION OF INJURY
146471,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,REAR END,DIVIDED - W/MEDIAN (NOT RAISED),STRAIGHT AND LEVEL,DRY,UNKNOWN,NOT ON SCENE (DESK REPORT),NO INJURY / DRIVE AWAY,"OVER $1,500",NOT APPLICABLE,W,BELMONT AVE,NO INDICATION OF INJURY
963960,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,FIXED OBJECT,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,ON SCENE,INJURY AND / OR TOW DUE TO CRASH,"OVER $1,500",NOT APPLICABLE,S,DANTE AVE,NO INDICATION OF INJURY
812905,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,TURNING,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,NOT ON SCENE (DESK REPORT),NO INJURY / DRIVE AWAY,"OVER $1,500",DRIVING SKILLS/KNOWLEDGE/EXPERIENCE,W,60TH ST,NO INDICATION OF INJURY


In [92]:
X_train_cat.isna().sum()

traffic_control_device    0
device_condition          0
weather_condition         0
lighting_condition        0
first_crash_type          0
trafficway_type           0
alignment                 0
roadway_surface_cond      0
road_defect               0
report_type               0
crash_type                0
damage                    0
sec_contributory_cause    0
street_direction          0
street_name               0
most_severe_injury        0
dtype: int64

In [93]:
X_train_cat.shape

(819223, 16)

In [94]:
# Keep only top 50 streets, rest as 'Other'
top_streets = X_train_cat['street_name'].value_counts().nlargest(50).index
X_train_cat['street_grouped'] = X_train_cat['street_name'].where(X_train_cat['street_name'].isin(top_streets), 'Other')


In [95]:
X_train_cat.columns

Index(['traffic_control_device', 'device_condition', 'weather_condition',
       'lighting_condition', 'first_crash_type', 'trafficway_type',
       'alignment', 'roadway_surface_cond', 'road_defect', 'report_type',
       'crash_type', 'damage', 'sec_contributory_cause', 'street_direction',
       'street_name', 'most_severe_injury', 'street_grouped'],
      dtype='object')

In [96]:
X_train_cat = X_train_cat.drop(columns=["street_name"], errors="ignore")

In [97]:
X_train_cat.nunique().sort_values(ascending=False)

street_grouped            51
sec_contributory_cause    40
trafficway_type           20
traffic_control_device    19
first_crash_type          18
weather_condition         12
device_condition           8
roadway_surface_cond       7
road_defect                7
lighting_condition         6
alignment                  6
most_severe_injury         5
street_direction           4
report_type                3
damage                     3
crash_type                 2
dtype: int64

One Hot encoding

In [98]:
ohe = OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore')

In [99]:
X_train_encoded = ohe.fit_transform(X_train_cat)


In [100]:
X_train_encoded

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [105]:
X_train_encoded = pd.DataFrame(
    X_train_encoded, 
    columns=ohe.get_feature_names_out(X_train_cat.columns),
    index=X_train.index
)


In [106]:
X_train_encoded.head()

Unnamed: 0,traffic_control_device_DELINEATORS,traffic_control_device_FLASHING CONTROL SIGNAL,traffic_control_device_LANE USE MARKING,traffic_control_device_NO CONTROLS,traffic_control_device_NO PASSING,traffic_control_device_OTHER,traffic_control_device_OTHER RAILROAD CROSSING,traffic_control_device_OTHER REG. SIGN,traffic_control_device_OTHER WARNING SIGN,traffic_control_device_PEDESTRIAN CROSSING SIGN,...,street_grouped_MONTROSE AVE,street_grouped_NORTH AVE,street_grouped_OGDEN AVE,street_grouped_Other,street_grouped_PULASKI RD,street_grouped_ROOSEVELT RD,street_grouped_SHERIDAN RD,street_grouped_STATE ST,street_grouped_STONY ISLAND AVE,street_grouped_WESTERN AVE
497148,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
620748,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
146471,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
963960,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
812905,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


Numeric columns

In [107]:
numerical_cols = df.select_dtypes(include=["int", "float"]).copy()
numerical_cols.head()

Unnamed: 0,posted_speed_limit,street_no,beat_of_occurrence,num_units,injuries_total,injuries_fatal,injuries_incapacitating,injuries_non_incapacitating,injuries_reported_not_evident,injuries_no_indication,injuries_unknown,crash_hour,crash_day_of_week,crash_month
0,30,9954,431.0,1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,22,4,1
1,30,5725,814.0,2,0.0,0.0,0.0,0.0,0.0,2.0,0.0,22,4,1
2,30,4505,222.0,2,0.0,0.0,0.0,0.0,0.0,2.0,0.0,22,4,1
3,30,3158,1134.0,2,0.0,0.0,0.0,0.0,0.0,3.0,0.0,22,4,1
4,30,2804,1135.0,2,0.0,0.0,0.0,0.0,0.0,2.0,0.0,22,4,1


In [108]:
X_train_num = X_train[numerical_cols.columns].copy()
X_train_num

Unnamed: 0,posted_speed_limit,street_no,beat_of_occurrence,num_units,injuries_total,injuries_fatal,injuries_incapacitating,injuries_non_incapacitating,injuries_reported_not_evident,injuries_no_indication,injuries_unknown,crash_hour,crash_day_of_week,crash_month
497148,15,2631,831.0,2,2.0,0.0,0.0,0.0,2.0,5.0,0.0,18,4,7
620748,20,3426,432.0,2,0.0,0.0,0.0,0.0,0.0,1.0,0.0,21,4,4
146471,30,5901,2514.0,2,0.0,0.0,0.0,0.0,0.0,3.0,0.0,17,4,9
963960,30,7301,324.0,1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,16,7,2
812905,30,4000,813.0,2,0.0,0.0,0.0,0.0,0.0,3.0,0.0,15,2,8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
770760,30,7843,1614.0,2,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0,5,12
518958,30,3,1653.0,2,0.0,0.0,0.0,0.0,0.0,2.0,0.0,17,1,5
148564,30,220,1833.0,2,1.0,0.0,0.0,0.0,1.0,1.0,0.0,16,5,9
429850,35,3621,1013.0,2,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0,6,3


In [109]:
X_train_num.shape

(819223, 14)

Scaling numeric columns

In [110]:
from sklearn.preprocessing import StandardScaler


scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_num)
X_train_scaled

array([[-2.22329774, -0.36601806, -0.59149489, ...,  0.86369117,
        -0.0595197 ,  0.08770978],
       [-1.39531195, -0.08991308, -1.15782542, ...,  1.40204344,
        -0.0595197 , -0.78983758],
       [ 0.26065963,  0.76965905,  1.79731283, ...,  0.68424041,
        -0.0595197 ,  0.67274136],
       ...,
       [ 0.26065963, -1.20336287,  0.83071862, ...,  0.50478966,
         0.44604199,  0.67274136],
       [ 1.08864542, -0.02218921, -0.33316869, ..., -2.36642245,
         0.95160368, -1.08235337],
       [ 0.26065963, -1.15126759, -0.04787436, ..., -0.21301337,
        -0.56508139,  0.67274136]])

In [111]:
X_train_scaled = pd.DataFrame(
    X_train_scaled, 
    columns= X_train_num.columns,
    index=X_train.index
)
X_train_scaled

Unnamed: 0,posted_speed_limit,street_no,beat_of_occurrence,num_units,injuries_total,injuries_fatal,injuries_incapacitating,injuries_non_incapacitating,injuries_reported_not_evident,injuries_no_indication,injuries_unknown,crash_hour,crash_day_of_week,crash_month
497148,-2.223298,-0.366018,-0.591495,-0.078076,3.129536,-0.03099,-0.118459,-0.25704,5.781311,2.609089,0.0,0.863691,-0.059520,0.087710
620748,-1.395312,-0.089913,-1.157825,-0.078076,-0.343041,-0.03099,-0.118459,-0.25704,-0.202923,-0.864903,0.0,1.402043,-0.059520,-0.789838
146471,0.260660,0.769659,1.797313,-0.078076,-0.343041,-0.03099,-0.118459,-0.25704,-0.202923,0.872093,0.0,0.684240,-0.059520,0.672741
963960,0.260660,1.255882,-1.311118,-2.303534,-0.343041,-0.03099,-0.118459,-0.25704,-0.202923,-0.864903,0.0,0.504790,1.457165,-1.374869
812905,0.260660,0.109438,-0.617044,-0.078076,-0.343041,-0.03099,-0.118459,-0.25704,-0.202923,0.872093,0.0,0.325339,-1.070643,0.380226
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
770760,0.260660,1.444119,0.519876,-0.078076,-0.343041,-0.03099,-0.118459,-0.25704,-0.202923,0.003595,0.0,-2.366422,0.446042,1.550289
518958,0.260660,-1.278727,0.575231,-0.078076,-0.343041,-0.03099,-0.118459,-0.25704,-0.202923,0.003595,0.0,0.684240,-1.576205,-0.497322
148564,0.260660,-1.203363,0.830719,-0.078076,1.393247,-0.03099,-0.118459,-0.25704,2.789194,-0.864903,0.0,0.504790,0.446042,0.672741
429850,1.088645,-0.022189,-0.333169,-0.078076,-0.343041,-0.03099,-0.118459,-0.25704,-0.202923,0.003595,0.0,-2.366422,0.951604,-1.082353


In [112]:
# join the two dataframes to make one large dataframe with numeric and categorical variables
X_train_full = pd.concat([X_train_scaled, X_train_encoded], axis=1)
X_train_full

Unnamed: 0,posted_speed_limit,street_no,beat_of_occurrence,num_units,injuries_total,injuries_fatal,injuries_incapacitating,injuries_non_incapacitating,injuries_reported_not_evident,injuries_no_indication,...,street_grouped_MONTROSE AVE,street_grouped_NORTH AVE,street_grouped_OGDEN AVE,street_grouped_Other,street_grouped_PULASKI RD,street_grouped_ROOSEVELT RD,street_grouped_SHERIDAN RD,street_grouped_STATE ST,street_grouped_STONY ISLAND AVE,street_grouped_WESTERN AVE
497148,-2.223298,-0.366018,-0.591495,-0.078076,3.129536,-0.03099,-0.118459,-0.25704,5.781311,2.609089,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
620748,-1.395312,-0.089913,-1.157825,-0.078076,-0.343041,-0.03099,-0.118459,-0.25704,-0.202923,-0.864903,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
146471,0.260660,0.769659,1.797313,-0.078076,-0.343041,-0.03099,-0.118459,-0.25704,-0.202923,0.872093,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
963960,0.260660,1.255882,-1.311118,-2.303534,-0.343041,-0.03099,-0.118459,-0.25704,-0.202923,-0.864903,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
812905,0.260660,0.109438,-0.617044,-0.078076,-0.343041,-0.03099,-0.118459,-0.25704,-0.202923,0.872093,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
770760,0.260660,1.444119,0.519876,-0.078076,-0.343041,-0.03099,-0.118459,-0.25704,-0.202923,0.003595,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
518958,0.260660,-1.278727,0.575231,-0.078076,-0.343041,-0.03099,-0.118459,-0.25704,-0.202923,0.003595,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
148564,0.260660,-1.203363,0.830719,-0.078076,1.393247,-0.03099,-0.118459,-0.25704,2.789194,-0.864903,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
429850,1.088645,-0.022189,-0.333169,-0.078076,-0.343041,-0.03099,-0.118459,-0.25704,-0.202923,0.003595,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


In [113]:
X_train_full.shape

(819223, 209)

Preprocessing X_test

In [114]:
X_test

Unnamed: 0,posted_speed_limit,traffic_control_device,device_condition,weather_condition,lighting_condition,first_crash_type,trafficway_type,alignment,roadway_surface_cond,road_defect,...,injuries_fatal,injuries_incapacitating,injuries_non_incapacitating,injuries_reported_not_evident,injuries_no_indication,injuries_unknown,crash_hour,crash_day_of_week,crash_month,location
389985,15,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,PEDESTRIAN,PARKING LOT,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,1.0,0.0,1.0,0.0,8,5,7,POINT (-87.603551470724 41.658029023473)
932984,50,NO CONTROLS,NO CONTROLS,CLEAR,"DARKNESS, LIGHTED ROAD",SIDESWIPE SAME DIRECTION,DIVIDED - W/MEDIAN BARRIER,CURVE ON GRADE,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,2.0,0.0,21,7,7,POINT (-87.613763431332 41.891615435111)
700032,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,PARKED MOTOR VEHICLE,ONE-WAY,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,1.0,0.0,14,6,7,POINT (-87.627135874478 41.714253498018)
890389,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,SIDESWIPE SAME DIRECTION,DIVIDED - W/MEDIAN BARRIER,STRAIGHT AND LEVEL,WET,NO DEFECTS,...,0.0,0.0,0.0,0.0,3.0,0.0,14,2,12,POINT (-87.60479457816 41.790692096732)
519475,35,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,FIXED OBJECT,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,1.0,0.0,9,7,5,POINT (-87.740922859385 41.740911065365)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
710896,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,DAYLIGHT,REAR END,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,2.0,0.0,13,3,6,POINT (-87.765398533209 41.891185870372)
730933,30,NO CONTROLS,NO CONTROLS,CLEAR,DUSK,PEDESTRIAN,ONE-WAY,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,1.0,0.0,2.0,0.0,20,1,4,POINT (-87.711276264979 41.84434082995)
601561,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,TURNING,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,2.0,0.0,6,6,7,POINT (-87.613819632519 41.715247546229)
100810,30,NO CONTROLS,NO CONTROLS,UNKNOWN,DAYLIGHT,TURNING,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,...,0.0,0.0,0.0,0.0,2.0,0.0,8,2,3,POINT (-87.646165835591 41.857659313588)


In [115]:
X_test_cat = X_test.select_dtypes(include=["object", "string"]).copy()
X_test_cat.head()

Unnamed: 0,traffic_control_device,device_condition,weather_condition,lighting_condition,first_crash_type,trafficway_type,alignment,roadway_surface_cond,road_defect,report_type,crash_type,damage,date_police_notified,sec_contributory_cause,street_direction,street_name,most_severe_injury,location
389985,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,PEDESTRIAN,PARKING LOT,STRAIGHT AND LEVEL,DRY,NO DEFECTS,ON SCENE,INJURY AND / OR TOW DUE TO CRASH,$500 OR LESS,07/21/2022 09:05:00 AM,IMPROPER TURNING/NO SIGNAL,S,EVANS AVE,NONINCAPACITATING INJURY,POINT (-87.603551470724 41.658029023473)
932984,NO CONTROLS,NO CONTROLS,CLEAR,"DARKNESS, LIGHTED ROAD",SIDESWIPE SAME DIRECTION,DIVIDED - W/MEDIAN BARRIER,CURVE ON GRADE,DRY,NO DEFECTS,NOT ON SCENE (DESK REPORT),NO INJURY / DRIVE AWAY,"$501 - $1,500",07/29/2017 10:00:00 PM,FAILING TO YIELD RIGHT-OF-WAY,N,LAKE SHORE DR NB,NO INDICATION OF INJURY,POINT (-87.613763431332 41.891615435111)
700032,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,PARKED MOTOR VEHICLE,ONE-WAY,STRAIGHT AND LEVEL,DRY,NO DEFECTS,ON SCENE,INJURY AND / OR TOW DUE TO CRASH,"OVER $1,500",07/26/2019 04:00:00 PM,NOT APPLICABLE,S,LA SALLE ST,NO INDICATION OF INJURY,POINT (-87.627135874478 41.714253498018)
890389,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,SIDESWIPE SAME DIRECTION,DIVIDED - W/MEDIAN BARRIER,STRAIGHT AND LEVEL,WET,NO DEFECTS,ON SCENE,NO INJURY / DRIVE AWAY,"$501 - $1,500",12/18/2017 02:35:00 PM,UNABLE TO DETERMINE,S,MARYLAND AVE,NO INDICATION OF INJURY,POINT (-87.60479457816 41.790692096732)
519475,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,FIXED OBJECT,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,ON SCENE,NO INJURY / DRIVE AWAY,"OVER $1,500",05/22/2021 10:00:00 AM,UNABLE TO DETERMINE,S,CICERO AVE,NO INDICATION OF INJURY,POINT (-87.740922859385 41.740911065365)


In [116]:
X_test_cat.shape

(204806, 18)

In [117]:
X_test_cat.nunique().sort_values(ascending=False)

date_police_notified      191340
location                  115738
street_name                 1431
sec_contributory_cause        40
trafficway_type               20
traffic_control_device        19
first_crash_type              18
weather_condition             12
device_condition               8
roadway_surface_cond           7
road_defect                    7
alignment                      6
lighting_condition             6
most_severe_injury             5
street_direction               4
damage                         3
report_type                    3
crash_type                     2
dtype: int64

In [118]:
X_test_cat = X_test_cat.drop(columns=['date_police_notified','location'], errors='ignore')
X_test_cat.head()

Unnamed: 0,traffic_control_device,device_condition,weather_condition,lighting_condition,first_crash_type,trafficway_type,alignment,roadway_surface_cond,road_defect,report_type,crash_type,damage,sec_contributory_cause,street_direction,street_name,most_severe_injury
389985,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,PEDESTRIAN,PARKING LOT,STRAIGHT AND LEVEL,DRY,NO DEFECTS,ON SCENE,INJURY AND / OR TOW DUE TO CRASH,$500 OR LESS,IMPROPER TURNING/NO SIGNAL,S,EVANS AVE,NONINCAPACITATING INJURY
932984,NO CONTROLS,NO CONTROLS,CLEAR,"DARKNESS, LIGHTED ROAD",SIDESWIPE SAME DIRECTION,DIVIDED - W/MEDIAN BARRIER,CURVE ON GRADE,DRY,NO DEFECTS,NOT ON SCENE (DESK REPORT),NO INJURY / DRIVE AWAY,"$501 - $1,500",FAILING TO YIELD RIGHT-OF-WAY,N,LAKE SHORE DR NB,NO INDICATION OF INJURY
700032,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,PARKED MOTOR VEHICLE,ONE-WAY,STRAIGHT AND LEVEL,DRY,NO DEFECTS,ON SCENE,INJURY AND / OR TOW DUE TO CRASH,"OVER $1,500",NOT APPLICABLE,S,LA SALLE ST,NO INDICATION OF INJURY
890389,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,SIDESWIPE SAME DIRECTION,DIVIDED - W/MEDIAN BARRIER,STRAIGHT AND LEVEL,WET,NO DEFECTS,ON SCENE,NO INJURY / DRIVE AWAY,"$501 - $1,500",UNABLE TO DETERMINE,S,MARYLAND AVE,NO INDICATION OF INJURY
519475,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,FIXED OBJECT,NOT DIVIDED,STRAIGHT AND LEVEL,DRY,NO DEFECTS,ON SCENE,NO INJURY / DRIVE AWAY,"OVER $1,500",UNABLE TO DETERMINE,S,CICERO AVE,NO INDICATION OF INJURY


In [121]:
# Keep only top 50 streets, rest as 'Other'
top_streets = X_test_cat['street_name'].value_counts().nlargest(50).index
X_test_cat['street_grouped'] = X_test_cat['street_name'].where(X_test_cat['street_name'].isin(top_streets), 'Other')


In [122]:
X_test_cat.columns

Index(['traffic_control_device', 'device_condition', 'weather_condition',
       'lighting_condition', 'first_crash_type', 'trafficway_type',
       'alignment', 'roadway_surface_cond', 'road_defect', 'report_type',
       'crash_type', 'damage', 'sec_contributory_cause', 'street_direction',
       'street_name', 'most_severe_injury', 'street_grouped'],
      dtype='object')

In [123]:
X_test_cat = X_test_cat.drop(columns=["street_name"], errors="ignore")

In [124]:
X_test_cat.nunique().sort_values(ascending=False)

street_grouped            51
sec_contributory_cause    40
trafficway_type           20
traffic_control_device    19
first_crash_type          18
weather_condition         12
device_condition           8
roadway_surface_cond       7
road_defect                7
lighting_condition         6
alignment                  6
most_severe_injury         5
street_direction           4
report_type                3
damage                     3
crash_type                 2
dtype: int64

One Hot encoding 

In [125]:
X_test_encoded = ohe.fit_transform(X_test_cat)

In [126]:
X_test_encoded

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [128]:
X_test_encoded = pd.DataFrame(
    X_test_encoded, 
    columns=ohe.get_feature_names_out(X_test_cat.columns),
    index=X_test.index
)
X_test_encoded

Unnamed: 0,traffic_control_device_DELINEATORS,traffic_control_device_FLASHING CONTROL SIGNAL,traffic_control_device_LANE USE MARKING,traffic_control_device_NO CONTROLS,traffic_control_device_NO PASSING,traffic_control_device_OTHER,traffic_control_device_OTHER RAILROAD CROSSING,traffic_control_device_OTHER REG. SIGN,traffic_control_device_OTHER WARNING SIGN,traffic_control_device_PEDESTRIAN CROSSING SIGN,...,street_grouped_MONTROSE AVE,street_grouped_NORTH AVE,street_grouped_OGDEN AVE,street_grouped_Other,street_grouped_PULASKI RD,street_grouped_ROOSEVELT RD,street_grouped_SHERIDAN RD,street_grouped_STATE ST,street_grouped_STONY ISLAND AVE,street_grouped_WESTERN AVE
389985,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
932984,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
700032,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
890389,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
519475,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
710896,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
730933,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
601561,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100810,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


Numeric columns

In [129]:
X_test_num = X_test[numerical_cols.columns].copy()
X_test_num

Unnamed: 0,posted_speed_limit,street_no,beat_of_occurrence,num_units,injuries_total,injuries_fatal,injuries_incapacitating,injuries_non_incapacitating,injuries_reported_not_evident,injuries_no_indication,injuries_unknown,crash_hour,crash_day_of_week,crash_month
389985,15,13052,533.0,2,1.0,0.0,0.0,1.0,0.0,1.0,0.0,8,5,7
932984,50,519,1834.0,2,0.0,0.0,0.0,0.0,0.0,2.0,0.0,21,7,7
700032,30,9903,511.0,2,0.0,0.0,0.0,0.0,0.0,1.0,0.0,14,6,7
890389,30,5721,235.0,2,0.0,0.0,0.0,0.0,0.0,3.0,0.0,14,2,12
519475,35,8333,834.0,1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,9,7,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
710896,30,600,1511.0,2,0.0,0.0,0.0,0.0,0.0,2.0,0.0,13,3,6
730933,30,2600,1032.0,2,1.0,0.0,0.0,1.0,0.0,2.0,0.0,20,1,4
601561,30,9839,511.0,2,0.0,0.0,0.0,0.0,0.0,2.0,0.0,6,6,7
100810,30,734,1235.0,2,0.0,0.0,0.0,0.0,0.0,2.0,0.0,8,2,3


In [130]:
from sklearn.preprocessing import StandardScaler


scaler = StandardScaler()
X_test_scaled = scaler.fit_transform(X_test_num)
X_test_scaled

array([[-2.21689182,  3.29555393, -1.01881397, ..., -0.93140708,
         0.44666845,  0.08665069],
       [ 3.56829557, -1.1175156 ,  0.82748835, ...,  1.4022122 ,
         1.45764658,  0.08665069],
       [ 0.2624742 ,  2.18674072, -1.05003507, ...,  0.14564797,
         0.95215752,  0.08665069],
       ...,
       [ 0.2624742 ,  2.1642053 , -1.05003507, ..., -1.29042543,
         0.95215752,  0.08665069],
       [ 0.2624742 , -1.04181067, -0.02257705, ..., -0.93140708,
        -1.06979875, -1.08613422],
       [ 0.2624742 , -0.17067574, -0.18152083, ...,  0.50466632,
         0.44666845, -0.79293799]])

In [131]:
X_test_scaled = pd.DataFrame(
    X_test_scaled, 
    columns= X_test_num.columns,
    index=X_test.index
)
X_test_scaled

Unnamed: 0,posted_speed_limit,street_no,beat_of_occurrence,num_units,injuries_total,injuries_fatal,injuries_incapacitating,injuries_non_incapacitating,injuries_reported_not_evident,injuries_no_indication,injuries_unknown,crash_hour,crash_day_of_week,crash_month
389985,-2.216892,3.295554,-1.018814,-0.077159,1.375447,-0.031938,-0.117341,2.078717,-0.199006,-0.865150,0.0,-0.931407,0.446668,0.086651
932984,3.568296,-1.117516,0.827488,-0.077159,-0.340184,-0.031938,-0.117341,-0.257617,-0.199006,0.001914,0.0,1.402212,1.457647,0.086651
700032,0.262474,2.186741,-1.050035,-0.077159,-0.340184,-0.031938,-0.117341,-0.257617,-0.199006,-0.865150,0.0,0.145648,0.952158,0.086651
890389,0.262474,0.714192,-1.441718,-0.077159,-0.340184,-0.031938,-0.117341,-0.257617,-0.199006,0.868977,0.0,0.145648,-1.069799,1.552632
519475,1.088930,1.633919,-0.591653,-2.304435,-0.340184,-0.031938,-0.117341,-0.257617,-0.199006,-0.865150,0.0,-0.751898,1.457647,-0.499742
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
710896,0.262474,-1.088994,0.369106,-0.077159,-0.340184,-0.031938,-0.117341,-0.257617,-0.199006,0.001914,0.0,-0.033861,-0.564310,-0.206546
730933,0.262474,-0.384762,-0.310663,-0.077159,1.375447,-0.031938,-0.117341,2.078717,-0.199006,0.001914,0.0,1.222703,-1.575288,-0.792938
601561,0.262474,2.164205,-1.050035,-0.077159,-0.340184,-0.031938,-0.117341,-0.257617,-0.199006,0.001914,0.0,-1.290425,0.952158,0.086651
100810,0.262474,-1.041811,-0.022577,-0.077159,-0.340184,-0.031938,-0.117341,-0.257617,-0.199006,0.001914,0.0,-0.931407,-1.069799,-1.086134


In [132]:
# join the two dataframes to make one large dataframe with numeric and categorical variables
X_test_full = pd.concat([X_test_scaled, X_test_encoded], axis=1)
X_test_full

Unnamed: 0,posted_speed_limit,street_no,beat_of_occurrence,num_units,injuries_total,injuries_fatal,injuries_incapacitating,injuries_non_incapacitating,injuries_reported_not_evident,injuries_no_indication,...,street_grouped_MONTROSE AVE,street_grouped_NORTH AVE,street_grouped_OGDEN AVE,street_grouped_Other,street_grouped_PULASKI RD,street_grouped_ROOSEVELT RD,street_grouped_SHERIDAN RD,street_grouped_STATE ST,street_grouped_STONY ISLAND AVE,street_grouped_WESTERN AVE
389985,-2.216892,3.295554,-1.018814,-0.077159,1.375447,-0.031938,-0.117341,2.078717,-0.199006,-0.865150,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
932984,3.568296,-1.117516,0.827488,-0.077159,-0.340184,-0.031938,-0.117341,-0.257617,-0.199006,0.001914,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
700032,0.262474,2.186741,-1.050035,-0.077159,-0.340184,-0.031938,-0.117341,-0.257617,-0.199006,-0.865150,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
890389,0.262474,0.714192,-1.441718,-0.077159,-0.340184,-0.031938,-0.117341,-0.257617,-0.199006,0.868977,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
519475,1.088930,1.633919,-0.591653,-2.304435,-0.340184,-0.031938,-0.117341,-0.257617,-0.199006,-0.865150,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
710896,0.262474,-1.088994,0.369106,-0.077159,-0.340184,-0.031938,-0.117341,-0.257617,-0.199006,0.001914,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
730933,0.262474,-0.384762,-0.310663,-0.077159,1.375447,-0.031938,-0.117341,2.078717,-0.199006,0.001914,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
601561,0.262474,2.164205,-1.050035,-0.077159,-0.340184,-0.031938,-0.117341,-0.257617,-0.199006,0.001914,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100810,0.262474,-1.041811,-0.022577,-0.077159,-0.340184,-0.031938,-0.117341,-0.257617,-0.199006,0.001914,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


In [140]:
from sklearn.preprocessing import LabelEncoder

# Instantiate
le = LabelEncoder()

# Encode the NEW grouped column
df['Target_Encoded'] = le.fit_transform(df['Crash_Cause'])

# Update your y variable
y = df['Target_Encoded']

print("Encoding complete. Classes are:", le.classes_)

Encoding complete. Classes are: ['Driver Error' 'External Factors' 'Other' 'Unknown' 'Vehicle Defect']
