# Business Understanding

## Introduction
Traffic accidents are a critical public safety issue, causing injuries, fatalities and significant economic losses. Stakeholders such as traffic authorities and emergency services often face challenges in predicting and mitigating injury severity in crashes. Understanding the factors influencing injury outcomes can inform better policies, resource allocation and public safety to reduce injury severity and save lives.

## Use Cases
- Use the model to identify high-risk conditions (eg. weather, lighting, etc.) and implement measures like improved signage, speed limits or road design to reduce injury severity in traffic accidents.
- Predict the severity of injuries based on crash conditions, enabling emergency services to prioritize resources and respond more effectively to severe accidents. 

## Value Proposition
This project aims to develop a classification model that predicts injury severity in traffic crashes. By identifying key high-risk contributing to severe injuries, stakeholders can implement proactive measures to:
    - Reduce injury severity in traffic accidents through ad-hoc interventions
    - Enhance decision-making and resource allocation for emergency services
    - Improve public safety and save lives

# Business Objective
- The task is to predict the severity of injuries based on the given features:
    - Environment: The environment in which the accident occurred.
        - POSTED_SPEED_LIMIT: The posted speed limit.
        - WEAHTER_CONDITION: The weather condition.
        - LIGHTING_CONDITION: The lighting condition.
        - ROADWAY_SURFACE_COND: The roadway surface condition.
        - ROAD_DEFECT: Whether or not the road was defective.
        - TRAFFICWAY_TYPE: The type of trafficway.
        - TRAFFIC_CONTROL_DEVICE: The traffic control device present at the location of the accident.
    - Crash Dynamics: The dynamics of the crash.
        - FIRST_CRASH_TYPE: The type of the first crash.
        - TRAFFICWAY_TYPE: The type of trafficway.
        - ALIGNMENT: The alignment of the road.
        - LANE_CNT: The number of through lanes in either direction.
        - CRASH_HOUR: The hour of the crash.
        - CRASH_DAY_OF_WEEK: The day of the week of the crash.
        - CRASH_MONTH: The month of the crash.
    - Human Factors:
        - PRIM_CONTRIBUTORY_CAUSE: The primary contributory cause of the accident.
        - SEC_CONTRIBUTORY_CAUSE: The secondary contributory cause of the accident.
        - HIT_AND_RUN_I: Whether or not the crash involved a hit and run.
        - NOT_RIGHT_OF_WAY_I: Whether or not the crash involved a violation of the right of way.
        - WORK_ZONE_I: Whether or not the crash occurred in a work zone.
    - Location Factors:
        - LATITUDE: The latitude of the location of the crash.
        - LONGITUDE: The longitude of the location of the crash.
        - BEAT_OF_OCCURRENCE: The police beat of occurrence.
    - Target:
        - MOST_SEVERE_INJURY: Multi-class classification target (eg. FATAL, INCAPACITATING INJURY, NONINCAPACITATING INJURY, REPORTED, NO INJURY).


# Data Understanding

## Introduction
The dataset contains information about traffic accidents in Chicago. Stakeholders need reliable data-driven insights to mitigate injury severity and optimize their strategies. The dataset in this project is directly related to the task of predicting injury severity in traffic accidents.

## Data Description
- The dataset includes detailed records of traffic accidents covering various features such as environment, crash dynamics, human factors, location factors and target variable MOST_SEVERE_INJURY.

## Data Quality
- The dataset is very large with over 400,000 records and 49 features, providing a rich source of information for analysis.
- The dataset comes from the City of Chicago's [open data portal](https://data.cityofchicago.org/Transportation/Traffic-Crashes-Crashes/85ca-t3if) and is updated daily making it a reliable source of information for stakeholders.

## Data Relevance
- Use data on crash conditions (eg. weather) to identify high-risk conditions take proative measures.
- Predict injury severity to prioritize emergency services and allocate resources more effectively. 

## Conclusion
The dataset is robust, relevant and continually updated, making it an indispensable resource for the task of predicting injury severity in traffic accidents. 

# Data Preparation

## Assembly
- The source data is comprised of three CSV files:
    - [Crash Data](https://data.cityofchicago.org/Transportation/Traffic-Crashes-Crashes/85ca-t3if/about_data)
    - [Driver/Passenger Data](https://data.cityofchicago.org/Transportation/Traffic-Crashes-People/u6pd-qa9d/about_data)
    - [Vehicles Data](https://data.cityofchicago.org/Transportation/Traffic-Crashes-Vehicles/68nd-jvt3/about_data)
- The data will be assembled into a single dataset by joining the three tables on the common key CRASH_RECORD_ID.

## Cleaning
- Irrelevant columns that do not contribute to the task will be dropped.
- Missing values that will be imputed or dropped.

## Transformation
- Categorical features will be encoded using one-hot encoding.
- Numerical features will be scaled using standard scaling.

## Splitting
- The dataset will be split into training and testing sets using a standard 80/20 split.

## Import Libraries

In [100]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.metrics import precision_recall_curve, average_precision_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import chi2
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.feature_selection import f_classif
from statsmodels.stats.outliers_influence import variance_inflation_factor
from scipy.sparse import hstack
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import FunctionTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from scipy.stats import spearmanr
import json
from sklearn.model_selection import ParameterGrid
from scipy.stats import chi2_contingency
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import StratifiedKFold

## Load Data

In [101]:
# load data
data = pd.read_csv('./data/Traffic_Crashes_-_Crashes_20250127.csv')
data_vehicles = pd.read_csv('./data/Traffic_Crashes_-_Vehicles_20250127.csv')
data_people= pd.read_csv('./data/Traffic_Crashes_-_People_20250127.csv')

  data_vehicles = pd.read_csv('./data/Traffic_Crashes_-_Vehicles_20250127.csv')
  data_people= pd.read_csv('./data/Traffic_Crashes_-_People_20250127.csv')


In [102]:
# merge data
data = data.merge(data_vehicles, on='CRASH_RECORD_ID')
data = data.merge(data_people, on='CRASH_RECORD_ID')

In [103]:
# assign target variable and drop it from the features
target = data['MOST_SEVERE_INJURY']
data.drop('MOST_SEVERE_INJURY', axis=1, inplace=True)

# combine y and y_enc into a DataFrame
label_encoder = LabelEncoder()
target_enc = label_encoder.fit_transform(target)
y_data = pd.DataFrame({'Original_Target': target, 'Encoded_Target': target_enc})

# save to CSV
y_data.to_csv('./checkpoint/target.csv', index=False)

In [104]:
# Load the CSV file
y_data = pd.read_csv('./checkpoint/target.csv')

# Access the columns
y = y_data['Original_Target']  # Original target values
y_enc = y_data['Encoded_Target']  # Encoded target values

# If needed, convert y_enc back to integers (it might be loaded as floats)
y_enc = y_enc.astype(int)


## Intial Clean Up
- Drop features unlikely to influence injury severity.
    - ids
    - location
    - date/time
    - miscellaneous
    - vehicle details
    - hazmat details
    - commerical vehicle details

In [105]:
categorical_drop = [
    # IDs
    'CRASH_RECORD_ID', 'PERSON_ID', 'USDOT_NO', 'CCMC_NO', 'ILCC_NO', 
    'UN_NO', 'EMS_RUN_NO', 'IDOT_PERMIT_NO', 'UNIT_NO',

    # Dates
    'DATE_POLICE_NOTIFIED', 'CRASH_DATE_EST_I', 'CRASH_DATE_x', 
    'CRASH_DATE_y', 'CRASH_DATE',

    # Geographic
    'CITY', 'STATE', 'ZIPCODE', 'LATITUDE', 'LONGITUDE', 'LOCATION', 
    'STREET_NAME', 'STREET_DIRECTION', 'CARRIER_STATE', 'CARRIER_CITY', 
    'STREET_NO', 'BEAT_OF_OCCURRENCE', 'TRAVEL_DIRECTION',

    # Miscellaneous
    'TOWED_BY', 'TOWED_TO', 'AREA_00_I', 'AREA_01_I', 'AREA_02_I', 
    'AREA_03_I', 'AREA_04_I', 'AREA_05_I', 'AREA_06_I', 'AREA_07_I', 
    'AREA_08_I', 'AREA_09_I', 'AREA_10_I', 'AREA_11_I', 'AREA_12_I', 
    'AREA_99_I', 'WORK_ZONE_TYPE', 'PHOTOS_TAKEN_I', 'STATEMENTS_TAKEN_I', 
    'DOORING_I', 'WIDE_LOAD_I', 'REPORT_TYPE', 'CRASH_TYPE',

    # Vehicle
    'VEHICLE_ID', 'MAKE', 'MODEL', 'LIC_PLATE_STATE', 
    'TRAILER1_WIDTH', 'TRAILER2_WIDTH',

    # Hazardous Materials
    'HAZMAT_PLACARDS_I', 'HAZMAT_NAME', 'HAZMAT_PRESENT_I', 
    'HAZMAT_REPORT_I', 'HAZMAT_REPORT_NO', 'HAZMAT_VIO_CAUSE_CRASH_I', 
    'HAZMAT_OUT_OF_SERVICE_I',

    # Commercial Vehicle
    'COMMERCIAL_SRC', 'CARGO_BODY_TYPE', 'VEHICLE_CONFIG', 'GVWR', 
    'CARRIER_NAME', 'MCS_VIO_CAUSE_CRASH_I', 'MCS_REPORT_I', 
    'MCS_REPORT_NO', 'MCS_OUT_OF_SERVICE_I',

    # High/Inf VIF
    'INJURIES_TOTAL', 'INJURIES_FATAL', 'INJURIES_INCAPACITATING', 
    'INJURIES_NON_INCAPACITATING', 'INJURIES_REPORTED_NOT_EVIDENT', 
    'CRASH_UNIT_ID', 'VEHICLE_ID_x', 'VEHICLE_ID_y',

    # Related to Target
    'INJURY_CLASSIFICATION', 'INJURIES_NO_INDICATION', 'INJURIES_UNKNOWN',

    # Potential Drops
    'VEHICLE_USE', 'BAC_RESULT', 'DRIVERS_LICENSE_STATE', 

    # Additional Drops
    'CMRC_VEH_I', 'TRAVEL_DIRECTION', 'TRAILER1_LENGTH', 'TRAILER2_LENGTH', 
    'TOTAL_VEHICLE_LENGTH', 'AXLE_CNT', 'LOAD_TYPE', 'HAZMAT_CLASS', 
    'SEAT_NO', 'DRIVERS_LICENSE_CLASS', 'HOSPITAL', 'EMS_AGENCY', 
    'PEDPEDAL_ACTION', 'PEDPEDAL_VISIBILITY', 'PEDPEDAL_LOCATION', 
    'BAC_RESULT VALUE', 'CELL_PHONE_USE', 'CMV_ID',
    'TOWED_I', 'FIRE_I'
]

# Drop columns
data.drop(columns=categorical_drop, errors='ignore', inplace=True)

## Data Preparation

### Remove Features with High Rate of Missing Values
- Drop features with high rate of missing values.

In [106]:
# calculate null percentages
null_percentage = data.isnull().mean() * 100

# drop columns with more than 50% missing values
columns_to_drop = null_percentage[null_percentage > 50].index
data = data.drop(columns=columns_to_drop)

### Impute Missing Values
- Median imputation for numerical features.
- Mode imputation for categorical features.

### Grouping Rare Feature Values

In [107]:
# simplify all categorical features by grouping rare categories into a single category by proportion
def simplify_all_categorical_features(df, rare_threshold=0.01, new_category='OTHER'):
    for column in df.select_dtypes(include='object').columns:
        total = len(df)
        value_counts = df[column].value_counts()
        rare_categories = value_counts[value_counts / total < rare_threshold].index
        df[column] = df[column].replace(rare_categories, new_category)
    return df

data = simplify_all_categorical_features(data, rare_threshold=0.01)

### Feature Engineering

In [108]:
# fill categorical columns with mode
for column in data.select_dtypes(include='object').columns:
    mode_value = data[column].mode()[0]
    data[column] = data[column].fillna(mode_value)

# fill numerical columns with median
for column in data.select_dtypes(include=['float64', 'int64']).columns:
    median_value = data[column].median()
    data[column] = data[column].fillna(median_value)

In [109]:
# simplify LIGHTING_CONDITION
data['LIGHTING_CONDITION'] = data['LIGHTING_CONDITION'].replace({'DUSK': 'LOW LIGHT', 'DAWN': 'LOW LIGHT'})
# data['Daylight'] = data['LIGHTING_CONDITION'].apply(lambda x: 'Daylight' if x == 'DAYLIGHT' else 'Other')

# simplify contact point to FRONT, SIDE, REAR, OTHER
data['Contact_Broad'] = data['FIRST_CONTACT_POINT'].apply(
    lambda x: 'Front' if 'FRONT' in x.upper() else (
        'Side' if 'SIDE' in x.upper() else (
            'Rear' if 'REAR' in x.upper() else 'Other'
        )
    )
)

# create RUSH_HOUR feature
data['CRASH_HOUR'] = pd.to_datetime(data['CRASH_HOUR'], format='%H').dt.hour
data['RUSH_HOUR'] = data['CRASH_HOUR'].apply(
    lambda x: 'Rush Hour' if 7 <= x <= 9 or 16 <= x <= 18 else 'Not Rush Hour'
)

# simplify vehicle types
data['VEHICLE_TYPE'] = data['VEHICLE_TYPE'].replace({
    'TRUCK - SINGLE UNIT': 'Truck',
    'TRACTOR W/ SEMI-TRAILER': 'Truck',
    'TRACTOR W/O SEMI-TRAILER': 'Truck',
    'SINGLE UNIT TRUCK WITH TRAILER': 'Truck',
    'OTHER VEHICLE WITH TRAILER': 'Truck',
    'BUS OVER 15 PASS.': 'Bus',
    'BUS UP TO 15 PASS.': 'Bus',
    'MOTORCYCLE (OVER 150CC)': 'Motorcycle',
    '3-WHEELED MOTORCYCLE (2 REAR WHEELS)': 'Motorcycle',
    'AUTOCYCLE': 'Motorcycle',
    'ALL-TERRAIN VEHICLE (ATV)': 'Motorcycle'
})

# group physical conditions
data['PHYSICAL_CONDITION_GROUP'] = data['PHYSICAL_CONDITION'].replace({
    'IMPAIRED - ALCOHOL': 'Impaired',
    'IMPAIRED - DRUGS': 'Impaired',
    'IMPAIRED - ALCOHOL AND DRUGS': 'Impaired',
    'MEDICATED': 'Impaired',
    'EMOTIONAL': 'Other',
    'FATIGUED/ASLEEP': 'Other',
    'ILLNESS/FAINTED': 'Other',
    'HAD BEEN DRINKING': 'Impaired',
    'NORMAL': 'Normal',
    'UNKNOWN': 'Unknown'
})

In [110]:
# print features and their sum of missing values
print(data.isnull().sum())

# drop missing values
data = data.dropna()

POSTED_SPEED_LIMIT          0
TRAFFIC_CONTROL_DEVICE      0
DEVICE_CONDITION            0
WEATHER_CONDITION           0
LIGHTING_CONDITION          0
FIRST_CRASH_TYPE            0
TRAFFICWAY_TYPE             0
ALIGNMENT                   0
ROADWAY_SURFACE_COND        0
ROAD_DEFECT                 0
DAMAGE                      0
PRIM_CONTRIBUTORY_CAUSE     0
SEC_CONTRIBUTORY_CAUSE      0
NUM_UNITS                   0
CRASH_HOUR                  0
CRASH_DAY_OF_WEEK           0
CRASH_MONTH                 0
UNIT_TYPE                   0
VEHICLE_YEAR                0
VEHICLE_DEFECT              0
VEHICLE_TYPE                0
MANEUVER                    0
OCCUPANT_CNT                0
FIRST_CONTACT_POINT         0
PERSON_TYPE                 0
SEX                         0
AGE                         0
SAFETY_EQUIPMENT            0
AIRBAG_DEPLOYED             0
EJECTION                    0
DRIVER_ACTION               0
DRIVER_VISION               0
PHYSICAL_CONDITION          0
Contact_Br

In [111]:
# checkpoint
data.to_csv('./checkpoint/data_post_feat_eng.csv', index=False)

In [112]:
# load data
data = pd.read_csv('./checkpoint/data_post_feat_eng.csv')

## Data Preprocessing & Feature Selection

### Significance Testing
- Spearman correlation for numerical features.
- Cramer's V for categorical features.
- Chi-squared test for categorical target.
- ANOVA for numerical target.

In [113]:
# separate numeric and categorical features
numeric_features = data.select_dtypes(include=['float64', 'int64']).columns.tolist()
categorical_features = data.select_dtypes(include=['object', 'category']).columns.tolist()

# add the target variable back to the dataset for correlation
data_with_target = data.copy()
data_with_target['target'] = y_enc

#### Pearson Correlation

In [None]:

# compute correlation for numeric features
pearson_corr = data_with_target[numeric_features + ['target']].corr(method='pearson')['target']

# display correlations sorted by absolute value
print("Numeric Feature Correlations:")
print(pearson_corr.abs().sort_values(ascending=False))

# checkpoint results
pearson_corr.to_csv('./checkpoint/pearson_corr.csv', index=True)

#### Spearman Correlation

In [86]:
# add the encoded target variable to the dataset temporarily
data_with_target = data.copy()
data_with_target['target'] = y_enc

significant_features = {}
for feature in numeric_features:
    # calculate Spearman correlation and p-value
    corr, p_value = spearmanr(data_with_target[feature], y_enc)
    significant_features[feature] = (corr, p_value)

# filter features with p-value < 0.05
significant_features = {k: v for k, v in significant_features.items() if v[1] < 0.05}

# display significant features
print("Significant Features (Spearman Correlation):")
for feature, (corr, p_value) in significant_features.items():
    print(f"{feature}: Correlation={corr:.4f}, P-value={p_value:.4e}")

# checkpoint results
with open('./checkpoint/significant_features.json', 'w') as f:
    json.dump(significant_features, f)

Significant Features (Spearman Correlation):
POSTED_SPEED_LIMIT: Correlation=0.0517, P-value=0.0000e+00
NUM_UNITS: Correlation=0.1156, P-value=0.0000e+00
CRASH_HOUR: Correlation=0.0089, P-value=9.5240e-74
CRASH_DAY_OF_WEEK: Correlation=-0.0047, P-value=7.9192e-22
CRASH_MONTH: Correlation=0.0089, P-value=3.0901e-74
VEHICLE_YEAR: Correlation=-0.0035, P-value=3.4643e-13
OCCUPANT_CNT: Correlation=0.0957, P-value=0.0000e+00
AGE: Correlation=-0.0368, P-value=0.0000e+00


#### Cramer's V

In [87]:
data_with_target = data.copy()
# note the y variable is unencoded
data_with_target['target'] = y

# calculate Cramers V for categorical features
def cramers_v(x, y):
    confusion_matrix = pd.crosstab(x, y)
    chi2 = chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()
    phi2 = chi2 / n
    r, k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k - 1) * (r - 1)) / (n - 1))
    rcorr = r - ((r - 1) ** 2) / (n - 1)
    kcorr = k - ((k - 1) ** 2) / (n - 1)
    return np.sqrt(phi2corr / min((kcorr - 1), (rcorr - 1)))

# compute correlation for categorical features
categorical_corr = pd.Series(index=categorical_features, dtype='float64')
for feature in categorical_features:
    categorical_corr[feature] = cramers_v(data_with_target[feature], data_with_target['target'])
    
# display correlations sorted by absolute value
print("Categorical Feature Correlations:")
print(categorical_corr.abs().sort_values(ascending=False))

# checkpoint results
categorical_corr.to_csv('./checkpoint/categorical_corr.csv', index=True)

Categorical Feature Correlations:
FIRST_CRASH_TYPE            0.202951
PERSON_TYPE                 0.156600
AIRBAG_DEPLOYED             0.148654
UNIT_TYPE                   0.142980
PRIM_CONTRIBUTORY_CAUSE     0.116963
DAMAGE                      0.107828
PHYSICAL_CONDITION_GROUP    0.098137
PHYSICAL_CONDITION          0.098137
FIRST_CONTACT_POINT         0.085649
TRAFFICWAY_TYPE             0.082182
SAFETY_EQUIPMENT            0.081633
MANEUVER                    0.074288
DRIVER_ACTION               0.071252
DEVICE_CONDITION            0.068263
EJECTION                    0.067016
Contact_Broad               0.063084
TRAFFIC_CONTROL_DEVICE      0.061936
SEX                         0.060986
SEC_CONTRIBUTORY_CAUSE      0.058448
DRIVER_VISION               0.051830
LIGHTING_CONDITION          0.046018
ROADWAY_SURFACE_COND        0.037424
VEHICLE_TYPE                0.035304
WEATHER_CONDITION           0.035051
RUSH_HOUR                   0.030136
ROAD_DEFECT                 0.025459
ALIG

#### Chi-Square

In [None]:
data_with_target = data.copy()
data_with_target['target'] = y

# chi-squared test for categorical features
chi2_results = {}
for feature in categorical_features:
    # calculate chi-squared statistic and p-value
    contingency_table = pd.crosstab(data_with_target[feature], data_with_target['target'])
    chi2, p_value, _, _ = chi2_contingency(contingency_table)
    chi2_results[feature] = (chi2, p_value)
    
# filter features with p-value < 0.05
chi2_results = {k: v for k, v in chi2_results.items() if v[1] < 0.05}

# display significant features
print("Significant Features (Chi-Squared Test):")
for feature, (chi2, p_value) in chi2_results.items():
    print(f"{feature}: Chi2={chi2:.4f}, P-value={p_value:.4e}")

# checkpoint results
chi2_results_df = pd.DataFrame(chi2_results).T
chi2_results_df.columns = ['Chi2', 'P-value']
chi2_results_df.to_csv('./checkpoint/chi2_results.csv', index=True)

Significant Features (Chi-Squared Test):
TRAFFIC_CONTROL_DEVICE: Chi2=64648.2315, P-value=0.0000e+00
DEVICE_CONDITION: Chi2=58895.3214, P-value=0.0000e+00
WEATHER_CONDITION: Chi2=20719.7128, P-value=0.0000e+00
LIGHTING_CONDITION: Chi2=35694.9952, P-value=0.0000e+00
FIRST_CRASH_TYPE: Chi2=694019.9929, P-value=0.0000e+00
TRAFFICWAY_TYPE: Chi2=113829.2689, P-value=0.0000e+00
ALIGNMENT: Chi2=2171.5625, P-value=0.0000e+00
ROADWAY_SURFACE_COND: Chi2=23613.1996, P-value=0.0000e+00
ROAD_DEFECT: Chi2=5468.0663, P-value=0.0000e+00
DAMAGE: Chi2=97955.1220, P-value=0.0000e+00
PRIM_CONTRIBUTORY_CAUSE: Chi2=230548.7098, P-value=0.0000e+00
SEC_CONTRIBUTORY_CAUSE: Chi2=57597.9448, P-value=0.0000e+00
UNIT_TYPE: Chi2=258340.3797, P-value=0.0000e+00
VEHICLE_DEFECT: Chi2=711.3812, P-value=2.5367e-148
VEHICLE_TYPE: Chi2=21027.5064, P-value=0.0000e+00
MANEUVER: Chi2=93022.8528, P-value=0.0000e+00
FIRST_CONTACT_POINT: Chi2=123665.1882, P-value=0.0000e+00
PERSON_TYPE: Chi2=309901.0956, P-value=0.0000e+00
SEX:

#### ANOVA

In [89]:
# ANOVA test for numerical features
from scipy.stats import f_oneway

data_with_target = data.copy()
data_with_target['target'] = y

anova_results = {}

for feature in numeric_features:
    # calculate ANOVA F-statistic and p-value
    groups = [group[1] for group in data_with_target.groupby('target')[feature]]
    f_statistic, p_value = f_oneway(*groups)
    anova_results[feature] = (f_statistic, p_value)

# filter features with p-value < 0.05
anova_results = {k: v for k, v in anova_results.items() if v[1] < 0.05}

# display significant features
print("Significant Features (ANOVA Test):")
for feature, (f_statistic, p_value) in anova_results.items():
    print(f"{feature}: F-statistic={f_statistic:.4f}, P-value={p_value:.4e}")

# checkpoint results
anova_results_df = pd.DataFrame(anova_results).T
anova_results_df.columns = ['F-statistic', 'P-value']
anova_results_df.to_csv('./checkpoint/anova_results.csv', index=True)

Significant Features (ANOVA Test):
POSTED_SPEED_LIMIT: F-statistic=6693.1945, P-value=0.0000e+00
NUM_UNITS: F-statistic=43090.0581, P-value=0.0000e+00
CRASH_HOUR: F-statistic=40.5055, P-value=5.3934e-34
CRASH_DAY_OF_WEEK: F-statistic=105.6114, P-value=3.9672e-90
CRASH_MONTH: F-statistic=210.2451, P-value=1.0626e-180
VEHICLE_YEAR: F-statistic=17.1931, P-value=4.1226e-14
OCCUPANT_CNT: F-statistic=7827.2069, P-value=0.0000e+00
AGE: F-statistic=1467.7508, P-value=0.0000e+00


#### Drop Features with Low Significance
- Drop features with p-value > 0.05.

In [90]:
# drop low relevance features based on anova, chi2, spearman, and cramers_v
data.drop(columns=[
    'VEHICLE_DEFECT', 
    'ALIGNMENT', 
    'Divided_Trafficway',
    'VEHICLE_YEAR',
    'Adverse_Weather',
    'CRASH_DAY_BINARY',
    'TRAFFICWAY_TYPE',
    'DRIVER_ACTION',
    'MANEUVER',
    'ROADWAY_SURFACE_COND',
    'LIGHTING_CONDITION',
    'SEC_CAUSE_GROUP',
    'ROAD_DEFECT',
    'CRASH_MONTH',
    'CRASH_HOUR',
    'CRASH_DAY_OF_WEEK',
    'WEATHER_CONDITION',
    'OLD_VEHICLE',
    'DRIVER_VISION'
    
    
    ], errors='ignore', inplace=True)

In [91]:
# print the number of features before dropping
print("Number of features after dropping:", data.shape[1])

Number of features after dropping: 22


In [92]:
# checkpoint data
data.to_csv('./checkpoint/data_post_feat_sel.csv', index=False)

# Modeling

## Test Train Split

In [93]:
# load data
X = pd.read_csv('./checkpoint/data_post_feat_sel.csv')
# y_enc = pd.read_csv('./checkpoint/target.csv')['Encoded_Target']
y = pd.read_csv('./checkpoint/target.csv')['Original_Target']

In [None]:
# check if there are any missing values
missing_values = X.isnull().sum()
print("Missing Values:")
print(missing_values[missing_values > 0])

Missing Values:
Series([], dtype: int64)


In [95]:
# align indices of X and y, drop any rows with NaNs in either, and perform a train-test split
def train_test_split_wrapper(data, y):
    # align indices of data and y
    data, y = data.align(y, join="inner", axis=0)

    # drop rows with NaNs in either the features or the target
    combined = pd.concat([data, y], axis=1)
    combined = combined.dropna()

    # split the cleaned data and target
    data_cleaned = combined.iloc[:, :-1]  # All columns except the last (features)
    y_cleaned = combined.iloc[:, -1]  # Last column (target)

    # train-test split
    X_train, X_test, y_train, y_test = train_test_split(
        data_cleaned, y_cleaned, test_size=0.2, random_state=42, stratify=y_cleaned
    )

    return X_train, X_test, y_train, y_test

In [96]:
# perform train test split
X_train, X_test, y_train, y_test = train_test_split_wrapper(data, y)

## Preprocessing
- One-hot encoding for categorical features.
- Standard scaling for numerical features.

In [97]:
def preprocess_data(df, scaler=None, encoder=None, train=True):
    # identify numeric & categorical features
    numeric_features = df.select_dtypes(include=['float64', 'int64']).columns.tolist()
    categorical_features = df.select_dtypes(include=['object', 'category']).columns.tolist()

    # scale numeric features
    if train:
        scaler = StandardScaler()
        df_numeric_scaled = pd.DataFrame(scaler.fit_transform(df[numeric_features]), 
                                         columns=numeric_features, index=df.index)
    else:
        if scaler is None:
            raise ValueError("Scaler cannot be None when train=False")
        df_numeric_scaled = pd.DataFrame(scaler.transform(df[numeric_features]), 
                                         columns=numeric_features, index=df.index)

    # encode categorical features
    if train:
        encoder = OneHotEncoder(handle_unknown='ignore', drop='first', sparse_output=False)
        df_categorical_encoded = pd.DataFrame(encoder.fit_transform(df[categorical_features]), 
                                              columns=encoder.get_feature_names_out(categorical_features), 
                                              index=df.index)
    else:
        if encoder is None:
            raise ValueError("Encoder cannot be None when train=False")
        df_categorical_encoded = pd.DataFrame(encoder.transform(df[categorical_features]), 
                                              columns=encoder.get_feature_names_out(categorical_features), 
                                              index=df.index)

    # combine processed features
    df_processed = pd.concat([df_numeric_scaled, df_categorical_encoded], axis=1)

    return df_processed, scaler, encoder 

In [98]:
# ensure scaler & encoder are passed for test data
X_train_processed, scaler, encoder = preprocess_data(X_train, train=True)
X_test_processed, _, _ = preprocess_data(X_test, scaler=scaler, encoder=encoder, train=False)

# Baseline Model
- Logistic Regression

In [None]:
# train baseline model
model = LogisticRegression(
    class_weight='balanced',
)

# fit the model
model.fit(X_train_processed, y_train)

KeyboardInterrupt: 

## Evaluation

In [68]:
# print target value counts
print("Target Value Counts:")
print(y.value_counts())

Target Value Counts:
Original_Target
NO INDICATION OF INJURY     3401601
NONINCAPACITATING INJURY     448135
REPORTED, NOT EVIDENT        257391
INCAPACITATING INJURY         97964
FATAL                          7036
Name: count, dtype: int64


In [None]:
def evaluate_model(model, X, y, X_test, y_test, output_path='./checkpoint/evaluation_metrics.json', cv_folds=5):
    # classification report
    print("🔹 Classification Report:")
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)

    report = classification_report(y_test, y_pred)
    print(report)

    # AUC-ROC
    roc_auc = roc_auc_score(pd.get_dummies(y_test).values, y_pred_proba, multi_class="ovr")
    print(f"🔹 AUC-ROC: {roc_auc:.4f}")

    # confusion matrix
    print("🔹 Confusion Matrix:")
    cm = confusion_matrix(y_test, y_pred)
    print(cm)

    # cross-validation
    print("\n🔹 Running Cross-Validation...")
    cv = StratifiedKFold(n_splits=cv_folds, shuffle=True, random_state=42)
    cv_scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')

    print(f"🔹 Cross-Validation Accuracy: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

    # save evaluation metrics
    evaluation_metrics = {
        'Classification Report': classification_report(y_test, y_pred, output_dict=True),
        'AUC-ROC': roc_auc,
        'Confusion Matrix': cm.tolist(),
        'Cross-Validation Accuracy': {
            'mean': cv_scores.mean(),
            'std_dev': cv_scores.std(),
            'scores': cv_scores.tolist()
        }
    }

    with open(output_path, 'w') as f:
        json.dump(evaluation_metrics, f, indent=4)

    return evaluation_metrics

In [None]:
# evaluate the baseline model
evaluation_metrics = evaluate_model(model, X_train_processed, y_train, X_test_processed, y_test, output_path='./checkpoint/evaluation_metrics_baseline.json')

# Decision Tree
- Use a decision tree classifier to predict injury severity.

In [69]:
# train a decision tree model
from sklearn.tree import DecisionTreeClassifier

# train the model
model = DecisionTreeClassifier(
    class_weight='balanced',
    random_state=42
)
model.fit(X_train_processed, y_train)

# get evaluation metrics
evaluation_metrics = evaluate_model(model, X_train_processed, y_train, X_test_processed, y_test, output_path='./checkpoint/evaluation_metrics_decision_tree.json')

Classification Report:
                          precision    recall  f1-score   support

                   FATAL       0.53      0.64      0.58      1407
   INCAPACITATING INJURY       0.47      0.55      0.51     19593
 NO INDICATION OF INJURY       0.92      0.89      0.90    680321
NONINCAPACITATING INJURY       0.53      0.58      0.55     89627
   REPORTED, NOT EVIDENT       0.35      0.44      0.39     51478

                accuracy                           0.82    842426
               macro avg       0.56      0.62      0.59    842426
            weighted avg       0.83      0.82      0.82    842426

AUC-ROC: 0.7670
Confusion Matrix:
[[   895     96    192    160     64]
 [   134  10799   4490   2993   1177]
 [   261   7099 602791  36396  33774]
 [   311   3657  27594  51737   6328]
 [    76   1344  21387   5941  22730]]


{'Classification Report': {'FATAL': {'precision': 0.5336911150864639,
   'recall': 0.6361051883439943,
   'f1-score': 0.5804150453955902,
   'support': 1407.0},
  'INCAPACITATING INJURY': {'precision': 0.4696238312676669,
   'recall': 0.5511662328382586,
   'f1-score': 0.5071381609843149,
   'support': 19593.0},
  'NO INDICATION OF INJURY': {'precision': 0.918253221093938,
   'recall': 0.886039090370575,
   'f1-score': 0.9018585775467075,
   'support': 680321.0},
  'NONINCAPACITATING INJURY': {'precision': 0.5321258498153805,
   'recall': 0.5772479275218405,
   'f1-score': 0.5537692529996682,
   'support': 89627.0},
  'REPORTED, NOT EVIDENT': {'precision': 0.3547516114431976,
   'recall': 0.4415478456816504,
   'f1-score': 0.393419355955379,
   'support': 51478.0},
  'accuracy': 0.8178190131833538,
  'macro avg': {'precision': 0.5616891257413295,
   'recall': 0.6184212569512637,
   'f1-score': 0.587320078576332,
   'support': 842426.0},
  'weighted avg': {'precision': 0.831662294074588

## Evaluation Results
- Decision Tree Classifier performs better than the baseline model.
    - ROC-AUC: Logistic Regression is slightly better at distinguishing between classes.
    - Weighted Average F1 Score: Decision Tree Classifier is better at predicting injury severity.
    - Accuracy: Decision Tree is overall better at predicting injury severity.
- Logistic Regression suffers from class imbalance and is not able to predict injury severity effectively.

Logistic Regression:
- It heavily predicts the majority class ("No Indication of Injury"), even for rare classes.
    - Fatal injuries (FATAL): Poor recall (69%), likely due to underrepresentation.
    - Struggles to separate "Injury" classes (e.g., Incapacitating vs. Non-incapacitating).
Decision Tree:
- Better balance in predictions, but slightly worse AUC-ROC.
    - Higher recall for all injury types, meaning it captures more true injuries.
    - Slightly more false positives in injury-related classes, which could be tuned.

# Model Improvement
- Investigate sampling techniques to improve model performance.
    - Under-sampling
    - Class weights

Note that SMOTE will not be considered as it is computationally expensive and may not be suitable for large datasets. 

## Undersampling 
- Use undersampling to balance the classes.

In [70]:
undersampler = RandomUnderSampler(sampling_strategy='auto', random_state=42)
X_train_undersampled, y_train_undersampled = undersampler.fit_resample(X_train_processed, y_train)

In [None]:
# train the model
model.fit(X_train_undersampled, y_train_undersampled)

In [None]:
# get evaluation metrics
evaluation_metrics = evaluate_model(model, X_train_undersampled, y_train_undersampled, X_test_processed, y_test, output_path='./checkpoint/evaluation_metrics_decision_tree_undersampled.json')