# Business Understanding

## Introduction
Traffic accidents are a critical public safety issue, causing injuries, fatalities and significant economic losses. Stakeholders such as traffic authorities and emergency services often face challenges in predicting and mitigating injury severity in crashes. Understanding the factors influencing injury outcomes can inform better policies, resource allocation and public safety to reduce injury severity and save lives.

## Use Cases
- Use the model to identify high-risk conditions (eg. weather, lighting, etc.) and implement measures like improved signage, speed limits or road design to reduce injury severity in traffic accidents.
- Predict the severity of injuries based on crash conditions, enabling emergency services to prioritize resources and respond more effectively to severe accidents. 

## Value Proposition
This project aims to develop a classification model that predicts injury severity in traffic crashes. By identifying key high-risk contributing to severe injuries, stakeholders can implement proactive measures to:
    - Reduce injury severity in traffic accidents through ad-hoc interventions
    - Enhance decision-making and resource allocation for emergency services
    - Improve public safety and save lives

# Business Objective
- The task is to predict the severity of injuries based on the given features:
    - Environment: The environment in which the accident occurred.
        - POSTED_SPEED_LIMIT: The posted speed limit.
        - WEAHTER_CONDITION: The weather condition.
        - LIGHTING_CONDITION: The lighting condition.
        - ROADWAY_SURFACE_COND: The roadway surface condition.
        - ROAD_DEFECT: Whether or not the road was defective.
        - TRAFFICWAY_TYPE: The type of trafficway.
        - TRAFFIC_CONTROL_DEVICE: The traffic control device present at the location of the accident.
    - Crash Dynamics: The dynamics of the crash.
        - FIRST_CRASH_TYPE: The type of the first crash.
        - TRAFFICWAY_TYPE: The type of trafficway.
        - ALIGNMENT: The alignment of the road.
        - LANE_CNT: The number of through lanes in either direction.
        - CRASH_HOUR: The hour of the crash.
        - CRASH_DAY_OF_WEEK: The day of the week of the crash.
        - CRASH_MONTH: The month of the crash.
    - Human Factors:
        - PRIM_CONTRIBUTORY_CAUSE: The primary contributory cause of the accident.
        - SEC_CONTRIBUTORY_CAUSE: The secondary contributory cause of the accident.
        - HIT_AND_RUN_I: Whether or not the crash involved a hit and run.
        - NOT_RIGHT_OF_WAY_I: Whether or not the crash involved a violation of the right of way.
        - WORK_ZONE_I: Whether or not the crash occurred in a work zone.
    - Location Factors:
        - LATITUDE: The latitude of the location of the crash.
        - LONGITUDE: The longitude of the location of the crash.
        - BEAT_OF_OCCURRENCE: The police beat of occurrence.
    - Target:
        - MOST_SEVERE_INJURY: Multi-class classification target (eg. FATAL, INCAPACITATING INJURY, NONINCAPACITATING INJURY, REPORTED, NO INJURY).


# Data Understanding

## Introduction
The dataset contains information about traffic accidents in Chicago. Stakeholders need reliable data-driven insights to mitigate injury severity and optimize their strategies. The dataset in this project is directly related to the task of predicting injury severity in traffic accidents.

## Data Description
- The dataset includes detailed records of traffic accidents covering various features such as environment, crash dynamics, human factors, location factors and target variable MOST_SEVERE_INJURY.

## Data Quality
- The dataset is very large with over 400,000 records and 49 features, providing a rich source of information for analysis.
- The dataset comes from the City of Chicago's [open data portal](https://data.cityofchicago.org/Transportation/Traffic-Crashes-Crashes/85ca-t3if) and is updated daily making it a reliable source of information for stakeholders.

## Data Relevance
- Use data on crash conditions (eg. weather) to identify high-risk conditions take proative measures.
- Predict injury severity to prioritize emergency services and allocate resources more effectively. 

## Conclusion
The dataset is robust, relevant and continually updated, making it an indispensable resource for the task of predicting injury severity in traffic accidents. 

# Data Preparation

## Assembly
- The source data is comprised of three CSV files:
    - [Crash Data](https://data.cityofchicago.org/Transportation/Traffic-Crashes-Crashes/85ca-t3if/about_data)
    - [Driver/Passenger Data](https://data.cityofchicago.org/Transportation/Traffic-Crashes-People/u6pd-qa9d/about_data)
    - [Vehicles Data](https://data.cityofchicago.org/Transportation/Traffic-Crashes-Vehicles/68nd-jvt3/about_data)
- The data will be assembled into a single dataset by joining the three tables on the common key CRASH_RECORD_ID.

## Cleaning
- Irrelevant columns that do not contribute to the task will be dropped.
- Missing values that will be imputed or dropped.

## Transformation
- Categorical features will be encoded using one-hot encoding.
- Numerical features will be scaled using standard scaling.

## Splitting
- The dataset will be split into training and testing sets using a standard 80/20 split.
- These sets will be saved to disk for future use as:
    - [X_train](./data/X_train.csv)
    - [X_test](./data/X_test.csv)
    - [y_train](./data/y_train.csv)
    - [y_test](./data/y_test.csv)

# Modeling

## Import Libraries

In [15]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.metrics import precision_recall_curve, average_precision_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV



## Load Data

In [16]:
# load data
data_crashes = pd.read_csv('./data/Traffic_Crashes_-_Crashes_20250122.csv')
data_vehicles = pd.read_csv('./data/Traffic_Crashes_-_Vehicles_20250122.csv')
data_people = pd.read_csv('./data/Traffic_Crashes_-_People_20250122.csv')

  data_vehicles = pd.read_csv('./data/Traffic_Crashes_-_Vehicles_20250122.csv')
  data_people = pd.read_csv('./data/Traffic_Crashes_-_People_20250122.csv')


## Inspect Data and Select Relevant Features

In [17]:
# create a variable to store the features to be used in the model
features = []

In [18]:
# inspect data
data_crashes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 911691 entries, 0 to 911690
Data columns (total 48 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   CRASH_RECORD_ID                911691 non-null  object 
 1   CRASH_DATE_EST_I               67246 non-null   object 
 2   CRASH_DATE                     911691 non-null  object 
 3   POSTED_SPEED_LIMIT             911691 non-null  int64  
 4   TRAFFIC_CONTROL_DEVICE         911691 non-null  object 
 5   DEVICE_CONDITION               911691 non-null  object 
 6   WEATHER_CONDITION              911691 non-null  object 
 7   LIGHTING_CONDITION             911691 non-null  object 
 8   FIRST_CRASH_TYPE               911691 non-null  object 
 9   TRAFFICWAY_TYPE                911691 non-null  object 
 10  LANE_CNT                       199023 non-null  float64
 11  ALIGNMENT                      911691 non-null  object 
 12  ROADWAY_SURFACE_COND          

## Feature Selection for Crash Data
- WEATHER_CONDITION, LIGHTING_CONDITION
- ROADWAY_SURFACE_COND, ROAD_DEFECT, ALIGNMENT, LANE_CNT
- TRAFFICWAY_TYPE, TRAFFIC_CONTROL_DEVICE, POSTED_SPEED_LIMIT
- DAMAGE
- CRASH_HOUR, CRASH_DAY_OF_WEEK, CRASH_MONTH
- LATITUDE, LONGITUDE, BEAT_OF_OCCURRENCE
- PRIM_CONTRIBUTORY_CAUSE, SEC_CONTRIBUTORY_CAUSE, HIT_AND_RUN_I, NOT_RIGHT_OF_WAY_I, WORK_ZONE_I
- MOST_SEVERE_INJURY, INJURIES_TOTAL, INJURIES_FATAL, INJURIES_INCAPACITATING, INJURIES_NON_INCAPACITATING, INJURIES_REPORTED_NOT_EVIDENT, INJURIES_NO_INDICATION


In [19]:
# add selected features from crashes data
features += [
    'WEATHER_CONDITION', 
    'LIGHTING_CONDITION', 

    'ROADWAY_SURFACE_COND', 
    'ROAD_DEFECT',
    'ALIGNMENT',
    'LANE_CNT',

    'TRAFFICWAY_TYPE',
    'TRAFFIC_CONTROL_DEVICE',
    'POSTED_SPEED_LIMIT',
    
    'DAMAGE',

    'CRASH_HOUR',
    'CRASH_DAY_OF_WEEK',
    'CRASH_MONTH',
    
    'LATITUDE',
    'LONGITUDE',
    'BEAT_OF_OCCURRENCE',
    
    'PRIM_CONTRIBUTORY_CAUSE',
    'SEC_CONTRIBUTORY_CAUSE',
    'HIT_AND_RUN_I',
    'NOT_RIGHT_OF_WAY_I',
    'WORK_ZONE_I',
    
    # target variable
    'MOST_SEVERE_INJURY',
    'INJURIES_TOTAL',
    'INJURIES_FATAL',
    'INJURIES_INCAPACITATING',
    'INJURIES_NON_INCAPACITATING',
    'INJURIES_REPORTED_NOT_EVIDENT',
    'INJURIES_NO_INDICATION'

    ]

In [20]:
data_vehicles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1859619 entries, 0 to 1859618
Data columns (total 71 columns):
 #   Column                    Dtype  
---  ------                    -----  
 0   CRASH_UNIT_ID             int64  
 1   CRASH_RECORD_ID           object 
 2   CRASH_DATE                object 
 3   UNIT_NO                   int64  
 4   UNIT_TYPE                 object 
 5   NUM_PASSENGERS            float64
 6   VEHICLE_ID                float64
 7   CMRC_VEH_I                object 
 8   MAKE                      object 
 9   MODEL                     object 
 10  LIC_PLATE_STATE           object 
 11  VEHICLE_YEAR              float64
 12  VEHICLE_DEFECT            object 
 13  VEHICLE_TYPE              object 
 14  VEHICLE_USE               object 
 15  TRAVEL_DIRECTION          object 
 16  MANEUVER                  object 
 17  TOWED_I                   object 
 18  FIRE_I                    object 
 19  OCCUPANT_CNT              float64
 20  EXCEED_SPEED_LIMIT_I    

## Feature Selection for Driver/Passenger
- OCCUPANT_CNT
- VEHICLE_DEFECT, VEHICLE_TYPE, VEHICLE_USE
- EXCEED_SPEED_LIMIT_I
- FIRST_CONTACT_POINT

In [21]:
# add selected features from vehicles data
features += [
    'OCCUPANT_CNT',
    'VEHICLE_DEFECT', 
    'VEHICLE_TYPE', 
    'VEHICLE_USE', 
    'EXCEED_SPEED_LIMIT_I',
    'FIRST_CONTACT_POINT'
    ]

In [22]:
data_people.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000830 entries, 0 to 2000829
Data columns (total 29 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   PERSON_ID              object 
 1   PERSON_TYPE            object 
 2   CRASH_RECORD_ID        object 
 3   VEHICLE_ID             float64
 4   CRASH_DATE             object 
 5   SEAT_NO                float64
 6   CITY                   object 
 7   STATE                  object 
 8   ZIPCODE                object 
 9   SEX                    object 
 10  AGE                    float64
 11  DRIVERS_LICENSE_STATE  object 
 12  DRIVERS_LICENSE_CLASS  object 
 13  SAFETY_EQUIPMENT       object 
 14  AIRBAG_DEPLOYED        object 
 15  EJECTION               object 
 16  INJURY_CLASSIFICATION  object 
 17  HOSPITAL               object 
 18  EMS_AGENCY             object 
 19  EMS_RUN_NO             object 
 20  DRIVER_ACTION          object 
 21  DRIVER_VISION          object 
 22  PHYSICAL_CONDITION

## Feature Selection for Vehicles
- AGE, SEX
- AIRBAG_DEPLOYED, EJECTION, SAFETY_EQUIPMENT
- CELL_PHONE_USE
- PHYSICAL_CONDITION 
- INJURY_CLASSIFICATION

In [23]:
# add selected features from people data
features += [
    'AGE',
    'SEX',
    'AIRBAG_DEPLOYED',
    'EJECTION',
    'SAFETY_EQUIPMENT',
    'CELL_PHONE_USE',
    'PHYSICAL_CONDITION',
    'INJURY_CLASSIFICATION'
]

## Merge Data from Three Tables
- The three tables will be merged on the common key CRASH_RECORD_ID.
- Use 'left' join strategy to retain all records from the Crash Data table.
    - This is because the Crash Data table is the primary table and we want to retain all records from it.

In [24]:
# merge data on CRASH_RECORD_ID 
data = data_crashes.merge(data_vehicles, on='CRASH_RECORD_ID', how='left')
data = data.merge(data_people, on='CRASH_RECORD_ID', how='left')

# inspect resulting data
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4206348 entries, 0 to 4206347
Columns: 146 entries, CRASH_RECORD_ID to CELL_PHONE_USE
dtypes: float64(24), int64(8), object(114)
memory usage: 4.6+ GB


## Select Relevant Features & Remove Duplicates
- Remove features that are not relevant to the task.
- Remove duplicate records.

In [25]:
# select relevant features
data = data[features]

# drop redundant columns
data = data.loc[:, ~data.columns.duplicated()]

# drop duplicates in rows (optional)
data = data.drop_duplicates()

## Checkpoint the Data for Future Use
- Save the merged dataset to disk for future use as:
    - [data](./data/data.csv)

In [26]:
# save merged and cleaned data
data.to_csv('./data/data.csv', index=False)