##                                    Chicago Car Crash Analysis

## Overview

Road safety remains a critical public health concern in major cities around the world, including Chicago. In an effort to improve traffic safety and reduce the frequency and severity of car accidents, the Chicago Department of Transportation has initiated a data-driven investigation into the primary causes of motor vehicle crashes within the city.

This project aims to develop a predictive model that identifies the primary contributory cause of car accidents using data drawn from three key sources: vehicle characteristics, driver and passenger information, and environmental factors such as road conditions and weather. By analyzing these diverse datasets, the model will help uncover underlying patterns and risk factors associated with traffic accidents.

The insights generated will support data-informed decision-making, enabling city officials to implement targeted safety interventions, design better infrastructure, and ultimately make Chicago’s roads safer for all users.

## Business Understanding

The Chicago Department of Transportation aims to investigate the root causes of car crashes within the city. This project seeks to build a predictive model that identifies the primary contributory cause of a car accident by analyzing comprehensive data from vehicles, drivers/passengers, and the environment. By leveraging this multi-source dataset, the goal is to uncover patterns that can inform targeted interventions, enhance road safety, and reduce accident rates across Chicago.

## Stakeholders


1. City of Chicago Transportation Department : to guide policies and interventions to improve road safety.

2. Vehicle Safety Boards : to influence regulations on vehicle safety features.

3. Public Health & Law Enforcement : to identify high-risk patterns and take preventive action.

## Key Questions This Project Aims to Answer:

1. What are the leading causes of accidents in Chicago?

2. Can we predict why a crash occurred based on who was driving, what they were driving, and under what conditions?

3. Are there specific car types, road conditions, or human behaviors consistently linked with particular causes?

4. How can the city use this information to reduce crash frequency and severity?


## Data Understanding

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [3]:
vehicles_df = pd.read_csv("Vehicles_Data.csv")
vehicles_df

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,CRASH_UNIT_ID,CRASH_RECORD_ID,CRASH_DATE,UNIT_NO,UNIT_TYPE,NUM_PASSENGERS,VEHICLE_ID,CMRC_VEH_I,MAKE,MODEL,...,TRAILER1_LENGTH,TRAILER2_LENGTH,TOTAL_VEHICLE_LENGTH,AXLE_CNT,VEHICLE_CONFIG,CARGO_BODY_TYPE,LOAD_TYPE,HAZMAT_OUT_OF_SERVICE_I,MCS_OUT_OF_SERVICE_I,HAZMAT_CLASS
0,10,2e31858c0e411f0bdcb337fb7c415aa93763cf2f23e02f...,08/04/2015 12:40:00 PM,1,DRIVER,,10.0,,FORD,Focus,...,,,,,,,,,,
1,100,e73b35bd7651b0c6693162bee0666db159b28901437009...,07/31/2015 05:50:00 PM,1,DRIVER,,96.0,,NISSAN,Pathfinder,...,,,,,,,,,,
2,1000,f2b1adeb85a15112e4fb7db74bff440d6ca53ff7a21e10...,09/02/2015 11:45:00 AM,1,DRIVER,,954.0,,FORD,F150,...,,,,,,,,,,
3,10000,15a3e24fce3ce7cd2b02d44013d1a93ff2fbdca80632ec...,10/31/2015 09:30:00 PM,2,DRIVER,,9561.0,,HYUNDAI,SONATA,...,,,,,,,,,,
4,100000,1d3c178880366c77deaf06b8c3198429112a1c8e8807ed...,11/16/2016 01:00:00 PM,2,PARKED,,96745.0,,"TOYOTA MOTOR COMPANY, LTD.",RAV4 (sport utility),...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1969763,2109651,006e6cb8786cf311d43f007a0c6e713d8024092582f607...,07/18/2025 11:05:00 PM,2,DRIVER,,2011167.0,,HONDA,CR-V,...,,,,,,,,,,
1969764,2109660,27b87030a620e0f94174913d2d2ce3a6002b127fbab220...,07/18/2025 04:45:00 PM,1,DRIVER,1.0,2011171.0,,CADILLAC,XTS,...,,,,,,,,,,
1969765,2109661,27b87030a620e0f94174913d2d2ce3a6002b127fbab220...,07/18/2025 04:45:00 PM,2,DRIVER,,2011179.0,,VOLKSWAGEN,TIGUAN,...,,,,,,,,,,
1969766,2109685,1b10f16633e40b81a7ed96a162c427614abccb7f5849f4...,07/18/2025 07:00:00 AM,1,DRIVER,,2011192.0,,INFINITI,Q50,...,,,,,,,,,,


In [4]:
print("\nData Types:\n", vehicles_df.dtypes)


Data Types:
 CRASH_UNIT_ID               int64
CRASH_RECORD_ID            object
CRASH_DATE                 object
UNIT_NO                     int64
UNIT_TYPE                  object
                            ...  
CARGO_BODY_TYPE            object
LOAD_TYPE                  object
HAZMAT_OUT_OF_SERVICE_I    object
MCS_OUT_OF_SERVICE_I       object
HAZMAT_CLASS               object
Length: 71, dtype: object


In [5]:
print("\nMissing Values:\n", vehicles_df.isnull().sum())


Missing Values:
 CRASH_UNIT_ID                    0
CRASH_RECORD_ID                  0
CRASH_DATE                       0
UNIT_NO                          0
UNIT_TYPE                     2335
                            ...   
CARGO_BODY_TYPE            1954200
LOAD_TYPE                  1954903
HAZMAT_OUT_OF_SERVICE_I    1956150
MCS_OUT_OF_SERVICE_I       1955911
HAZMAT_CLASS               1968527
Length: 71, dtype: int64


In [6]:
print("\nSample Records:\n", vehicles_df.head())


Sample Records:
    CRASH_UNIT_ID                                    CRASH_RECORD_ID  \
0             10  2e31858c0e411f0bdcb337fb7c415aa93763cf2f23e02f...   
1            100  e73b35bd7651b0c6693162bee0666db159b28901437009...   
2           1000  f2b1adeb85a15112e4fb7db74bff440d6ca53ff7a21e10...   
3          10000  15a3e24fce3ce7cd2b02d44013d1a93ff2fbdca80632ec...   
4         100000  1d3c178880366c77deaf06b8c3198429112a1c8e8807ed...   

               CRASH_DATE  UNIT_NO UNIT_TYPE  NUM_PASSENGERS  VEHICLE_ID  \
0  08/04/2015 12:40:00 PM        1    DRIVER             NaN        10.0   
1  07/31/2015 05:50:00 PM        1    DRIVER             NaN        96.0   
2  09/02/2015 11:45:00 AM        1    DRIVER             NaN       954.0   
3  10/31/2015 09:30:00 PM        2    DRIVER             NaN      9561.0   
4  11/16/2016 01:00:00 PM        2    PARKED             NaN     96745.0   

  CMRC_VEH_I                        MAKE                 MODEL  ...  \
0        NaN               

In [7]:
print("\nColumn Names:\n", vehicles_df.columns.tolist())



Column Names:
 ['CRASH_UNIT_ID', 'CRASH_RECORD_ID', 'CRASH_DATE', 'UNIT_NO', 'UNIT_TYPE', 'NUM_PASSENGERS', 'VEHICLE_ID', 'CMRC_VEH_I', 'MAKE', 'MODEL', 'LIC_PLATE_STATE', 'VEHICLE_YEAR', 'VEHICLE_DEFECT', 'VEHICLE_TYPE', 'VEHICLE_USE', 'TRAVEL_DIRECTION', 'MANEUVER', 'TOWED_I', 'FIRE_I', 'OCCUPANT_CNT', 'EXCEED_SPEED_LIMIT_I', 'TOWED_BY', 'TOWED_TO', 'AREA_00_I', 'AREA_01_I', 'AREA_02_I', 'AREA_03_I', 'AREA_04_I', 'AREA_05_I', 'AREA_06_I', 'AREA_07_I', 'AREA_08_I', 'AREA_09_I', 'AREA_10_I', 'AREA_11_I', 'AREA_12_I', 'AREA_99_I', 'FIRST_CONTACT_POINT', 'CMV_ID', 'USDOT_NO', 'CCMC_NO', 'ILCC_NO', 'COMMERCIAL_SRC', 'GVWR', 'CARRIER_NAME', 'CARRIER_STATE', 'CARRIER_CITY', 'HAZMAT_PLACARDS_I', 'HAZMAT_NAME', 'UN_NO', 'HAZMAT_PRESENT_I', 'HAZMAT_REPORT_I', 'HAZMAT_REPORT_NO', 'MCS_REPORT_I', 'MCS_REPORT_NO', 'HAZMAT_VIO_CAUSE_CRASH_I', 'MCS_VIO_CAUSE_CRASH_I', 'IDOT_PERMIT_NO', 'WIDE_LOAD_I', 'TRAILER1_WIDTH', 'TRAILER2_WIDTH', 'TRAILER1_LENGTH', 'TRAILER2_LENGTH', 'TOTAL_VEHICLE_LENGTH', 

In [8]:
print("Shape of the dataset:", vehicles_df.shape)

Shape of the dataset: (1969768, 71)


In [9]:
vehicles_df.describe()

Unnamed: 0,CRASH_UNIT_ID,UNIT_NO,NUM_PASSENGERS,VEHICLE_ID,VEHICLE_YEAR,OCCUPANT_CNT,CMV_ID,TRAILER1_LENGTH,TRAILER2_LENGTH,TOTAL_VEHICLE_LENGTH,AXLE_CNT
count,1969768.0,1969768.0,291632.0,1923407.0,1620192.0,1923407.0,19670.0,2486.0,70.0,3048.0,4624.0
mean,1053651.0,3.482699,1.469362,1002109.0,2014.572,1.079456,10963.675292,48.460579,44.271429,53.165026,9.315528
std,609681.4,2691.896,1.065047,579478.9,138.6615,0.7819697,6329.397305,20.42685,28.00824,31.530108,382.441301
min,2.0,0.0,1.0,2.0,1900.0,0.0,1.0,1.0,1.0,1.0,1.0
25%,525195.8,1.0,1.0,500589.5,2007.0,1.0,5405.25,45.0,24.25,35.0,2.0
50%,1055198.0,2.0,1.0,1000024.0,2013.0,1.0,11022.5,53.0,50.0,53.0,3.0
75%,1581963.0,2.0,2.0,1503526.0,2017.0,1.0,16446.75,53.0,53.0,66.0,5.0
max,2109686.0,3778035.0,59.0,2011193.0,9999.0,99.0,21845.0,740.0,123.0,999.0,26009.0


In [10]:
vehicles_df.describe(include='all')

Unnamed: 0,CRASH_UNIT_ID,CRASH_RECORD_ID,CRASH_DATE,UNIT_NO,UNIT_TYPE,NUM_PASSENGERS,VEHICLE_ID,CMRC_VEH_I,MAKE,MODEL,...,TRAILER1_LENGTH,TRAILER2_LENGTH,TOTAL_VEHICLE_LENGTH,AXLE_CNT,VEHICLE_CONFIG,CARGO_BODY_TYPE,LOAD_TYPE,HAZMAT_OUT_OF_SERVICE_I,MCS_OUT_OF_SERVICE_I,HAZMAT_CLASS
count,1969768.0,1969768,1969768,1969768.0,1967433,291632.0,1923407.0,36557,1923402,1923257,...,2486.0,70.0,3048.0,4624.0,16274,15568,14865,13618,13857,1241
unique,,965825,637303,,9,,,2,1435,2839,...,,,,,8,9,6,2,2,8
top,,67c687ee6be6bb4420406e6b5dde7c729d82e24a662cfd...,02/06/2025 08:00:00 AM,,DRIVER,,,Y,CHEVROLET,OTHER (EXPLAIN IN NARRATIVE),...,,,,,TRACTOR/SEMI-TRAILER,VAN/ENCLOSED BOX,OTHER,N,N,MISCELLANEOUS
freq,,18,74,,1649045,,,22864,220533,194784,...,,,,,6260,6578,8110,13600,13804,1095
mean,1053651.0,,,3.482699,,1.469362,1002109.0,,,,...,48.460579,44.271429,53.165026,9.315528,,,,,,
std,609681.4,,,2691.896,,1.065047,579478.9,,,,...,20.42685,28.00824,31.530108,382.441301,,,,,,
min,2.0,,,0.0,,1.0,2.0,,,,...,1.0,1.0,1.0,1.0,,,,,,
25%,525195.8,,,1.0,,1.0,500589.5,,,,...,45.0,24.25,35.0,2.0,,,,,,
50%,1055198.0,,,2.0,,1.0,1000024.0,,,,...,53.0,50.0,53.0,3.0,,,,,,
75%,1581963.0,,,2.0,,2.0,1503526.0,,,,...,53.0,53.0,66.0,5.0,,,,,,


## Data Cleaning

In [11]:
vehicles_df.isnull().sum()

CRASH_UNIT_ID                    0
CRASH_RECORD_ID                  0
CRASH_DATE                       0
UNIT_NO                          0
UNIT_TYPE                     2335
                            ...   
CARGO_BODY_TYPE            1954200
LOAD_TYPE                  1954903
HAZMAT_OUT_OF_SERVICE_I    1956150
MCS_OUT_OF_SERVICE_I       1955911
HAZMAT_CLASS               1968527
Length: 71, dtype: int64

In [13]:
vehicles_df_cleaned = vehicles_df.dropna()
vehicles_df_cleaned 

Unnamed: 0,CRASH_UNIT_ID,CRASH_RECORD_ID,CRASH_DATE,UNIT_NO,UNIT_TYPE,NUM_PASSENGERS,VEHICLE_ID,CMRC_VEH_I,MAKE,MODEL,...,TRAILER1_LENGTH,TRAILER2_LENGTH,TOTAL_VEHICLE_LENGTH,AXLE_CNT,VEHICLE_CONFIG,CARGO_BODY_TYPE,LOAD_TYPE,HAZMAT_OUT_OF_SERVICE_I,MCS_OUT_OF_SERVICE_I,HAZMAT_CLASS
