# Determining the Primary Causes of Traffic Accidents in Chicago

This projects raw dataset originates from the [City of Chicago's website](https://data.cityofchicago.org/Transportation/Traffic-Crashes-Crashes/85ca-t3if/about_data), where it is updated daily. I downloaded the data on May 1st, 2024 and [uploaded that snapshot to Kaggle](https://www.kaggle.com/datasets/joelmott/chicago-traffic-crashes-may-2024).

This dataset consists of three seperate csv files: one for general crash information, one for the people involved in each crash, and one for each vehicle. When merged, the resulting dataset contains over 150 columns and 3.8 million records. In order to improve modeling results and interpretability, I narrowed that down to 15 feature columns and one target column in this [data engineering notebook](https://github.com/joeldmott/chicago_auto_accidents_project/blob/main/data_engineering_notebook.ipynb).

In this project, I use the resulting csv file from that data engineering effort.

In [1]:
import json
import os
from pathlib import Path
from google.colab import userdata

# api key for importing Kaggle and downloading the datasets
os.environ["KAGGLE_KEY"] = userdata.get('KAGGLE_KEY')
os.environ["KAGGLE_USERNAME"] = userdata.get('KAGGLE_USERNAME')
# api key for the json object below
api_key = userdata.get('API_KEY')

# uses pathlib Path
kaggle_path = Path('/root/.kaggle')
os.makedirs(kaggle_path, exist_ok=True)

# opens file and dumps python dict to json object
with open (kaggle_path/'kaggle.json', 'w') as handl:
    json.dump(api_key,handl)

os.chmod(kaggle_path/'kaggle.json', 600)

In [7]:
import kaggle
! kaggle datasets download joelmott/chicago-traffic-crashes-may-2024 -f trimmed_chicago_crashes_data.csv

Dataset URL: https://www.kaggle.com/datasets/joelmott/chicago-traffic-crashes-may-2024
License(s): CC0-1.0
Downloading trimmed_chicago_crashes_data.csv.zip to /content
 47% 5.00M/10.6M [00:00<00:00, 36.3MB/s]
100% 10.6M/10.6M [00:00<00:00, 56.1MB/s]


In [10]:
!unzip /content/trimmed_chicago_crashes_data.csv

Archive:  /content/trimmed_chicago_crashes_data.csv.zip
  inflating: trimmed_chicago_crashes_data.csv  


In [11]:
import pandas as pd
df = pd.read_csv('/content/trimmed_chicago_crashes_data.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,POSTED_SPEED_LIMIT,WEATHER_CONDITION,LIGHTING_CONDITION,ROADWAY_SURFACE_COND,NUM_UNITS,CRASH_HOUR,CRASH_DAY_OF_WEEK,CRASH_MONTH,LATITUDE,LONGITUDE,PERSON_TYPE,SEX,AGE,TRAVEL_DIRECTION,MANEUVER,TARGET
0,0,30,CLEAR,DAYLIGHT,DRY,2,17,6,8,41.942976,-87.761883,BICYCLE,M,14,S,STRAIGHT AHEAD,1
1,1,30,CLEAR,DAYLIGHT,DRY,2,17,6,8,41.942976,-87.761883,DRIVER,M,36,S,STRAIGHT AHEAD,1
2,2,15,CLEAR,DAYLIGHT,DRY,2,12,4,9,41.744152,-87.585945,DRIVER,M,55,W,BACKING,1
3,3,15,CLEAR,DAYLIGHT,DRY,2,12,4,9,41.744152,-87.585945,DRIVER,M,55,S,SLOW/STOP IN TRAFFIC,1
4,4,30,CLEAR,DAYLIGHT,DRY,2,11,4,9,41.937252,-87.776321,DRIVER,M,39,S,STRAIGHT AHEAD,1


In [13]:
#A new index column was added in for some reason, let's drop that
df.drop('Unnamed: 0', axis=1, inplace=True)
#Now let's check and see if it looks like it did back in the data engineering notebook:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 884772 entries, 0 to 884771
Data columns (total 16 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   POSTED_SPEED_LIMIT    884772 non-null  int64  
 1   WEATHER_CONDITION     884772 non-null  object 
 2   LIGHTING_CONDITION    884772 non-null  object 
 3   ROADWAY_SURFACE_COND  884772 non-null  object 
 4   NUM_UNITS             884772 non-null  int64  
 5   CRASH_HOUR            884772 non-null  int64  
 6   CRASH_DAY_OF_WEEK     884772 non-null  int64  
 7   CRASH_MONTH           884772 non-null  int64  
 8   LATITUDE              884772 non-null  float64
 9   LONGITUDE             884772 non-null  float64
 10  PERSON_TYPE           884772 non-null  object 
 11  SEX                   884772 non-null  object 
 12  AGE                   884772 non-null  int64  
 13  TRAVEL_DIRECTION      884772 non-null  object 
 14  MANEUVER              884772 non-null  object 
 15  

This file is more managable than the gargantuan raw data, but still contains plenty of records. Subsequently, we'll validate this project using a train-test split as opposed to cross-validation.

In [14]:
from sklearn.model_selection import train_test_split
X = df.drop('TARGET', axis=1)
y = df['TARGET']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=210)

In [24]:
#establishing which features are numeric or categorical
numeric_features = ['POSTED_SPEED_LIMIT', 'NUM_UNITS', 'CRASH_HOUR', 'LATITUDE', 'LONGITUDE', 'AGE']

categorical_features = ['WEATHER_CONDITION', 'LIGHTING_CONDITION', 'ROADWAY_SURFACE_COND',
                        'CRASH_DAY_OF_WEEK', 'PERSON_TYPE', 'SEX', 'TRAVEL_DIRECTION', 'MANEUVER']

#splitting them up for preprocessing
X_train_numeric = X_train[numeric_features]
X_train_categorical = X_train[categorical_features]

## preprocessing the training data

First, we'll standardize the numeric features

In [27]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_numeric_scaled = scaler.fit_transform(X_train_numeric)
X_train_numeric_scaled = pd.DataFrame(X_train_numeric_scaled,
                                      columns = X_train_numeric.columns,
                                      index = X_train_numeric.index)
X_train_numeric_scaled.head()

Unnamed: 0,POSTED_SPEED_LIMIT,NUM_UNITS,CRASH_HOUR,LATITUDE,LONGITUDE,AGE
571273,0.132509,1.141696,-1.013182,0.107484,-0.085609,-1.109211
751169,0.132509,-0.333483,-1.013182,-0.112652,-0.012405,0.459883
43397,0.132509,1.141696,-2.335921,-0.233132,0.113552,-1.046447
847787,0.132509,-0.333483,1.254371,0.310612,-0.137801,-1.485793
542731,0.132509,-0.333483,-0.068368,0.106912,0.022791,0.39712


Looks good; let's one-hot encode the categorical variables. Since we'll be using regularization to optimize our models, we won't need to drop one of the resulting dummy variables. This will help with model interpretability.

In [32]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(sparse_output=False)
X_train_categorical_ohe = ohe.fit_transform(X_train_categorical)
X_train_categorical_ohe = pd.DataFrame(X_train_categorical_ohe,
                                       columns = ohe.get_feature_names_out(),
                                       index = X_train_categorical.index)
X_train_categorical_ohe.head()

array([[0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.]])

In [44]:
X_train_categorical_ohe.info()

<class 'pandas.core.frame.DataFrame'>
Index: 707817 entries, 571273 to 303626
Data columns (total 63 columns):
 #   Column                                       Non-Null Count   Dtype  
---  ------                                       --------------   -----  
 0   WEATHER_CONDITION_BLOWING SAND, SOIL, DIRT   707817 non-null  float64
 1   WEATHER_CONDITION_BLOWING SNOW               707817 non-null  float64
 2   WEATHER_CONDITION_CLEAR                      707817 non-null  float64
 3   WEATHER_CONDITION_CLOUDY/OVERCAST            707817 non-null  float64
 4   WEATHER_CONDITION_FOG/SMOKE/HAZE             707817 non-null  float64
 5   WEATHER_CONDITION_FREEZING RAIN/DRIZZLE      707817 non-null  float64
 6   WEATHER_CONDITION_RAIN                       707817 non-null  float64
 7   WEATHER_CONDITION_SEVERE CROSS WIND GATE     707817 non-null  float64
 8   WEATHER_CONDITION_SLEET/HAIL                 707817 non-null  float64
 9   WEATHER_CONDITION_SNOW                       707817 non-nul

In [45]:
X_train_preprocessed = pd.concat([X_train_numeric_scaled, X_train_categorical_ohe], axis=1)
X_train_preprocessed.head()

Unnamed: 0,POSTED_SPEED_LIMIT,NUM_UNITS,CRASH_HOUR,LATITUDE,LONGITUDE,AGE,"WEATHER_CONDITION_BLOWING SAND, SOIL, DIRT",WEATHER_CONDITION_BLOWING SNOW,WEATHER_CONDITION_CLEAR,WEATHER_CONDITION_CLOUDY/OVERCAST,...,MANEUVER_SKIDDING/CONTROL LOSS,MANEUVER_SLOW/STOP - LEFT TURN,MANEUVER_SLOW/STOP - LOAD/UNLOAD,MANEUVER_SLOW/STOP - RIGHT TURN,MANEUVER_SLOW/STOP IN TRAFFIC,MANEUVER_STARTING IN TRAFFIC,MANEUVER_STRAIGHT AHEAD,MANEUVER_TURNING LEFT,MANEUVER_TURNING RIGHT,MANEUVER_U-TURN
571273,0.132509,1.141696,-1.013182,0.107484,-0.085609,-1.109211,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
751169,0.132509,-0.333483,-1.013182,-0.112652,-0.012405,0.459883,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
43397,0.132509,1.141696,-2.335921,-0.233132,0.113552,-1.046447,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
847787,0.132509,-0.333483,1.254371,0.310612,-0.137801,-1.485793,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
542731,0.132509,-0.333483,-0.068368,0.106912,0.022791,0.39712,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [47]:
X_train_preprocessed.shape

(707817, 69)

That's still too many columns, but I can't drop any more in good conscience without further information. Let's make use of Principal Component Analysis (PCA) to which features are more significant.