# Chicago, IL : Car Crash Analysis & Predictive Modeling


### *Predicting severity of car crashes with Machine Learning Models*

Authors: [Christos Maglaras](mailto:Christo111M@gmail.com), [Marcos Panyagua](mailto:marcosvppfernandes@gmail.com), [Jamie Dowat](mailto:jamie_dowat44@yahoo.com)

Date: 3/12/2021

![chicago](img/chicago_night_drive.jpg)

## Stakeholder: Chicago Department of Transportation

![cdot](img/cdot.png)

### Business Understanding

Just a week ago, the National Security Coundil released a [report](https://www.nsc.org/newsroom/motor-vehicle-deaths-2020-estimated-to-be-highest) containing some disturbing statistics from 2020. The first paragraph begins as follows:
> For the first time since 2007, preliminary data from the National Safety Council show that as many as 42,060 people are estimated to have died in motor vehicle crashes in 2020. That marks an 8% increase over 2019 in a year where people drove significantly less frequently because of the pandemic.

According to their data, the US hasn't seen an increase like this since **1924**.

Following this trend, the **Governor's Highway Safety Association** reported that [*pedestrian* fatality rate](https://www.smartcitiesdive.com/news/ghsa-projects-highest-pedestrian-death-rate-since-1988/573203/) has reached a **30-year high**, with nighttime pedestrian fatalities having increased by 67%, and a 16% increase in daytime fatalities, highlighting the need for *safer road crossings* and increased efforts to make pedestrians and vehicles more *visible*.

Narrowing our focus even further, in **Illinois**, around **1000** people were KILLED in motor vehicle crashes in **2019** alone. 

**Advocates for Highway and Auto Safety** have scored all US states against their [Roadmap for State Highway Safety Laws](https://saferoads.org/wp-content/uploads/2020/01/Advocates-for-Highway-and-Auto-Safety-2020-Roadmap-of-State-Highway-Safety-Laws.pdf), a set of 16 laws that cover occupant protection (selt belt, helmet laws), child protection, and teen driving. 

![scoring](img/scoringsafety.png)

When Illinois is [scored](https://saferoads.org/state/illinois/) against this Roadmap, it has been given a yellow rating (Caution), since it still lacks the following safety laws:

* All-Rider Motorcycle Helmet Law
* Booster Seat Law
* GDL (Graduated Driver's License) – Minimum Age 16 for Learner’s Permit
* GDL – Stronger Nighttime Restriction Provision
* GDL- Stronger Passenger Restriction Provision
* GDL- Age 18 for Unrestricted License

Currently, the Chicago Department of Transportation is working with the city's new initiative, **Vision Zero**, to reduce accidents on the road. In Vision Zero's [report](https://8gq.ef1.myftpupload.com/wp-content/uploads/2016/05/17_0612-VZ-Action-Plan_FOR-WEB.pdf) and action plan, they used crash data to identify high crash corridors in the city as well as other important trends to guide education, road safety improvements, and more.

![quotes](img/visionzeroquotes.png)

Bearing all of this in mind, we had these current safety movements and road safety problems guide our exploration and modeling of this data. 

Even more importantly, this business understanding was the sole influencer of our target choice, *severity of crash based on injury*.

******

## Predictive Modeling Preview

In terms of business problems, we found one of the ways a predictive model could most help with this business problem is to determine how different factors of the crash determine the severity of injuries in the crash.

Tried Logistic Regression, K-Nearest Neighbors, Decision Tree, Naive Bayes, and Random Forest Classifier, before settling on our final model produced by a Bayesian-Optimized XGBoost Classifier.

We experimented with both a BINARY classification and a TERNARY, with our final model using as TERNARY classification. The targets are defined as follows, using the MOST_SEVERE_INJURY column in the Crashes dataset:

* BINARY:
    * Class 0: No Injury
    * Class 1: Injury
    
* TERNARY
    * Class 0: No Injury
    * Class 1: NON-INCAPACITATING injuries
    * Class 2: INCAPACITATING or FATAL injuries

******

## Data: [Chicago City Data Portal](https://data.cityofchicago.org/)

![ccdp](img/chicagocitydataportal.jpg)

### [Crashes](https://data.cityofchicago.org/Transportation/Traffic-Crashes-Crashes/85ca-t3if):

##### Number of Rows: 482,866

*Shows crash data from crash from the Chicago Police Department's **E-Crash** system*

**"All crashes are recorded as per the format specified in the Traffic Crash Report, SR1050, of the Illinois Department of Transportation."**

| Column Name                 | Description                |
| --------------------------- | -------------------------- |
| crash_record_id  |  Can be used to link to the same crash in the Vehicles and People datasets. |
| rd_no | Chicago Police Department report number|
| crash_date | Date and time of crash as entered by the reporting officer |
| posted_speed_limit  | Posted speed limit, as determined by reporting officer |
| traffic_control_device | Traffic control device present at crash location, as determined by reporting officer (signals, stop sign, etc) |
| device_condition  | Condition of traffic control device, as determined by reporting officer |
| weather_condition | Weather condition at time of crash, as determined by reporting officer |
| lighting_condition | Light condition at time of crash, as determined by reporting officer |
| first_crash_type | Type of first collision in crash |
| trafficway_type  | Trafficway type, as determined by reporting officer |
| lane_ct | Total number of through lanes in either direction, excluding turn lanes, as determined by reporting officer (0 = intersection)|
| alignment | Street alignment at crash location, as determined by reporting officer |
| roadway_surface_cond        | Road surface condition, as determined by reporting officer |
| road_defect | Road defects, as determined by reporting officer |
| crash_type | A general severity classification for the crash. Can be either Injury and/or Tow Due to Crash or No Injury / Drive Away |
| damage | A field observation of estimated damage. |
| prim_contributory_cause   | The factor which was most significant in causing the crash, as determined by officer judgment |
| sec_contributory_cause | The factor which was second most significant in causing the crash, as determined by officer judgment |
| street_name | Street address name of crash location, as determined by reporting officer|
| num_units | Number of units involved in the crash. A unit can be a motor vehicle, a pedestrian, a bicyclist, or another non-passenger roadway user. Each unit represents a mode of traffic with an independent trajectory. |
| most_severe_injury | Most severe injury sustained by any person involved in the crash |
| injuries_total | Total persons sustaining fatal, incapacitating, non-incapacitating, and possible injuries as determined by the reporting officer |
| injuries_fatal | Total persons sustaining fatal injuries in the crash |
| injuries_incapacitating | Total persons sustaining incapacitating/serious injuries in the crash as determined by the reporting officer. Any injury other than fatal injury, which prevents the injured person from walking, driving, or normally continuing the activities they were capable of performing before the injury occurred. Includes severe lacerations, broken limbs, skull or chest injuries, and abdominal injuries. |
| injuries_non_incapacitating | Total persons sustaining non-incapacitating injuries in the crash as determined by the reporting officer. Any injury, other than fatal or incapacitating injury, which is evident to observers at the scene of the crash. Includes lump on head, abrasions, bruises, and minor lacerations. |
| crash_hour | The hour of the day component of CRASH_DATE. |
| crash_day_of_week | The day of the week component of CRASH_DATE. Sunday=1 |
| latitude | The latitude of the crash location, as determined by reporting officer, as derived from the reported address of crash |
| longitude | The longitude of the crash location, as determined by reporting officer, as derived from the reported address of crash |


### [People](https://data.cityofchicago.org/Transportation/Traffic-Crashes-People/u6pd-qa9d):

##### Number of Rows: 1,068,637

*Information about people involved in a crash and if any injuries were sustained.*

| Column Name                 | Description                |
| --------------------------- | -------------------------- |
| crash_record_id | This number can be used to link to the same crash in the Crashes and Vehicles datasets. This number also serves as a unique ID in the Crashes dataset. |
| person_type | Type of roadway user involved in crash |
| rd_no | Chicago Police Department report number. For privacy reasons, this column is blank for recent crashes. |
| crash_date | Date and time of crash as entered by the reporting officer |
| seat_no | Code for seating position of motor vehicle occupant: 1= driver, 2= center front, 3 = front passenger, 4 = second row left, 5 = second row center, 6 = second row right, 7 = enclosed passengers, 8 = exposed passengers, 9= unknown position, 10 = third row left, 11 = third row center, 12 = third row right |
| city | City of residence of person involved in crash |
| state | State of residence of person involved in crash |
| zipcode | ZIP Code of residence of person involved in crash |
| sex | Gender of person involved in crash, as determined by reporting officer |
| age | Age of person involved in crash |
| drivers_license_state | State issuing driver's license of person involved in crash |
| drivers_license_class | Class of driver's license of person involved in crash |
| safety_equipment | Safety equipment used by vehicle occupant in crash, if any |
| airbag_deployed | Whether vehicle occupant airbag deployed as result of crash |
| ejection | Whether vehicle occupant was ejected or extricated from the vehicle as a result of crash |
| injury_classification | Severity of injury person sustained in the crash |
| driver_action | Driver action that contributed to the crash, as determined by reporting officer |
| driver_vision | What, if any, objects obscured the driver’s vision at time of crash |
| physical_condition | Driver’s apparent physical condition at time of crash, as observed by the reporting officer |
| pedpedal_action | Action of pedestrian or cyclist at the time of crash |
| pedpedal_visibility | Visibility of pedestrian of cyclist safety equipment in use at time of crash |
| pedpedal_location | Location of pedestrian or cyclist at the time of crash |
| bac_result | Status of blood alcohol concentration testing for driver or other person involved in crash |
| bac_result value | Driver’s blood alcohol concentration test result (fatal crashes may include pedestrian or cyclist results) |
| cell_phone_use | Whether person was/was not using cellphone at the time of the crash, as determined by the reporting officer |

### [Vehicles](https://data.cityofchicago.org/Transportation/Traffic-Crashes-Vehicles/68nd-jvt3):

##### Number of Rows: 987,148

*Information about vehicles ("units") involved in a traffic crash.*

| Column Name                 | Description                |
| --------------------------- | -------------------------- |
| crash_record_id | This number can be used to link to the same crash in the Crashes and People datasets. This number also serves as a unique ID in the Crashes dataset. |
| rd_no | Chicago Police Department report number. For privacy reasons, this column is blank for recent crashes. |
| crash_date | Date and time of crash as entered by the reporting officer |
| unit_type | The type of unit (i.e Driver, parked, pedestrian, bicycle, etc) |
| num_passengers | Number of passengers in the vehicle. The driver is not included. More information on passengers is in the People dataset. |
| make | The make (brand) of the vehicle, if relevant |
| model | The model of the vehicle, if relevant |
| lic_plate_state | The state issuing the license plate of the vehicle, if relevant |
| vehicle_year | The model year of the vehicle, if relevant |
| vehicle_defect | Indicates part of car containing defect (brakes, wheels, etc.) |
| vehicle_type | The type of vehicle, if relevant (passenger, truck, bus, etc) |
| vehicle_use | The normal use of the vehicle, if relevant |
| maneuver | The action the unit was taking prior to the crash, as determined by the reporting officer |
| towed_I | Indicator of whether the vehicle was towed |
| occupant_cnt | The number of people in the unit, as determined by the reporting officer |
| exceed_speed_limit_I | Indicator of whether the unit was speeding, as determined by the reporting officer |
| first_contact_point | Indicates orientation on car that was hit (front, rear, etc) |



In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', None)

In [119]:
crashes = pd.read_csv('data/traffic_crashes_chicago.csv', low_memory=False)

In [120]:
people = pd.read_csv('data/traffic_crashes_people.csv', low_memory=False)

In [121]:
vehicles = pd.read_csv('data/traffic_crashes_vehicles.csv', low_memory=False)

## Exploratory Data Analysis (EDA)

In [95]:
from src import data_cleaning

In [96]:
crashes, people, vehicles = data_cleaning.column_mask([crashes, people, vehicles])

In [100]:
from src import eda
%matplotlib inline

In [46]:
from pandas.tseries.holiday import USFederalHolidayCalendar
cal = USFederalHolidayCalendar()
holidays = cal.holidays(start='2013-01-01', end='2022-12-31', return_name=True)
holidays = holidays.reset_index(name='holiday').rename(columns={'index':'date'})



holidays = holidays[(holidays['holiday'] =='July 4th')|(holidays['holiday'] =='Thanksgiving')|(holidays['holiday'] =='Christmas')|(holidays['holiday'] =='New Years Day')]     
                    
crashes['crash_date'] = crashes['crash_date'].apply(lambda x: pd.to_datetime(x))

## Predictive Modeling

In [48]:
# Focus metric: RECALL

In [56]:
for col in crash_mod.columns:
    print(col, '\n\n', crash_mod[col].value_counts(), '\n\n\n')

traffic_control_device 

 No_device         127586
device_present    117573
Name: traffic_control_device, dtype: int64 



weather_condition 

 Clear        198998
Not_Clear     46161
Name: weather_condition, dtype: int64 



first_crash_type 

 REAR END                        66207
TURNING                         42481
SIDESWIPE SAME DIRECTION        40069
PARKED MOTOR VEHICLE            34114
ANGLE                           31392
FIXED OBJECT                    10225
PEDESTRIAN                       5488
PEDALCYCLIST                     3590
SIDESWIPE OPPOSITE DIRECTION     3546
HEAD ON                          2448
OTHER OBJECT                     1821
REAR TO FRONT                    1594
REAR TO SIDE                      900
OTHER NONCOLLISION                674
REAR TO REAR                      256
ANIMAL                            201
OVERTURNED                        132
TRAIN                              21
Name: first_crash_type, dtype: int64 



roadway_surface_cond 

 Dry  

In [57]:
from src import models

In [58]:
pp = models.Preprocessor(target_col_name = 'injured', df=crash_mod)

In [59]:
target_split = pp.target_split([crash_mod])

In [60]:
tr_te_split = pp.split_train_test(target_split)

In [None]:
# !pip install xgboost
# !brew install libomp

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import roc_auc_score
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier

In [61]:
X_train = tr_te_split[2]['train']['X']

y_train = tr_te_split[2]['train']['y']

lr = LogisticRegression(max_iter=100000, C=0.05)

In [122]:
rec, prec, acc = models.kfold_validation(X_train, y_train, classifier=lr, 
                        continuous_cols=['crash_hour'], 
                        categorical_cols=tr_te_split[0]['train']['X'].drop(labels=['crash_hour'], axis=1).columns, 
                        smote=True,
                        minority_size=0.7, 
                        majority_reduce=0.7)

In [125]:
prec

[0.386401524172422]

In [63]:
# prepped_data = pp.scale_and_ohe(tr_te_split, continuous_cols=['crash_hour'],
#                                categorical_cols=tr_te_split[0]['train']['X'].drop(labels=['crash_hour'], axis=1).columns)

In [64]:
# smoted = pp.balance_classes(prepped_data)

In [65]:
# lr = models.LogisticRegression(max_iter=100000, C=0.1)

# X_train = prepped_data[2]['train']['X']
# y_train = prepped_data[2]['train']['y']

# lr.fit(X_train, y_train)

In [66]:
# from sklearn.metrics import recall_score, precision_score, accuracy_score, \
#                             plot_confusion_matrix, confusion_matrix


# y_pred = lr.predict(prepped_data[2]['test']['X'])

# cm = confusion_matrix(np.array(prepped_data[2]['test']['y']).reshape(-1,1), lr.predict(prepped_data[2]['test']['X']))

# ax = plt.subplot()
# sns.set(font_scale=1.5) # Adjust to fit
# sns.heatmap(cm, annot=True, ax=ax, cmap="rocket", fmt="g");  

# # Labels, title and ticks
# label_font = {'size':'12'}  # Adjust to fit
# ax.set_xlabel('Predicted labels', fontdict=label_font);
# ax.set_ylabel('Observed labels', fontdict=label_font);

# title_font = {'size':'15'}  # Adjust to fit
# ax.set_title('Confusion Matrix', fontdict=title_font);

# ax.tick_params(axis='both', which='major', labelsize=12)  # Adjust to fit
# ax.xaxis.set_ticklabels(['False', 'True']);
# ax.yaxis.set_ticklabels(['False', 'True']);



In [67]:
# y_pred = lr.predict(prepped_data[2]['test']['X'])

# recall_score(np.array(prepped_data[2]['test']['y']).reshape(-1,1), y_pred)

In [68]:
# precision_score(np.array(prepped_data[2]['test']['y']).reshape(-1,1), y_pred)

In [69]:
# lr2 = models.LogisticRegression(max_iter=100000, C=0.1)

# X_train = smoted[3]['train']['SMOTE_even_split']['X']
# y_train = smoted[3]['train']['SMOTE_even_split']['y']

# lr2.fit(X_train, y_train)

In [70]:
# X_test = smoted[3]['test']['X']
# y_test = smoted[3]['test']['y']


# lr2.score(X_test, y_test)

In [71]:
# accuracy_score(y_test, lr2.predict(X_test))

In [72]:
# !pip install xgboost
# !brew install libomp

In [73]:
# from xgboost import XGBClassifier

# X_train = smoted[3]['train']['SMOTE_even_split']['X']
# y_train = smoted[3]['train']['SMOTE_even_split']['y']

# classifier1 = XGBClassifier().fit(X_train, y_train)

In [74]:
# train_p1 = classifier1.predict(smoted[3]['test']['X'])

In [75]:
# accuracy_score(smoted[3]['test']['y'], train_p1)

In [76]:
# knn = models.KNeighborsClassifier()

# X_train = smoted[3]['train']['SMOTE_even_split']['X']
# y_train = smoted[3]['train']['SMOTE_even_split']['y']

# knn.fit(X_train, y_train)

In [77]:
# X_test = smoted[3]['test']['X']
# y_test = smoted[3]['test']['y']

# knn.score(X_test, y_test)

In [78]:
# y_pred = knn.predict(X_test)

# recall_score(y_test, y_pred)

In [79]:
# precision_score(y_test, y_pred)

In [80]:
# from sklearn.ensemble import RandomForestClassifier

# rfc = RandomForestClassifier(n_estimators=100, class_weight={0:1, 1:6})

# rfc.fit(X_train, y_train)

In [81]:
# y_pred = rfc.predict(prepped_data[2]['test']['X'])

In [82]:
# recall_score(prepped_data[2]['test']['y'], y_pred)

In [83]:
# from sklearn.naive_bayes import GaussianNB

# gnb = GaussianNB()

# gnb.fit(X_train, y_train)


In [84]:
# y_pred = gnb.predict(prepped_data[2]['test']['X'])

In [85]:
# precision_score(prepped_data[2]['test']['y'], y_pred)

In [86]:
### Model dataset: people

In [87]:
### Model dataset: vehicles

In [88]:
### Model dataset: joined (all 3)

In [89]:
# Model with crashes - LogReg, RandomForest, XGBoost, Naive Bayes, RandomizedSearch

# from xgboost import XGBClassifier
# classifier1 = XGBClassifier().fit(X_train, Y_train)

# train_p1 = classifier1.predict(X_train)

# from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, hamming_loss
# print(classification_report(Y_train, train_p1))

In [90]:
# Model with all three - LogReg, RandomForest, XGBoost, Naive Bayes, RandomizedSearch

# from sklearn.model_selection import train_test_split
# X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2)
# D_train = xgb.DMatrix(X_train, label=Y_train)
# D_test  = xgb.DMatrix(X_test, label=Y_test)

# params = {
#         'min_child_weight': [5,6,7,8],
#         'gamma'           : [1.1,1.2,1.3],
#         'subsample'       : [.7,.8,.9],
#         'max_depth'       : [10,11,12,13],
#         'eta'             : [.2,.3,.4],
#         'colsample_bytree': [.4,.5,.6]        
#         }


# xgb = XGBClassifier(learning_rate=0.02,
#                     n_estimators=600,
#                     objective='binary:logistic',
#                     silent=True,
#                     nthread=1,
#                     tree_method= 'gpu_hist'
# #                     verbosity=0,
# #                    scale_pos_weight = 7
#                    )



In [91]:
# Model with all three - LogReg, RandomForest, XGBoost, Naive Bayes, RandomizedSearch




## Sources

##### Source Code
* [Changing font size for Confusion Matrix](https://stackoverflow.com/questions/59839782/confusion-matrix-font-size)
*[SMOTE-ing Data Tutorial](https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/)

##### Business Understanding

* [Speeding Increases During Pandemic, Prompting Safety Groups To Take Action](http://www.ghsa.org/about/news/Forbes/Speed-Pilot21)
* [Calls for safer streets intensify amid 45% spike in pedestrian deaths](https://www.smartcitiesdive.com/news/calls-for-safer-streets-intensify-amid-45-spike-in-pedestrian-deaths/596420/)
* [Chicago unveils its West Side Vision Zero Traffic Safety Plan](https://chi.streetsblog.org/2019/09/12/the-city-of-chicago-unveils-its-west-side-vision-zero-traffic-safety-plan/)