In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer,  make_column_selector as selector
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import plot_confusion_matrix, recall_score,\
    accuracy_score, precision_score, f1_score

from sklearn.dummy import DummyClassifier

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImPipeline

from sklearn.multioutput import MultiOutputClassifier

from sklearn.tree import DecisionTreeClassifier

import pickle

# Overview

This project uses crash data from Chicago in order to create a model that can predict what type of crash has occured. The model uses data from the crash, person, and vehicle databases.

Chicago can use this model to determine what kind of crash has occured by only inputting 23 simple answers.

This project used an iterative approach to answering the business problem, so this project will be explained chronologically to best illustrate how the problem was approached.

## Business Problem

The cause of a crash, known as the `crash type`, can be difficult to surmise. There are lots of variables at play and people can lie or be un/nonresponsive when questioned in more serious cases.

Our goal is to create a model that, given 23 simple inputs, can predict the cause of a car accident in Chicago.

## Intake Data

The intake datasets include records from 2015 to August 2022.

`Traffic_Crashes_-_Vehicles.csv` - Information about vehicle status at time of crash. Includes variables such as vehicle defect, number of crashes, and vehicle length.

`Traffic_Crashes_-_People.csv` - Information about person status at time of crash. Includes variables such as drivers licence type, origin, sex, and age.

`Traffic_Crashes_-_Crashes.csv` - Information about the crash's environmental factors. Includes variables such as weather, street defect, and beat of occurance.


# Data Preparation(Round 1)

The first step taken is to examine the `crashes` data using the profile tool. 34 steps were identified to set up the dataframe for modeling. Some notable steps taken were to remove columns that weren't helpful or had a lot of missing values.

At this point several variables were engineered as well to help simplify the modeling. One big variable that was engineered was to condense all of the `injury` fields into an `INJURY_LEVEL` to reduce complexity. This level increases as the level of injury increases.(no injury/non-incapacitating/incapacitating/fatal)

The `crashes` dataframe was reduced from 49 to 18 columns.
Although there are some difficult values here still, we can still use this data to create our first model.

It is also becomind evident that further cleaning may be required. Most of these variables are categorical, with many(10+) categories. This dataset alone is also fairly large at 643k datapoints.

(FROM ./EDA_cleaning/EDA_CRASHES.ipynb)

In [8]:
crashes = pd.read_csv('./Datasets/Traffic_Crashes_-_Crashes.csv')

In [9]:
crashes.columns

Index(['CRASH_RECORD_ID', 'RD_NO', 'CRASH_DATE_EST_I', 'CRASH_DATE',
       'POSTED_SPEED_LIMIT', 'TRAFFIC_CONTROL_DEVICE', 'DEVICE_CONDITION',
       'WEATHER_CONDITION', 'LIGHTING_CONDITION', 'FIRST_CRASH_TYPE',
       'TRAFFICWAY_TYPE', 'LANE_CNT', 'ALIGNMENT', 'ROADWAY_SURFACE_COND',
       'ROAD_DEFECT', 'REPORT_TYPE', 'CRASH_TYPE', 'INTERSECTION_RELATED_I',
       'NOT_RIGHT_OF_WAY_I', 'HIT_AND_RUN_I', 'DAMAGE', 'DATE_POLICE_NOTIFIED',
       'PRIM_CONTRIBUTORY_CAUSE', 'SEC_CONTRIBUTORY_CAUSE', 'STREET_NO',
       'STREET_DIRECTION', 'STREET_NAME', 'BEAT_OF_OCCURRENCE',
       'PHOTOS_TAKEN_I', 'STATEMENTS_TAKEN_I', 'DOORING_I', 'WORK_ZONE_I',
       'WORK_ZONE_TYPE', 'WORKERS_PRESENT_I', 'NUM_UNITS',
       'MOST_SEVERE_INJURY', 'INJURIES_TOTAL', 'INJURIES_FATAL',
       'INJURIES_INCAPACITATING', 'INJURIES_NON_INCAPACITATING',
       'INJURIES_REPORTED_NOT_EVIDENT', 'INJURIES_NO_INDICATION',
       'INJURIES_UNKNOWN', 'CRASH_HOUR', 'CRASH_DAY_OF_WEEK', 'CRASH_MONTH',
       'LA

In [13]:
unpickleFile = open('./Datasets/clean_crashes.pkl', 'rb')
crashes_clean = pickle.load(unpickleFile, encoding='bytes')

In [14]:
crashes_clean.columns

Index(['CRASH_RECORD_ID', 'POSTED_SPEED_LIMIT', 'WEATHER_CONDITION',
       'LIGHTING_CONDITION', 'FIRST_CRASH_TYPE', 'TRAFFICWAY_TYPE',
       'ALIGNMENT', 'ROADWAY_SURFACE_COND', 'ROAD_DEFECT', 'CRASH_TYPE',
       'PRIM_CONTRIBUTORY_CAUSE', 'SEC_CONTRIBUTORY_CAUSE', 'STREET_DIRECTION',
       'BEAT_OF_OCCURRENCE', 'CRASH_HOUR', 'CRASH_DAY_OF_WEEK', 'CRASH_MONTH',
       'INJURY_LEVEL'],
      dtype='object')

# Modeling

## First Simple Model

With several components of this clean crashes dataframe, we went ahead and created our first simple model.

This first simple model is only trained on `WEATHER_CONDITION` and `POSTED_SPEED_LIMIT` to predict the target of `PRIM_CONTRIBUTORY_CAUSE`. We used these variables arbitrarily, but looking back we could have first used linear regression to find the most important variables.

We also made sure to exclude variables that described events that had no effect on the crash. These mostly include internal report metrics by Chicago.

For our first model we made sure to use a train test split and employ pipelines to reduce data leakage, as well as to practice the process and see if we could solve any problems that may happen to later models now. We did run into some problems using SimpleImputer with this data, but would circle back to this later.

The model used was a MultiOutputClassifier using LogisticRegression, and had a score of only 0.38. At this point we know that we need to add additional factors, and reduce the prediction target classes to increase model performance.

(FROM ./EDA_cleaning/EDA_CRASHES.ipynb)

In [16]:
unpickleFile = open('./EDA_cleaning/FSM.sav', 'rb')
FSM = pickle.load(unpickleFile, encoding='bytes')

In [17]:
unpickleFile = open('./EDA_cleaning/FSM_Xtest_ct.sav', 'rb')
FSM_X_test_ct = pickle.load(unpickleFile, encoding='bytes')

unpickleFile = open('./EDA_cleaning/FSM_ytest.sav', 'rb')
FSM_y_test = pickle.load(unpickleFile, encoding='bytes')

In [18]:
FSM.score(FSM_X_test_ct, FSM_y_test)

0.37907858741606565

# Data Preparation(Round 2)

## Combine `crash` with `person` and `vehicle`

Next, we'll go ahead and examine the `person` and `vehicle` datasets.

`person` - Again work on reducing complexity. No variables were engineered at this point, but instead we got rid of many columns that weren't helpful. Reduced variables from 30 to 7. Dropped rows that included passenger data, as we're assuming the driver was controlling the vehicle when it crashed.

`vehicle` - Reduced complexity. Reduced variables from 72 to 11.

Then these dataframes were combined with the `crash` dataset, such that every row became:
A single vehicle operator, with the information about their vehicle and the crash they were in.

With this dataframe, we could create models that look at each driver's information and use that to predict the type of crash that driver got into.
We ended up with 29 variables and 1 categorical target, and 1.1 million datapoints.

- (FROM ./EDA_cleaning/EDA_People.ipynb)
- (FROM ./EDA_cleaning/EDA_Vehicles.ipynb)
- (FROM ./EDA_cleaning/joined_dataset(person-vehicle-crash).ipynb)

In [19]:
unpickleFile = open('./Datasets/clean_joined_df.pkl', 'rb')
combined_df =  pickle.load(unpickleFile, encoding='bytes')

In [20]:
combined_df.columns

Index(['DRIVERS_LICENSE_CLASS', 'SEX', 'AGE', 'SAFETY_EQUIPMENT', 'UNIT_TYPE',
       'NUM_PASSENGERS', 'MAKE', 'MODEL', 'VEHICLE_DEFECT', 'VEHICLE_TYPE',
       'VEHICLE_USE', 'MANEUVER', 'FIRST_CONTACT_POINT', 'POSTED_SPEED_LIMIT',
       'WEATHER_CONDITION', 'LIGHTING_CONDITION', 'FIRST_CRASH_TYPE',
       'TRAFFICWAY_TYPE', 'ALIGNMENT', 'ROADWAY_SURFACE_COND', 'ROAD_DEFECT',
       'CRASH_TYPE', 'PRIM_CONTRIBUTORY_CAUSE', 'SEC_CONTRIBUTORY_CAUSE',
       'STREET_DIRECTION', 'BEAT_OF_OCCURRENCE', 'CRASH_HOUR',
       'CRASH_DAY_OF_WEEK', 'CRASH_MONTH', 'INJURY_LEVEL'],
      dtype='object')

In [21]:
combined_df.shape

(1088146, 30)

In [23]:
combined_df.dtypes

DRIVERS_LICENSE_CLASS       object
SEX                         object
AGE                        float64
SAFETY_EQUIPMENT            object
UNIT_TYPE                   object
NUM_PASSENGERS             float64
MAKE                        object
MODEL                       object
VEHICLE_DEFECT              object
VEHICLE_TYPE                object
VEHICLE_USE                 object
MANEUVER                    object
FIRST_CONTACT_POINT         object
POSTED_SPEED_LIMIT         float64
WEATHER_CONDITION           object
LIGHTING_CONDITION          object
FIRST_CRASH_TYPE            object
TRAFFICWAY_TYPE             object
ALIGNMENT                   object
ROADWAY_SURFACE_COND        object
ROAD_DEFECT                 object
CRASH_TYPE                  object
PRIM_CONTRIBUTORY_CAUSE     object
SEC_CONTRIBUTORY_CAUSE      object
STREET_DIRECTION            object
BEAT_OF_OCCURRENCE         float64
CRASH_HOUR                 float64
CRASH_DAY_OF_WEEK          float64
CRASH_MONTH         

# Gridsearch (Round 1)

After practicing pipelines with the FSM, I felt comfortable working on running gridsearches to find a better performing model was the next step. Pipelines with proper steps taken to avoid data leakage was taken.

In the first round of gridsearches, I found that after combining the cleaned versions of each dataset, gridsearching was just not working.

It took 3 days to troubleshoot the causes of the issues, likely due to my own inexperience. This could also be because I did not double-check the tools that I was using to make sure that they were all appropriate.

These issues are detailed below:

- SMOTE USAGE: 
Our targets were very imbalanced, as you can see on the graph below. The top class was more than the 2nd and 3rd class added together. I wanted to use SMOTE to help fix this issue to create a better model. This is due to the size of the dataset(1.3 million). Trying to use SMOTE on such a large dataset, with so many variables after applying OHE, made this too computationally expensive. FIX: Use undersampling or a model that compensates for class imbalance.
![target imbalance](./graphs/all_causes.jpg)


- TOO MANY TARGETS:
As mentioned above, there was some fairly serious class imbalance. The top cause was 'unable to determine' which also isn't helpful for predicting. How is it helpful to feed the model information and the model tells you that it's 'unable to determine' the cause? FIX: Reducing the targets. This would also decrease the size of the dataset which would increase the speed of gridsearches.


- TOO MANY VARIABLES/TOO COMPLEX:
We were able to reduce the original data to only 29 variables. However, all of these variables are categorical, which makes the OHE applied dataframe absolutely gigantic. FIX: additional data cleaning/engineering would need to be applied.


- SIMPLEIMPUTER NOT OPERATING CORRECTLY:
I'm still unsure of the cause of this issue, but using the imputer as a part of the pipelines on my gridsearches caused them to become unfunctional.

(FROM ./Modeling_gridsearch/gridsearch1.ipynb)

# Data Preparation(Round 3)

We'll need to apply the fixes for the issues found in the first round of gridsearching. At this point, I was also running critically low on time in order to have the deliverables ready on time. We would need to take a creative approach to approaching this issue.

Although this may have introduced some data leakage or muddied the data, I made the decision to create engineered variables for most categorical values.

So for example with `weather`, instead of inputting 'clear', 'other', 'snow', 'sleet', etc, you can input only 3 options for the weather.

This has a couple of benefits:
- greatly reduced complexity
- removes need for additional cleaning with pipelines(OHE only)

I also only screened the data to predict upon the top 4 causes, except the #1 cause of 'unable to determine'.
![target selection](./graphs/top_4_causes.jpg)

We also took addtitional steps to screen the data and eliminate excessively complicated variables, and I've detailed the cleaning steps taken below, as well as included the functions created to do a final clean pass on the data.

We were able to again reduce the complexity of the dataframe, going down to 21 variables(11 simple variables) and only 4 target classes. We've been able to get down to only 230k data points, which will greatly speed up gridsearches.

(FROM ./EDA_cleaning/final_data_prep.ipynb)

In [24]:
# list of variables to screen with
# WILL TRY ON TARGET FIRST AND THEN SEE HOW MUCH SCREENING IS NECESSARY

# 'PRIM_CONTRIBUTORY_CAUSE' - limit to top 4:
# FAILING TO YIELD RIGHT-OF-WAY                                                       133013
# FOLLOWING TOO CLOSELY                                                               132795
# IMPROPER OVERTAKING/PASSING                                                          57294
# FAILING TO REDUCE SPEED TO AVOID CRASH                                               51012


# 'DRIVERS_LICENSE_CLASS - only take people with drivers licenses (D)
# 'UNIT_TYPE' - only for drivers
# 'AGE' - only people 16 or above can use a drivers license

# list of variables to engineer
# SAFETY_EQUIPMENT - safety belt used/not used
# 'NUM_PASSENGERS' - has/does not have passengers
# 'VEHICLE_DEFECT' - defective/not defective
# 'VEHICLE_TYPE' - motorcycle/passenger/large passenger/large
# 'VEHICLE_USE' - personal/not-personal
# 'MANEUVER' - straight/turn/traffic/other
# 'FIRST_CONTACT_POINT' - LEAVE AS IS FOR NOW
# 'POSTED_SPEED_LIMIT' - BUCKET(LOW/MED/HIGH)
# 'WEATHER_CONDITION' - clear+unknown/rain/snow
# 'ROADWAY_SURFACE_COND' - DRY/WET/OTHER
# 'ROAD_DEFECT' - no defect/possible defect
# 'TRAFFICWAY_TYPE'' - not divided/divided/other
# 'ALIGNMENT' - straight/curved

# Overly complicated: remove with predjudice
# 'MAKE', 'MODEL', 'FIRST_CRASH_TYPE', 'SEC_CONTRIBUTORY_CAUSE', 'BEAT_OF_OCCURRENCE'

# list of variables to remove after above steps
# 'DRIVERS_LICENSE_CLASS', 'SAFETY_EQUIPMENT', 'UNIT_TYPE', 'NUM_PASSENGERS', 'VEHICLE_DEFECT', 'VEHICLE_TYPE',
# 'VEHICLE_USE', 'MANEUVER', 'POSTED_SPEED_LIMIT', 'WEATHER_CONDITION', 'ROADWAY_SURFACE_COND', 'ALIGNMENT',
# 'ROAD_DEFECT'



In [27]:
def screen_vars(data):
    
    # SCREEN pass 1: target reduction
    cause_list = ['FAILING TO YIELD RIGHT-OF-WAY', 'FOLLOWING TOO CLOSELY',
              'IMPROPER OVERTAKING/PASSING', 'FAILING TO REDUCE SPEED TO AVOID CRASH']
    data = data[data['PRIM_CONTRIBUTORY_CAUSE'].isin(cause_list)]
    
    # SCREEN pass 2: has standard drivers license
#     not looking at crashes from drivers of industrial applications
    data = data[data['DRIVERS_LICENSE_CLASS'].isin(['D'])]
#     data.DRIVERS_LICENSE_CLASS.value_counts()

    # SCREEN pass 3: UNIT_TYPE only for drivers
    data = data[data['UNIT_TYPE'].isin(['DRIVER'])]
    
    # SCREEN pass 4: age 16 or above
#     Can't have a drivers licence below 16
    data = data[data['AGE'] >= 16]
    
    # DELETE pass: remove overly complex variables
    data.drop(columns=['MAKE', 'MODEL', 'FIRST_CRASH_TYPE', 'SEC_CONTRIBUTORY_CAUSE', 'BEAT_OF_OCCURRENCE'], inplace=True)

In [28]:
# ENGINEER pass: create function to engineer new variables for all required cols

def Engineer_Vars(data):
    
    # 'SAFETY_EQUIPMENT' - safety belt or helmet used/not used+unkown
    data['se_simple'] = np.where(((data['SAFETY_EQUIPMENT']=='SAFETY BELT USED')|
                               (data['SAFETY_EQUIPMENT']=='DOT COMPLIANT MOTORCYCLE HELMET')), 'safety belt/helmet used', 
                                 'NO safety belt/helmet + unknown')
    
    # 'NUM_PASSENGERS' - has/does not have passengers
    data['passengers_simple'] = np.where(data['NUM_PASSENGERS'] > 0, 'has passengers', 'no passengers')
    
    # 'VEHICLE_DEFECT' - not defective/defective+unknown
    data['defect_simple'] = np.where(data['VEHICLE_DEFECT'] == 'NONE', 'not defective', 'defective/unknown')
    
    # 'VEHICLE_TYPE' - passenger/other
    data['vehicletype_simple'] = np.where(data['VEHICLE_TYPE'] == 'PASSENGER', 'passenger car', 'other')
    
    # 'VEHICLE_USE' - personal/not-personal
    data['vehicleuse_simple'] = np.where(data['VEHICLE_USE'] == 'PERSONAL', 'personal vehicle', 'non-personal vehicle')  
    
    # 'MANEUVER' - straight/turn/traffic/other    
#     Leave as is for now. too complicated

    # 'FIRST_CONTACT_POINT' - LEAVE AS IS FOR NOW
    
    # 'POSTED_SPEED_LIMIT' - BUCKET(LOW/MED/HIGH)
    conditions = [
        (data['POSTED_SPEED_LIMIT'] < 30),
        (data['POSTED_SPEED_LIMIT'] >= 30) & (data['POSTED_SPEED_LIMIT'] < 40),
        (data['POSTED_SPEED_LIMIT'] >= 40)]
    
    choices = ['low', 'med', 'high']
    data['speedlimit_simple'] = np.select(conditions, choices, default='low')
    
    # 'WEATHER_CONDITION' - clear+unknown/rain/snow/other
    conditions = [
        (data['WEATHER_CONDITION'] == 'CLEAR') | (data['WEATHER_CONDITION'] == 'UNKNOWN'),
        (data['WEATHER_CONDITION'] == 'RAIN') | (data['WEATHER_CONDITION'] == 'CLOUDY/OVERCAST') |
            (data['WEATHER_CONDITION'] == 'CLOUDY/OVERCAST'),
        (data['WEATHER_CONDITION'] == 'SNOW')]
    
    choices = ['clear/unknown', 'rain', 'snow']
    data['weather_simple'] = np.select(conditions, choices, default='other')
    
    # 'ROADWAY_SURFACE_COND' - DRY/H20/OTHER
    conditions = [
        (data['ROADWAY_SURFACE_COND'] == 'DRY'),
        (data['ROADWAY_SURFACE_COND'] == 'WET') | (data['ROADWAY_SURFACE_COND'] == 'SNOW OR SLUSH') |
            (data['ROADWAY_SURFACE_COND'] == 'ICE')]
    
    choices = ['dry', 'H20']
    data['roadcond_simple'] = np.select(conditions, choices, default='other')
    
    # 'ROAD_DEFECT' - no defect/possible defect
    data['roaddef_simple'] = np.where(data['ROAD_DEFECT'] == 'NO DEFECTS', 'no road defect', 'possible road defect')
    
    # 'TRAFFICWAY_TYPE'' - not divided/divided/other
    conditions = [
        (data['TRAFFICWAY_TYPE'] == 'NOT DIVIDED') | (data['TRAFFICWAY_TYPE'] == 'ONE-WAY'),
        (data['TRAFFICWAY_TYPE'] == 'DIVIDED - W/MEDIAN (NOT RAISED)') | (data['TRAFFICWAY_TYPE'] == 'DIVIDED - W/MEDIAN BARRIER')
        ]
    
    choices = ['not divided', 'divided']
    data['trafficway_simple'] = np.select(conditions, choices, default='other')
    
    # 'ALIGNMENT' - straight/curved
    data['alignment_simple'] = np.where(data['ALIGNMENT'].str.contains('STRAIGHT'), 'straight', 'curved')

In [29]:
unpickleFile = open('./Datasets/final_data.pkl', 'rb')
final_df =  pickle.load(unpickleFile, encoding='bytes')
final_df.columns

Index(['SEX', 'AGE', 'MANEUVER', 'FIRST_CONTACT_POINT', 'LIGHTING_CONDITION',
       'CRASH_TYPE', 'PRIM_CONTRIBUTORY_CAUSE', 'STREET_DIRECTION',
       'CRASH_HOUR', 'CRASH_DAY_OF_WEEK', 'CRASH_MONTH', 'INJURY_LEVEL',
       'se_simple', 'passengers_simple', 'defect_simple', 'vehicletype_simple',
       'vehicleuse_simple', 'speedlimit_simple', 'weather_simple',
       'roadcond_simple', 'roaddef_simple', 'trafficway_simple',
       'alignment_simple'],
      dtype='object')

In [30]:
final_df.shape

(230514, 23)

# Gridsearch (Round 2)

Now that we have a dataframe that's of a manageable size, with an appropriate amount of targets, we can continue with some real gridsearches!

For our gridsearches, we had to apply several transformations:
- For categorical data: apply onehotencoder so that models can be made from categorical data
- Scaled data so that it could better be modeled
- Encode target data so that it could be modeled

I ran gridsearches with RandomForestClassifier, DecisionTreeClassifier, and GradientBostingClassifier.
Although I didn't run as many gridsearches as I would have liked, the best model ended up being a RandomForestClassifier with these criterion:

{'rfc__criterion': 'gini',
 'rfc__max_features': 'sqrt',
 'rfc__min_samples_leaf': 2}
 
 This model was able to achieve a score of 0.64 on the test data.
 
 (FROM ./Modeling_gridsearch/gridsearch2.ipynb)

In [31]:
unpickleFile = open('./Modeling_gridsearch/rfc_model_final.sav', 'rb')
final_model =  pickle.load(unpickleFile, encoding='bytes')



unpickleFile = open('./Modeling_gridsearch/rfc_model_final_Xtest.sav', 'rb')
final_model_Xtest =  pickle.load(unpickleFile, encoding='bytes')

unpickleFile = open('./Modeling_gridsearch/rfc_model_final_ytest.sav', 'rb')
final_model_ytest =  pickle.load(unpickleFile, encoding='bytes')

In [32]:
final_model.score(final_model_Xtest, final_model_ytest)

0.6370924360998803

# Final Model Analysis

The final model has the following performance statistics:
- Recall: 0.64
- Accuracy: 0.64
- Precision: 0.64
- F1: 0.64

![confusion matrix](./graphs/confusion_matrix.jpg)
These stats are all the same because we used the 'micro' average hyperparameter for the scoring. This adds up the values in the confusion matrix before applying division. This is the recommended hyperparameter for unbalanced data.

Using 'macro' would have applied division and then summed the values together, and so the performance statistics would have been much lower.

We also wanted to look at what the most important factors are when determining the type of crash. Unsuprisingly, the most imporant factors are sex and age, which are the first two columns fed into the model.

(FROM ./Modeling_gridsearch/gridsearch2.ipynb)

In [33]:
importance = final_model.feature_importances_
for i,v in enumerate(importance):
    print('Feature: %0d, Score: %.5f' % (i,v))

Feature: 0, Score: 0.07904
Feature: 1, Score: 0.06944
Feature: 2, Score: 0.05088
Feature: 3, Score: 0.06060
Feature: 4, Score: 0.01246
Feature: 5, Score: 0.01247
Feature: 6, Score: 0.00004
Feature: 7, Score: 0.00094
Feature: 8, Score: 0.00594
Feature: 9, Score: 0.00522
Feature: 10, Score: 0.00000
Feature: 11, Score: 0.00002
Feature: 12, Score: 0.00386
Feature: 13, Score: 0.00958
Feature: 14, Score: 0.00071
Feature: 15, Score: 0.00163
Feature: 16, Score: 0.00014
Feature: 17, Score: 0.00143
Feature: 18, Score: 0.03821
Feature: 19, Score: 0.00180
Feature: 20, Score: 0.00065
Feature: 21, Score: 0.00021
Feature: 22, Score: 0.00044
Feature: 23, Score: 0.04480
Feature: 24, Score: 0.00141
Feature: 25, Score: 0.01870
Feature: 26, Score: 0.02782
Feature: 27, Score: 0.00001
Feature: 28, Score: 0.00386
Feature: 29, Score: 0.00064
Feature: 30, Score: 0.00148
Feature: 31, Score: 0.01474
Feature: 32, Score: 0.01704
Feature: 33, Score: 0.01641
Feature: 34, Score: 0.00169
Feature: 35, Score: 0.01593
Fe

# Conclusions

- We were able to create a model that has a score of 64% in predicting the type of a crash from the top 4 types.

- This model uses 23 input factors to predict the type of crash.

- The most important factors in finding the type of a crash are sex and age.

# Next Steps

Continue to iterate upon model:
- This model was only trained on 21% of total crashes, and only predicts the top 4 crash types. Including other types of crash types will allow the model to predict additional crash types.
- Improve performance by running additional gridsearches and checking other model types to see if they have better results.

Look more into ‘Unable to determine’ cause:
- It's likely that many crashes get lumped into this cause, when there is a true cause at hand. Can we discover the true cause based on model?
- The number one crash type, containing more than half of all crashes, has to be thrown out due to being uninterpretable. Is there a better way of interpreting this cause?

Deploy model:
- Answer 23 simple questions to find cause of crash. May be useful to law enforcement.

