# Austin's Car Crash
#### By: Luca Comba, Hung Tran, Steven Tran

<img src="https://upload.wikimedia.org/wikipedia/en/thumb/a/a0/Seal_of_Austin%2C_TX.svg/1024px-Seal_of_Austin%2C_TX.svg.png" width="100" height="100">

#### Table of Contents
1. [Introduction](#introduction)
2. [Data](#data)
3. [Models](#models)
4. [Conclusion](#conclusion)


# Introduction

<div id="introduction" />

The original dataset includes records of traffic accidents in Austin, Texas, from 2010 to today, with 216,088 instances and 45 features, including both numerical and categorical data. The dataset can be found at [Austin Crash Report Data](https://catalog.data.gov/dataset/vision-zero-crash-report-data).

# Data
<div id="data" />

In the following section we will set up the dataset and the features we will be using for our models.

In [1]:
# imports
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor


from utils.utils import (
    print_linear_regression_scores
)

RANDOM_SEED=42

In [2]:
# read data
df = pd.read_csv('data/austin_car_crash_cleaned.csv')

## Data Cleaning

We went over the dataset cleansing in the file [cleaning.ipynb](./cleaning.ipynb).


## Exploratory Data Analysis

We went over the exploratory data analysis in the file [exploratory.ipynb](./exploratory.ipynb).

## Feature Selection
<div id="feature-selection" />

After running the [feature_selection.ipynb](./feature_selection.ipynb) notebook we will need to drop some features from the dataset.

We will be dropping the `primary_address`, `secondary_address`, `latitude`, `longitude`, and `timestamp_us_central` due to their types.

In [3]:
df = df.drop(columns=['primary_address', 'secondary_address', 'timestamp_us_central', 'latitude', 'longitude'])

The **Backward Elimination** method was used and it selected the following features: 

```
'speed_limit', 'nonincap_injry_cnt', 'poss_injry_cnt',
'non_injry_cnt', 'unkn_injry_cnt', 'tot_injry_cnt',
'motor_vehicle_death_count', 'bicycle_death_count',
'pedestrian_death_count', 'motorcycle_death_count',
'severity_incapacitating_injury', 'severity_non_incapacitating_injury',
'severity_not_injured', 'severity_possible_injury', 'severity_unknown',
'unit_involved_motor_vehicle_other', 'unit_involved_motorcycle',
'unit_involved_passenger_car', 'hour', 'weekend', 'hour_sin',
'hour_cos', 'month_cos'
```

that means we can drop the remaining features from the dataset.

In [4]:
print(f"Features before backwards elimination: {len(df.columns)}")

to_keep = ['estimated_total_comprehensive_cost']

backward_elimination_features = [
    *to_keep,
    'speed_limit', 'nonincap_injry_cnt', 'poss_injry_cnt',
    'non_injry_cnt', 'unkn_injry_cnt', 'tot_injry_cnt',
    'motor_vehicle_death_count', 'bicycle_death_count',
    'pedestrian_death_count', 'motorcycle_death_count',
    'severity_incapacitating_injury', 'severity_non_incapacitating_injury',
    'severity_not_injured', 'severity_possible_injury', 'severity_unknown',
    'unit_involved_motor_vehicle_other', 'unit_involved_motorcycle',
    'unit_involved_passenger_car', 'hour', 'weekend', 'hour_sin',
    'hour_cos', 'month_cos'
]

cols = df.columns.tolist()
df = df.drop(columns=[col for col in cols if col not in backward_elimination_features])

print(f"Features after backwards elimination: {len(df.columns)}")

Features before backwards elimination: 47
Features after backwards elimination: 24


In [5]:
display(df.columns)

Index(['speed_limit', 'nonincap_injry_cnt', 'poss_injry_cnt', 'non_injry_cnt',
       'unkn_injry_cnt', 'tot_injry_cnt', 'motor_vehicle_death_count',
       'bicycle_death_count', 'pedestrian_death_count',
       'motorcycle_death_count', 'estimated_total_comprehensive_cost',
       'severity_incapacitating_injury', 'severity_non_incapacitating_injury',
       'severity_not_injured', 'severity_possible_injury', 'severity_unknown',
       'unit_involved_motor_vehicle_other', 'unit_involved_motorcycle',
       'unit_involved_passenger_car', 'hour', 'weekend', 'hour_sin',
       'hour_cos', 'month_cos'],
      dtype='object')

## Split

In [6]:
# Define the target column
target_col = 'estimated_total_comprehensive_cost'

# Split into features (X) and target (y)
X = df.drop(columns=[target_col])
y = df[target_col]

In [7]:
# Split 60-20-20 train-validation-test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_SEED)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=RANDOM_SEED)

## Feature Scaling

In [8]:
# Select numeric columns
numeric_cols = X_train.select_dtypes(include=[np.number]).columns.tolist()

print(f"Numeric columns: {numeric_cols}")

Numeric columns: ['speed_limit', 'nonincap_injry_cnt', 'poss_injry_cnt', 'non_injry_cnt', 'unkn_injry_cnt', 'tot_injry_cnt', 'motor_vehicle_death_count', 'bicycle_death_count', 'pedestrian_death_count', 'motorcycle_death_count', 'unit_involved_motor_vehicle_other', 'unit_involved_motorcycle', 'unit_involved_passenger_car', 'hour', 'weekend', 'hour_sin', 'hour_cos', 'month_cos']


In [9]:
cols_to_remove = [ 'unit_involved_motor_vehicle_other', 'unit_involved_motorcycle', 'unit_involved_passenger_car']

# 'unit_involved_bicycle',
# 'unit_involved_large_passenger_vehicle',
# 'unit_involved_motor_vehicle_other',
# 'unit_involved_motorcycle',
# 'unit_involved_other_unknown',
# 'unit_involved_passenger_car',
# 'unit_involved_pedestrian',


for col in cols_to_remove:
    numeric_cols.remove(col)

In [10]:
# Select categorical columns
categorical_cols = X_train.select_dtypes(exclude=[np.number]).columns.tolist()
print(f"Categorical columns: {categorical_cols}")

Categorical columns: ['severity_incapacitating_injury', 'severity_non_incapacitating_injury', 'severity_not_injured', 'severity_possible_injury', 'severity_unknown']


In [11]:
print(f'Train after scaler. mean: {np.mean(X_train[numeric_cols], axis=0)} std: {np.std(X_train[numeric_cols], axis=0)}')
print(f'Test after scaler. mean: {np.mean(X_test[numeric_cols], axis=0)} std: {np.std(X_test[numeric_cols], axis=0)}')

Train after scaler. mean: speed_limit                  45.911993
nonincap_injry_cnt            0.284557
poss_injry_cnt                0.358758
non_injry_cnt                 1.848982
unkn_injry_cnt                0.109255
tot_injry_cnt                 0.678243
motor_vehicle_death_count     0.002899
bicycle_death_count           0.000136
pedestrian_death_count        0.001874
motorcycle_death_count        0.000879
hour                         12.880414
weekend                       0.262670
hour_sin                     -0.156446
hour_cos                     -0.190531
month_cos                     0.010656
dtype: float64 std: speed_limit                  12.681201
nonincap_injry_cnt            0.643845
poss_injry_cnt                0.754040
non_injry_cnt                 1.675248
unkn_injry_cnt                0.386797
tot_injry_cnt                 0.965032
motor_vehicle_death_count     0.058967
bicycle_death_count           0.011664
pedestrian_death_count        0.043965
motorcycle_death_c

In [12]:
# Scale numeric features
scaler = StandardScaler()
X_train[numeric_cols] = scaler.fit_transform(X_train[numeric_cols])
X_val[numeric_cols] = scaler.transform(X_val[numeric_cols])
X_test[numeric_cols] = scaler.transform(X_test[numeric_cols])

In [13]:
print(f'Train after scaler. mean: {np.mean(X_train[numeric_cols], axis=0)} std: {np.std(X_train[numeric_cols], axis=0)}')
print(f'Test after scaler. mean: {np.mean(X_test[numeric_cols], axis=0)} std: {np.std(X_test[numeric_cols], axis=0)}')

Train after scaler. mean: speed_limit                 -1.463656e-16
nonincap_injry_cnt          -7.467028e-17
poss_injry_cnt               6.098569e-17
non_injry_cnt               -5.079661e-17
unkn_injry_cnt              -1.204839e-17
tot_injry_cnt               -6.277063e-17
motor_vehicle_death_count    3.837636e-17
bicycle_death_count          4.239249e-18
pedestrian_death_count       5.726705e-18
motorcycle_death_count      -2.766668e-17
hour                         1.001801e-16
weekend                      9.311473e-17
hour_sin                     8.181007e-18
hour_cos                     3.019535e-17
month_cos                    1.993191e-17
dtype: float64 std: speed_limit                  1.0
nonincap_injry_cnt           1.0
poss_injry_cnt               1.0
non_injry_cnt                1.0
unkn_injry_cnt               1.0
tot_injry_cnt                1.0
motor_vehicle_death_count    1.0
bicycle_death_count          1.0
pedestrian_death_count       1.0
motorcycle_death_count     

# Models
<div id="models" />
We will be using the following models:

To predict the `estimated_total_comprehensive_cost` we can use:
1. Linear Regression
2. Ridge Regression
3. Lasso Regression
4. Decision Tree Regression
5. Random Forest Regression
6. Support Vector Regression (SVR) <- Maybe
7. K-Nearest Neighbors (KNN) Regression

## Pipeline

The sklearn pipeline will help us creating models.

In [14]:
preprocessor = None # No preprocessing needed yet

In [15]:
models = {
    'linear_regression': Pipeline([
        ('preprocessor', preprocessor),
        ('regressor', LinearRegression())
    ]),
    "ridge_regression": Pipeline([
        ('preprocessor', preprocessor),
        ('regressor', Ridge(alpha=1.0))
    ]),
    "lasso_regression": Pipeline([
        ('preprocessor', preprocessor),
        ('regressor', Lasso(alpha=0.1))
    ]),
    "decision_tree_regressor": Pipeline([
        ('preprocessor', preprocessor),
        ('regressor', DecisionTreeRegressor(random_state=RANDOM_SEED))
    ]),
    "random_forest_regressor": Pipeline([
        ('preprocessor', preprocessor),
        ('regressor', RandomForestRegressor(n_estimators=100, random_state=42))
    ]),
    # Need a GPU wrapper
    # "svr_pipeline": Pipeline([
    #     ('preprocessor', preprocessor),
    #     ('regressor', SVR(kernel='rbf', C=1.0, epsilon=0.1, n_jobs=-1))
    # ]),
    "k_neighbors_regressor": Pipeline([
        ('preprocessor', preprocessor),
        ('regressor', KNeighborsRegressor(n_neighbors=90, n_jobs=-1))
    ])
}

### Validation Set

On the validation set we will be training the models and then we will be using the test set to evaluate the models.

In [16]:
validation_results = {}
for model_name, model_pipeline in models.items():
    print(f"Training {model_name}")

    model_pipeline.fit(X_train, y_train)

    y_hat = model_pipeline.predict(X_val)

    validation_results[model_name] = y_hat

    metrics_dict = print_linear_regression_scores(model_name, y_val, y_hat)
    print()
    validation_results[model_name] = {"y_hat": y_hat, **metrics_dict}

Training linear_regression
linear_regression Scores:
MAE: 596.2416 MSE: 6161738.2372 RMSE: 2482.2849 RMSE%: 0.81% R^2: 1.0000

Training ridge_regression
ridge_regression Scores:
MAE: 642.9466 MSE: 6413593.3502 RMSE: 2532.5073 RMSE%: 0.83% R^2: 1.0000

Training lasso_regression
lasso_regression Scores:
MAE: 964.4849 MSE: 13347059.2477 RMSE: 3653.3627 RMSE%: 1.20% R^2: 1.0000

Training decision_tree_regressor
decision_tree_regressor Scores:
MAE: 2866.2937 MSE: 9961257631.0414 RMSE: 99806.1002 RMSE%: 32.71% R^2: 0.9805

Training random_forest_regressor
random_forest_regressor Scores:
MAE: 3116.4263 MSE: 6163358736.0273 RMSE: 78507.0617 RMSE%: 25.73% R^2: 0.9879

Training k_neighbors_regressor
k_neighbors_regressor Scores:
MAE: 57567.6659 MSE: 135830501613.4747 RMSE: 368551.8981 RMSE%: 120.79% R^2: 0.7343



### Testing Set

Now we can adjust the models to the test set and evaluate them.

#### Hyperparameter tuning

In [17]:
models['decision_tree_regressor'].set_params(regressor__max_depth=10)

In [18]:
models['random_forest_regressor'].set_params(regressor__n_estimators=200)

In [19]:
models['k_neighbors_regressor'].set_params(regressor__n_neighbors=200)

In [20]:
testing_results = {}
for model_name, model_pipeline in models.items():
    print(f"Training {model_name}")

    model_pipeline.fit(X_train, y_train)

    y_hat = model_pipeline.predict(X_test)

    testing_results[model_name] = y_hat

    metrics_dict = print_linear_regression_scores(model_name, y_test, y_hat)
    print()
    testing_results[model_name] = {"y_hat": y_hat, **metrics_dict}

Training linear_regression
linear_regression Scores:
MAE: 566.6099 MSE: 4124417.7936 RMSE: 2030.8663 RMSE%: 0.65% R^2: 1.0000

Training ridge_regression
ridge_regression Scores:
MAE: 620.2677 MSE: 4506290.5176 RMSE: 2122.8025 RMSE%: 0.68% R^2: 1.0000

Training lasso_regression
lasso_regression Scores:
MAE: 975.5219 MSE: 14163911.3537 RMSE: 3763.4972 RMSE%: 1.20% R^2: 1.0000

Training decision_tree_regressor
decision_tree_regressor Scores:
MAE: 11617.9526 MSE: 25671788985.0981 RMSE: 160224.1835 RMSE%: 51.23% R^2: 0.9572

Training random_forest_regressor
random_forest_regressor Scores:
MAE: 4986.5060 MSE: 18777444761.3485 RMSE: 137030.8168 RMSE%: 43.81% R^2: 0.9687

Training k_neighbors_regressor
k_neighbors_regressor Scores:
MAE: 84510.5269 MSE: 230800997218.1838 RMSE: 480417.5238 RMSE%: 153.60% R^2: 0.6153



# Conclusion
<div id="conclusion" />

In this project we have gone over the dataset and created a model to predict the `estimated_total_comprehensive_cost` of a car crash in Austin, Texas. We have used different models and evaluated them using the test set. The best model was the Random Forest Regression with an R2 score of 0.85.