# Austin's Car Crash
#### By: Luca Comba, Hung Tran, Steven Tran

<img src="https://upload.wikimedia.org/wikipedia/en/thumb/a/a0/Seal_of_Austin%2C_TX.svg/1024px-Seal_of_Austin%2C_TX.svg.png" width="100" height="100">

#### Table of Contents
1. [Introduction](#introduction)
2. [Data](#data)
3. [Models](#models)
4. [Conclusion](#conclusion)


In [1]:
# imports
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

RANDOM_SEED=42

In [2]:
# read data
df = pd.read_csv('data/austin_car_crash_cleaned.csv')

# Introduction

<div id="introduction" />

The original dataset includes records of traffic accidents in Austin, Texas, from 2010 to today, with 216,088 instances and 45 features, including both numerical and categorical data. The dataset can be found at [Austin Crash Report Data](https://catalog.data.gov/dataset/vision-zero-crash-report-data).

# Data
<div id="data" />

In the following section we will set up the dataset and the features we will be using for our models.

## Data Cleaning

We went over the dataset cleansing in the file [cleaning.ipynb](./cleaning.ipynb).


## Exploratory Data Analysis

We went over the exploratory data analysis in the file [exploratory.ipynb](./exploratory.ipynb).

## Feature Selection
<div id="feature-selection" />

After running the [feature_selection.ipynb](./feature_selection.ipynb) notebook we will need to drop some features from the dataset.

We will be dropping the `primary_address`, `secondary_address`, `latitude`, `longitude`, and `timestamp_us_central` due to their types.

In [3]:
df = df.drop(columns=['primary_address', 'secondary_address', 'timestamp_us_central', 'latitude', 'longitude'])

## Split

In [4]:
# Define the target column
target_col = 'estimated_total_comprehensive_cost'

# Split into features (X) and target (y)
X = df.drop(columns=[target_col])
y = df[target_col]

In [5]:
# Split 60-20-20 train-validation-test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_SEED)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=RANDOM_SEED)

## Feature Scaling

In [6]:
# Select numeric columns
numeric_cols = X_train.select_dtypes(include=[np.number]).columns.tolist()

print(f"Numeric columns: {numeric_cols}")

Numeric columns: ['speed_limit', 'sus_serious_injry_cnt', 'nonincap_injry_cnt', 'poss_injry_cnt', 'non_injry_cnt', 'unkn_injry_cnt', 'tot_injry_cnt', 'death_cnt', 'motor_vehicle_death_count', 'motor_vehicle_serious_injury_count', 'bicycle_death_count', 'bicycle_serious_injury_count', 'pedestrian_death_count', 'pedestrian_serious_injury_count', 'motorcycle_death_count', 'motorcycle_serious_injury_count', 'other_death_count', 'other_serious_injury_count', 'micromobility_serious_injury_count', 'micromobility_death_count', 'law_enforcement_fatality_count', 'unit_involved_bicycle', 'unit_involved_large_passenger_vehicle', 'unit_involved_motor_vehicle_other', 'unit_involved_motorcycle', 'unit_involved_other_unknown', 'unit_involved_passenger_car', 'unit_involved_pedestrian', 'hour', 'day_of_week', 'month', 'year', 'day_of_month', 'weekend', 'hour_sin', 'hour_cos', 'month_sin', 'month_cos']


In [7]:
cols_to_remove = [
    'unit_involved_bicycle',
    'unit_involved_large_passenger_vehicle',
    'unit_involved_motor_vehicle_other',
    'unit_involved_motorcycle',
    'unit_involved_other_unknown',
    'unit_involved_passenger_car',
    'unit_involved_pedestrian',
]

for col in cols_to_remove:
    numeric_cols.remove(col)

In [8]:
# Select categorical columns
categorical_cols = X_train.select_dtypes(exclude=[np.number]).columns.tolist()
print(f"Categorical columns: {categorical_cols}")

Categorical columns: ['fatal_crash', 'construction_zone', 'severity_incapacitating_injury', 'severity_killed', 'severity_non_incapacitating_injury', 'severity_not_injured', 'severity_possible_injury', 'severity_unknown']


In [9]:
print(f'Train after scaler. mean: {np.mean(X_train[numeric_cols], axis=0)} std: {np.std(X_train[numeric_cols], axis=0)}')
print(f'Test after scaler. mean: {np.mean(X_test[numeric_cols], axis=0)} std: {np.std(X_test[numeric_cols], axis=0)}')

Train after scaler. mean: speed_limit                             45.911993
sus_serious_injry_cnt                    0.034929
nonincap_injry_cnt                       0.284557
poss_injry_cnt                           0.358758
non_injry_cnt                            1.848982
unkn_injry_cnt                           0.109255
tot_injry_cnt                            0.678243
death_cnt                                0.005788
motor_vehicle_death_count                0.002899
motor_vehicle_serious_injury_count       0.024441
bicycle_death_count                      0.000136
bicycle_serious_injury_count             0.001790
pedestrian_death_count                   0.001874
pedestrian_serious_injury_count          0.003517
motorcycle_death_count                   0.000879
motorcycle_serious_injury_count          0.005171
other_death_count                        0.000000
other_serious_injury_count               0.000010
micromobility_serious_injury_count       0.000000
micromobility_death_coun

In [10]:
# Scale numeric features
scaler = StandardScaler()
X_train[numeric_cols] = scaler.fit_transform(X_train[numeric_cols])
X_val[numeric_cols] = scaler.transform(X_val[numeric_cols])
X_test[numeric_cols] = scaler.transform(X_test[numeric_cols])

In [11]:
print(f'Train after scaler. mean: {np.mean(X_train[numeric_cols], axis=0)} std: {np.std(X_train[numeric_cols], axis=0)}')
print(f'Test after scaler. mean: {np.mean(X_test[numeric_cols], axis=0)} std: {np.std(X_test[numeric_cols], axis=0)}')

Train after scaler. mean: speed_limit                          -1.463656e-16
sus_serious_injry_cnt                -1.517205e-17
nonincap_injry_cnt                   -7.467028e-17
poss_injry_cnt                        6.098569e-17
non_injry_cnt                        -5.079661e-17
unkn_injry_cnt                       -1.204839e-17
tot_injry_cnt                        -6.277063e-17
death_cnt                            -2.000628e-17
motor_vehicle_death_count             3.837636e-17
motor_vehicle_serious_injury_count   -4.209500e-17
bicycle_death_count                   4.239249e-18
bicycle_serious_injury_count          1.896506e-18
pedestrian_death_count                5.726705e-18
pedestrian_serious_injury_count       1.502330e-17
motorcycle_death_count               -2.766668e-17
motorcycle_serious_injury_count       4.060754e-17
other_death_count                     0.000000e+00
other_serious_injury_count           -2.714607e-18
micromobility_serious_injury_count    0.000000e+00
micro

# Models
<div id="models" />
We will be using the following models:

To predict the `estimated_total_comprehensive_cost` we can use:
1. Linear Regression
2. Decision Tree Regression
3. Random Forest Regression
4. Support Vector Regression (SVR)
5. K-Nearest Neighbors (KNN) Regression

## Pipeline

In [12]:
# creating a pipe using the make_pipeline method   
pipe = make_pipeline(LinearRegression())
  
# Train the model on the training set
pipe.fit(X_train, y_train)

# Validate the model on the validation set
y_val_pred = pipe.predict(X_val)
val_mse = mean_squared_error(y_val, y_val_pred)
val_r2 = r2_score(y_val, y_val_pred)
print('Validation MSE:', val_mse)
print('Validation R2:', val_r2)

# Final evaluation on the test set
y_test_pred = pipe.predict(X_test)
test_mse = mean_squared_error(y_test, y_test_pred)
test_r2 = r2_score(y_test, y_test_pred)
print('Test MSE:', test_mse)
print('Test R2:', test_r2)


Validation MSE: 6162207.852094665
Validation R2: 0.9999879458211386
Test MSE: 4124926.833899759
Test R2: 0.9999931244298819


# Conclusion
<div id="conclusion" />