## Effect of Passive Restraints on Vehicle Collision Occupant Status
##### Description:
* US data, for 1997-2002, from police-reported car crashes in which there is a harmful event (people or property),<br> and from which at least one vehicle was towed. Data are restricted to front-seat occupants, include only a subset<br> of the variables recorded, and are restricted in other ways also.
##### dvcat
* Ordered factor with levels (estimated impact speeds) 1-9km/h, 10-24, 25-39, 40-54, 55+
##### weight
* Observation weights, albeit of uncertain accuracy, designed to account for varying sampling probabilities.
##### dead
* Factor with levels alive dead
##### airbag
* A factor with levels none airbag
##### seatbelt
* A factor with levels none belted
##### frontal
* A numeric vector; 0 = non-frontal, 1=frontal impact
##### sex
* A factor with levels f m
##### ageOFocc
* Age of occupant in years
##### yearacc
* Year of accident
##### yearVeh
* Year of model of vehicle; a numeric vector
##### abcat
* Did one or more (driver or passenger) airbag(s) deploy? This factor has levels deploy nodeploy unavail
##### occRole
* A factor with levels driver pass
##### deploy
* A numeric vector: 0 if an airbag was unavailable or did not deploy; 1 if one or more bags deployed.
##### injSeverity
* A numeric vector: <br>0=None, 1=Possible Injury, 2=No Incapacity, 3=Incapacity, 4=Killed, 5=Unknown, 6=Prior Death
##### caseid
* A character created by pasting together the populations sampling unit, the case number, and the vehicle number.<br> Within each year, use this to uniquely identify the vehicle.



### Import Dependencies / Machine Learning

In [1]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

### Import Dataset

In [2]:
# Import the dataset from Google Drive:
url = ('https://drive.google.com/file/d/1t3Z8Blgy2BPmBB4FqrQkC_jie9IwYuQb/view?usp=sharing')
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
crash_1 = pd.read_csv(path,index_col=0)
crash_1.head()

Unnamed: 0,dvcat,weight,dead,airbag,seatbelt,frontal,sex,ageOFocc,yearacc,yearVeh,abcat,occRole,deploy,injSeverity,caseid
1,25-39,25.069,alive,none,belted,1,f,26,1997,1990.0,unavail,driver,0,3.0,2:3:1
2,10-24,25.069,alive,airbag,belted,1,f,72,1997,1995.0,deploy,driver,1,1.0,2:3:2
3,10-24,32.379,alive,none,none,1,f,69,1997,1988.0,unavail,driver,0,4.0,2:5:1
4,25-39,495.444,alive,airbag,belted,1,f,53,1997,1995.0,deploy,driver,1,1.0,2:10:1
5,25-39,25.069,alive,none,belted,1,f,32,1997,1988.0,unavail,driver,0,3.0,2:11:1


### Initialize Dataset for Analysis

In [3]:
# Remove columns not needed for this analysis:

# 'weight': Value of unknown significance or origin
# 'yearacc': Year the accident occurred from 1997-2002
# 'caseid': Not individual accident identifiers, numerous indicents assigned to single id's
# 'airbag' & 'deploy': Values are duplicated in the 'abcat' column

crash_2 = crash_1.drop(['weight','yearacc','caseid','airbag','deploy'], axis=1)

# Rename the column titles for better clarity:
crash_2.rename(columns={'dvcat':'est_impact_kmh',
                        'dead':'ultimate_outcome',
                        'airbag':'airbag_available',
                        'frontal':'front_impact',
                        'ageOFocc':'occupant_age',
                        'yearacc':'accident_year',
                        'yearVeh':'vehicle_year',
                        'abcat':'airbag_deployment',
                        'occRole':'occupant_role',
                        'injSeverity':'injury_severity'},inplace=True)

crash_2.head()

Unnamed: 0,est_impact_kmh,ultimate_outcome,seatbelt,front_impact,sex,occupant_age,vehicle_year,airbag_deployment,occupant_role,injury_severity
1,25-39,alive,belted,1,f,26,1990.0,unavail,driver,3.0
2,10-24,alive,belted,1,f,72,1995.0,deploy,driver,1.0
3,10-24,alive,none,1,f,69,1988.0,unavail,driver,4.0
4,25-39,alive,belted,1,f,53,1995.0,deploy,driver,1.0
5,25-39,alive,belted,1,f,32,1988.0,unavail,driver,3.0


In [4]:
# Drop the all rows with null values:
crash_3 = crash_2.dropna()
for column in crash_3.columns:
    print(f'Column {column} has {crash_3[column].isnull().sum()}\
    null values')

Column est_impact_kmh has 0    null values
Column ultimate_outcome has 0    null values
Column seatbelt has 0    null values
Column front_impact has 0    null values
Column sex has 0    null values
Column occupant_age has 0    null values
Column vehicle_year has 0    null values
Column airbag_deployment has 0    null values
Column occupant_role has 0    null values
Column injury_severity has 0    null values


In [5]:
# Rename values in est_impacgt_kmh & airbag_available:
crash_3['est_impact_kmh'] = crash_3['est_impact_kmh'].replace({'1-9km/h':'1-9'})

In [6]:
# Change the values to reflect belted or not belted:
crash_3['seatbelt'] = crash_3['seatbelt'].replace({'none':'not_belted'})
belted = crash_3.seatbelt.value_counts()
belted

belted        18465
not_belted     7598
Name: seatbelt, dtype: int64

In [7]:
crash_3

Unnamed: 0,est_impact_kmh,ultimate_outcome,seatbelt,front_impact,sex,occupant_age,vehicle_year,airbag_deployment,occupant_role,injury_severity
1,25-39,alive,belted,1,f,26,1990.0,unavail,driver,3.0
2,10-24,alive,belted,1,f,72,1995.0,deploy,driver,1.0
3,10-24,alive,not_belted,1,f,69,1988.0,unavail,driver,4.0
4,25-39,alive,belted,1,f,53,1995.0,deploy,driver,1.0
5,25-39,alive,belted,1,f,32,1988.0,unavail,driver,3.0
...,...,...,...,...,...,...,...,...,...,...
26213,25-39,alive,belted,1,m,17,1985.0,unavail,driver,0.0
26214,10-24,alive,belted,1,m,54,2002.0,nodeploy,driver,2.0
26215,10-24,alive,belted,1,f,27,1990.0,deploy,driver,3.0
26216,25-39,alive,belted,1,f,18,1999.0,deploy,driver,0.0


In [8]:
severity = crash_3.injury_severity.value_counts()
severity

3.0    8495
0.0    6478
1.0    5595
2.0    4242
4.0    1118
5.0     133
6.0       2
Name: injury_severity, dtype: int64

In [13]:

crash_4 = crash_3.drop(['injury_severity'], axis=1)
crash_4

Unnamed: 0,est_impact_kmh,ultimate_outcome,seatbelt,front_impact,sex,occupant_age,vehicle_year,airbag_deployment,occupant_role,injury_severity
1,25-39,alive,belted,1,f,26,1990.0,unavail,driver,3.0
2,10-24,alive,belted,1,f,72,1995.0,deploy,driver,1.0
3,10-24,alive,not_belted,1,f,69,1988.0,unavail,driver,4.0
4,25-39,alive,belted,1,f,53,1995.0,deploy,driver,1.0
5,25-39,alive,belted,1,f,32,1988.0,unavail,driver,3.0
...,...,...,...,...,...,...,...,...,...,...
26213,25-39,alive,belted,1,m,17,1985.0,unavail,driver,0.0
26214,10-24,alive,belted,1,m,54,2002.0,nodeploy,driver,2.0
26215,10-24,alive,belted,1,f,27,1990.0,deploy,driver,3.0
26216,25-39,alive,belted,1,f,18,1999.0,deploy,driver,0.0


### Integer Encoding

In [14]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

crash_5 = crash_4.copy()
crash_5['est_impact_kmh'] = le.fit_transform(crash_5['est_impact_kmh']) 
crash_5['ultimate_outcome'] = le.fit_transform(crash_5['ultimate_outcome'])
crash_5['seatbelt'] = le.fit_transform(crash_5['seatbelt'])
crash_5['sex'] = le.fit_transform(crash_5['sex'])
crash_5['airbag_deployment'] = le.fit_transform(crash_5['airbag_deployment'])
crash_5['occupant_role'] = le.fit_transform(crash_5['occupant_role'])

crash_5.head()

Unnamed: 0,est_impact_kmh,ultimate_outcome,seatbelt,front_impact,sex,occupant_age,vehicle_year,airbag_deployment,occupant_role,injury_severity
1,2,0,0,1,0,26,1990.0,2,0,3.0
2,1,0,0,1,0,72,1995.0,0,0,1.0
3,1,0,1,1,0,69,1988.0,2,0,4.0
4,2,0,0,1,0,53,1995.0,0,0,1.0
5,2,0,0,1,0,32,1988.0,2,0,3.0


### Split Data into Training & Testing

In [15]:
# Separate the features (X) from the target (y):
y = crash_5['ultimate_outcome']
X = crash_5.drop(columns='ultimate_outcome')

# Split data into training & testing:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, stratify=y)
X_train.shape

(19446, 9)

### Balanced Random Forest Classifier:

In [16]:
# Resample the training data with the BalancedRandomForestClassifier
from imblearn.ensemble import BalancedRandomForestClassifier

BRFC = BalancedRandomForestClassifier(n_estimators=100, random_state=1)
BRFC.fit(X_train, y_train)

BalancedRandomForestClassifier(random_state=1)

In [17]:
# Calculate the balanced accuracy score:
from sklearn.metrics import balanced_accuracy_score

BFRC_pred = BRFC.predict(X_test)
balanced_accuracy_score(y_test, BFRC_pred)

0.9480145630669008

In [18]:
# Display the confusion matrix:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, BFRC_pred)

array([[6110,   77],
       [  27,  268]])

In [20]:
# Print the imbalanced classification report:
from imblearn.metrics import classification_report_imbalanced

print(classification_report_imbalanced(y_test, BFRC_pred))

                   pre       rec       spe        f1       geo       iba       sup

          0       1.00      0.99      0.91      0.99      0.95      0.90      6187
          1       0.78      0.91      0.99      0.84      0.95      0.89       295

avg / total       0.99      0.98      0.91      0.98      0.95      0.90      6482



In [21]:
# List the features sorted in descending order by feature importance:
importances = BRFC.feature_importances_
sorted(zip(importances, X.columns), reverse=True)

[(0.6810093179313453, 'injury_severity'),
 (0.1396865046231222, 'est_impact_kmh'),
 (0.07260846938756133, 'occupant_age'),
 (0.03807350400968621, 'seatbelt'),
 (0.03383656506462821, 'vehicle_year'),
 (0.012448528664141633, 'front_impact'),
 (0.00800275547226642, 'airbag_deployment'),
 (0.0076026304887454686, 'sex'),
 (0.006731724358503153, 'occupant_role')]

### Easy Ensemble AdaBoost Classifier

In [22]:
# Train the classifier:
from imblearn.ensemble import EasyEnsembleClassifier

EEC = EasyEnsembleClassifier(n_estimators=100, random_state=1)
EEC.fit(X_train, y_train)

EasyEnsembleClassifier(n_estimators=100, random_state=1)

In [23]:
# Calculate the balanced accuracy score:
EEC_pred = EEC.predict(X_test)
balanced_accuracy_score(y_test, EEC_pred)

0.9488991406256421

In [24]:
# Display the confusion matrix:
confusion_matrix(y_test, EEC_pred)

array([[6079,  108],
       [  25,  270]])

In [25]:
# Print the imbalanced classification report:
print(classification_report_imbalanced(y_test, EEC_pred))

                   pre       rec       spe        f1       geo       iba       sup

          0       1.00      0.98      0.92      0.99      0.95      0.91      6187
          1       0.71      0.92      0.98      0.80      0.95      0.89       295

avg / total       0.98      0.98      0.92      0.98      0.95      0.90      6482



### Naive Random Oversampling

In [26]:
# Resample the training data with the RandomOversampler:
from imblearn.over_sampling import RandomOverSampler

ROS = RandomOverSampler(random_state=1)
X_resampled, y_resampled = ROS.fit_resample(X_train, y_train)

In [27]:
# Train the Logistic Regression model using the resampled data:
from sklearn.linear_model import LogisticRegression

ROS_model = LogisticRegression(solver='lbfgs', random_state=1)
ROS_model.fit(X_resampled, y_resampled)

LogisticRegression(random_state=1)

In [28]:
# Calculated the balanced accuracy score:
ROS_pred = ROS_model.predict(X_test)
balanced_accuracy_score(y_test, ROS_pred)

0.9516424542438628

In [29]:
# Display the confusion matrix:
confusion_matrix(y_test, ROS_pred)

array([[6071,  116],
       [  23,  272]])

In [30]:
# Print the imbalanced classification report:
print(classification_report_imbalanced(y_test, ROS_pred))

                   pre       rec       spe        f1       geo       iba       sup

          0       1.00      0.98      0.92      0.99      0.95      0.91      6187
          1       0.70      0.92      0.98      0.80      0.95      0.90       295

avg / total       0.98      0.98      0.92      0.98      0.95      0.91      6482



### SMOTE Oversampling

In [31]:
# Resample the training data with SMOTE:
from imblearn.over_sampling import SMOTE

X_resampled, y_resampled = SMOTE(random_state=1, sampling_strategy='auto').fit_resample(X_train, y_train)

In [32]:
# Train the Logistic Regression model using the resampled data:
Smote_model = LogisticRegression(solver='lbfgs', random_state=1)
Smote_model.fit(X_resampled, y_resampled)

LogisticRegression(random_state=1)

In [33]:
# Calculate the balanced accuracy score:
SMOTE_pred = Smote_model.predict(X_test)
balanced_accuracy_score(y_test, SMOTE_pred)

0.9490673445962421

In [34]:
# Display the confusion matrix
confusion_matrix(y_test, SMOTE_pred)

array([[6144,   43],
       [  28,  267]])

In [35]:
# Print the imbalanced classification report
print(classification_report_imbalanced(y_test, SMOTE_pred))

                   pre       rec       spe        f1       geo       iba       sup

          0       1.00      0.99      0.91      0.99      0.95      0.91      6187
          1       0.86      0.91      0.99      0.88      0.95      0.89       295

avg / total       0.99      0.99      0.91      0.99      0.95      0.91      6482



### Undersampling

In [36]:
# Resample the data using the ClusterCentroids resampler:
from imblearn.under_sampling import ClusterCentroids

CC = ClusterCentroids(random_state=1)
X_resampled, y_resampled = CC.fit_resample(X_train, y_train)

In [37]:
# Train the Logistic Regression model using the resampled data:
CC_model = LogisticRegression(solver='lbfgs', random_state=78)
CC_model.fit(X_resampled, y_resampled)

LogisticRegression(random_state=78)

In [38]:
# Calculate the balanced accuracy score:
CC_pred = CC_model.predict(X_test)
balanced_accuracy_score(y_test, CC_pred)

0.8839633676955234

In [39]:
# Display the confusion matrix:
confusion_matrix(y_test, CC_pred)

array([[4877, 1310],
       [   6,  289]])

In [40]:
# Print the imbalanced classification report:
print(classification_report_imbalanced(y_test, CC_pred))

                   pre       rec       spe        f1       geo       iba       sup

          0       1.00      0.79      0.98      0.88      0.88      0.76      6187
          1       0.18      0.98      0.79      0.31      0.88      0.79       295

avg / total       0.96      0.80      0.97      0.85      0.88      0.76      6482



### Combination (Over and Under) Sampling

In [41]:
# Resample the training data with SMOTEENN:
from imblearn.combine import SMOTEENN

SMTN = SMOTEENN(random_state=0)
X_resampled2, y_resampled2 = SMTN.fit_resample(X, y)

In [42]:
# Train the Logistic Regression model using the resampled data:
SMTN_model = LogisticRegression(solver='lbfgs', random_state=1)
SMTN_model.fit(X_resampled, y_resampled)

LogisticRegression(random_state=1)

In [43]:
# Calculate the balanced accuracy score:
SMTN_pred = SMTN_model.predict(X_test)
balanced_accuracy_score(y_test, SMTN_pred)

0.8839633676955234

In [44]:
# Display the confusion matrix:
confusion_matrix(y_test, SMTN_pred)

array([[4877, 1310],
       [   6,  289]])

In [45]:
# Print the imbalanced classification report:
print(classification_report_imbalanced(y_test, SMTN_pred))

                   pre       rec       spe        f1       geo       iba       sup

          0       1.00      0.79      0.98      0.88      0.88      0.76      6187
          1       0.18      0.98      0.79      0.31      0.88      0.79       295

avg / total       0.96      0.80      0.97      0.85      0.88      0.76      6482

