# Airbags and Other Influences on Accident Fatalities
##### Description:
* US data, for 1997-2002, from police-reported car crashes in which there is a harmful event (people or property),<br> and from which at least one vehicle was towed. Data are restricted to front-seat occupants, include only a subset<br> of the variables recorded, and are restricted in other ways also.
##### dvcat
* Ordered factor with levels (estimated impact speeds) 1-9km/h, 10-24, 25-39, 40-54, 55+
##### weight
* Observation weights, albeit of uncertain accuracy, designed to account for varying sampling probabilities.
##### dead
* Factor with levels alive dead
##### airbag
* A factor with levels none airbag
##### seatbelt
* A factor with levels none belted
##### frontal
* A numeric vector; 0 = non-frontal, 1=frontal impact
##### sex
* A factor with levels f m
##### ageOFocc
* Age of occupant in years
##### yearacc
* Year of accident
##### yearVeh
* Year of model of vehicle; a numeric vector
##### abcat
* Did one or more (driver or passenger) airbag(s) deploy? This factor has levels deploy nodeploy unavail
##### occRole
* A factor with levels driver pass
##### deploy
* A numeric vector: 0 if an airbag was unavailable or did not deploy; 1 if one or more bags deployed.
##### injSeverity
* A numeric vector: <br>0=None, 1=Possible Injury, 2=No Incapacity, 3=Incapacity, 4=Killed, 5=Unknown, 6=Prior Death
##### caseid
* A character created by pasting together the populations sampling unit, the case number, and the vehicle number.<br> Within each year, use this to uniquely identify the vehicle.



# Import Dependencies / Machine Learning

In [1]:
# Disable the python deprication warnings:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd

In [2]:
from pathlib import Path
from collections import Counter

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [3]:
from sklearn.metrics import confusion_matrix 
from sklearn.metrics import accuracy_score 
from sklearn.metrics import classification_report

# Import Dataset

In [4]:
# Import the dataset from Google Drive:
url = ('https://drive.google.com/file/d/1t3Z8Blgy2BPmBB4FqrQkC_jie9IwYuQb/view?usp=sharing')
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
crash_df = pd.read_csv(path,index_col=0)
crash_df.head(25)

Unnamed: 0,dvcat,weight,dead,airbag,seatbelt,frontal,sex,ageOFocc,yearacc,yearVeh,abcat,occRole,deploy,injSeverity,caseid
1,25-39,25.069,alive,none,belted,1,f,26,1997,1990.0,unavail,driver,0,3.0,2:3:1
2,10-24,25.069,alive,airbag,belted,1,f,72,1997,1995.0,deploy,driver,1,1.0,2:3:2
3,10-24,32.379,alive,none,none,1,f,69,1997,1988.0,unavail,driver,0,4.0,2:5:1
4,25-39,495.444,alive,airbag,belted,1,f,53,1997,1995.0,deploy,driver,1,1.0,2:10:1
5,25-39,25.069,alive,none,belted,1,f,32,1997,1988.0,unavail,driver,0,3.0,2:11:1
6,40-54,25.069,alive,none,belted,1,f,22,1997,1985.0,unavail,driver,0,3.0,2:11:2
7,55+,27.078,alive,none,belted,1,m,22,1997,1984.0,unavail,driver,0,3.0,2:13:1
8,55+,27.078,dead,none,none,1,m,32,1997,1987.0,unavail,driver,0,4.0,2:13:2
9,10-24,812.869,alive,none,belted,0,m,40,1997,1984.0,unavail,driver,0,1.0,2:14:1
10,10-24,812.869,alive,none,belted,1,f,18,1997,1987.0,unavail,driver,0,0.0,2:14:2


# Initialize Dataset for Analysis

In [5]:
# Remove unneeded columns:
crash_v1 = crash_df.drop(['weight','yearacc','caseid','airbag','deploy'], axis=1)

# Rename the columns so they are easier to understand:
crash_v1.rename(columns={'dvcat':'est_impact_kmh',
                         'frontal':'front_impact',
                         'ageOFocc':'occupant_age',
                         'yearVeh':'vehicle_year',
                         'abcat':'airbag_deployment',
                         'occRole':'occupant_role',
                         'dead':'dead_or_alive'},inplace=True)

crash_v1.head(20)

Unnamed: 0,est_impact_kmh,dead_or_alive,seatbelt,front_impact,sex,occupant_age,vehicle_year,airbag_deployment,occupant_role,injSeverity
1,25-39,alive,belted,1,f,26,1990.0,unavail,driver,3.0
2,10-24,alive,belted,1,f,72,1995.0,deploy,driver,1.0
3,10-24,alive,none,1,f,69,1988.0,unavail,driver,4.0
4,25-39,alive,belted,1,f,53,1995.0,deploy,driver,1.0
5,25-39,alive,belted,1,f,32,1988.0,unavail,driver,3.0
6,40-54,alive,belted,1,f,22,1985.0,unavail,driver,3.0
7,55+,alive,belted,1,m,22,1984.0,unavail,driver,3.0
8,55+,dead,none,1,m,32,1987.0,unavail,driver,4.0
9,10-24,alive,belted,0,m,40,1984.0,unavail,driver,1.0
10,10-24,alive,belted,1,f,18,1987.0,unavail,driver,0.0


In [6]:
# Check the dataset for any null values:
for column in crash_v1.columns:
    print(f'Column {column} has {crash_v1[column].isnull().sum()}\
    null values')    

Column est_impact_kmh has 0    null values
Column dead_or_alive has 0    null values
Column seatbelt has 0    null values
Column front_impact has 0    null values
Column sex has 0    null values
Column occupant_age has 0    null values
Column vehicle_year has 1    null values
Column airbag_deployment has 0    null values
Column occupant_role has 0    null values
Column injSeverity has 153    null values


In [7]:
# Drop the null row:
crash_v2 = crash_v1.dropna()
for column in crash_v2.columns:
    print(f'Column {column} has {crash_v2[column].isnull().sum()}\
    null values')

Column est_impact_kmh has 0    null values
Column dead_or_alive has 0    null values
Column seatbelt has 0    null values
Column front_impact has 0    null values
Column sex has 0    null values
Column occupant_age has 0    null values
Column vehicle_year has 0    null values
Column airbag_deployment has 0    null values
Column occupant_role has 0    null values
Column injSeverity has 0    null values


In [8]:
# Print out the est_impact_kmh value counts:
impact = crash_v2.est_impact_kmh.value_counts()
impact

10-24      12766
25-39       8165
40-54       2965
55+         1491
1-9km/h      676
Name: est_impact_kmh, dtype: int64

In [9]:
# Rename catagories to form 'Under 40' & over 40 group:
U = 'Under40'
O = 'Over40'

crash_v2['est_impact_kmh'] = crash_v2['est_impact_kmh'].replace({'1-9km/h':U,'10-24':U,'25-39':U})
crash_v2['est_impact_kmh'] = crash_v2['est_impact_kmh'].replace({'40-54':O,'55+':O})

# Print out the est_impact_kmh value counts:
impact = crash_v2.est_impact_kmh.value_counts()
impact

Under40    21607
Over40      4456
Name: est_impact_kmh, dtype: int64

In [10]:
crash_v2.head()

Unnamed: 0,est_impact_kmh,dead_or_alive,seatbelt,front_impact,sex,occupant_age,vehicle_year,airbag_deployment,occupant_role,injSeverity
1,Under40,alive,belted,1,f,26,1990.0,unavail,driver,3.0
2,Under40,alive,belted,1,f,72,1995.0,deploy,driver,1.0
3,Under40,alive,none,1,f,69,1988.0,unavail,driver,4.0
4,Under40,alive,belted,1,f,53,1995.0,deploy,driver,1.0
5,Under40,alive,belted,1,f,32,1988.0,unavail,driver,3.0


In [12]:
# Print out the occupant_status value counts:
survive = crash_v2.dead_or_alive.value_counts()
survive

alive    24883
dead      1180
Name: dead_or_alive, dtype: int64

In [13]:
# Print out the airbag value counts:
airbag = crash_v2.airbag_deployment.value_counts()
airbag

unavail     11727
deploy       8799
nodeploy     5537
Name: airbag_deployment, dtype: int64

In [15]:
# Print out the seatbelt value counts:
seatbelt = crash_v2.seatbelt.value_counts()
seatbelt

belted    18465
none       7598
Name: seatbelt, dtype: int64

In [16]:
# Change the values to reflect belted or not belted:
crash_v2['seatbelt'] = crash_v2['seatbelt'].replace({'none':'not_belted'})
belted = crash_v2.seatbelt.value_counts()
belted

belted        18465
not_belted     7598
Name: seatbelt, dtype: int64

In [17]:
# Print out the seatbelt value counts:
front = crash_v2.front_impact.value_counts()
front

1    16775
0     9288
Name: front_impact, dtype: int64

In [18]:
# Print out the vehicle_year value counts: 
year = crash_v2.vehicle_year.value_counts()
year

1995.0    2026
1997.0    1885
1994.0    1832
1996.0    1813
1998.0    1809
1993.0    1622
1999.0    1568
1992.0    1415
1991.0    1406
1989.0    1352
1990.0    1317
2000.0    1259
1988.0    1239
1987.0    1019
1986.0     905
2001.0     708
1985.0     708
1984.0     519
2002.0     362
1983.0     267
1982.0     191
1981.0     143
1979.0     129
1978.0     122
1980.0     102
1977.0      58
1976.0      37
1973.0      34
2003.0      31
1975.0      28
1974.0      25
1969.0      23
1972.0      23
1966.0      17
1971.0      17
1970.0      16
1968.0      13
1967.0       9
1963.0       4
1965.0       4
1956.0       2
1961.0       1
1964.0       1
1953.0       1
1959.0       1
Name: vehicle_year, dtype: int64

In [19]:
# Print out the vehicle_year value counts: 
#deploy = crash_v2.airbag_deployment.value_counts()
#deploy

In [20]:
# Remove all crashes where the airbags were not available:
#unavailable = crash_v2[crash_v2['airbag_deployment'] == 'unavail'].index
#crash_v2.drop(unavailable, inplace=True)
#crash_v2.head()

In [21]:
# Verify 'unavail' has been removed from the airbag_deployment column: 
#unavail_removed = crash_v2.airbag_deployment.value_counts()
#unavail_removed

In [22]:
# Print out the occupant_role value counts: 
occupant = crash_v2.occupant_role.value_counts()
occupant

driver    20541
pass       5522
Name: occupant_role, dtype: int64

In [23]:
# Create a new database with just car years 1990 and newer:
#crash_v3 = crash_v2.loc[crash_v2['vehicle_year'] >= 1990]
#crash_v3.head()

In [24]:
# Create a new database with just front impact crashes:
crash_v4 = crash_v2.loc[crash_v2['front_impact'] == 1]
crash_v4.head()

Unnamed: 0,est_impact_kmh,dead_or_alive,seatbelt,front_impact,sex,occupant_age,vehicle_year,airbag_deployment,occupant_role,injSeverity
1,Under40,alive,belted,1,f,26,1990.0,unavail,driver,3.0
2,Under40,alive,belted,1,f,72,1995.0,deploy,driver,1.0
3,Under40,alive,not_belted,1,f,69,1988.0,unavail,driver,4.0
4,Under40,alive,belted,1,f,53,1995.0,deploy,driver,1.0
5,Under40,alive,belted,1,f,32,1988.0,unavail,driver,3.0


In [25]:
# Delete the front_impact column:
#crash_v5 = crash_v4.drop(['front_impact'], axis=1)
#crash_v5.head()

In [28]:
# Print out the dead_or_alive value counts:
survive2 = crash_v4.dead_or_alive.value_counts()
survive2

alive    16193
dead       582
Name: dead_or_alive, dtype: int64

In [30]:
# Print out the injServerity value counts:
injury = crash_v4.injSeverity.value_counts()
injury

3.0    5477
0.0    4245
1.0    3493
2.0    2919
4.0     556
5.0      83
6.0       2
Name: injSeverity, dtype: int64

# Integer Encoding

In [29]:
le = LabelEncoder()

crash_v6 = crash_v5.copy()
crash_v6['est_impact_kmh'] = le.fit_transform(crash_v6['est_impact_kmh']) 
crash_v6['occupant_status'] = le.fit_transform(crash_v6['occupant_status'])
crash_v6['airbag'] = le.fit_transform(crash_v6['airbag'])
crash_v6['seatbelt'] = le.fit_transform(crash_v6['seatbelt'])
crash_v6['airbag_deployment'] = le.fit_transform(crash_v6['airbag_deployment'])
crash_v6['occupant_role'] = le.fit_transform(crash_v6['occupant_role'])

crash_v6.head(30)

Unnamed: 0,est_impact_kmh,occupant_status,airbag,seatbelt,occupant_age,vehicle_year,airbag_deployment,occupant_role
2,1,0,0,0,72,1995.0,0,0
4,1,0,0,0,53,1995.0,0,0
13,1,0,0,0,67,1991.0,0,0
21,1,0,0,1,20,1995.0,0,0
25,1,0,0,0,23,1995.0,0,0
31,1,0,0,0,26,1992.0,0,0
37,1,0,0,1,19,1993.0,0,0
44,1,0,0,0,32,1995.0,0,0
57,1,0,0,1,74,1993.0,0,0
63,1,0,0,0,32,1995.0,1,0


In [30]:
# Separate the features (X) from the target (y):
y = crash_v6['est_impact_kmh']
X = crash_v6.drop(columns='est_impact_kmh')

In [31]:
# Split data into training & testing:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, stratify=y)
X_train.shape

(6656, 7)

# Split Data Into Training & Testing

In [32]:
X.describe()

Unnamed: 0,occupant_status,airbag,seatbelt,occupant_age,vehicle_year,airbag_deployment,occupant_role
count,8875.0,8875.0,8875.0,8875.0,8875.0,8875.0,8875.0
mean,0.024338,0.0,0.233239,37.348845,1996.728,0.205296,0.168113
std,0.154105,0.0,0.422917,17.347832,2.778653,0.40394,0.373987
min,0.0,0.0,0.0,16.0,1986.0,0.0,0.0
25%,0.0,0.0,0.0,23.0,1995.0,0.0,0.0
50%,0.0,0.0,0.0,33.0,1997.0,0.0,0.0
75%,0.0,0.0,0.0,48.0,1999.0,0.0,0.0
max,1.0,0.0,1.0,97.0,2003.0,1.0,1.0


# Create a Logistic Regression Model

In [34]:
classifier = LogisticRegression(solver='lbfgs',
                                max_iter=200,
                                random_state=1)

In [35]:
# Fit (train) or model using the training data:
classifier.fit(X_train, y_train)

LogisticRegression(max_iter=200, random_state=1)

In [36]:
classifier.coef_

array([[-2.22854569e+00,  0.00000000e+00, -4.74727993e-01,
         6.45192227e-03,  7.61564648e-04,  2.40393455e+00,
         1.08137830e-01]])

In [37]:
# Make predictions:
y_pred = classifier.predict(X_test)
results = pd.DataFrame({"Prediction": y_pred, "Actual": y_test}).reset_index(drop=True)
results.head(10)

Unnamed: 0,Prediction,Actual
0,1,1
1,1,1
2,1,1
3,1,0
4,1,1
5,1,1
6,1,1
7,1,1
8,1,1
9,1,1


In [38]:
# Determine prediction accuracy:
print(accuracy_score(y_test, y_pred))

0.8634520054078414


In [42]:
confusion_matrix(y_test, y_pred)


array([[  33,  287],
       [  16, 1883]])

In [44]:
print (classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.67      0.10      0.18       320
           1       0.87      0.99      0.93      1899

    accuracy                           0.86      2219
   macro avg       0.77      0.55      0.55      2219
weighted avg       0.84      0.86      0.82      2219



In [42]:
# Set up to test for seatbelt usage instead of occupant status with Logistic Regression:
crash_v6.head()

Unnamed: 0,est_impact_kmh,occupant_status,airbag,seatbelt,occupant_age,vehicle_year,airbag_deployment,occupant_role,deployment
2,1,0,0,0,72,1995.0,0,0,1
4,1,0,0,0,53,1995.0,0,0,1
13,1,0,0,0,67,1991.0,0,0,1
21,1,0,0,1,20,1995.0,0,0,1
25,1,0,0,0,23,1995.0,0,0,1


# Create a second Logistic Regression Model

In [49]:
# Change the target from occupant status to seatbelt:
y = crash_v6['seatbelt']
X = crash_v6.drop(columns='seatbelt')

In [50]:
# Split data into training & testing:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, stratify=y)
X_train.shape

(6656, 8)

In [51]:
X.describe()

Unnamed: 0,est_impact_kmh,occupant_status,airbag,occupant_age,vehicle_year,airbag_deployment,occupant_role,deployment
count,8875.0,8875.0,8875.0,8875.0,8875.0,8875.0,8875.0,8875.0
mean,0.855887,0.024338,0.0,37.348845,1996.728,0.205296,0.168113,0.794704
std,0.351224,0.154105,0.0,17.347832,2.778653,0.40394,0.373987,0.40394
min,0.0,0.0,0.0,16.0,1986.0,0.0,0.0,0.0
25%,1.0,0.0,0.0,23.0,1995.0,0.0,0.0,1.0
50%,1.0,0.0,0.0,33.0,1997.0,0.0,0.0,1.0
75%,1.0,0.0,0.0,48.0,1999.0,0.0,0.0,1.0
max,1.0,1.0,0.0,97.0,2003.0,1.0,1.0,1.0


In [52]:
classifier = LogisticRegression(solver='lbfgs',
                                max_iter=200,
                                random_state=1)

In [53]:
# Fit (train) or model using the training data:
classifier.fit(X_train, y_train)

LogisticRegression(max_iter=200, random_state=1)

In [54]:
# Make predictions:
y_pred = classifier.predict(X_test)
results = pd.DataFrame({"Prediction": y_pred, "Actual": y_test}).reset_index(drop=True)
results.head(10)

Unnamed: 0,Prediction,Actual
0,0,0
1,0,0
2,0,0
3,0,0
4,0,0
5,0,0
6,0,1
7,0,0
8,0,1
9,0,0


In [56]:
# Determine prediction accuracy with seatbelts as the target:
print(accuracy_score(y_test, y_pred))

0.7715187021180712
