# Airbags and Other Influences on Accident Fatalities
##### Description:
* US data, for 1997-2002, from police-reported car crashes in which there is a harmful event (people or property),<br> and from which at least one vehicle was towed. Data are restricted to front-seat occupants, include only a subset<br> of the variables recorded, and are restricted in other ways also.
##### dvcat
* Ordered factor with levels (estimated impact speeds) 1-9km/h, 10-24, 25-39, 40-54, 55+
##### weight
* Observation weights, albeit of uncertain accuracy, designed to account for varying sampling probabilities.
##### dead
* Factor with levels alive dead
##### airbag
* A factor with levels none airbag
##### seatbelt
* A factor with levels none belted
##### frontal
* A numeric vector; 0 = non-frontal, 1=frontal impact
##### sex
* A factor with levels f m
##### ageOFocc
* Age of occupant in years
##### yearacc
* Year of accident
##### yearVeh
* Year of model of vehicle; a numeric vector
##### abcat
* Did one or more (driver or passenger) airbag(s) deploy? This factor has levels deploy nodeploy unavail
##### occRole
* A factor with levels driver pass
##### deploy
* A numeric vector: 0 if an airbag was unavailable or did not deploy; 1 if one or more bags deployed.
##### injSeverity
* A numeric vector: <br>0=None, 1=Possible Injury, 2=No Incapacity, 3=Incapacity, 4=Killed, 5=Unknown, 6=Prior Death
##### caseid
* A character created by pasting together the populations sampling unit, the case number, and the vehicle number.<br> Within each year, use this to uniquely identify the vehicle.



# Import Dependencies / Machine Learning

In [38]:
# Disable the python deprication warnings:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd

In [59]:
from pathlib import Path
from collections import Counter

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Import Dataset

In [3]:
# Import the dataset from Google Drive:
url = ('https://drive.google.com/file/d/1t3Z8Blgy2BPmBB4FqrQkC_jie9IwYuQb/view?usp=sharing')
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
crash_df = pd.read_csv(path,index_col=0)
crash_df.head()

Unnamed: 0,dvcat,weight,dead,airbag,seatbelt,frontal,sex,ageOFocc,yearacc,yearVeh,abcat,occRole,deploy,injSeverity,caseid
1,25-39,25.069,alive,none,belted,1,f,26,1997,1990.0,unavail,driver,0,3.0,2:3:1
2,10-24,25.069,alive,airbag,belted,1,f,72,1997,1995.0,deploy,driver,1,1.0,2:3:2
3,10-24,32.379,alive,none,none,1,f,69,1997,1988.0,unavail,driver,0,4.0,2:5:1
4,25-39,495.444,alive,airbag,belted,1,f,53,1997,1995.0,deploy,driver,1,1.0,2:10:1
5,25-39,25.069,alive,none,belted,1,f,32,1997,1988.0,unavail,driver,0,3.0,2:11:1


# Initialize Dataset for Analysis

In [4]:
# Remove unneeded columns:
crash_v1 = crash_df.drop(['weight','injSeverity','yearacc','caseid','sex'], axis=1)

# Rename the columns so they are easier to understand:
crash_v1.rename(columns={'dvcat':'est_impact_kmh',
                         'frontal':'front_impact',
                         'ageOFocc':'occupant_age',
                         'yearVeh':'vehicle_year',
                         'abcat':'airbag_deployment',
                         'occRole':'occupant_role',
                         'deploy':'deployment',
                         'dead':'occupant_status'},inplace=True)

crash_v1.head()

Unnamed: 0,est_impact_kmh,occupant_status,airbag,seatbelt,front_impact,occupant_age,vehicle_year,airbag_deployment,occupant_role,deployment
1,25-39,alive,none,belted,1,26,1990.0,unavail,driver,0
2,10-24,alive,airbag,belted,1,72,1995.0,deploy,driver,1
3,10-24,alive,none,none,1,69,1988.0,unavail,driver,0
4,25-39,alive,airbag,belted,1,53,1995.0,deploy,driver,1
5,25-39,alive,none,belted,1,32,1988.0,unavail,driver,0


In [5]:
# Check the dataset for any null values:
for column in crash_v1.columns:
    print(f'Column {column} has {crash_v1[column].isnull().sum()}\
    null values')    

Column est_impact_kmh has 0    null values
Column occupant_status has 0    null values
Column airbag has 0    null values
Column seatbelt has 0    null values
Column front_impact has 0    null values
Column occupant_age has 0    null values
Column vehicle_year has 1    null values
Column airbag_deployment has 0    null values
Column occupant_role has 0    null values
Column deployment has 0    null values


In [6]:
# Drop the null row:
crash_v2 = crash_v1.dropna()
for column in crash_v2.columns:
    print(f'Column {column} has {crash_v2[column].isnull().sum()}\
    null values')

Column est_impact_kmh has 0    null values
Column occupant_status has 0    null values
Column airbag has 0    null values
Column seatbelt has 0    null values
Column front_impact has 0    null values
Column occupant_age has 0    null values
Column vehicle_year has 0    null values
Column airbag_deployment has 0    null values
Column occupant_role has 0    null values
Column deployment has 0    null values


In [7]:
# Print out the est_impact_kmh value counts:
impact = crash_v2.est_impact_kmh.value_counts()
impact

10-24      12847
25-39       8214
40-54       2977
55+         1492
1-9km/h      686
Name: est_impact_kmh, dtype: int64

In [8]:
# Rename catagories to form 'Under 40' & over 40 group:
U = 'Under40'
O = 'Over40'

crash_v2['est_impact_kmh'] = crash_v2['est_impact_kmh'].replace({'1-9km/h':U,'10-24':U,'25-39':U})
crash_v2['est_impact_kmh'] = crash_v2['est_impact_kmh'].replace({'40-54':O,'55+':O})

# Print out the est_impact_kmh value counts:
impact = crash_v2.est_impact_kmh.value_counts()
impact

Under40    21747
Over40      4469
Name: est_impact_kmh, dtype: int64

In [9]:
crash_v2.head()

Unnamed: 0,est_impact_kmh,occupant_status,airbag,seatbelt,front_impact,occupant_age,vehicle_year,airbag_deployment,occupant_role,deployment
1,Under40,alive,none,belted,1,26,1990.0,unavail,driver,0
2,Under40,alive,airbag,belted,1,72,1995.0,deploy,driver,1
3,Under40,alive,none,none,1,69,1988.0,unavail,driver,0
4,Under40,alive,airbag,belted,1,53,1995.0,deploy,driver,1
5,Under40,alive,none,belted,1,32,1988.0,unavail,driver,0


In [10]:
# Print out the occupant_status value counts:
survive = crash_v2.occupant_status.value_counts()
survive

alive    25036
dead      1180
Name: occupant_status, dtype: int64

In [11]:
# Print out the airbag value counts:
airbag = crash_v2.airbag.value_counts()
airbag

airbag    14418
none      11798
Name: airbag, dtype: int64

In [12]:
# Change the values to reflect installed or not installed:
crash_v2['airbag'] = crash_v2['airbag'].replace({'airbag':'installed','none':'not installed'})
installed = crash_v2.airbag.value_counts()
installed

installed        14418
not installed    11798
Name: airbag, dtype: int64

In [13]:
# Print out the seatbelt value counts:
seatbelt = crash_v2.seatbelt.value_counts()
seatbelt

belted    18572
none       7644
Name: seatbelt, dtype: int64

In [14]:
# Change the values to reflect installed or not installed:
crash_v2['seatbelt'] = crash_v2['seatbelt'].replace({'none':'not_belted'})
belted = crash_v2.seatbelt.value_counts()
belted

belted        18572
not_belted     7644
Name: seatbelt, dtype: int64

In [15]:
# Print out the seatbelt value counts:
front = crash_v2.front_impact.value_counts()
front

1    16865
0     9351
Name: front_impact, dtype: int64

In [26]:
# Print out the vehicle_year value counts: 
year = crash_v2.vehicle_year.value_counts()
year

1995.0    2037
1997.0    1895
1994.0    1842
1998.0    1821
1996.0    1820
1993.0    1633
1999.0    1588
1992.0    1420
1991.0    1412
1989.0    1358
1990.0    1320
2000.0    1265
1988.0    1247
1987.0    1026
1986.0     908
1985.0     712
2001.0     710
1984.0     524
2002.0     367
1983.0     270
1982.0     192
1981.0     145
1979.0     129
1978.0     123
1980.0     104
1977.0      60
1976.0      37
1973.0      34
2003.0      31
1975.0      28
1974.0      25
1972.0      23
1969.0      23
1970.0      17
1971.0      17
1966.0      17
1968.0      13
1967.0       9
1963.0       4
1965.0       4
1956.0       2
1961.0       1
1964.0       1
1953.0       1
1959.0       1
Name: vehicle_year, dtype: int64

In [27]:
# Print out the vehicle_year value counts: 
deploy = crash_v2.airbag_deployment.value_counts()
deploy

unavail     11798
deploy       8835
nodeploy     5583
Name: airbag_deployment, dtype: int64

In [30]:
# Remove all crashes where the airbags were not available:
unavailable = crash_v2[crash_v2['airbag_deployment'] == 'unavail'].index
crash_v2.drop(unavailable, inplace=True)
crash_v2.head()

Unnamed: 0,est_impact_kmh,occupant_status,airbag,seatbelt,front_impact,occupant_age,vehicle_year,airbag_deployment,occupant_role,deployment
2,Under40,alive,installed,belted,1,72,1995.0,deploy,driver,1
4,Under40,alive,installed,belted,1,53,1995.0,deploy,driver,1
13,Under40,alive,installed,belted,1,67,1991.0,deploy,driver,1
14,Under40,dead,installed,belted,0,54,1994.0,nodeploy,driver,0
19,Under40,alive,installed,belted,0,33,1995.0,nodeploy,driver,0


In [33]:
# Verify 'unavail' has been removed from the airbag_deployment column: 
unavail_removed = crash_v2.airbag_deployment.value_counts()
unavail_removed

deploy      8835
nodeploy    5583
Name: airbag_deployment, dtype: int64

In [32]:
# Print out the occupant_role value counts: 
occupant = crash_v2.occupant_role.value_counts()
occupant

driver    11729
pass       2689
Name: occupant_role, dtype: int64

In [35]:
# Create a new database with just car years 1990 and newer:
crash_v3 = crash_v2.loc[crash_v2['vehicle_year'] >= 1990]
crash_v3.head()

Unnamed: 0,est_impact_kmh,occupant_status,airbag,seatbelt,front_impact,occupant_age,vehicle_year,airbag_deployment,occupant_role,deployment
2,Under40,alive,installed,belted,1,72,1995.0,deploy,driver,1
4,Under40,alive,installed,belted,1,53,1995.0,deploy,driver,1
13,Under40,alive,installed,belted,1,67,1991.0,deploy,driver,1
14,Under40,dead,installed,belted,0,54,1994.0,nodeploy,driver,0
19,Under40,alive,installed,belted,0,33,1995.0,nodeploy,driver,0


In [36]:
# Create a new database with just front impact crashes:
crash_v4 = crash_v2.loc[crash_v2['front_impact'] == 1]
crash_v4.head()

Unnamed: 0,est_impact_kmh,occupant_status,airbag,seatbelt,front_impact,occupant_age,vehicle_year,airbag_deployment,occupant_role,deployment
2,Under40,alive,installed,belted,1,72,1995.0,deploy,driver,1
4,Under40,alive,installed,belted,1,53,1995.0,deploy,driver,1
13,Under40,alive,installed,belted,1,67,1991.0,deploy,driver,1
21,Under40,alive,installed,not_belted,1,20,1995.0,deploy,driver,1
25,Under40,alive,installed,belted,1,23,1995.0,deploy,driver,1


In [45]:
# Delete the front_impact column:
crash_v5 = crash_v4.drop(['front_impact'], axis=1)
crash_v5.head()

Unnamed: 0,est_impact_kmh,occupant_status,airbag,seatbelt,occupant_age,vehicle_year,airbag_deployment,occupant_role,deployment
2,Under40,alive,installed,belted,72,1995.0,deploy,driver,1
4,Under40,alive,installed,belted,53,1995.0,deploy,driver,1
13,Under40,alive,installed,belted,67,1991.0,deploy,driver,1
21,Under40,alive,installed,not_belted,20,1995.0,deploy,driver,1
25,Under40,alive,installed,belted,23,1995.0,deploy,driver,1


In [46]:
# Print out the occupant_status value counts:
survive2 = crash_v5.occupant_status.value_counts()
survive2

alive    8659
dead      216
Name: occupant_status, dtype: int64

# Integer Encoding

In [47]:
le = LabelEncoder()

crash_v6 = crash_v5.copy()
crash_v6['est_impact_kmh'] = le.fit_transform(crash_v6['est_impact_kmh']) 
crash_v6['occupant_status'] = le.fit_transform(crash_v6['occupant_status'])
crash_v6['airbag'] = le.fit_transform(crash_v6['airbag'])
crash_v6['seatbelt'] = le.fit_transform(crash_v6['seatbelt'])
crash_v6['airbag_deployment'] = le.fit_transform(crash_v6['airbag_deployment'])
crash_v6['occupant_role'] = le.fit_transform(crash_v6['occupant_role'])

crash_v6.head()

Unnamed: 0,est_impact_kmh,occupant_status,airbag,seatbelt,occupant_age,vehicle_year,airbag_deployment,occupant_role,deployment
2,1,0,0,0,72,1995.0,0,0,1
4,1,0,0,0,53,1995.0,0,0,1
13,1,0,0,0,67,1991.0,0,0,1
21,1,0,0,1,20,1995.0,0,0,1
25,1,0,0,0,23,1995.0,0,0,1


In [50]:
# Separate the features (X) from the target (y):
y = crash_v6['occupant_status']
X = crash_v6.drop(columns='occupant_status')

In [53]:
# Split data into training & testing:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, stratify=y)
X_train.shape

(6656, 8)

# Split Data Into Training & Testing

In [None]:
# Create our features
X = crash_v2.drop(columns='Occupant_Status')
X = pd.get_dummies(X)

# Create our target
target = ['Occupant_Status']
y = crash_v2.loc[:, target].copy()

In [None]:
X.describe()

# Create a Logistic Regression Model

In [55]:
classifier = LogisticRegression(solver='lbfgs',
                                max_iter=200,
                                random_state=1)

In [56]:
# Fit (train) or model using the training data:
classifier.fit(X_train, y_train)

LogisticRegression(max_iter=200, random_state=1)

In [57]:
classifier.coef_

array([[-2.30735766e+00,  0.00000000e+00,  1.69406226e+00,
         3.51742939e-02, -2.26593285e-03, -1.26189069e-01,
         1.30697203e-01,  1.23702257e-01]])

In [58]:
# Make predictions:
y_pred = classifier.predict(X_test)
results = pd.DataFrame({"Prediction": y_pred, "Actual": y_test}).reset_index(drop=True)
results.head(20)

Unnamed: 0,Prediction,Actual
0,0,0
1,0,1
2,0,0
3,0,0
4,0,0
5,0,0
6,0,0
7,0,0
8,0,0
9,0,0


In [60]:
# Determine prediction accuracy:
print(accuracy_score(y_test, y_pred))

0.9765660207300586
