## 4.0 Feature selection

The aim for this notebook is to identify most significant features to explain the label and reduce dimensionalty of the dataset used by using feature selection model -RFE. 

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

## 4.1.0 Preprocessing Data

1. convert categorical variables into binary columns
2. encode label into numerical
3. data is the encoded full dataset [including 1, 0]
4. X is all encoded predictors
5. y is encoded labels [NotFatal:0, Fatal:1]

In [2]:
data = pd.read_csv('datas/data_cleaned.csv')
data.head()

Unnamed: 0,REPORT_ID,UND_UNIT_NUMBER,CASUALTY_NUMBER,cas_type,cas_gender,cas_age,cas_pos_in_veh,thrown_out,fatality,seat_belt,...,time,area_speed,acc_pos,hor_align,ver_align,moist_cond,wea_cond,dayNight,crash_type,traf_ctrls
0,2017-1-15/08/2019,1,1,Driver,Female,34.0,Driver,Not Thrown Out,NotFatal,Worn,...,peak,60,T-Junction,Straight road,Level,Dry,Not Raining,Daylight,Rear End,No Control
1,2017-5-15/08/2019,2,1,Driver,Female,41.0,Driver,Not Thrown Out,NotFatal,Worn,...,peak,60,T-Junction,Straight road,Level,Dry,Not Raining,Daylight,Right Turn,No Control
2,2017-9-15/08/2019,1,1,Driver,Male,39.889159,Driver,Not Thrown Out,NotFatal,Worn,...,peak,60,Divided Road,Straight road,Level,Dry,Not Raining,Daylight,Right Angle,No Control
3,2017-10-15/08/2019,1,1,Driver,Male,19.0,Driver,Not Thrown Out,NotFatal,Worn,...,peak,60,Freeway,"CURVED, VIEW OPEN",Level,Dry,Not Raining,Daylight,Head On,No Control
4,2017-10-15/08/2019,2,1,Driver,Male,48.0,Driver,Not Thrown Out,NotFatal,Worn,...,peak,60,Freeway,"CURVED, VIEW OPEN",Level,Dry,Not Raining,Daylight,Head On,No Control


In [3]:
data.loc[data['fatality']=='NotFatal','fatality']=0
data.loc[data['fatality']=='Fatal','fatality']=1
ids = data[['REPORT_ID','UND_UNIT_NUMBER','CASUALTY_NUMBER']]
data = data.drop(['REPORT_ID','UND_UNIT_NUMBER','CASUALTY_NUMBER'],axis=1)
y = data['fatality']
data = data.drop('fatality', axis=1)
categorical = []
for col, dtype in zip(data.columns, data.dtypes):
    if dtype=='object':
        categorical.append(col)
data = pd.get_dummies(data,columns=categorical, prefix=categorical)
data['fatality']=y
X = data.drop('fatality', axis=1)

In [4]:
for col in X.columns:
    print(col)

cas_age
n_occupants
total_units
cas_total
area_speed
cas_type_Driver
cas_type_Passenger
cas_type_Pedestrian
cas_type_Rider
cas_gender_Female
cas_gender_Male
cas_gender_Unknown
cas_pos_in_veh_Back of Enclosed Van
cas_pos_in_veh_Driver
cas_pos_in_veh_Front Seat Left Passenger
cas_pos_in_veh_Front Seat Middle Passenger
cas_pos_in_veh_NotApplicable
cas_pos_in_veh_Other
cas_pos_in_veh_Passenger of Motorcycle
cas_pos_in_veh_Passenger on Multi-Passenger Vehicle
cas_pos_in_veh_Rear Seat Left Passenger
cas_pos_in_veh_Rear Seat Middle Passenger
cas_pos_in_veh_Rear Seat Right Passenger
thrown_out_Not Thrown Out
thrown_out_NotApplicable
thrown_out_Thrown Out
seat_belt_NotWorn
seat_belt_Unknown
seat_belt_Worn
unit_type_Large
unit_type_MCycling
unit_type_Medium
unit_type_Other Defined Special Vehicle
unit_type_PCycling
unit_type_Pedestrian on Footpath/Carpark
unit_type_Pedestrian on Road
unit_type_Small
unit_type_Utility
unit_type_Wheelchair
lic_type_Disqualified
lic_type_Full
lic_type_Learners
lic_

## 4.1.1 Standardisation 

We aim to standardise the dataset to select the best features to explain the outcome of fatality in road crash incident. 

In [5]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

## 4.2.0 Feature selection

we aim to reduce the least important predictors in order to improve the predictability of the model. In this section, we decide to perform RFE using 4 different algorithms and they are decision classifer, random forest, logistic regression and rusboosting classifier, in attempting to select most significant features to explain the label. 

In [6]:
print('Fatal', round(data['fatality'].value_counts()[0]/len(data)*100,2), '% of the dataset')
print('not Fatal', round(data['fatality'].value_counts()[1]/len(data)*100,2), '% of the dataset')

Fatal 98.54 % of the dataset
not Fatal 1.46 % of the dataset


## 4.2.1 RFE

A RFE algorithm will be used to select features and 4 different types of models will be implemented along with RFE and compare the model result using the in-built feature importance method in RFE.

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=45,stratify=y,shuffle=True)

In [11]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import cohen_kappa_score,make_scorer
from sklearn.model_selection import StratifiedKFold
from imblearn.ensemble import RUSBoostClassifier
classifiers = {
    "LogisiticRegression": LogisticRegression(penalty='l1'),
    "DecisionTreeClassifier": DecisionTreeClassifier(),
    'RandomForestClassifer':RandomForestClassifier(),
    'RUSBoostClassifier':RUSBoostClassifier(sampling_strategy='auto')
}
for key, classifier in classifiers.items():
    selector = RFE(classifier,step=1,n_features_to_select=1)
    selector = selector.fit(X_train, y_train)
    classifiers[key]=selector

















In [14]:
ranking = pd.DataFrame()
for key,classifier  in classifiers.items():
    ranking[key]=pd.Series(classifier.ranking_,index=X.columns).sort_values().index
ranking.index = range(1,130)

## 4.2.2 Results

The following table shows top 30 features predicted to be significant in explaining label. 

In [16]:
ranking[:10]

Unnamed: 0,LogisiticRegression,DecisionTreeClassifier,RandomForestClassifer,RUSBoostClassifier
1,seat_belt_NotWorn,cas_age,cas_age,cas_age
2,cas_pos_in_veh_Driver,cas_total,area_speed,area_speed
3,crash_type_Rear End,area_speed,cas_total,thrown_out_Not Thrown Out
4,area_speed,total_units,total_units,crash_type_Head On
5,cas_gender_Male,unit_type_Medium,n_occupants,crash_type_Hit Fixed Object
6,traf_ctrls_Roundabout,"hor_align_CURVED, VIEW OPEN",unit_type_Small,crash_type_Rear End
7,cas_age,n_occupants,"hor_align_CURVED, VIEW OPEN",dayNight_Daylight
8,moist_cond_Dry,time_peak,ver_align_Level,cas_gender_Male
9,wea_cond_Raining,stat_area_2 Metropolitan,thrown_out_Not Thrown Out,cas_total
10,ver_align_Bottom of Hill,ver_align_Slope,dayNight_Night,dayNight_Night


## 4.2.3 Dicussion
The table above indicates that all four models found age, seatbelt worn, area speed to be significant in explaining fatality outcome as expected in data visualisation notebook. However ranking lists produced by different models are inconsistent, and this might be influenced by random noise and these models require parameter tuning for the best performance. As such, there is no a conclusions whether regression or tree-based aglorithm give better predictability. As a result, it is recommended not to remove any features, preventing from losing information.In next notebook- data analysis, we aim to train and fit the entire dataset using different model, and select the model with the best performance. 

In [9]:
data.to_csv('datas/final_dataset.csv',index=False)