---
---
## Project: Tabular dataset

The dataset, called FARS, is a collection of statistics of US road traffic accidents. The class label is about the severity of the accident. It has 20 features and over 100K examples. The dataset is available in Canvas as a CSV file, in which the last column contains the class labels: https://ncl.instructure.com/courses/53509/files/7652449/download?download_frd=1

Experiments on the tabular dataset are relatively fast. We evaluate a very broad range of options for the design of your machine learning pipelines, including (but not limited to) data normalisation, feature/instance selection, class imbalance correction, several (appropriate) machine learning models, hyperparameter tuning and cross-validation evaluation.

### The code block below is used for importing required libraries

In [None]:
# importing required libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Read the dataset and store it in pandas dataframe

data = pd.read_csv("fars.csv")
data

Unnamed: 0,CASE_STATE,AGE,SEX,PERSON_TYPE,SEATING_POSITION,RESTRAINT_SYSTEM-USE,AIR_BAG_AVAILABILITY/DEPLOYMENT,EJECTION,EJECTION_PATH,EXTRICATION,...,DRUG_TEST_RESULTS_(2_of_3),DRUG_TEST_TYPE_(3_of_3),DRUG_TEST_RESULTS_(3_of_3),HISPANIC_ORIGIN,TAKEN_TO_HOSPITAL,RELATED_FACTOR_(1)-PERSON_LEVEL,RELATED_FACTOR_(2)-PERSON_LEVEL,RELATED_FACTOR_(3)-PERSON_LEVEL,RACE,INJURY_SEVERITY
0,Alabama,34,Male,Driver,Front_Seat_-_Left_Side_(Drivers_Side),None_Used/Not_Applicable,Air_Bag_Available_but_Not_Deployed_for_this_Seat,Totally_Ejected,Unknown,Not_Extricated,...,0,Not_Tested_for_Drugs,0,Non-Hispanic,No,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,White,Fatal_Injury
1,Alabama,20,Male,Driver,Front_Seat_-_Left_Side_(Drivers_Side),None_Used/Not_Applicable,Deployed_Air_Bag_from_Front,Totally_Ejected,Unknown,Not_Extricated,...,0,Not_Tested_for_Drugs,0,Non-Hispanic,No,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,White,Fatal_Injury
2,Alabama,43,Male,Driver,Front_Seat_-_Left_Side_(Drivers_Side),Lap_and_Shoulder_Belt,Air_Bag_Not_Available_for_this_Seat,Not_Ejected,Not_Ejected/Not_Applicable,Extricated,...,0,Not_Tested_for_Drugs,0,Non-Hispanic,No,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Black,Fatal_Injury
3,Alabama,38,Female,Passenger_of_a_Motor_Vehicle_in_Transport,Front_Seat_-_Right_Side,Lap_and_Shoulder_Belt,Air_Bag_Not_Available_for_this_Seat,Not_Ejected,Not_Ejected/Not_Applicable,Extricated,...,0,Not_Tested_for_Drugs,0,Not_a_Fatality_(Not_Applicable),Yes,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_a_Fatality_(Not_Applicable),Incapaciting_Injury
4,Alabama,50,Male,Driver,Front_Seat_-_Left_Side_(Drivers_Side),Lap_and_Shoulder_Belt,Deployed_Air_Bag_from_Front,Not_Ejected,Not_Ejected/Not_Applicable,Not_Extricated,...,0,Not_Tested_for_Drugs,0,Non-Hispanic,Yes,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Black,Fatal_Injury
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
100963,Wyoming,10,Female,Passenger_of_a_Motor_Vehicle_in_Transport,Second_Seat_-_Left_Side,Lap_and_Shoulder_Belt,Air_Bag_Not_Available_for_this_Seat,Not_Ejected,Not_Ejected/Not_Applicable,Not_Extricated,...,0,Not_Tested_for_Drugs,0,Not_a_Fatality_(Not_Applicable),Yes,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_a_Fatality_(Not_Applicable),Possible_Injury
100964,Wyoming,9,Female,Passenger_of_a_Motor_Vehicle_in_Transport,Second_Seat_-_Right_Side,Lap_and_Shoulder_Belt,Air_Bag_Not_Available_for_this_Seat,Not_Ejected,Not_Ejected/Not_Applicable,Not_Extricated,...,0,Not_Tested_for_Drugs,0,Not_a_Fatality_(Not_Applicable),Yes,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_a_Fatality_(Not_Applicable),Possible_Injury
100965,Wyoming,7,Female,Passenger_of_a_Motor_Vehicle_in_Transport,Second_Seat_-_Middle,Lap_Belt,Air_Bag_Not_Available_for_this_Seat,Not_Ejected,Not_Ejected/Not_Applicable,Not_Extricated,...,0,Not_Tested_for_Drugs,0,Not_a_Fatality_(Not_Applicable),Yes,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_a_Fatality_(Not_Applicable),Possible_Injury
100966,Wyoming,4,Female,Passenger_of_a_Motor_Vehicle_in_Transport,Second_Seat_-_Middle,Lap_Belt,Air_Bag_Not_Available_for_this_Seat,Not_Ejected,Not_Ejected/Not_Applicable,Not_Extricated,...,0,Not_Tested_for_Drugs,0,Not_a_Fatality_(Not_Applicable),Yes,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_a_Fatality_(Not_Applicable),Possible_Injury


## Let us check the data types of columns present in the dataset



## Checking if there are any missing values present in the dataset



In [None]:
print(data.isnull())

noofmissingvalues = data.isnull().sum()
print(noofmissingvalues)

        CASE_STATE    AGE    SEX  PERSON_TYPE  SEATING_POSITION  \
0            False  False  False        False             False   
1            False  False  False        False             False   
2            False  False  False        False             False   
3            False  False  False        False             False   
4            False  False  False        False             False   
...            ...    ...    ...          ...               ...   
100963       False  False  False        False             False   
100964       False  False  False        False             False   
100965       False  False  False        False             False   
100966       False  False  False        False             False   
100967       False  False  False        False             False   

        RESTRAINT_SYSTEM-USE  AIR_BAG_AVAILABILITY/DEPLOYMENT  EJECTION  \
0                      False                            False     False   
1                      False                 

### After executing above code block it can be seen that there are no missing values present in the dataset as there is a boolean value of "False" for each column and also number of missing values for each column in the datasset indicates 0.

# Checking if any duplicates are present in the dataset

In [None]:
# checking duplicate data present in our provided dataset

duplicate_data = data[data.duplicated()]
duplicate_data

Unnamed: 0,CASE_STATE,AGE,SEX,PERSON_TYPE,SEATING_POSITION,RESTRAINT_SYSTEM-USE,AIR_BAG_AVAILABILITY/DEPLOYMENT,EJECTION,EJECTION_PATH,EXTRICATION,...,DRUG_TEST_RESULTS_(2_of_3),DRUG_TEST_TYPE_(3_of_3),DRUG_TEST_RESULTS_(3_of_3),HISPANIC_ORIGIN,TAKEN_TO_HOSPITAL,RELATED_FACTOR_(1)-PERSON_LEVEL,RELATED_FACTOR_(2)-PERSON_LEVEL,RELATED_FACTOR_(3)-PERSON_LEVEL,RACE,INJURY_SEVERITY
69,Alabama,1,Male,Passenger_of_a_Motor_Vehicle_in_Transport,Second_Seat_-_Right_Side,Child_Safety_Seat,Air_Bag_Not_Available_for_this_Seat,Not_Ejected,Not_Ejected/Not_Applicable,Not_Extricated,...,0,Not_Tested_for_Drugs,0,Not_a_Fatality_(Not_Applicable),Yes,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_a_Fatality_(Not_Applicable),Incapaciting_Injury
126,Alabama,99,Unknown,Passenger_of_a_Motor_Vehicle_in_Transport,Front_Seat_-_Right_Side,Lap_and_Shoulder_Belt,Air_Bag_Not_Available_for_this_Seat,Not_Ejected,Not_Ejected/Not_Applicable,Not_Extricated,...,0,Not_Tested_for_Drugs,0,Not_a_Fatality_(Not_Applicable),No,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_a_Fatality_(Not_Applicable),No_Injury
170,Alabama,99,Unknown,Passenger_of_a_Motor_Vehicle_in_Transport,Front_Seat_-_Right_Side,Lap_and_Shoulder_Belt,Air_Bag_Not_Available_for_this_Seat,Not_Ejected,Not_Ejected/Not_Applicable,Not_Extricated,...,0,Not_Tested_for_Drugs,0,Not_a_Fatality_(Not_Applicable),No,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_a_Fatality_(Not_Applicable),No_Injury
173,Alabama,32,Female,Driver,Front_Seat_-_Left_Side_(Drivers_Side),Lap_and_Shoulder_Belt,Air_Bag_Not_Available_for_this_Seat,Not_Ejected,Not_Ejected/Not_Applicable,Not_Extricated,...,0,Not_Tested_for_Drugs,0,Not_a_Fatality_(Not_Applicable),Yes,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_a_Fatality_(Not_Applicable),Incapaciting_Injury
225,Alabama,1,Male,Passenger_of_a_Motor_Vehicle_in_Transport,Second_Seat_-_Right_Side,Child_Safety_Seat,Air_Bag_Not_Available_for_this_Seat,Not_Ejected,Not_Ejected/Not_Applicable,Not_Extricated,...,0,Not_Tested_for_Drugs,0,Not_a_Fatality_(Not_Applicable),Yes,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_a_Fatality_(Not_Applicable),Incapaciting_Injury
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
100539,Wisconsin,20,Male,Driver,Front_Seat_-_Left_Side_(Drivers_Side),Lap_and_Shoulder_Belt,Air_Bag_Not_Available_for_this_Seat,Not_Ejected,Not_Ejected/Not_Applicable,Not_Extricated,...,0,Not_Tested_for_Drugs,0,Not_a_Fatality_(Not_Applicable),No,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_a_Fatality_(Not_Applicable),No_Injury
100542,Wisconsin,32,Female,Driver,Front_Seat_-_Left_Side_(Drivers_Side),Lap_and_Shoulder_Belt,Deployed_Air_Bag_from_Front,Not_Ejected,Not_Ejected/Not_Applicable,Not_Extricated,...,0,Not_Tested_for_Drugs,0,Not_a_Fatality_(Not_Applicable),Yes,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_a_Fatality_(Not_Applicable),Incapaciting_Injury
100786,Wyoming,46,Male,Driver,Front_Seat_-_Left_Side_(Drivers_Side),Lap_and_Shoulder_Belt,Air_Bag_Not_Available_for_this_Seat,Not_Ejected,Not_Ejected/Not_Applicable,Not_Extricated,...,0,Not_Tested_for_Drugs,0,Not_a_Fatality_(Not_Applicable),No,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_a_Fatality_(Not_Applicable),No_Injury
100855,Wyoming,39,Male,Driver,Front_Seat_-_Left_Side_(Drivers_Side),Lap_and_Shoulder_Belt,Air_Bag_Not_Available_for_this_Seat,Not_Ejected,Not_Ejected/Not_Applicable,Not_Extricated,...,0,Not_Tested_for_Drugs,0,Not_a_Fatality_(Not_Applicable),No,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_a_Fatality_(Not_Applicable),No_Injury


In [None]:
duplicate_data.shape

(7964, 30)

Here we can see that about 7964 rows are duplicate in our dataset. We can remove this duplicate data which in turn will increase the effeciency. Duplicate data may lead to increase in processing time and distort the analysis and model training.

## First we will remove the duplicate data that is present in the dataset

In [None]:
# Removing duplicate data

unique_data = data.drop_duplicates()
unique_data

Unnamed: 0,CASE_STATE,AGE,SEX,PERSON_TYPE,SEATING_POSITION,RESTRAINT_SYSTEM-USE,AIR_BAG_AVAILABILITY/DEPLOYMENT,EJECTION,EJECTION_PATH,EXTRICATION,...,DRUG_TEST_RESULTS_(2_of_3),DRUG_TEST_TYPE_(3_of_3),DRUG_TEST_RESULTS_(3_of_3),HISPANIC_ORIGIN,TAKEN_TO_HOSPITAL,RELATED_FACTOR_(1)-PERSON_LEVEL,RELATED_FACTOR_(2)-PERSON_LEVEL,RELATED_FACTOR_(3)-PERSON_LEVEL,RACE,INJURY_SEVERITY
0,Alabama,34,Male,Driver,Front_Seat_-_Left_Side_(Drivers_Side),None_Used/Not_Applicable,Air_Bag_Available_but_Not_Deployed_for_this_Seat,Totally_Ejected,Unknown,Not_Extricated,...,0,Not_Tested_for_Drugs,0,Non-Hispanic,No,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,White,Fatal_Injury
1,Alabama,20,Male,Driver,Front_Seat_-_Left_Side_(Drivers_Side),None_Used/Not_Applicable,Deployed_Air_Bag_from_Front,Totally_Ejected,Unknown,Not_Extricated,...,0,Not_Tested_for_Drugs,0,Non-Hispanic,No,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,White,Fatal_Injury
2,Alabama,43,Male,Driver,Front_Seat_-_Left_Side_(Drivers_Side),Lap_and_Shoulder_Belt,Air_Bag_Not_Available_for_this_Seat,Not_Ejected,Not_Ejected/Not_Applicable,Extricated,...,0,Not_Tested_for_Drugs,0,Non-Hispanic,No,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Black,Fatal_Injury
3,Alabama,38,Female,Passenger_of_a_Motor_Vehicle_in_Transport,Front_Seat_-_Right_Side,Lap_and_Shoulder_Belt,Air_Bag_Not_Available_for_this_Seat,Not_Ejected,Not_Ejected/Not_Applicable,Extricated,...,0,Not_Tested_for_Drugs,0,Not_a_Fatality_(Not_Applicable),Yes,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_a_Fatality_(Not_Applicable),Incapaciting_Injury
4,Alabama,50,Male,Driver,Front_Seat_-_Left_Side_(Drivers_Side),Lap_and_Shoulder_Belt,Deployed_Air_Bag_from_Front,Not_Ejected,Not_Ejected/Not_Applicable,Not_Extricated,...,0,Not_Tested_for_Drugs,0,Non-Hispanic,Yes,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Black,Fatal_Injury
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
100963,Wyoming,10,Female,Passenger_of_a_Motor_Vehicle_in_Transport,Second_Seat_-_Left_Side,Lap_and_Shoulder_Belt,Air_Bag_Not_Available_for_this_Seat,Not_Ejected,Not_Ejected/Not_Applicable,Not_Extricated,...,0,Not_Tested_for_Drugs,0,Not_a_Fatality_(Not_Applicable),Yes,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_a_Fatality_(Not_Applicable),Possible_Injury
100964,Wyoming,9,Female,Passenger_of_a_Motor_Vehicle_in_Transport,Second_Seat_-_Right_Side,Lap_and_Shoulder_Belt,Air_Bag_Not_Available_for_this_Seat,Not_Ejected,Not_Ejected/Not_Applicable,Not_Extricated,...,0,Not_Tested_for_Drugs,0,Not_a_Fatality_(Not_Applicable),Yes,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_a_Fatality_(Not_Applicable),Possible_Injury
100965,Wyoming,7,Female,Passenger_of_a_Motor_Vehicle_in_Transport,Second_Seat_-_Middle,Lap_Belt,Air_Bag_Not_Available_for_this_Seat,Not_Ejected,Not_Ejected/Not_Applicable,Not_Extricated,...,0,Not_Tested_for_Drugs,0,Not_a_Fatality_(Not_Applicable),Yes,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_a_Fatality_(Not_Applicable),Possible_Injury
100966,Wyoming,4,Female,Passenger_of_a_Motor_Vehicle_in_Transport,Second_Seat_-_Middle,Lap_Belt,Air_Bag_Not_Available_for_this_Seat,Not_Ejected,Not_Ejected/Not_Applicable,Not_Extricated,...,0,Not_Tested_for_Drugs,0,Not_a_Fatality_(Not_Applicable),Yes,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_a_Fatality_(Not_Applicable),Possible_Injury


In [None]:
unique_data.shape

(93004, 30)

### Now we will convert the non numeric features present in the dataset into numeric format using one hot encoding.

In [None]:
columns_to_drop = ['AGE', 'INJURY_SEVERITY','ALCOHOL_TEST_RESULT','DRUG_TEST_RESULTS_(1_of_3)','DRUG_TEST_RESULTS_(2_of_3)','DRUG_TEST_RESULTS_(3_of_3)']
workdata1 = unique_data
workdata1 = workdata1.drop(columns=columns_to_drop)
workdata1

Unnamed: 0,CASE_STATE,SEX,PERSON_TYPE,SEATING_POSITION,RESTRAINT_SYSTEM-USE,AIR_BAG_AVAILABILITY/DEPLOYMENT,EJECTION,EJECTION_PATH,EXTRICATION,NON_MOTORIST_LOCATION,...,METHOD_OF_DRUG_DETERMINATION,DRUG_TEST_TYPE_(1_of_3),DRUG_TEST_TYPE_(2_of_3),DRUG_TEST_TYPE_(3_of_3),HISPANIC_ORIGIN,TAKEN_TO_HOSPITAL,RELATED_FACTOR_(1)-PERSON_LEVEL,RELATED_FACTOR_(2)-PERSON_LEVEL,RELATED_FACTOR_(3)-PERSON_LEVEL,RACE
0,Alabama,Male,Driver,Front_Seat_-_Left_Side_(Drivers_Side),None_Used/Not_Applicable,Air_Bag_Available_but_Not_Deployed_for_this_Seat,Totally_Ejected,Unknown,Not_Extricated,Not_Applicable_-_Vehicle_Occupant,...,Not_Reported,Unknown_if_Tested_for_Drugs,Not_Tested_for_Drugs,Not_Tested_for_Drugs,Non-Hispanic,No,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,White
1,Alabama,Male,Driver,Front_Seat_-_Left_Side_(Drivers_Side),None_Used/Not_Applicable,Deployed_Air_Bag_from_Front,Totally_Ejected,Unknown,Not_Extricated,Not_Applicable_-_Vehicle_Occupant,...,Not_Reported,Not_Tested_for_Drugs,Not_Tested_for_Drugs,Not_Tested_for_Drugs,Non-Hispanic,No,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,White
2,Alabama,Male,Driver,Front_Seat_-_Left_Side_(Drivers_Side),Lap_and_Shoulder_Belt,Air_Bag_Not_Available_for_this_Seat,Not_Ejected,Not_Ejected/Not_Applicable,Extricated,Not_Applicable_-_Vehicle_Occupant,...,Not_Reported,Not_Tested_for_Drugs,Not_Tested_for_Drugs,Not_Tested_for_Drugs,Non-Hispanic,No,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Black
3,Alabama,Female,Passenger_of_a_Motor_Vehicle_in_Transport,Front_Seat_-_Right_Side,Lap_and_Shoulder_Belt,Air_Bag_Not_Available_for_this_Seat,Not_Ejected,Not_Ejected/Not_Applicable,Extricated,Not_Applicable_-_Vehicle_Occupant,...,Not_Reported,Not_Tested_for_Drugs,Not_Tested_for_Drugs,Not_Tested_for_Drugs,Not_a_Fatality_(Not_Applicable),Yes,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_a_Fatality_(Not_Applicable)
4,Alabama,Male,Driver,Front_Seat_-_Left_Side_(Drivers_Side),Lap_and_Shoulder_Belt,Deployed_Air_Bag_from_Front,Not_Ejected,Not_Ejected/Not_Applicable,Not_Extricated,Not_Applicable_-_Vehicle_Occupant,...,Not_Reported,Unknown_if_Tested_for_Drugs,Not_Tested_for_Drugs,Not_Tested_for_Drugs,Non-Hispanic,Yes,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Black
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
100963,Wyoming,Female,Passenger_of_a_Motor_Vehicle_in_Transport,Second_Seat_-_Left_Side,Lap_and_Shoulder_Belt,Air_Bag_Not_Available_for_this_Seat,Not_Ejected,Not_Ejected/Not_Applicable,Not_Extricated,Not_Applicable_-_Vehicle_Occupant,...,Not_Reported,Not_Tested_for_Drugs,Not_Tested_for_Drugs,Not_Tested_for_Drugs,Not_a_Fatality_(Not_Applicable),Yes,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_a_Fatality_(Not_Applicable)
100964,Wyoming,Female,Passenger_of_a_Motor_Vehicle_in_Transport,Second_Seat_-_Right_Side,Lap_and_Shoulder_Belt,Air_Bag_Not_Available_for_this_Seat,Not_Ejected,Not_Ejected/Not_Applicable,Not_Extricated,Not_Applicable_-_Vehicle_Occupant,...,Not_Reported,Not_Tested_for_Drugs,Not_Tested_for_Drugs,Not_Tested_for_Drugs,Not_a_Fatality_(Not_Applicable),Yes,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_a_Fatality_(Not_Applicable)
100965,Wyoming,Female,Passenger_of_a_Motor_Vehicle_in_Transport,Second_Seat_-_Middle,Lap_Belt,Air_Bag_Not_Available_for_this_Seat,Not_Ejected,Not_Ejected/Not_Applicable,Not_Extricated,Not_Applicable_-_Vehicle_Occupant,...,Not_Reported,Not_Tested_for_Drugs,Not_Tested_for_Drugs,Not_Tested_for_Drugs,Not_a_Fatality_(Not_Applicable),Yes,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_a_Fatality_(Not_Applicable)
100966,Wyoming,Female,Passenger_of_a_Motor_Vehicle_in_Transport,Second_Seat_-_Middle,Lap_Belt,Air_Bag_Not_Available_for_this_Seat,Not_Ejected,Not_Ejected/Not_Applicable,Not_Extricated,Not_Applicable_-_Vehicle_Occupant,...,Not_Reported,Not_Tested_for_Drugs,Not_Tested_for_Drugs,Not_Tested_for_Drugs,Not_a_Fatality_(Not_Applicable),Yes,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_a_Fatality_(Not_Applicable)


In [None]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False)
encoded_data = encoder.fit_transform(workdata1)
categorical_cols = workdata1.columns
feature_names = encoder.get_feature_names_out(categorical_cols)
encoded_data = pd.DataFrame(encoded_data, columns=feature_names)
encoded_data



Unnamed: 0,CASE_STATE_Alabama,CASE_STATE_Alaska,CASE_STATE_Arizona,CASE_STATE_Arkansas,CASE_STATE_California,CASE_STATE_Colorado,CASE_STATE_Connecticut,CASE_STATE_Delaware,CASE_STATE_District_of_Columbia,CASE_STATE_Florida,...,RACE_Japanese,RACE_Korean,RACE_Multiple_Races_(Individual_races_not_specified;_ex._mixed),RACE_Not_a_Fatality_(Not_Applicable),RACE_Other_Asian_or_Pacific_Islander,RACE_Other_Indian_(Includes_South_and_Central_America),RACE_Samoan,RACE_Unknown,RACE_Vietnamese,RACE_White
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
92999,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
93000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
93001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
93002,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
from sklearn.preprocessing import LabelEncoder

label_encoded = LabelEncoder()

unique_data['INJURY_SEVERITY']= label_encoded.fit_transform(unique_data['INJURY_SEVERITY'])
unique_data['INJURY_SEVERITY']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  unique_data['INJURY_SEVERITY']= label_encoded.fit_transform(unique_data['INJURY_SEVERITY'])


0         1
1         1
2         1
3         2
4         1
         ..
100963    6
100964    6
100965    6
100966    6
100967    1
Name: INJURY_SEVERITY, Length: 93004, dtype: int64

In [None]:
workdata3 = unique_data[['AGE','ALCOHOL_TEST_RESULT','DRUG_TEST_RESULTS_(1_of_3)','DRUG_TEST_RESULTS_(2_of_3)','DRUG_TEST_RESULTS_(3_of_3)', 'INJURY_SEVERITY']]
encoded_data.reset_index(drop=True, inplace=True)
workdata3.reset_index(drop=True, inplace=True)
new_data = pd.concat([encoded_data, workdata3], axis =1)
new_data

Unnamed: 0,CASE_STATE_Alabama,CASE_STATE_Alaska,CASE_STATE_Arizona,CASE_STATE_Arkansas,CASE_STATE_California,CASE_STATE_Colorado,CASE_STATE_Connecticut,CASE_STATE_Delaware,CASE_STATE_District_of_Columbia,CASE_STATE_Florida,...,RACE_Samoan,RACE_Unknown,RACE_Vietnamese,RACE_White,AGE,ALCOHOL_TEST_RESULT,DRUG_TEST_RESULTS_(1_of_3),DRUG_TEST_RESULTS_(2_of_3),DRUG_TEST_RESULTS_(3_of_3),INJURY_SEVERITY
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,34,97,999,0,0,1
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,20,96,0,0,0,1
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,43,96,0,0,0,1
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,38,96,0,0,0,2
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,50,97,999,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
92999,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,10,96,0,0,0,6
93000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,9,96,0,0,0,6
93001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,7,96,0,0,0,6
93002,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,4,96,0,0,0,6


## Now let us split the data into training and testing set

I have split the data as 80% and 20% where first part will be used for training the models and later will be used for testing.

In [None]:
from sklearn.model_selection import train_test_split

X=new_data.drop(columns=['INJURY_SEVERITY'])
Y=new_data['INJURY_SEVERITY']
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.2, random_state=42)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

X_train shape: (74403, 362)
X_test shape: (18601, 362)
y_train shape: (74403,)
y_test shape: (18601,)


In [None]:
# Importing required libraries
from imblearn.over_sampling import RandomOverSampler
from collections import Counter


# Creating the instance of RandomOverSampler
random_oversampler = RandomOverSampler(random_state=42)

# Fitting the features and label of training set that needs to be oversampled
x_train_oversampled, y_train_oversampled = random_oversampler.fit_resample(X_train,y_train)

# Print class distribution before and after random oversampling
print('Original dataset shape', Counter(Y))
print('Resampled dataset shape', Counter(y_train_oversampled))

Original dataset shape Counter({1: 41442, 4: 15642, 2: 14230, 5: 12945, 6: 8104, 7: 399, 3: 233, 0: 9})
Resampled dataset shape Counter({4: 33224, 1: 33224, 5: 33224, 6: 33224, 2: 33224, 7: 33224, 3: 33224, 0: 33224})


# Exploring several machine learning models for classification

## Decision tree algorithm

In [None]:
# Decision tree

# Importing the required libraries
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Creating an instance of DecisionTreeClassifier
decision_tree = DecisionTreeClassifier(random_state=42)

# Here we are training the model with our training data
decision_tree.fit(x_train_oversampled, y_train_oversampled)

# Next we will make predictions on testing set of our data
decision_tree_predictions = decision_tree.predict(X_test)

# Evaluating the model
accuracy1 = accuracy_score(y_test, decision_tree_predictions)
print(f"Accuracy of decision tree classifier: {accuracy1:.4f}")
print("\nClassification Report:")
print(classification_report(y_test, decision_tree_predictions))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, decision_tree_predictions))


Accuracy of decision tree classifier: 0.7277

Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         4
           1       1.00      1.00      1.00      8218
           2       0.50      0.51      0.50      2881
           3       0.31      0.41      0.35        39
           4       0.83      0.80      0.81      3108
           5       0.36      0.35      0.36      2615
           6       0.23      0.24      0.23      1650
           7       0.43      0.45      0.44        86

    accuracy                           0.73     18601
   macro avg       0.46      0.47      0.46     18601
weighted avg       0.73      0.73      0.73     18601


Confusion Matrix:
[[   0    0    0    0    3    0    1    0]
 [   0 8214    1    0    0    0    0    3]
 [   0    0 1461   16    8 1014  381    1]
 [   0    0    6   16    1   10    6    0]
 [   0    0    5    2 2485  114  473   29]
 [   0    0 1078   13  105  927  481   11]
 [ 

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Hyperparameter(s) optimized decision tree algorithm

Next, we will optimize the hyperparameters of decision tree algorithm and then use it to train and test our data. This will basically fine tune the parameters that are set before the model begins learning. It will help to avoid the overfitting or underfitting of data, effectively utilize the resources and help in improving model's performance.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import metrics


# Defining the parameter grid to search
param_grid = {
    'criterion': ['entropy', 'gini'],
    'max_depth': [3, 5, 7, None],
    'min_samples_split': [2, 4, 8],
    'min_samples_leaf': [1, 2, 4]
}

# Creating an instance of GridSearchCV
gridsearch = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5)

# Fit the grid search to the data
gridsearch.fit(x_train_oversampled, y_train_oversampled)

# Getting and printing the best parameters found by GridSearchCV
best_params = gridsearch.best_params_
print("Best hyperparameters:", best_params)

# Using the best parameters found to create the optimized decision tree classifier
optimized_decision_tree = DecisionTreeClassifier(**best_params)

# Training the optimized decision tree classifier
optimized_decision_tree.fit(x_train_oversampled, y_train_oversampled)

# Making predictions on the testing data
optimized_decision_tree_predictions = optimized_decision_tree.predict(X_test)

# Model evaluation with the optimized decision tree classifier
accuracy_optimized = metrics.accuracy_score(y_test, optimized_decision_tree_predictions)
print(f"Accuracy of optimized decision tree: {accuracy_optimized:.4f}")
print("\nClassification Report:")
print(metrics.classification_report(y_test, decision_tree_predictions))
print("\nConfusion Matrix:")
print(metrics.confusion_matrix(y_test, decision_tree_predictions))

Best hyperparameters: {'criterion': 'gini', 'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2}
Accuracy of optimized decision tree: 0.7283

Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         4
           1       1.00      1.00      1.00      8218
           2       0.50      0.51      0.50      2881
           3       0.31      0.41      0.35        39
           4       0.83      0.80      0.81      3108
           5       0.36      0.35      0.36      2615
           6       0.23      0.24      0.23      1650
           7       0.43      0.45      0.44        86

    accuracy                           0.73     18601
   macro avg       0.46      0.47      0.46     18601
weighted avg       0.73      0.73      0.73     18601


Confusion Matrix:
[[   0    0    0    0    3    0    1    0]
 [   0 8214    1    0    0    0    0    3]
 [   0    0 1461   16    8 1014  381    1]
 [   0    0    6   16  

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Here we have observed a marginal improvement in accuracy after doing hyperparameter optimization of decision tree algorithm. Accuracy changed from 0.7277 to 0.7283 which is a 0.1% increase.

## Random forest algorithm

In [None]:
# Random forest

# Importing necessary libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Creating an instance of RandomForestClassifier
random_forest = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the Random Forest classifier using training data
random_forest.fit(x_train_oversampled, y_train_oversampled)

# Make predictions on the testing data
random_forest_predictions = random_forest.predict(X_test)

# Evaluate the models performance
random_forest_accuracy1 = accuracy_score(y_test, random_forest_predictions)
print("Accuracy of random forest classifier: ", random_forest_accuracy1)
print("\nClassification Report:")
print(classification_report(y_test, random_forest_predictions))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, random_forest_predictions))

Accuracy of random forest classifier:  0.7480780603193377

Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         4
           1       1.00      1.00      1.00      8218
           2       0.53      0.55      0.54      2881
           3       0.19      0.18      0.18        39
           4       0.83      0.86      0.84      3108
           5       0.40      0.40      0.40      2615
           6       0.25      0.22      0.24      1650
           7       0.57      0.42      0.48        86

    accuracy                           0.75     18601
   macro avg       0.47      0.45      0.46     18601
weighted avg       0.74      0.75      0.75     18601


Confusion Matrix:
[[   0    0    0    0    3    1    0    0]
 [   0 8215    0    0    0    0    0    3]
 [   0    0 1573   13    6  969  320    0]
 [   0    0   11    7    1   14    6    0]
 [   2    0    4    1 2665   76  345   15]
 [   0    0 1027   10  111 1048  

## Hyperparameter(s) optimized Random Forest Algorithm

In [None]:
# Import all the necessary libraries
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

# Defining the parameter grid and their values to search
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 5, 10, 15],
    'min_samples_split': [2, 5, 10]

}

# Creating an instance of Random Forest classifier
random_forest = RandomForestClassifier()

# Perform GridSearchCV to find the best hyperparameters
# Creating an instance of GridSearchCV
grid_search = GridSearchCV(random_forest, param_grid, cv=5)

# Fit the grid search to the data
grid_search.fit(x_train_oversampled, y_train_oversampled)

# Getting and printing the best hyperparameters
best_params = grid_search.best_params_
print("Best hyperparameters:", best_params)

# Use the best model
optimized_random_forest = grid_search.best_estimator_

# Make predictions on the test set using the best model
optimized_random_forest_predictions = optimized_random_forest.predict(X_test)

# Evaluate model performance
accuracy = accuracy_score(y_test, optimized_random_forest_predictions)
print(f"\nAccuracy of the optimized oversampled Random Forest classifier: {accuracy}")
print("\nClassification Report:")
print(metrics.classification_report(y_test, optimized_random_forest_predictions))
print("\nConfusion Matrix:")
print(metrics.confusion_matrix(y_test, optimized_random_forest_predictions))

Best hyperparameters: {'max_depth': None, 'min_samples_split': 2, 'n_estimators': 200}

Accuracy of the optimized oversampled Random Forest classifier: 0.7484543841728939

Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         4
           1       1.00      1.00      1.00      8218
           2       0.53      0.55      0.54      2881
           3       0.21      0.21      0.21        39
           4       0.83      0.86      0.84      3108
           5       0.40      0.41      0.40      2615
           6       0.25      0.22      0.23      1650
           7       0.55      0.42      0.48        86

    accuracy                           0.75     18601
   macro avg       0.47      0.46      0.46     18601
weighted avg       0.74      0.75      0.75     18601


Confusion Matrix:
[[   0    0    0    0    3    1    0    0]
 [   0 8215    0    0    0    0    0    3]
 [   0    0 1576   13    6  974  312    0]
 [   0

Here we can observe the best hyperparameters selected for random forest algorithm. When we create an instance of random forest classifier using these hyperparameters and it gives us an accuracy of 0.7484. There is a slight increase in accuracy as compared to non optimized random forest classifier.

# Support Vector Machine (SVM) Algorithm

In [None]:
# Importing required libraries
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

# Creating an instance of support vector classifier
svm = SVC()

# Training the algorithm using training set
svm.fit(x_train_oversampled, y_train_oversampled)

# Making predictions on the testing set of the data
svm_predictions = svm.predict(X_test)

# Evaluating the model's performance
print("Accuracy of SVM model: ", accuracy_score(y_test, svm_predictions))
print("\nClassification Report:")
print(classification_report(y_test, svm_predictions))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, svm_predictions))

Accuracy of SVM model:  0.4418042040750497

Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         4
           1       0.44      1.00      0.61      8218
           2       0.00      0.00      0.00      2881
           3       0.00      0.00      0.00        39
           4       0.00      0.00      0.00      3108
           5       0.00      0.00      0.00      2615
           6       0.00      0.00      0.00      1650
           7       0.00      0.00      0.00        86

    accuracy                           0.44     18601
   macro avg       0.06      0.12      0.08     18601
weighted avg       0.20      0.44      0.27     18601


Confusion Matrix:
[[   0    4    0    0    0    0    0    0]
 [   0 8218    0    0    0    0    0    0]
 [   0 2881    0    0    0    0    0    0]
 [   0   39    0    0    0    0    0    0]
 [   0 3108    0    0    0    0    0    0]
 [   0 2615    0    0    0    0    0    0]
 [   

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
new_data

Unnamed: 0,CASE_STATE_Alabama,CASE_STATE_Alaska,CASE_STATE_Arizona,CASE_STATE_Arkansas,CASE_STATE_California,CASE_STATE_Colorado,CASE_STATE_Connecticut,CASE_STATE_Delaware,CASE_STATE_District_of_Columbia,CASE_STATE_Florida,...,RACE_Samoan,RACE_Unknown,RACE_Vietnamese,RACE_White,AGE,ALCOHOL_TEST_RESULT,DRUG_TEST_RESULTS_(1_of_3),DRUG_TEST_RESULTS_(2_of_3),DRUG_TEST_RESULTS_(3_of_3),INJURY_SEVERITY
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,34,97,999,0,0,1
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,20,96,0,0,0,1
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,43,96,0,0,0,1
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,38,96,0,0,0,2
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,50,97,999,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
92999,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,10,96,0,0,0,6
93000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,9,96,0,0,0,6
93001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,7,96,0,0,0,6
93002,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,4,96,0,0,0,6


# Data Normalization

Data normalization basically scales numeric features to a consistent range. We will perform normalization using MinMaxScaler which will scale the features between the range of 0 to 1. Normalization is done only on the features and not on labels

In [None]:
# Importing required libraries
from sklearn.preprocessing import MinMaxScaler

# Initialize the MinMaxScaler
scaler = MinMaxScaler()

# Fit the features training data and transform it
X_train_normalized = scaler.fit_transform(X_train)

# Transform the features testing data
X_test_normalized = scaler.transform(X_test)

X_train_normalized

array([[0., 0., 0., ..., 1., 1., 1.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 1., 1.],
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

Here we can see all the features have now been converted between the range 0 to 1.

# Decision tree algorithm on non oversampled data

In [None]:
# Decision tree

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Creating an instance of DecisionTreeClassifier
decision_tree1 = DecisionTreeClassifier(random_state=42)

# Here we are training the model with our non oversampled data
decision_tree1.fit(X_train, y_train)

# Next we will make predictions on testing set of our data
decision_tree_predictions1 = decision_tree1.predict(X_test)

# Evaluating the model
accuracy2 = accuracy_score(y_test, decision_tree_predictions1)
print(f"Accuracy of decision tree classifier on non oversampled data: {accuracy2:.4f}")
print("\nClassification Report:")
print(classification_report(y_test, decision_tree_predictions1))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, decision_tree_predictions1))

Accuracy of decision tree classifier on non oversampled data: 0.7385

Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         4
           1       1.00      1.00      1.00      8218
           2       0.51      0.52      0.52      2881
           3       0.23      0.26      0.24        39
           4       0.83      0.84      0.83      3108
           5       0.38      0.38      0.38      2615
           6       0.25      0.23      0.24      1650
           7       0.36      0.34      0.35        86

    accuracy                           0.74     18601
   macro avg       0.44      0.45      0.44     18601
weighted avg       0.74      0.74      0.74     18601


Confusion Matrix:
[[   0    0    0    0    3    0    1    0]
 [   0 8215    1    0    0    0    0    2]
 [   0    0 1511    8    9 1011  339    3]
 [   0    0   10   10    1   11    7    0]
 [   0    0    5    2 2599  102  379   21]
 [   0    0 1065   18 

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Cross Validation Evaluation

Cross validation can be used to test the performance of a machine learning model. In cross validation we divide the entire dataset into multiple folds, out of which one part is used for training the model and other is used for testing. Each fold is used for training and testing only once.

### Cross validation on decision tree

In [None]:
# Importing required libraries
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score,KFold

# Creating an instance of random forest classifier
d_t = DecisionTreeClassifier()
dt_kfold=KFold(n_splits=3)
d_t_score=cross_val_score(d_t,X,Y,cv=dt_kfold)
print("Cross Validation Scores are {}".format(d_t_score))
print("Average Cross Validation score :{}".format(d_t_score.mean()))

Cross Validation Scores are [0.74643571 0.75062095 0.75723364]
Average Cross Validation score :0.7514300997202904


### Cross validation on random forest

In [None]:
# Importing required libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score,KFold

# Creating an instance of random forest classifier
random_fc = RandomForestClassifier()
kfold=KFold(n_splits=3)
random_forest_score=cross_val_score(random_fc,X,Y,cv=kfold)
print("Cross Validation Scores are {}".format(random_forest_score))
print("Average Cross Validation score :{}".format(random_forest_score.mean()))

Cross Validation Scores are [0.77017612 0.77071707 0.77787813]
Average Cross Validation score :0.7729237747586888
