# CS452/CS552 Assignment 2: Car Rollover Prediction

**Release Date: 23.11.2021** <br>
**Submission Deadline: 12.12.2021 23.55**

In [1]:
# Author: Mert Erkol
# Department: Computer Science
# Degree: BSc. 

In [1]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

import requests as rq
import io
import random

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, recall_score, precision_score, f1_score, fbeta_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler, RobustScaler
from sklearn.inspection import permutation_importance
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

from sklearn.naive_bayes import GaussianNB, CategoricalNB, ComplementNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

pd.set_option('display.max_columns', 150)
pd.set_option('display.max_rows', 10000)
random.seed(42) # DO NOT CHANGE
np.random.seed(42) # DO NOT CHANGE

# Part 1: Data Loading and Cleaning
> Get FARS datasets through the API provided by NHTSA using requests library.<br>

Two types of datasets will be used in the project:<br>
*person* dataset provides the information of persons involved in a crash.<br>
*vindecode* dataset the information of vehicles involved in a crash thanks to decoded Vehicle Identification Number (VIN).

In [3]:
api = "https://crashviewer.nhtsa.dot.gov/CrashAPI/FARSData/GetFARSData"
case_years = np.arange(2014, 2020, dtype=int) # DO NOT CHANGE

In [4]:
def get_persons_data(year):
    response = rq.get(url = api, params = {'dataset':'person', 'caseYear':year, 'format':'csv'})

    if response.ok:
        csv_data = response.content
        persons = pd.read_csv(io.StringIO(csv_data.decode('utf-8')))
        return persons
    
    print("Failed to get the data")
    print(response.status)
    
    return None

def get_vehicles_data(year):
    response = rq.get(url = api, params = {'dataset':'vindecode', 'caseYear':year, 'format':'csv'})

    if response.ok:
        csv_data = response.content
        vehicles = pd.read_csv(io.StringIO(csv_data.decode('utf-8')))
        return vehicles
    
    print("Failed to get the data")
    print(response.status)
    
    return None

In [5]:
persons = get_persons_data(2014)
np.delete(case_years,0)
for i in case_years:
    persons = pd.concat([persons,get_persons_data(i)])

  persons = get_persons_data(2014)
  persons = pd.concat([persons,get_persons_data(i)])
  persons = pd.concat([persons,get_persons_data(i)])


In [6]:
vehicles = get_vehicles_data(2014)
for i in case_years:
     vehicles = pd.concat([vehicles,get_vehicles_data(i)])    

  vehicles = get_vehicles_data(2014)
  vehicles = pd.concat([vehicles,get_vehicles_data(i)])
  vehicles = pd.concat([vehicles,get_vehicles_data(i)])
  vehicles = pd.concat([vehicles,get_vehicles_data(i)])
  vehicles = pd.concat([vehicles,get_vehicles_data(i)])


1.1) Use the above methods to get vehicles and persons datasets from 2014 to 2019.

1.2) Determine the useless columns to drop from the dataframes.

**Hint 1**: An encoded and a decoded (actual-valued) column exist for most of the features. Drop the encoded columns.<br>
**Hint 2**: Some features have mostly null values or a single value. <br>
**Hint 3**: Some features provides information that can be obtained after the accident. <br>
You can refer to [FARS User’s Manual](https://crashstats.nhtsa.dot.gov/Api/Public/ViewPublication/813023) if needed.

In [7]:
persons.columns[0::]

Index(['caseyear', 'state', 'statename', 'st_case', 've_forms', 'veh_no',
       'per_no', 'str_veh', 'str_vehname', 'county',
       ...
       'hispanic', 'hispanicname', 'race', 'racename', 'location',
       'locationname', 'func_sys', 'func_sysname', 'rur_urb', 'rur_urbname'],
      dtype='object', length=131)

In [8]:
persons.isna().sum()

caseyear             0
state                0
statename            0
st_case              0
ve_forms             0
                 ...  
locationname         0
func_sys        148696
func_sysname    148696
rur_urb         148696
rur_urbname     148696
Length: 131, dtype: int64

In [10]:
# Be aware that this is a mutable method.
def drop_useless_columns(vehicles, persons):

    vec_cols_to_drop = [
        'state', 'ncicmake', 'vehtype', 'vintrim_t', 'vintrim1_t',
        'vintrim1_t', 'vintrim2_t', 'vintrim3_t', 'vintrim4_t', 'bodystyl',
        'mfg', 'cycles', 'fuel', 'fuelinj', 'carbtype', 'carbbrls', 'gvwrange',
        'gvwrange_t', 'tiredesc_f', 'tiredesc_r', 'rearsize', 'tonrating',
        'drivetyp', 'salectry', 'salectry_t', 'abs', 'security', 'security_t',
        'drl', 'rstrnt', 'rstrnt_t', 'tkcab', 'tkcab_t', 'tkaxlef',
        'tkaxlef_t', 'tkaxler', 'tkaxler_t', 'tkbrak', 'engmfg', 'tkduty',
        'tkbedl', 'segmnt', 'plant', 'plntcity', 'plntctry', 'plntstat',
        'plntstat_t', 'origin', 'enghead', 'incomplt', 'battyp', 'battyp_t',
        'batkwrtg', 'batvolt', 'supchrgr', 'supchrgr_t', 'turbo', 'turbo_t',
        'engvvt', 'mcyusage', 'mcyusage_t','tkbrak_t','tkduty_t','tkbedl_t','engmodel','displcc','psi_f','psi_r'
    ]
    per_cols_to_drop = [
        'state', 'str_veh', 'county', 'countyname', 'day', 'month',
        'monthname', 'hour', 'hourname', 'minute', 'minutename', 'road_fnc',
        'road_fncname', 'harm_ev', 'harm_evname', 'man_coll', 'man_collname',
        'sch_bus', 'sch_busname', 'make', 'mak_mod', 'body_typ', 'mod_year',
        'tow_veh', 'tow_vehname', 'spec_use', 'spec_usename', 'emer_use',
        'emer_usename', 'rollover', 'impact1', 'impact1name', 'fire_exp',
        'fire_expname', 'age', 'sex', 'per_typ', 'inj_sev', 'inj_sevname',
        'seat_pos', 'seat_posname', 'rest_use', 'rest_usename', 'rest_mis',
        'rest_misname', 'air_bag', 'air_bagname', 'ejection', 'ejectionname',
        'ej_path', 'ej_pathname', 'extricat', 'extricatname', 'drinking',
        'drinkingname', 'alc_det', 'alc_status', 'alc_statusname', 'atst_typ',
        'atst_typname', 'alc_res', 'alc_resname', 'drugs', 'drugsname',
        'drug_det', 'dstatus', 'drugtst1', 'drugtst1name', 'drugres1',
        'drugres1name', 'drugtst2', 'drugtst2name', 'drugres2', 'drugres2name',
        'drugtst3', 'drugtst3name', 'drugres3', 'drugres3name', 'hospital',
        'hospitalname', 'doa', 'doaname', 'death_da', 'death_daname',
        'death_mo', 'death_moname', 'death_yr', 'death_yrname', 'death_hr',
        'death_hrname', 'death_mn', 'death_mnname', 'death_tm', 'lag_hrs',
        'lag_mins', 'p_sf1', 'p_sf1name', 'p_sf2', 'p_sf2name', 'p_sf3name',
        'cert_no', 'work_inj', 'work_injname', 'hispanic', 'race', 'location',
        'locationname', 'func_sys', 'func_sysname', 'rur_urb', 'rur_urbname',
        'dstatusname', 'p_sf3','hispanicname','racename'
    ]

    vehicles.drop(vec_cols_to_drop, axis=1, inplace=True)
    persons.drop(per_cols_to_drop, axis=1, inplace=True)

In [11]:
# YOUR CODE HERE
drop_useless_columns(vehicles,persons)

In [12]:
persons.head()

Unnamed: 0,caseyear,statename,st_case,ve_forms,veh_no,per_no,str_vehname,makename,body_typname,mod_yearname,rollovername,agename,sexname,per_typname,alc_detname,drug_detname
0,2014,Alabama,10001,1,1,1,Occupant of a Motor Vehicle,Toyota,"4-door sedan, hardtop",2011,"Rollover, Tripped by Object/Vehicle",24 Years,Male,Driver of a Motor Vehicle In-Transport,Observed,Not Reported
1,2014,Alabama,10001,1,1,2,Occupant of a Motor Vehicle,Toyota,"4-door sedan, hardtop",2011,"Rollover, Tripped by Object/Vehicle",30 Years,Female,Passenger of a Motor Vehicle In-Transport,Not Reported,Not Reported
2,2014,Alabama,10002,1,1,1,Occupant of a Motor Vehicle,Dodge,"Standard pickup (GVWR 4,500 to 10,00 lbs.)(Jee...",1997,No Rollover,52 Years,Male,Driver of a Motor Vehicle In-Transport,Not Reported,Not Reported
3,2014,Alabama,10003,2,1,1,Occupant of a Motor Vehicle,Chevrolet,"4-door sedan, hardtop",2004,No Rollover,22 Years,Male,Driver of a Motor Vehicle In-Transport,Not Reported,Not Reported
4,2014,Alabama,10003,2,1,2,Occupant of a Motor Vehicle,Chevrolet,"4-door sedan, hardtop",2004,No Rollover,21 Years,Female,Passenger of a Motor Vehicle In-Transport,Not Reported,Not Reported


In [13]:
# Getting the persons only for driver
latest_person = persons.loc[persons["per_typname"] == "Driver of a Motor Vehicle In-Transport"]

1.3.1) Complete the following method returning a single DataFrame named 'accidents' whose rows are singular for person and vehicle data. Then, merge the dataframes belongs to the same year.

**Hint 1**: You need to define a key from some columns like year, st_case, and veh_no to have an unique value for each row. <br>
**Hint 2**: You might use such methods ```<DataFrame>.merge```, ```<DataFrame>.join```, or ```pd.concat```.

In [14]:
latest_person.head()

Unnamed: 0,caseyear,statename,st_case,ve_forms,veh_no,per_no,str_vehname,makename,body_typname,mod_yearname,rollovername,agename,sexname,per_typname,alc_detname,drug_detname
0,2014,Alabama,10001,1,1,1,Occupant of a Motor Vehicle,Toyota,"4-door sedan, hardtop",2011,"Rollover, Tripped by Object/Vehicle",24 Years,Male,Driver of a Motor Vehicle In-Transport,Observed,Not Reported
2,2014,Alabama,10002,1,1,1,Occupant of a Motor Vehicle,Dodge,"Standard pickup (GVWR 4,500 to 10,00 lbs.)(Jee...",1997,No Rollover,52 Years,Male,Driver of a Motor Vehicle In-Transport,Not Reported,Not Reported
3,2014,Alabama,10003,2,1,1,Occupant of a Motor Vehicle,Chevrolet,"4-door sedan, hardtop",2004,No Rollover,22 Years,Male,Driver of a Motor Vehicle In-Transport,Not Reported,Not Reported
7,2014,Alabama,10003,2,2,1,Occupant of a Motor Vehicle,Toyota,"4-door sedan, hardtop",1997,No Rollover,20 Years,Female,Driver of a Motor Vehicle In-Transport,Not Reported,Not Reported
10,2014,Alabama,10004,3,1,1,Occupant of a Motor Vehicle,Toyota,"Compact pickup (GVWR <4,500 lbs.) (D50,Colt P/...",1999,No Rollover,81 Years,Male,Driver of a Motor Vehicle In-Transport,Not Reported,Not Reported


In [15]:
def merge_vehicles_and_persons(vehicles, persons) -> pd.DataFrame:
    
    # YOUR CODE HERE
    accidents = pd.merge(vehicles,persons, on = ['caseyear','st_case','veh_no','statename'])

    return accidents

In [16]:
# YOUR CODE HERE merging..
accident = pd.DataFrame(merge_vehicles_and_persons(vehicles, latest_person))

1.3.2) Obtain a single dataframe named 'data' by merging accidents dataframes of all available years. There might be some columns which do not present in all years. <br> 
Save this dataframe as a csv or xlsx file so that after having the final data you can skip the previous steps while doing the assignment.


In [None]:
# YOUR CODE HERE

In [789]:
# Read the saved final data file as a DataFrame.
# Remark: Your final dataset should comprise approximately 60 columns. 

data = accident.to_csv("out.csv")

In [2]:
data = pd.read_csv("out.csv")

  exec(code_obj, self.user_global_ns, self.user_ns)


In [3]:
data = data.drop(columns=["Unnamed: 0"])

In [4]:
data.head()

Unnamed: 0,caseyear,statename,st_case,veh_no,vinyear,vehtype_t,vinmake_t,vinmodel_t,bodystyl_t,doors,wheels,drivwhls,mfg_t,displci,cylndrs,fuel_t,fuelinj_t,carbtype_t,whlbsh,whlblg,tiresz_f,tiresz_f_t,rearsize_t,shipweight,msrp,drivetyp_t,abs_t,drl_t,engmfg_t,segmnt_t,plntctry_t,origin_t,dispclmt,blocktype,enghead_t,vlvclndr,vlvtotal,engvincd,ve_forms,per_no,str_vehname,makename,body_typname,mod_yearname,rollovername,agename,sexname,per_typname,alc_detname,drug_detname
0,2014,Alabama,10001,1,2011.0,Passenger Car,TOYOTA,COROLLA,Sedan,4.0,4.0,2.0,TOYOTA,110.0,4.0,Gas,,Fuel Injection,102.4,102.4,29.0,15R195,15R195,2734.0,15600.0,Front Wheel Drive,All Wheel Std,Standard,,Non Luxury Traditional Compact,Canada,Import Built in North America,1.8,In-Line,Double Overhead Camshaft,4.0,16.0,U,1,1,Occupant of a Motor Vehicle,Toyota,"4-door sedan, hardtop",2011,"Rollover, Tripped by Object/Vehicle",24 Years,Male,Driver of a Motor Vehicle In-Transport,Observed,Not Reported
1,2014,Alabama,10001,1,2011.0,Passenger Car,TOYOTA,COROLLA,Sedan,4.0,4.0,2.0,TOYOTA,110.0,4.0,Gas,,Fuel Injection,102.4,102.4,29.0,15R195,15R195,2734.0,15600.0,Front Wheel Drive,All Wheel Std,Standard,,Non Luxury Traditional Compact,Canada,Import Built in North America,1.8,In-Line,Double Overhead Camshaft,4.0,16.0,U,1,1,Occupant of a Motor Vehicle,Toyota,"4-door sedan, hardtop",2011,"Rollover, Tripped by Object/Vehicle",24 Years,Male,Driver of a Motor Vehicle In-Transport,Observed,Not Reported
2,2014,Alabama,10001,1,2011.0,Passenger Car,TOYOTA,COROLLA,Sedan,4.0,4.0,2.0,TOYOTA,110.0,4.0,Gas,,Fuel Injection,102.4,102.4,29.0,15R195,15R195,2734.0,15600.0,Front Wheel Drive,All Wheel Std,Standard,,Non Luxury Traditional Compact,Canada,Import Built in North America,1.8,In-Line,Double Overhead Camshaft,4.0,16.0,U,1,1,Occupant of a Motor Vehicle,Toyota,"4-door sedan, hardtop",2011,"Rollover, Tripped by Object/Vehicle",24 Years,Male,Driver of a Motor Vehicle In-Transport,Observed,Not Reported
3,2014,Alabama,10001,1,2011.0,Passenger Car,TOYOTA,COROLLA,Sedan,4.0,4.0,2.0,TOYOTA,110.0,4.0,Gas,,Fuel Injection,102.4,102.4,29.0,15R195,15R195,2734.0,15600.0,Front Wheel Drive,All Wheel Std,Standard,,Non Luxury Traditional Compact,Canada,Import Built in North America,1.8,In-Line,Double Overhead Camshaft,4.0,16.0,U,1,1,Occupant of a Motor Vehicle,Toyota,"4-door sedan, hardtop",2011,"Rollover, Tripped by Object/Vehicle",24 Years,Male,Driver of a Motor Vehicle In-Transport,Observed,Not Reported
4,2014,Alabama,10002,1,1997.0,Truck,DODGE,RAM 2500,Pickup,2.0,4.0,2.0,DAIMLER-CHRYSLER,360.0,8.0,Gas,Unknown,Fuel Injection,138.7,154.7,,,,4787.0,20775.0,Rear Wheel Drive,Other Std,Not Available,CHRYSLER,Non Luxury Full Size 3qtr to 1 Ton Pickup,Mexico,Domestic,5.9,V-type,Overhead Valve,2.0,16.0,Z,1,1,Occupant of a Motor Vehicle,Dodge,"Standard pickup (GVWR 4,500 to 10,00 lbs.)(Jee...",1997,No Rollover,52 Years,Male,Driver of a Motor Vehicle In-Transport,Not Reported,Not Reported


#  Part-2: Exploratory Data Analysis (EDA)

> In this part, explore categorical and numerical features and report some prominent characteristics of data.<br>
Try to get insights from the data for the feature engineering part.<br>

>Before the analysis, inspect the target variable.<br>
Decide on what type of problem is better to learn the car rollover phenomenon by ML models.<br>
Setting problem type is an crucial decision for achievements of ML projects.<br>
**Hint**: You should take an action changing the target variable.<br>
- Detecting missing values, e.g. 0, NaN, 999, -1, representative words (unknown, unavailable, etc.), ...
- Dropping duplicate and empty rows/columns
- histograms, scatter and bar plots
- value counts of categorical features
- statistical tables

In [5]:
#Inspect our data
data.shape

(439291, 50)

In [6]:
# Inspect our data
data.describe()

Unnamed: 0,caseyear,st_case,veh_no,vinyear,doors,wheels,drivwhls,displci,whlbsh,whlblg,shipweight,msrp,dispclmt,vlvclndr,vlvtotal,ve_forms,per_no
count,439291.0,439291.0,439291.0,420670.0,420670.0,420670.0,420670.0,420670.0,341196.0,341192.0,420670.0,420670.0,364201.0,420670.0,420670.0,439291.0,439291.0
mean,2015.776214,276398.946721,1.46573,2005.554049,2.934835,2.783764,1.877695,237.512064,115.962057,119.918731,3089.602458,21303.124145,4.108588,1.730825,9.326617,1.928155,1.000553
std,1.830911,163251.636654,1.174434,6.745418,1.576735,2.03638,1.557569,191.330061,15.815118,26.44072,1708.909565,12259.558655,2.6608,1.846041,10.096606,1.852402,0.032947
min,2014.0,10001.0,1.0,1981.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
25%,2014.0,122011.0,1.0,2001.0,2.0,0.0,0.0,132.0,105.7,105.7,2530.0,14799.0,2.4,0.0,0.0,1.0,1.0
50%,2015.0,270269.0,1.0,2006.0,4.0,4.0,2.0,207.0,110.5,110.5,3330.0,21231.0,3.5,0.0,0.0,2.0,1.0
75%,2017.0,420825.0,2.0,2011.0,4.0,4.0,4.0,293.0,120.2,123.1,4129.0,27710.0,5.0,4.0,16.0,2.0,1.0
max,2019.0,560131.0,64.0,2020.0,5.0,14.0,8.0,1099.0,254.0,960.0,14795.0,441600.0,16.1,16.0,48.0,64.0,8.0


In [7]:
#Checking the value counts
data['rearsize_t'].value_counts().sum()

101900

In [8]:
data['engmfg_t'].value_counts().sum()

178051

In [9]:
data['enghead_t'].value_counts()

Double Overhead Camshaft    146254
Overhead Valve               55283
Single Overhead Camshaft     51162
Unknown                       1539
Name: enghead_t, dtype: int64

In [10]:
data['fuelinj_t'].value_counts()

Unknown         215368
Sequential       43401
Direct           33220
Multiport        18589
Common Rail       2357
Port               685
Throttlebody        68
Name: fuelinj_t, dtype: int64

In [11]:
data['drl_t'].value_counts()

Standard         140091
Not Available    118046
Optional          62753
Unknown           12379
Available            34
STANDARD              5
Name: drl_t, dtype: int64

In [12]:
data.isnull().sum()

caseyear             0
statename            0
st_case              0
veh_no               0
vinyear          18621
vehtype_t        18621
vinmake_t        18621
vinmodel_t       18648
bodystyl_t       18621
doors            18621
wheels           18621
drivwhls         18621
mfg_t            29399
displci          18621
cylndrs          19078
fuel_t           62753
fuelinj_t       125603
carbtype_t       97625
whlbsh           98095
whlblg           98099
tiresz_f        178797
tiresz_f_t      178797
rearsize_t      337391
shipweight       18621
msrp             18621
drivetyp_t       63095
abs_t           112693
drl_t           105983
engmfg_t        261240
segmnt_t         62640
plntctry_t       23244
origin_t         20104
dispclmt         75090
blocktype        98693
enghead_t       185053
vlvclndr         18621
vlvtotal         18621
engvincd         50466
ve_forms             0
per_no               0
str_vehname          0
makename             0
body_typname         0
mod_yearnam

In [13]:
# Dropping unneccery columns after I evaulated I decided to drop them
data = data.drop([
    'rearsize_t', 'engmfg_t', 'agename', 'sexname', 'body_typname', 'engvincd',
    'mod_yearname', 'tiresz_f', 'segmnt_t', 'alc_detname', 'drug_detname',
    'fuelinj_t', 'whlbsh', 'whlblg', 'blocktype', 'origin_t'
],
                 axis=1)

In [14]:
#Inspecting our target value
data['rollovername'].value_counts()

No Rollover                            369460
Rollover, Tripped by Object/Vehicle     57641
Rollover, Untripped                      9067
Rollover, Unknown Type                   3123
Name: rollovername, dtype: int64

In [15]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 439291 entries, 0 to 439290
Data columns (total 34 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   caseyear      439291 non-null  int64  
 1   statename     439291 non-null  object 
 2   st_case       439291 non-null  int64  
 3   veh_no        439291 non-null  int64  
 4   vinyear       420670 non-null  float64
 5   vehtype_t     420670 non-null  object 
 6   vinmake_t     420670 non-null  object 
 7   vinmodel_t    420643 non-null  object 
 8   bodystyl_t    420670 non-null  object 
 9   doors         420670 non-null  float64
 10  wheels        420670 non-null  float64
 11  drivwhls      420670 non-null  float64
 12  mfg_t         409892 non-null  object 
 13  displci       420670 non-null  float64
 14  cylndrs       420213 non-null  object 
 15  fuel_t        376538 non-null  object 
 16  carbtype_t    341666 non-null  object 
 17  tiresz_f_t    260494 non-null  object 
 18  ship

In [16]:
# Do not include record identifier columns (e.g. st_case, veh_no, etc.) into the following lists!
categorical = [
    'statename', 'vehtype_t', 'vinmake_t', 'vinmodel_t', 'bodystyl_t', 'mfg_t',
    'cylndrs', 'fuel_t', 'carbtype_t', 'tiresz_f_t', 'drivetyp_t', 'abs_t',
    'drl_t', 'plntctry_t', 'enghead_t', 'str_vehname', 'makename',
    'per_typname'
]  # names of columns having categorical values
numeric = [
    'vinyear', 'doors', 'wheels', 'drivwhls', 'displci', 'shipweight', 'msrp',
    'dispclmt', 'vlvclndr', 'vlvtotal', 've_forms', 'per_no'
]  # names of columns having numerical values
target_name = [
    "rollovername"
]  # set the column name of target variable (it is named 'rollovername' in FARS datasets)

#YOUR CODE HERE

features = categorical + numeric

In [17]:
# YOUR CODE HERE
# Replacing Unknown with nan
for i in data[features]:
    data = data.replace('Unknown',np.nan)

In [18]:
data.isnull().sum()

caseyear             0
statename            0
st_case              0
veh_no               0
vinyear          18621
vehtype_t        18621
vinmake_t        18621
vinmodel_t       18648
bodystyl_t       18621
doors            18621
wheels           18621
drivwhls         18621
mfg_t            29399
displci          18621
cylndrs          19078
fuel_t           62802
carbtype_t       97671
tiresz_f_t      180866
shipweight       18621
msrp             18621
drivetyp_t       63095
abs_t           114120
drl_t           118362
plntctry_t       23246
dispclmt         75090
enghead_t       186592
vlvclndr         18621
vlvtotal         18621
ve_forms             0
per_no               0
str_vehname          0
makename             0
rollovername         0
per_typname          0
dtype: int64

In [19]:
# Dropping duplicates
data = data.drop_duplicates(subset=categorical)

In [20]:
data[features].isnull().sum()

statename          0
vehtype_t       1192
vinmake_t       1192
vinmodel_t      1213
bodystyl_t      1192
mfg_t           5128
cylndrs         1419
fuel_t         13809
carbtype_t     20855
tiresz_f_t     46690
drivetyp_t     13929
abs_t          27703
drl_t          29789
plntctry_t      2871
enghead_t      49451
str_vehname        0
makename           0
per_typname        0
vinyear         1192
doors           1192
wheels          1192
drivwhls        1192
displci         1192
shipweight      1192
msrp            1192
dispclmt       16670
vlvclndr        1192
vlvtotal        1192
ve_forms           0
per_no             0
dtype: int64

In [21]:
data.duplicated().sum()

0

In [22]:
#Dropping na values
data.dropna(subset=categorical,inplace=True)

In [23]:
data.isnull().sum()

caseyear         0
statename        0
st_case          0
veh_no           0
vinyear          0
vehtype_t        0
vinmake_t        0
vinmodel_t       0
bodystyl_t       0
doors            0
wheels           0
drivwhls         0
mfg_t            0
displci          0
cylndrs          0
fuel_t           0
carbtype_t       0
tiresz_f_t       0
shipweight       0
msrp             0
drivetyp_t       0
abs_t            0
drl_t            0
plntctry_t       0
dispclmt        10
enghead_t        0
vlvclndr         0
vlvtotal         0
ve_forms         0
per_no           0
str_vehname      0
makename         0
rollovername     0
per_typname      0
dtype: int64

In [25]:
data.describe()

Unnamed: 0,caseyear,st_case,veh_no,vinyear,doors,wheels,drivwhls,displci,shipweight,msrp,dispclmt,vlvclndr,vlvtotal,ve_forms,per_no
count,61197.0,61197.0,61197.0,61197.0,61197.0,61197.0,61197.0,61197.0,61197.0,61197.0,61187.0,61197.0,61197.0,61197.0,61197.0
mean,2016.007059,283544.21403,1.499355,2007.427439,3.705999,2.794189,1.922447,200.800758,3734.797703,27557.340964,3.294417,2.880223,16.016112,1.999575,1.000768
std,1.613512,158653.443807,1.270709,6.17204,0.710638,1.83557,1.500179,78.148986,1025.474632,12382.92573,1.277522,1.590126,8.965531,2.063411,0.039806
min,2014.0,10001.0,1.0,1985.0,0.0,0.0,0.0,55.0,0.0,0.0,0.9,0.0,0.0,1.0,1.0
25%,2015.0,130816.0,1.0,2003.0,4.0,0.0,0.0,144.0,3108.0,19680.0,2.4,2.0,16.0,1.0,1.0
50%,2016.0,280465.0,1.0,2008.0,4.0,4.0,2.0,191.0,3526.0,25165.0,3.1,4.0,16.0,2.0,1.0
75%,2017.0,420305.0,2.0,2013.0,4.0,4.0,4.0,232.0,4229.0,32595.0,3.8,4.0,24.0,2.0,1.0
max,2019.0,560131.0,63.0,2019.0,5.0,4.0,4.0,512.0,8687.0,441600.0,8.4,16.0,48.0,64.0,7.0


In [26]:
data['vehtype_t'].value_counts()

Passenger Car    33731
Truck            27466
Name: vehtype_t, dtype: int64

In [27]:
data['rollovername'].value_counts()

No Rollover                            51552
Rollover, Tripped by Object/Vehicle     8158
Rollover, Untripped                     1042
Rollover, Unknown Type                   445
Name: rollovername, dtype: int64

#  Part-3: Data Cleaning and Feature Engineering

### 3.1. Handling with Missing Values: Dropping and/or Imputation
> You may drop some rows and columns.<br>
> Imputation with statistical values (mean, median, mode)<br>
> Imputation with a keyword like missing, unknown, etc.<br>

In [28]:
# YOUR CODE HERE

### 3.2. Reducing cardinality of categorical features 
> Some categorical features have a cardinality which will cause to have lots of dummy variables. <br>
> Therefore, it is certainly needed to do some operations on that features. You may even consider dropping some of them. <br>
> Be aware of some risk of information loss caused by operations done in this part.<br>
> **Hint**: There are 37 different values for *bodystyl_t* feature which means vehicle body type. You can map the values into some higher level categories like [TRUCK, VAN, BUS, SPORT, STANDARD, OTHER, ...]. <br>
> You can also utilize [FARS User’s Manual](https://crashstats.nhtsa.dot.gov/Api/Public/ViewPublication/813023) for this step.

# From now on I am just reducing the cardinality I decided a threshold for each column that I reduced cardinality

# I replaced other categories with name 'OTHER'

In [29]:
# YOUR CODE HERE
data[categorical].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 61197 entries, 0 to 439290
Data columns (total 18 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   statename    61197 non-null  object
 1   vehtype_t    61197 non-null  object
 2   vinmake_t    61197 non-null  object
 3   vinmodel_t   61197 non-null  object
 4   bodystyl_t   61197 non-null  object
 5   mfg_t        61197 non-null  object
 6   cylndrs      61197 non-null  object
 7   fuel_t       61197 non-null  object
 8   carbtype_t   61197 non-null  object
 9   tiresz_f_t   61197 non-null  object
 10  drivetyp_t   61197 non-null  object
 11  abs_t        61197 non-null  object
 12  drl_t        61197 non-null  object
 13  plntctry_t   61197 non-null  object
 14  enghead_t    61197 non-null  object
 15  str_vehname  61197 non-null  object
 16  makename     61197 non-null  object
 17  per_typname  61197 non-null  object
dtypes: object(18)
memory usage: 8.9+ MB


In [30]:
data[categorical].describe()

Unnamed: 0,statename,vehtype_t,vinmake_t,vinmodel_t,bodystyl_t,mfg_t,cylndrs,fuel_t,carbtype_t,tiresz_f_t,drivetyp_t,abs_t,drl_t,plntctry_t,enghead_t,str_vehname,makename,per_typname
count,61197,61197,61197,61197,61197,61197,61197,61197,61197,61197,61197,61197,61197,61197,61197,61197,61197,61197
unique,52,2,53,690,31,42,18,7,3,72,6,7,5,30,3,1,50,1
top,California,Passenger Car,FORD,ACCORD,SEDAN,General Motors,6,Gas,Fuel Injection,16R205,Front Wheel Drive,All Wheel Std,Standard,United States,Double Overhead Camshaft,Occupant of a Motor Vehicle,Ford,Driver of a Motor Vehicle In-Transport
freq,3698,33731,8452,1777,17918,10438,22951,53002,61190,4728,31897,51258,27001,32781,39711,61197,8451,61197


In [31]:
# Making it upper to stack our variables
data['cylndrs'] = data['cylndrs'].str.upper()

In [32]:
data['cylndrs'].value_counts().loc[lambda x : x >=8000]

6    22951
4    21530
8    10467
Name: cylndrs, dtype: int64

In [33]:
data['plntctry_t'] = data['plntctry_t'].str.upper()

In [34]:
## threshold is 1000 here
data['plntctry_t'].value_counts().loc[lambda x : x >=1000]

UNITED STATES                             32781
JAPAN                                      8488
CANADA                                     6953
MEXICO                                     4854
GERMANY                                    2756
KOREA, REPUBLIC OF                         2003
KOREA, DEMOCRATIC PEOPLE'S REPUBLIC OF     1370
Name: plntctry_t, dtype: int64

In [35]:
data['tiresz_f_t'] = data['tiresz_f_t'].str.upper()

In [36]:
data['tiresz_f_t'].value_counts().loc[lambda x : x >=2600]

16R205    4728
17R225    4518
16R215    4405
15R205    3886
15R195    3846
16R225    3758
17R215    3486
17R265    2695
17R245    2629
Name: tiresz_f_t, dtype: int64

In [37]:
data['vinmake_t'] = data['vinmake_t'].str.upper()

In [38]:
data['vinmake_t'].value_counts().loc[lambda x : x >=1800]

FORD         8452
CHEVROLET    7849
TOYOTA       5327
HONDA        4219
DODGE        4214
NISSAN       3567
CHRYSLER     2274
HYUNDAI      2031
KIA          1995
Name: vinmake_t, dtype: int64

In [39]:
data['mfg_t'] = data['mfg_t'].str.upper()

In [40]:
data['mfg_t'].value_counts().loc[lambda x : x >= 845]

GENERAL MOTORS        13559
FORD                  11837
TOYOTA                 6183
CHRYSLER GROUP LLC     5379
HONDA                  5185
NISSAN                 4346
FCA                    2195
HYUNDAI                2034
KIA                    2001
DAIMLER-CHRYSLER       1978
BMW                    1683
VOLKSWAGEN             1356
SUBARU                 1228
MITSUBISHI              888
Name: mfg_t, dtype: int64

In [41]:
data['bodystyl_t'] = data['bodystyl_t'].str.upper()

In [42]:
data['bodystyl_t'].value_counts().loc[lambda x : x >= 9000]

SEDAN                    23670
SPORT UTILITY VEHICLE    17760
Name: bodystyl_t, dtype: int64

In [43]:
data['statename'] = data['statename'].str.upper()

In [44]:
data['statename'].value_counts().loc[lambda x : x >=3300]

CALIFORNIA    3698
TEXAS         3521
Name: statename, dtype: int64

In [45]:
data['makename'] = data['makename'].str.upper()

In [46]:
data['makename'].value_counts().loc[lambda x : x >=1700]

FORD             8451
CHEVROLET        7860
TOYOTA           5224
DODGE            4967
HONDA            4217
CHRYSLER         2278
NISSAN/DATSUN    2080
HYUNDAI          2040
KIA              1995
Name: makename, dtype: int64

In [47]:
data['vinmodel_t'].value_counts().loc[lambda x : x >=900]

ACCORD       1777
SILVERADO    1171
EXPLORER     1090
FOCUS         937
Name: vinmodel_t, dtype: int64

In [48]:
data['bodystyl_t'] = data['bodystyl_t'].str.upper()

In [49]:
data['bodystyl_t'].value_counts().loc[lambda x : x >=1000]

SEDAN                    23670
SPORT UTILITY VEHICLE    17760
PICKUP                    5219
COUPE                     4420
HATCHBACK                 3338
VAN PASSENGER             3013
CONVERTIBLE               1339
Name: bodystyl_t, dtype: int64

In [50]:
top = [
    'SILVERADO', 'F150', 'MALIBU', 'FUSION', 'FOCUS', 'SIERRA','FOCUS','CIVIC'
]

In [51]:
top_mfg = ['GENERAL MOTORS','FORD','CHRYSLER GROUP LLC','FCA','HONDA','HYUNDAI','TOYOTA','KIA','VOLKSWAGEN']

In [52]:
top_makename = ['CHEVROLET','FORD','DODGE','HYUNDAI','TOYOTA']

In [53]:
top_state = ['TEXAS','FLORIDA','CALIFORNIA']

In [54]:
top_body = ['SEDAN','SPORT UTILITY VEHICLE','PICKUP']

In [55]:
top_vinmake = ['FORD','CHEVROLET','DODGE','HONDA','JEEP']

In [56]:
top_tire = ['17R225', '17R265', '17R215', '16R205', '17R245']

In [57]:
top_plnt  = ['UNITED STATES','CANADA','MEXICO','JAPAN','KOREA, REPUBLIC OF','GERMANY']

In [58]:
top_clyndr = ['4','6','8']

In [59]:
data.loc[(data.cylndrs != top_clyndr[0]) & (data.cylndrs != top_clyndr[1]) &
         (data.cylndrs != top_clyndr[2]), 'cylndrs'] = 'OTHER'

In [60]:
data.loc[(data.plntctry_t != top_plnt[0]) & (data.plntctry_t != top_plnt[1]) &
         (data.plntctry_t != top_plnt[2]) & (data.plntctry_t != top_plnt[3]) &
         (data.plntctry_t != top_plnt[4]) & (data.plntctry_t != top_plnt[5]),
         'plntctry_t'] = 'OTHER'

In [61]:
data.loc[(data.vinmake_t != top_vinmake[0])
         & (data.vinmake_t != top_vinmake[1]) &
         (data.vinmake_t != top_vinmake[2]) &
         (data.vinmake_t != top_vinmake[3]) &
         (data.vinmake_t != top_vinmake[4]),
         'vinmake_t'] = 'OTHER'

In [62]:
data.loc[(data.tiresz_f_t != top_tire[0])
         & (data.tiresz_f_t != top_tire[1]) & (data.tiresz_f_t != top_tire[2])
         & (data.tiresz_f_t != top_tire[3]) & (data.tiresz_f_t != top_tire[4]),
         'tiresz_f_t'] = 'OTHER'

In [63]:
data.loc[(data.vinmodel_t != top[0]) & (data.vinmodel_t != top[1]) &
         (data.vinmodel_t != top[2]) & (data.vinmodel_t != top[3]) &
         (data.vinmodel_t != top[4]) & (data.vinmodel_t != top[5]) &
         (data.vinmodel_t != top[6]), 'vinmodel_t'] = 'OTHER'

In [64]:
data.loc[(data.statename != top_state[0]) & (data.statename != top_state[1]) &
         (data.statename != top_state[2]), 'statename'] = 'OTHER'

In [65]:
data.loc[(data.makename != top_makename[0]) &
         (data.makename != top_makename[1]) &
         (data.makename != top_makename[2]) &
         (data.makename != top_makename[3]) &
         (data.makename != top_makename[4]), 'makename'] = 'OTHER'

In [66]:
data.loc[(data.mfg_t != top_mfg[0]) &
         (data.mfg_t != top_mfg[1]) &
         (data.mfg_t != top_mfg[2]) &
         (data.mfg_t != top_mfg[3]) &
         (data.mfg_t != top_mfg[4]) &
         (data.mfg_t != top_mfg[5]), 'mfg_t'] = 'OTHER'

In [67]:
data.loc[(data.bodystyl_t != top_body[0]) & (data.bodystyl_t != top_body[1]) &
         (data.bodystyl_t != top_body[2]), 'bodystyl_t'] = 'OTHER'

In [68]:
data['vinmodel_t'].value_counts()

OTHER        56587
SILVERADO     1171
FOCUS          937
F150           844
SIERRA         628
MALIBU         561
FUSION         469
Name: vinmodel_t, dtype: int64

In [69]:
data['statename'].value_counts()

OTHER         50681
CALIFORNIA     3698
TEXAS          3521
FLORIDA        3297
Name: statename, dtype: int64

In [70]:
data['makename'].value_counts()

OTHER        32655
FORD          8451
CHEVROLET     7860
TOYOTA        5224
DODGE         4967
HYUNDAI       2040
Name: makename, dtype: int64

In [71]:
data['mfg_t'].value_counts()

OTHER                 21008
GENERAL MOTORS        13559
FORD                  11837
CHRYSLER GROUP LLC     5379
HONDA                  5185
FCA                    2195
HYUNDAI                2034
Name: mfg_t, dtype: int64

In [72]:
data['bodystyl_t'].value_counts()

SEDAN                    23670
SPORT UTILITY VEHICLE    17760
OTHER                    14548
PICKUP                    5219
Name: bodystyl_t, dtype: int64

In [73]:
data[categorical].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 61197 entries, 0 to 439290
Data columns (total 18 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   statename    61197 non-null  object
 1   vehtype_t    61197 non-null  object
 2   vinmake_t    61197 non-null  object
 3   vinmodel_t   61197 non-null  object
 4   bodystyl_t   61197 non-null  object
 5   mfg_t        61197 non-null  object
 6   cylndrs      61197 non-null  object
 7   fuel_t       61197 non-null  object
 8   carbtype_t   61197 non-null  object
 9   tiresz_f_t   61197 non-null  object
 10  drivetyp_t   61197 non-null  object
 11  abs_t        61197 non-null  object
 12  drl_t        61197 non-null  object
 13  plntctry_t   61197 non-null  object
 14  enghead_t    61197 non-null  object
 15  str_vehname  61197 non-null  object
 16  makename     61197 non-null  object
 17  per_typname  61197 non-null  object
dtypes: object(18)
memory usage: 8.9+ MB


In [74]:
# Dropping for numerics
data = data.dropna()

### BONUS: 3.3. Identifying and handling with outliers 
(not required, might help for linear and probabilistic models, be aware of the risk of information loss) 

In [75]:
# YOUR CODE HERE

> You may want to repeat some parts of EDA for the processed data at the end of this part, so, you may review your insights from the data to serve for the feature selection.

#  Part-4: Data Splitting and Transformation
> Firstly, split your data into training and test datasets before the data transformation. <br>
Then, scale numerical features and encode categorical features based on the training dataset.

To split the dataset, which of the following strategies is better?:

(1) Using accidents in 2019 for testing. <br>
(2) Getting random 30% of data for testing.

Which choice you selected will not matter for grading, but, how you choose it matters for the learning task. Think about which testing approach is more reasonable/meaningful regarding to create a business value. 

**Note:** If you have included year as a feature and you select (1), you should scale it properly for testing dataset. <br>

In [76]:
# Data splitting
y = data[target_name]
x = pd.get_dummies(data[features])

In [77]:
#YOUR CODE HERE (Selected 2)
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.30,random_state = 42)

In [78]:
# Scaling
scaler = MinMaxScaler(feature_range=(0,1))
scaled_x_train = scaler.fit_transform(x_train)
scaled_x_test = scaler.transform(x_test) 

In [79]:
scaled_x_train

array([[0.73529412, 0.8       , 1.        , ..., 0.        , 0.        ,
        0.        ],
       [0.26470588, 0.8       , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.47058824, 0.8       , 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.5       , 0.8       , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.73529412, 0.8       , 1.        , ..., 1.        , 0.        ,
        0.        ],
       [0.79411765, 0.8       , 1.        , ..., 0.        , 0.        ,
        0.        ]])

#  Part-5: Feature Selection

> You are required to present **at most 30** useful features at the end of the project.<br>
> It is not required to have 30 features before training. You should eliminate some features before and after the data learning phase to report at the end. <br>
> So, you need to select one or multiple feature subsets to experiment ML models.<br>
> It is worthy to note that you are not encouraged to be minimalistic (like using 5 features) while the selection because it might sacrifice some learning performance. But maybe only 5 features can be useful, it is unknown for now and it will turn out with your selection approach.

You can either select features manually and/or in an automated way.
If you deploy an automated feature selection method, you **must explain how it works** in the report.

Some manual ways: Thresholding correlations, multicollinearity inspection, considering value distributions with respect to the target variable, finding non-informative features, ...

Some automated ways: 
- selecting the best k variables based on a statistical score 
- forward/backward/recursive feature elimination
- identifying important features using a decision-tree based model

In [80]:
#YOUR CODE HERE
# deciding the best 20 features using selecting K best here  
features = SelectKBest(score_func=chi2, k=20)
fit = features.fit(x,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(x.columns)
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score']  
print(featureScores.nlargest(20,'Score')) 

                                 Specs         Score
6                                 msrp  79655.537166
4                              displci   3073.921533
10                            ve_forms   2500.517103
9                             vlvtotal   1463.021152
8                             vlvclndr    518.529034
67   drivetyp_t_Rear Wheel Drive w/4x4    213.783851
34    bodystyl_t_SPORT UTILITY VEHICLE    209.940910
76                 drl_t_Not Available    199.576826
5                           shipweight    186.011123
33                    bodystyl_t_SEDAN    151.618789
64        drivetyp_t_Front Wheel Drive    127.437886
42                           cylndrs_4    102.454241
89  enghead_t_Single Overhead Camshaft     96.955361
66         drivetyp_t_Rear Wheel Drive     95.258048
13                   statename_FLORIDA     89.846792
77                      drl_t_Optional     83.488382
49      fuel_t_Electric and Gas Hybrid     66.141822
58                   tiresz_f_t_17R225     65.

In [81]:
# Modifiying our train and test sets deciding on the features that we have chosen
x_train = x_train.filter(items=[
    'msrp', 'displci', 've_forms', 'vlvtotal', 'vlvclndr',
    'drivetyp_t_Rear Wheel Drive w/4x4', 'bodystyl_t_SPORT UTILITY VEHICLE',
    'drl_t_Not Available', 'shipweight', 'bodystyl_t_SEDAN',
    'drivetyp_t_Front Wheel Drive', 'cylndrs_4',
    'enghead_t_Single Overhead Camshaft', 'drivetyp_t_Rear Wheel Drive',
    'statename_FLORIDA', 'drl_t_Optional', 'drl_t_Optional',
    'fuel_t_Electric and Gas Hybrid', 'tiresz_f_t_17R225', 'vehtype_t_Truck', 'plntctry_t_CANADA'
])

In [82]:
x_test = x_test.filter(items=[
    'msrp', 'displci', 've_forms', 'vlvtotal', 'vlvclndr',
    'drivetyp_t_Rear Wheel Drive w/4x4', 'bodystyl_t_SPORT UTILITY VEHICLE',
    'drl_t_Not Available', 'shipweight', 'bodystyl_t_SEDAN',
    'drivetyp_t_Front Wheel Drive', 'cylndrs_4',
    'enghead_t_Single Overhead Camshaft', 'drivetyp_t_Rear Wheel Drive',
    'statename_FLORIDA', 'drl_t_Optional', 'drl_t_Optional',
    'fuel_t_Electric and Gas Hybrid', 'tiresz_f_t_17R225', 'vehtype_t_Truck', 'plntctry_t_CANADA'
])

In [83]:
scaled_x_train = scaler.fit_transform(x_train)

In [84]:
scaled_x_test =  scaler.transform(x_test)

#  Part-6: Training and Performance Evaluation

> You need to determine a score which is proper for the learning task and the distribution of target variable to benchmark the models. Note that even a model is successful for a target metric, it might performs poorly in terms of other performance scores. So, you should observe multiple metrics. <br>
> **Hint:** Think about which type of error made in predictions is more harmful from a car insurance business perspective: False negative  or false positive predictions? <br>

> According EDA done in the previous parts, you might need to adress any imbalanced dataset problem. (Not expected to use advanced methods or to do an extensive experimentation) <br>
**Hint**: Inspect model parameters which are related to class weightening or loss weightening.

> You have three Naive-Bayes-based model options for benchmarking with other types of models: ```GaussianNB```, ```CategoricalNB```, and ```ComplementNB```.
Read the descriptions provided in Sklearn library or search them on the Internet, and choose one of them regarding which sounds more reasonable considering the dataset characteristics and the learning task.<br>
**BONUS**: Experiment a simple prediction technique ensembling multiple Naive-Bayes models. (not required)

In [85]:
#YOUR CODE HERE
model = CategoricalNB()
model_1 = GaussianNB()
model_2 = ComplementNB()

In [86]:
# fitting
model.fit(scaled_x_train,y_train)
model_1.fit(scaled_x_train,y_train)
model_2.fit(scaled_x_train,y_train)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


ComplementNB()

In [87]:
# Predicting
y_predict=model.predict(scaled_x_test)

In [88]:
y_predict_1=model_1.predict(scaled_x_test)

In [89]:
y_predict_2=model_2.predict(scaled_x_test)

In [90]:
scaled_x_test

array([[0.21108641, 0.70678337, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.09264186, 0.45295405, 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.09220453, 0.28008753, 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.09909253, 0.31947484, 0.01587302, ..., 0.        , 0.        ,
        0.        ],
       [0.05300849, 0.0940919 , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.08179963, 0.06564551, 0.01587302, ..., 0.        , 1.        ,
        0.        ]])

In [91]:
# Evaluation
print("Scores for CategoricalNB")
print("Recall score : ", recall_score(y_test, y_predict, average='micro'))
print("Precision score : ",precision_score(y_test, y_predict , average='micro'))
print("F1 score : ",f1_score(y_test, y_predict , average='micro'))

Scores for CategoricalNB
Recall score :  0.8415318407147138
Precision score :  0.8415318407147138
F1 score :  0.8415318407147138


In [92]:
# Evaluation
print("Scores for GaussianNB")
print("Recall score : ", recall_score(y_test, y_predict_1, average='micro'))
print("Precision score : ",precision_score(y_test, y_predict_1 , average='micro'))
print("F1 score : ",f1_score(y_test, y_predict_1 , average='micro'))

Scores for GaussianNB
Recall score :  0.15656152966170944
Precision score :  0.15656152966170944
F1 score :  0.15656152966170944


In [93]:
# Evaluation
print("Scores for ComplmentNB")
print("Recall score : ", recall_score(y_test, y_predict_2, average='micro'))
print("Precision score : ",precision_score(y_test, y_predict_2 , average='micro'))
print("F1 score : ",f1_score(y_test, y_predict_2 , average='micro'))

Scores for ComplmentNB
Recall score :  0.5622923135588603
Precision score :  0.5622923135588603
F1 score :  0.5622923135588603


> Plot training and testing performance of all models with a bar chart.

In [None]:
#YOUR CODE HERE

# Part-7: Interpretation
7.1) Visualize rollover risk distributions of the models for training and testing datasets separately. Report the visual results.<br>
**Hint**: Use ```predict_proba``` function of the fitted models.

In [None]:
# YOUR CODE HERE


7.2) Visualize year-by-year rollover risk distributions by the predictions of the best model you determined. Interpret the results in the report.

7.3) Determine the features which useful to explain the car rollover phenomenon.<br>
**Reminder**: Determine **at most 30** features. It is not saying that report 30 features. Maybe, after the top 20 features, others have much less importance. You should report a short-listed features.

You can follow several strategies:
- Using ```coef_``` attribute of linear models
- Using ```feature_log_prob_``` attribute of probabilistic models
- Using ```feature_importances_``` attribute of the decision-tree based models
- Using ```permutation_importance``` function of sklearn (**REQUIRED** to explain how it works in the report)
    
You can either rely on the best model or all models to determine the good features.
For example, if you have multiple successful models, you can determine the common important features for them. 
    
After the creating a short-listed features, you can ensure whether they are enough to explain the phenemona via observing performances of re-trained model(s) using only these features.

In [None]:
# YOUR CODE HERE