# Group Project - KSI data - Classification problem

***Target Column***
ACCLASS<br>
Required to transform into binary (0, 1):<br>
'Fatal' --> 1, <br>
'Non-Fatal Injury' --> 0, <br>
'Property Damage Only' --> 0, <br>
***5 nan value from this column, we can consider to drop them***



below columns need to fill values:
'PEDESTRIAN', 'CYCLIST', 'AUTOMOBILE', 'MOTORCYCLE', 'TRUCK',
'TRSN_CITY_VEH', 'EMERG_VEH', 'PASSENGER', 'SPEEDING', 'AG_DRIV',
'REDLIGHT', 'ALCOHOL', 'DISABILITY'
fill Nan as No, and transform to 0, 1
(Default they are Yes, Nan values)
ROAD_CLASS fill most freq value
DISTRICT fill most freq value

Questionable column:
CYCCOND: multi categories, fill Nan as most freq value??


From the dataset, below columns are unnecessary:
ObjectId, HEIGHBOURHOOD_158, HEIGHBOURHOOD_140, CYCLISTYPE(too much categories and too much Nan value),<br>
PEDCOND(too much categories and too much Nan value), PEDACT(too much categories and too much Nan value),<br>
PEDTYPE (too much categories and too much Nan value), DRICOND ('other' included, means it is not a accuracy value), DRIVACT ('other' included, means it is not a accuracy value), MANOEUVER('other' included, means it is not a accuracy value)<br>
FATAL_NO, INVTYPE, DATE, YEAR, ACCNUM, INDEX_, STREET1, STREET2, OFFSET, X, Y,INJURY



In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
dataset_path = r'dataset\KSI.csv'

df = pd.read_csv(dataset_path)

In [3]:
df.head(5)

Unnamed: 0,X,Y,INDEX_,ACCNUM,YEAR,DATE,TIME,STREET1,STREET2,OFFSET,...,AG_DRIV,REDLIGHT,ALCOHOL,DISABILITY,HOOD_158,NEIGHBOURHOOD_158,HOOD_140,NEIGHBOURHOOD_140,DIVISION,ObjectId
0,-8844611.0,5412414.0,3387730,892658.0,2006,2006/03/11 05:00:00+00,852,BLOOR ST W,DUNDAS ST W,,...,Yes,,,,88,High Park North,88,High Park North (88),D11,1
1,-8844611.0,5412414.0,3387731,892658.0,2006,2006/03/11 05:00:00+00,852,BLOOR ST W,DUNDAS ST W,,...,Yes,,,,88,High Park North,88,High Park North (88),D11,2
2,-8816480.0,5434843.0,3388101,892810.0,2006,2006/03/11 05:00:00+00,915,MORNINGSIDE AVE,SHEPPARD AVE E,,...,Yes,Yes,,,146,Malvern East,132,Malvern (132),D42,3
3,-8829728.0,5419071.0,3389067,893184.0,2006,2006/01/01 05:00:00+00,236,WOODBINE AVE,O CONNOR DR,,...,Yes,,Yes,,60,Woodbine-Lumsden,60,Woodbine-Lumsden (60),D55,4
4,-8816480.0,5434843.0,3388102,892810.0,2006,2006/03/11 05:00:00+00,915,MORNINGSIDE AVE,SHEPPARD AVE E,,...,Yes,Yes,,,146,Malvern East,132,Malvern (132),D42,5


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18194 entries, 0 to 18193
Data columns (total 57 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   X                  18194 non-null  float64
 1   Y                  18194 non-null  float64
 2   INDEX_             18194 non-null  int64  
 3   ACCNUM             13264 non-null  float64
 4   YEAR               18194 non-null  int64  
 5   DATE               18194 non-null  object 
 6   TIME               18194 non-null  int64  
 7   STREET1            18194 non-null  object 
 8   STREET2            16510 non-null  object 
 9   OFFSET             3402 non-null   object 
 10  ROAD_CLASS         17818 non-null  object 
 11  DISTRICT           18089 non-null  object 
 12  WARDNUM            17332 non-null  float64
 13  LATITUDE           18194 non-null  float64
 14  LONGITUDE          18194 non-null  float64
 15  LOCCOORD           18099 non-null  object 
 16  ACCLOC             127

In [5]:
df.columns.values

array(['X', 'Y', 'INDEX_', 'ACCNUM', 'YEAR', 'DATE', 'TIME', 'STREET1',
       'STREET2', 'OFFSET', 'ROAD_CLASS', 'DISTRICT', 'WARDNUM',
       'LATITUDE', 'LONGITUDE', 'LOCCOORD', 'ACCLOC', 'TRAFFCTL',
       'VISIBILITY', 'LIGHT', 'RDSFCOND', 'ACCLASS', 'IMPACTYPE',
       'INVTYPE', 'INVAGE', 'INJURY', 'FATAL_NO', 'INITDIR', 'VEHTYPE',
       'MANOEUVER', 'DRIVACT', 'DRIVCOND', 'PEDTYPE', 'PEDACT', 'PEDCOND',
       'CYCLISTYPE', 'CYCACT', 'CYCCOND', 'PEDESTRIAN', 'CYCLIST',
       'AUTOMOBILE', 'MOTORCYCLE', 'TRUCK', 'TRSN_CITY_VEH', 'EMERG_VEH',
       'PASSENGER', 'SPEEDING', 'AG_DRIV', 'REDLIGHT', 'ALCOHOL',
       'DISABILITY', 'HOOD_158', 'NEIGHBOURHOOD_158', 'HOOD_140',
       'NEIGHBOURHOOD_140', 'DIVISION', 'ObjectId'], dtype=object)

# Exploration <br>
Use below code to display categrial data and null counts

In [6]:
print(df['DRIVCOND'].value_counts())
print(df['DRIVCOND'].isnull().sum())

DRIVCOND
Normal                                5849
Inattentive                           1581
Unknown                               1100
Medical or Physical Disability         177
Had Been Drinking                      163
Ability Impaired, Alcohol Over .08     126
Ability Impaired, Alcohol              121
Other                                   52
Fatigue                                 51
Ability Impaired, Drugs                 20
Name: count, dtype: int64
8954


# Determine columns

In [6]:
#Since 5 rows are missing target values (ACCLASS), we will remove them
df = df.dropna(subset=['ACCLASS'])

#We will remove the columns that are not useful for our model
meaningless_columns = ['INDEX_', 'ACCNUM', 'YEAR', 'DATE', 'TIME', 'STREET1',
                       'STREET2', 'OFFSET', 'FATAL_NO', 'NEIGHBOURHOOD_158', 'NEIGHBOURHOOD_140',
                       'ObjectId', 'WARDNUM', 'DIVISION']

too_much_missing = ['PEDTYPE','CYCACT', 'CYCLISTYPE', 'PEDACT', 'CYCCOND', 'MANOEUVER']

#We will remove the columns with duplicate information
# X and Y are the same as LONGITUDE and LATITUDE
# VEHTYPE, PEDCOND, DRIVCOND, IMPACTYPE, DRIVACT duplicated because there are categorical columns for the same information
duplicated_columns = ['X', 'Y', 'VEHTYPE', 'PEDCOND', 'DRIVCOND', 'IMPACTYPE','LOCCOORD']


#columns need to fill Nan values
binary_map = {np.nan: 'No'}
fill_nan_columns = ['PEDESTRIAN', 'CYCLIST', 'AUTOMOBILE', 'MOTORCYCLE', 'TRUCK',
                    'TRSN_CITY_VEH', 'EMERG_VEH', 'PASSENGER', 'SPEEDING', 'AG_DRIV',
                    'REDLIGHT', 'ALCOHOL', 'DISABILITY']

#columns which contain categorical data
categorical_columns = ['LIGHT', 'INVAGE', 'RDSFCOND', 
                       'DISTRICT', 'INITDIR', 'ROAD_CLASS', 'TRAFFCTL', 
                       'ACCLOC', 'VISIBILITY','INVTYPE']



In [7]:
# Create a copy of the dataframe
df_origin = df.copy()

In [8]:
#drop meaningless columns, duplicated columns and columns with too much missing values
df = df.drop(columns=meaningless_columns)
df = df.drop(columns=duplicated_columns)
df = df.drop(columns=too_much_missing)


# Dealing with columns which contains many catefories

In [9]:
#Simplfy the categorical data
# LIGHT
# Daylight                10385
# Dark                     3687
# Dark, artificial         3300
# Dusk                      240
# Dusk, artificial          219
# Daylight, artificial      141
# Dawn                      110
# Dawn, artificial          101
# Other                       6

#We will simplify the LIGHT column to Daylight, Dark, Dusk, Dawn, Other
light_map = {
    'Daylight': 'Daylight', 
    'Dark': 'Dark', 
    'Dark, artificial': 'Dark', 
    'Dusk': 'Dusk', 
    'Dusk, artificial': 'Dusk',         
    'Daylight, artificial': 'Daylight', 
    'Dawn': 'Dawn', 
    'Dawn, artificial': 'Dawn', 
    'Other': 'Other'
    }

df['LIGHT'] = df['LIGHT'].map(light_map)

# IMPACTYPE
# Pedestrian Collisions     7293
# Turning Movement          2792
# Cyclist Collisions        1795
# Rear End                  1746
# SMV Other                 1457
# Angle                     1283
# Approaching                928
# Sideswipe                  506
# Other                      195
# SMV Unattended Vehicle     190
# Name: count, dtype: int64
# Null Values: 4

#Pedestrian, Cyclist are representing IMPACTYPE column, so we will drop it

# INVAGE
# unknown     2609
# 20 to 24    1710
# 25 to 29    1638
# 30 to 34    1384
# 35 to 39    1311
# 50 to 54    1302
# 40 to 44    1274
# 45 to 49    1239
# 55 to 59    1098
# 60 to 64     877
# 15 to 19     852
# 65 to 69     681
# 70 to 74     529
# 75 to 79     434
# 80 to 84     336
# 10 to 14     249
# 85 to 89     212
# 5 to 9       199
# 0 to 4       177
# 90 to 94      63
# Over 95       15
# Name: count, dtype: int64
# Null Values: 0

#We will simplify the INVAGE column to 0 to 20, 20 to 40, 40 to 60, 60 to 80, over 80
invage_map = {
    'unknown': 'unknown',
    '20 to 24': '20 to 40',
    '25 to 29': '20 to 40',
    '30 to 34': '20 to 40',
    '35 to 39': '20 to 40',
    '50 to 54': '40 to 60',
    '40 to 44': '40 to 60',
    '45 to 49': '40 to 60',
    '55 to 59': '40 to 60',
    '60 to 64': '60 to 80',
    '15 to 19': '0 to 20',
    '65 to 69': '60 to 80',
    '70 to 74': '60 to 80',
    '75 to 79': '60 to 80',
    '80 to 84': 'over 80',
    '10 to 14': '0 to 20',
    '85 to 89': 'over 80',
    '5 to 9': '0 to 20',
    '0 to 4': '0 to 20',
    '90 to 94': 'over 80',
    'Over 95': 'over 80'
    }

df['INVAGE'] = df['INVAGE'].map(invage_map)

# RDSFCOND
# Dry                     14594
# Wet                      3021
# Loose Snow                169
# Other                     145
# Slush                     102
# Ice                        77
# Packed Snow                44
# Loose Sand or Gravel       11
# Spilled liquid              1
# Name: count, dtype: int64
# Null Values: 25

#We will simplify the RDSFCOND column to Dry, Wet, Snow, Ice, Other
rdsfcond_map = {
    'Dry': 'Dry',
    'Wet': 'Wet',
    'Loose Snow': 'Snow',
    'Other': 'Other',
    'Slush': 'Snow',
    'Ice': 'Ice',
    'Packed Snow': 'Snow',
    'Loose Sand or Gravel': 'Other',
    'Spilled liquid': 'Other'
    }

df['RDSFCOND'] = df['RDSFCOND'].map(rdsfcond_map)

# fill the missing values with other
df['RDSFCOND'] = df['RDSFCOND'].fillna('Other')

# DISTRICT
# Toronto and East York    6125
# Etobicoke York           4207
# Scarborough              4111
# North York               3637
# Toronto East York           4
# Name: count, dtype: int64
# Null Values: 105

# DISTRICT column has 105 missing values, we will fill them with the most frequent value
df['DISTRICT'] = df['DISTRICT'].fillna(df['DISTRICT'].mode()[0])

# DRIVACT
# Driving Properly                4221
# Failed to Yield Right of Way    1541
# Lost control                     975
# Improper Turn                    573
# Other                            504
# Disobeyed Traffic Control        475
# Following too Close              251
# Exceeding Speed Limit            246
# Speed too Fast For Condition     208
# Improper Lane Change             122
# Improper Passing                 112
# Wrong Way on One Way Road          9
# Speed too Slow                     4
# Name: count, dtype: int64
# Null Values: 8948

#Redlight, Speeding, Ag_Driv, Alcohol, Disability are representing DRIVACT column, so we will drop it

# INITDIR
# East       3259
# West       3197
# South      3106
# North      3066
# Unknown     510
# Name: count, dtype: int64
# Null Values: 5051

# INITDIR column has 5051 missing values, we will fill them with Unknown
df['INITDIR'] = df['INITDIR'].fillna('Unknown')

# ROAD_CLASS
# Major Arterial         12951
# Minor Arterial          2840
# Collector                996
# Local                    841
# Expressway               132
# Other                     25
# Laneway                   11
# Expressway Ramp            9
# Pending                    7
# Major Arterial Ramp        1
# Name: count, dtype: int64
# Null Values: 376

# Simplify the ROAD_CLASS column to Major Arterial, Minor Arterial, Collector, Local, Other
road_class_map = {
    'Major Arterial': 'Major Arterial',
    'Minor Arterial': 'Minor Arterial',
    'Collector': 'Collector',
    'Local': 'Local',
    'Expressway': 'Other',
    'Other': 'Other',
    'Laneway': 'Other',
    'Expressway Ramp': 'Other',
    'Pending': 'Other',
    'Major Arterial Ramp': 'Other'
    }

df['ROAD_CLASS'] = df['ROAD_CLASS'].map(road_class_map)

# Fill the missing values with Other
df['ROAD_CLASS'] = df['ROAD_CLASS'].fillna('Other')

# TRAFFCTL
# No Control              8788
# Traffic Signal          7635
# Stop Sign               1380
# Pedestrian Crossover     198
# Traffic Controller       108
# Yield Sign                21
# Streetcar (Stop for)      16
# Traffic Gate               5
# School Guard               2
# Police Control             2
# Name: count, dtype: int64
# Null Values: 34

# Simplyfy the TRAFFCTL column to No Control, Traffic Signal, Stop Sign, Other
traffctl_map = {
    'No Control': 'No Control',
    'Traffic Signal': 'Traffic Signal',
    'Stop Sign': 'Stop Sign',
    'Pedestrian Crossover': 'Other',
    'Traffic Controller': 'Other',
    'Yield Sign': 'Other',
    'Streetcar (Stop for)': 'Other',
    'Traffic Gate': 'Other',
    'School Guard': 'Other',
    'Police Control': 'Other'
    }

df['TRAFFCTL'] = df['TRAFFCTL'].map(traffctl_map)

# Fill the missing values with Other
df['TRAFFCTL'] = df['TRAFFCTL'].fillna('Other')

# ACCLOC
# At Intersection          8689
# Non Intersection         2420
# Intersection Related     1200
# At/Near Private Drive     379
# Overpass or Bridge         17
# Laneway                    14
# Private Driveway           13
# Underpass or Tunnel         6
# Trail                       1
# Name: count, dtype: int64
# Null Values: 5450

# Simplyfy the ACCLOC column to At Intersection, Non Intersection, Other
accloc_map = {
    'At Intersection': 'At Intersection',
    'Non Intersection': 'Non Intersection',
    'Intersection Related': 'At Intersection',
    'At/Near Private Drive': 'Other',
    'Overpass or Bridge': 'Other',
    'Laneway': 'Other',
    'Private Driveway': 'Other',
    'Underpass or Tunnel': 'Other',
    'Trail': 'Other'
    }   

df['ACCLOC'] = df['ACCLOC'].map(accloc_map)

# Fill the missing values with Other
df['ACCLOC'] = df['ACCLOC'].fillna('Other')

# VISIBILITY
# Clear                     15714
# Rain                       1879
# Snow                        351
# Other                        97
# Fog, Mist, Smoke, Dust       50
# Freezing Rain                47
# Drifting Snow                21
# Strong wind                  10
# Name: count, dtype: int64
# Null Values: 20

# Simplyfy the VISIBILITY column to Clear, Rain, Snow, Other

visibility_map = {
    'Clear': 'Clear',
    'Rain': 'Rain',
    'Snow': 'Snow',
    'Other': 'Other',
    'Fog, Mist, Smoke, Dust': 'Other',
    'Freezing Rain': 'Other',
    'Drifting Snow': 'Other',
    'Strong wind': 'Other'
    }

df['VISIBILITY'] = df['VISIBILITY'].map(visibility_map)

# Fill the missing values with Other
df['VISIBILITY'] = df['VISIBILITY'].fillna('Other')

# INVTYPE
# Driver                  8274
# Pedestrian              3110
# Passenger               2766
# Vehicle Owner           1637
# Cyclist                  784
# Motorcycle Driver        697
# Truck Driver             346
# Other Property Owner     257
# Other                    186
# Motorcycle Passenger      39
# Moped Driver              30
# Driver - Not Hit          17
# Wheelchair                17
# In-Line Skater             5
# Cyclist Passenger          3
# Trailer Owner              2
# Pedestrian - Not Hit       1
# Witness                    1
# Moped Passenger            1
# Name: count, dtype: int64
# Null Values: 16

# Simplyfy the INVTYPE column to Driver, Pedestrian, Passenger, Vehicle Owner, Cyclist, Other

invtype_map = {
    'Driver': 'Driver',
    'Pedestrian': 'Pedestrian',
    'Passenger': 'Passenger',
    'Vehicle Owner': 'Vehicle Owner',
    'Cyclist': 'Cyclist',
    'Motorcycle Driver': 'Driver',
    'Truck Driver': 'Driver',
    'Other Property Owner': 'Other',
    'Other': 'Other',
    'Motorcycle Passenger': 'Passenger',
    'Moped Driver': 'Other',
    'Driver - Not Hit': 'Other',
    'Wheelchair': 'Other',
    'In-Line Skater': 'Other',
    'Cyclist Passenger': 'Passenger',
    'Trailer Owner': 'Vehicle Owner',
    'Pedestrian - Not Hit': 'Other',
    'Witness': 'Other',
    'Moped Passenger': 'Passenger'
    }

df['INVTYPE'] = df['INVTYPE'].map(invtype_map)

# Fill the missing values with Other
df['INVTYPE'] = df['INVTYPE'].fillna('Other')

# MANOEUVER
# Going Ahead                            6265
# Turning Left                           1786
# Stopped                                 620
# Turning Right                           476
# Slowing or Stopping                     282
# Changing Lanes                          216
# Parked                                  183
# Other                                   181
# Reversing                               122
# Unknown                                 122
# Making U Turn                           106
# Overtaking                               91
# Pulling Away from Shoulder or Curb       40
# Pulling Onto Shoulder or towardCurb      18
# Merging                                  18
# Disabled                                  4
# Name: count, dtype: int64
# Null Values: 7659

# Too difficult to simplify and too much missing values, we will drop it



In [10]:
df.columns.values

array(['ROAD_CLASS', 'DISTRICT', 'LATITUDE', 'LONGITUDE', 'ACCLOC',
       'TRAFFCTL', 'VISIBILITY', 'LIGHT', 'RDSFCOND', 'ACCLASS',
       'INVTYPE', 'INVAGE', 'INJURY', 'INITDIR', 'DRIVACT', 'PEDESTRIAN',
       'CYCLIST', 'AUTOMOBILE', 'MOTORCYCLE', 'TRUCK', 'TRSN_CITY_VEH',
       'EMERG_VEH', 'PASSENGER', 'SPEEDING', 'AG_DRIV', 'REDLIGHT',
       'ALCOHOL', 'DISABILITY', 'HOOD_158', 'HOOD_140'], dtype=object)

In [14]:
for item in df.columns.values:
    print(df[item].value_counts())
    print('Null Values:', df[item].isnull().sum())

ROAD_CLASS
Major Arterial    12951
Minor Arterial     2840
Collector           996
Local               841
Other               561
Name: count, dtype: int64
Null Values: 0
DISTRICT
Toronto and East York    6230
Etobicoke York           4207
Scarborough              4111
North York               3637
Toronto East York           4
Name: count, dtype: int64
Null Values: 0
LATITUDE
43.740245    48
43.650845    35
43.682345    34
43.656345    30
43.654945    29
             ..
43.615945     1
43.632745     1
43.682849     1
43.749804     1
43.661425     1
Name: count, Length: 4498, dtype: int64
Null Values: 0
LONGITUDE
-79.251190    36
-79.386590    32
-79.327990    30
-79.383790    26
-79.420090    25
              ..
-79.587390     1
-79.349437     1
-79.536365     1
-79.610454     1
-79.445216     1
Name: count, Length: 4935, dtype: int64
Null Values: 0
LOCCOORD
Intersection                           11963
Mid-Block                               6110
Mid-Block (Abnormal)                 

In [10]:
#Columns can be try to exclude or include
try_columns = ['DIVISION', 'LOCCOORD', 'INJURY']

In [11]:
#Find remaining columns
remaining_columns = list(set(df.columns.values) - set(meaningless_columns) - set(too_much_missing) - set(duplicate_columns) - set(fill_zero_columns) - set(fill_nan_columns) - set(categorical_columns) - set(try_columns))
print(remaining_columns)

['LONGITUDE', 'LATITUDE', 'HOOD_158', 'HOOD_140', 'ACCLASS']


In [18]:
#Check remaining columns for missing values
df[remaining_columns].isnull().sum()

HOOD_140     0
ACCLASS      0
LATITUDE     0
LONGITUDE    0
HOOD_158     0
dtype: int64

In [29]:
#Make a copy of the dataframe
df_copy = df.copy()

#Drop meaningless columns
df_copy = df_copy.drop(columns=meaningless_columns)
#Drop too much missing columns
df_copy = df_copy.drop(columns=too_much_missing)
#Drop duplicate columns
df_copy = df_copy.drop(columns=duplicate_columns)
#Fill missing values with 'No'
df_copy[fill_nan_columns] = df_copy[fill_nan_columns].fillna(value='No')
#Fill missing values with 0
df_copy[fill_zero_columns] = df_copy[fill_zero_columns].fillna(value=0)

#drop y column
df_copy = df_copy.drop(columns=['ACCLASS'])

#Set y as target variable
y = df['ACCLASS']

In [33]:
#train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df_copy, y, test_size=0.8, random_state=58)

In [34]:
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(3637, 34) (14552, 34) (3637,) (14552,)


In [31]:
y.replace({'Property Damage Only': 'Non-Fatal Injury'}, inplace=True)

In [32]:
y.value_counts()

ACCLASS
Non-Fatal Injury    15616
Fatal                2573
Name: count, dtype: int64

In [36]:
#tartget values are imbalanced, we will use SMOTE to balance the target values
from imblearn import over_sampling
smote = over_sampling.SMOTE(random_state=58)
X_train, y_train = smote.fit_resample(X_train, y_train)


ImportError: cannot import name '_MissingValues' from 'sklearn.utils._param_validation' (c:\ProgramData\anaconda3\Lib\site-packages\sklearn\utils\_param_validation.py)

In [None]:
#Create a pipeline to transform the data

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

#Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='most_frequent')

#Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

#Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, remaining_columns),
        ('cat', categorical_transformer, categorical_columns)
    ])

#Define the model
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=0)

#Bundle preprocessing and modeling code in a pipeline
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('model', model)
                     ])

#Preprocessing of training data, fit model
clf.fit(df_copy, y)

In [19]:
#Mapping for target column
print(df['ACCLASS'].value_counts())

# Only Fatal and Non-Fatal from our prediction
# 1 for Fatal, 0 for Non-Fatal
data_map = {
    'Fatal': 1, 
    'Non-Fatal Injury': 0, 
    'Property Damage Only': 0
    }

ACCLASS
Non-Fatal Injury        15599
Fatal                    2573
Property Damage Only       17
Name: count, dtype: int64


In [24]:
df.describe()

Unnamed: 0,X,Y,INDEX_,ACCNUM,YEAR,TIME,WARDNUM,LATITUDE,LONGITUDE,FATAL_NO,ObjectId
count,18194.0,18194.0,18194.0,13264.0,18194.0,18194.0,17332.0,18194.0,18194.0,827.0,18194.0
mean,-8838345.0,5420748.0,38188700.0,424844400.0,2012.934869,1362.615917,2521.028,43.710459,-79.396201,29.073761,9097.5
std,11625.33,8682.16,37264630.0,1065503000.0,4.754258,630.816048,184480.3,0.056369,0.104432,17.803627,5252.299734
min,-8865305.0,5402162.0,3363207.0,25301.0,2006.0,0.0,1.0,43.589678,-79.63839,1.0,1.0
25%,-8846591.0,5413242.0,5391370.0,1021229.0,2009.0,920.0,7.0,43.661727,-79.47028,14.0,4549.25
50%,-8838448.0,5419556.0,7644612.0,1197308.0,2012.0,1450.0,13.0,43.702745,-79.397132,28.0,9097.5
75%,-8829671.0,5427813.0,80782610.0,1365020.0,2017.0,1850.0,22.0,43.756345,-79.318286,42.0,13645.75
max,-8807929.0,5443099.0,81706060.0,4008024000.0,2022.0,2359.0,17162220.0,43.855445,-79.122974,78.0,18194.0
