#### Killed or Seriously Injured (KSI) Toronto
- **ACCNUM**: Accident number (unique identifier for each incident)
- **YEAR**: Year when the incident occurred
- **MONTH**: Month when the incident occurred
- **DAY**: Day of the month when the incident occurred
- **HOUR**: Hour of the day when the incident occurred
- **MINUTES**: Minutes within the hour when the incident occurred
- **WEEKDAY**: Day of the week when the incident occurred (1=Sunday, 2=Monday, ..., 7=Saturday)

#### Location Information
- **LATITUDE**: Latitude coordinates of the incident location
- **LONGITUDE**: Longitude coordinates of the incident location
- **Ward_Name**: Name of the ward where the incident occurred
- **Ward_ID**: Identifier for the ward where the incident occurred
- **Hood_Name**: Name of the neighborhood where the incident occurred
- **Hood_ID**: Identifier for the neighborhood where the incident occurred

#### Road and Traffic Information
- **Division**: Police division responsible for the area where the incident occurred
- **District**: Police district where the incident occurred
- **STREET1**: Primary street involved in the incident
- **STREET2**: Secondary street involved in the incident
- **OFFSET**: Location offset description (e.g., intersection, mid-block)
- **ROAD_CLASS**: Classification of the road (e.g., major road, local street)
- **LOCCOORD**: Location coordinates description (e.g., at intersection, mid-block)
- **ACCLOC**: Specific location of the incident on the road

#### Environmental Conditions
- **TRAFFCTL**: Type of traffic control device present (e.g., stop sign, traffic light)
- **VISIBILITY**: Visibility conditions at the time of the incident (e.g., clear, fog)
- **LIGHT**: Lighting conditions at the time of the incident (e.g., daylight, dark)
- **RDSFCOND**: Road surface conditions at the time of the incident (e.g., dry, wet)

#### Incident Details
- **ACCLASS**: Accident class (e.g., fatal, injury, property damage)
- **IMPACTYPE**: Impact type (e.g., rear-end, side-impact)
- **INVTYPE**: Involvement type of the person (e.g., driver, pedestrian)
- **INVAGE**: Age of the person involved
- **INJURY**: Injury severity level (e.g., fatal, major, minor)
- **FATAL_NO**: Fatality number in case of a fatal accident

#### Vehicle and Maneuver Information
- **INITDIR**: Initial direction of travel
- **VEHTYPE**: Vehicle type involved in the incident (e.g., car, motorcycle)
- **MANOEUVER**: Vehicle maneuver at the time of the incident (e.g., going straight, turning)
- **DRIVACT**: Driver action at the time of the incident (e.g., speeding, lost control)
- **DRIVCOND**: Driver condition at the time of the incident (e.g., sober, impaired)

#### Pedestrian and Cyclist Information
- **PEDTYPE**: Pedestrian type involved (if any)
- **PEDACT**: Pedestrian action at the time of the incident (if any)
- **PEDCOND**: Pedestrian condition at the time of the incident (if any)
- **CYCLISTYPE**: Cyclist type involved (if any)
- **CYCACT**: Cyclist action at the time of the incident (if any)
- **CYCCOND**: Cyclist condition at the time of the incident (if any)

#### Indicators (1/0)
- **PEDESTRIAN**: Was a pedestrian involved?
- **CYCLIST**: Was a cyclist involved?
- **AUTOMOBILE**: Was an automobile involved?
- **MOTORCYCLE**: Was a motorcycle involved?
- **TRUCK**: Was a truck involved?
- **TRSN_CITY_VEH**: Was a city transit vehicle involved?
- **EMERG_VEH**: Was an emergency vehicle involved?
- **PASSENGER**: Was a passenger involved?

#### Contributing Factors
- **SPEEDING**: Was speeding a factor in the incident?
- **AG_DRIV**: Was aggressive driving a factor?
- **REDLIGHT**: Was running a red light a factor?
- **ALCOHOL**: Was alcohol a factor?
- **DISABILITY**: Was disability a factor?
- **FATAL**: Was the incident fatal?


In [136]:
#load libraries
import pandas as pd
import numpy as np
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import  accuracy_score
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_validate, cross_val_predict
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
import joblib


In [2]:
#read dataset
df= pd.read_csv("KSI_CLEAN.csv")

In [3]:
df.info()
#info of data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12557 entries, 0 to 12556
Data columns (total 56 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   ACCNUM         12557 non-null  int64  
 1   YEAR           12557 non-null  int64  
 2   MONTH          12557 non-null  int64  
 3   DAY            12557 non-null  int64  
 4   HOUR           12557 non-null  int64  
 5   MINUTES        12557 non-null  int64  
 6   WEEKDAY        12557 non-null  int64  
 7   LATITUDE       12557 non-null  float64
 8   LONGITUDE      12557 non-null  float64
 9   Ward_Name      12557 non-null  object 
 10  Ward_ID        12557 non-null  int64  
 11  Hood_Name      12557 non-null  object 
 12  Hood_ID        12557 non-null  int64  
 13  Division       12557 non-null  object 
 14  District       12557 non-null  object 
 15  STREET1        12557 non-null  object 
 16  STREET2        12557 non-null  object 
 17  OFFSET         12557 non-null  object 
 18  ROAD_C

In [4]:
df.head()

Unnamed: 0,ACCNUM,YEAR,MONTH,DAY,HOUR,MINUTES,WEEKDAY,LATITUDE,LONGITUDE,Ward_Name,...,TRUCK,TRSN_CITY_VEH,EMERG_VEH,PASSENGER,SPEEDING,AG_DRIV,REDLIGHT,ALCOHOL,DISABILITY,FATAL
0,1249781,2011,8,4,23,18,3,43.651545,-79.38349,Toronto Centre-Rosedale (27),...,0,1,0,0,0,0,0,0,0,0
1,1311542,2012,8,19,23,18,6,43.780445,-79.30049,Scarborough-Agincourt (40),...,0,0,0,1,1,1,0,0,0,0
2,5002235651,2015,12,30,23,39,2,43.682342,-79.328266,Toronto-Danforth (30),...,0,0,0,0,0,1,0,0,0,1
3,1311542,2012,8,19,23,18,6,43.780445,-79.30049,Scarborough-Agincourt (40),...,0,0,0,1,1,1,0,0,0,0
4,1311542,2012,8,19,23,18,6,43.780445,-79.30049,Scarborough-Agincourt (40),...,0,0,0,1,1,1,0,0,0,0


In [5]:
df.shape

(12557, 56)

In [6]:
df.describe()

Unnamed: 0,ACCNUM,YEAR,MONTH,DAY,HOUR,MINUTES,WEEKDAY,LATITUDE,LONGITUDE,Ward_ID,...,TRUCK,TRSN_CITY_VEH,EMERG_VEH,PASSENGER,SPEEDING,AG_DRIV,REDLIGHT,ALCOHOL,DISABILITY,FATAL
count,12557.0,12557.0,12557.0,12557.0,12557.0,12557.0,12557.0,12557.0,12557.0,12557.0,...,12557.0,12557.0,12557.0,12557.0,12557.0,12557.0,12557.0,12557.0,12557.0,12557.0
mean,1576768000.0,2011.547822,6.76746,15.646333,13.167317,28.246317,2.987099,43.710715,-79.395989,22.56582,...,0.063391,0.065063,0.001195,0.368878,0.174405,0.511189,0.084335,0.043163,0.028351,0.136657
std,2541023000.0,3.104151,3.27867,8.861354,6.242482,17.515703,1.965007,0.056025,0.104216,12.531213,...,0.243674,0.246647,0.034543,0.48252,0.379472,0.499895,0.277901,0.203232,0.165979,0.343498
min,128407.0,2007.0,1.0,1.0,0.0,0.0,0.0,43.592048,-79.63839,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1108321.0,2009.0,4.0,8.0,9.0,13.0,1.0,43.662645,-79.467434,12.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,1276036.0,2011.0,7.0,16.0,14.0,30.0,3.0,43.702345,-79.39649,23.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
75%,4002384000.0,2014.0,10.0,23.0,18.0,44.0,5.0,43.755992,-79.31809,34.0,...,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
max,7003292000.0,2017.0,12.0,31.0,23.0,59.0,6.0,43.855445,-79.125897,44.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [7]:
df.describe(include="O")

Unnamed: 0,Ward_Name,Hood_Name,Division,District,STREET1,STREET2,OFFSET,ROAD_CLASS,LOCCOORD,ACCLOC,...,VEHTYPE,MANOEUVER,DRIVACT,DRIVCOND,PEDTYPE,PEDACT,PEDCOND,CYCLISTYPE,CYCACT,CYCCOND
count,12557,12557,12557,12557,12557,12557.0,12557.0,12557,12557,12557,...,12557,12557.0,12557.0,12557.0,12557.0,12557.0,12557.0,12557.0,12557.0,12557.0
unique,44,140,17,6,1352,2093.0,283.0,8,5,10,...,28,17.0,14.0,11.0,17.0,16.0,11.0,23.0,12.0,11.0
top,Trinity-Spadina (20),Waterfront Communities-The Island (77),D42,Toronto East York,YONGE ST,,,Major Arterial,Intersection,At Intersection,...,"Automobile, Station Wagon",,,,,,,,,
freq,722,492,1237,4200,252,1064.0,11104.0,8640,8331,5772,...,5140,5443.0,6356.0,6359.0,10540.0,10543.0,10506.0,12028.0,12026.0,12027.0


## Data Cleaning

In [8]:
df["STREET2"].value_counts()

STREET2
                            1064
LAWRENCE AVE E               100
BATHURST ST                   96
FINCH AVE E                   86
ISLINGTON AVE                 86
                            ... 
BAYVIEW BLOOR DVP S RAMP       1
POWER St                       1
WILSON AVENUE                  1
27 S 427 C S RAMP              1
DONLANDS AVE                   1
Name: count, Length: 2093, dtype: int64

In [9]:
df.isna().sum()

ACCNUM              0
YEAR                0
MONTH               0
DAY                 0
HOUR                0
MINUTES             0
WEEKDAY             0
LATITUDE            0
LONGITUDE           0
Ward_Name           0
Ward_ID             0
Hood_Name           0
Hood_ID             0
Division            0
District            0
STREET1             0
STREET2             0
OFFSET              0
ROAD_CLASS          0
LOCCOORD            0
ACCLOC              0
TRAFFCTL            0
VISIBILITY          0
LIGHT               0
RDSFCOND            0
ACCLASS             0
IMPACTYPE           0
INVTYPE             0
INVAGE              0
INJURY           4576
FATAL_NO            0
INITDIR             0
VEHTYPE             0
MANOEUVER           0
DRIVACT             0
DRIVCOND            0
PEDTYPE             0
PEDACT              0
PEDCOND             0
CYCLISTYPE          0
CYCACT              0
CYCCOND             0
PEDESTRIAN          0
CYCLIST             0
AUTOMOBILE          0
MOTORCYCLE

#### find data  Contains " "

In [10]:
df = df.replace(' ' , np.nan)
#replace with null values

In [11]:
df.head()

Unnamed: 0,ACCNUM,YEAR,MONTH,DAY,HOUR,MINUTES,WEEKDAY,LATITUDE,LONGITUDE,Ward_Name,...,TRUCK,TRSN_CITY_VEH,EMERG_VEH,PASSENGER,SPEEDING,AG_DRIV,REDLIGHT,ALCOHOL,DISABILITY,FATAL
0,1249781,2011,8,4,23,18,3,43.651545,-79.38349,Toronto Centre-Rosedale (27),...,0,1,0,0,0,0,0,0,0,0
1,1311542,2012,8,19,23,18,6,43.780445,-79.30049,Scarborough-Agincourt (40),...,0,0,0,1,1,1,0,0,0,0
2,5002235651,2015,12,30,23,39,2,43.682342,-79.328266,Toronto-Danforth (30),...,0,0,0,0,0,1,0,0,0,1
3,1311542,2012,8,19,23,18,6,43.780445,-79.30049,Scarborough-Agincourt (40),...,0,0,0,1,1,1,0,0,0,0
4,1311542,2012,8,19,23,18,6,43.780445,-79.30049,Scarborough-Agincourt (40),...,0,0,0,1,1,1,0,0,0,0


In [12]:
df.duplicated().sum()
#total of duplicated row 

447

In [13]:
df.drop_duplicates(inplace=True)
# drop all duplicated in dataset

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12110 entries, 0 to 12556
Data columns (total 56 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   ACCNUM         12110 non-null  int64  
 1   YEAR           12110 non-null  int64  
 2   MONTH          12110 non-null  int64  
 3   DAY            12110 non-null  int64  
 4   HOUR           12110 non-null  int64  
 5   MINUTES        12110 non-null  int64  
 6   WEEKDAY        12110 non-null  int64  
 7   LATITUDE       12110 non-null  float64
 8   LONGITUDE      12110 non-null  float64
 9   Ward_Name      12110 non-null  object 
 10  Ward_ID        12110 non-null  int64  
 11  Hood_Name      12110 non-null  object 
 12  Hood_ID        12110 non-null  int64  
 13  Division       12110 non-null  object 
 14  District       12109 non-null  object 
 15  STREET1        12110 non-null  object 
 16  STREET2        11080 non-null  object 
 17  OFFSET         1408 non-null   object 
 18  ROAD_CLASS 

In [15]:
df.shape

(12110, 56)

In [16]:
df.isna().sum()

ACCNUM               0
YEAR                 0
MONTH                0
DAY                  0
HOUR                 0
MINUTES              0
WEEKDAY              0
LATITUDE             0
LONGITUDE            0
Ward_Name            0
Ward_ID              0
Hood_Name            0
Hood_ID              0
Division             0
District             1
STREET1              0
STREET2           1030
OFFSET           10702
ROAD_CLASS           0
LOCCOORD            77
ACCLOC            4502
TRAFFCTL            22
VISIBILITY           2
LIGHT                2
RDSFCOND             7
ACCLASS              0
IMPACTYPE            0
INVTYPE              4
INVAGE               0
INJURY            5723
FATAL_NO             0
INITDIR           3387
VEHTYPE           1603
MANOEUVER         5044
DRIVACT           5924
DRIVCOND          5927
PEDTYPE          10106
PEDACT           10109
PEDCOND          10073
CYCLISTYPE       11581
CYCACT           11579
CYCCOND          11580
PEDESTRIAN           0
CYCLIST    

In [17]:
df.isna().mean()*100

ACCNUM            0.000000
YEAR              0.000000
MONTH             0.000000
DAY               0.000000
HOUR              0.000000
MINUTES           0.000000
WEEKDAY           0.000000
LATITUDE          0.000000
LONGITUDE         0.000000
Ward_Name         0.000000
Ward_ID           0.000000
Hood_Name         0.000000
Hood_ID           0.000000
Division          0.000000
District          0.008258
STREET1           0.000000
STREET2           8.505367
OFFSET           88.373245
ROAD_CLASS        0.000000
LOCCOORD          0.635838
ACCLOC           37.175888
TRAFFCTL          0.181668
VISIBILITY        0.016515
LIGHT             0.016515
RDSFCOND          0.057803
ACCLASS           0.000000
IMPACTYPE         0.000000
INVTYPE           0.033031
INVAGE            0.000000
INJURY           47.258464
FATAL_NO          0.000000
INITDIR          27.968621
VEHTYPE          13.236994
MANOEUVER        41.651528
DRIVACT          48.918249
DRIVCOND         48.943022
PEDTYPE          83.451693
P

In [18]:
df.columns

Index(['ACCNUM', 'YEAR', 'MONTH', 'DAY', 'HOUR', 'MINUTES', 'WEEKDAY',
       'LATITUDE', 'LONGITUDE', 'Ward_Name', 'Ward_ID', 'Hood_Name', 'Hood_ID',
       'Division', 'District', 'STREET1', 'STREET2', 'OFFSET', 'ROAD_CLASS',
       'LOCCOORD', 'ACCLOC', 'TRAFFCTL', 'VISIBILITY', 'LIGHT', 'RDSFCOND',
       'ACCLASS', 'IMPACTYPE', 'INVTYPE', 'INVAGE', 'INJURY', 'FATAL_NO',
       'INITDIR', 'VEHTYPE', 'MANOEUVER', 'DRIVACT', 'DRIVCOND', 'PEDTYPE',
       'PEDACT', 'PEDCOND', 'CYCLISTYPE', 'CYCACT', 'CYCCOND', 'PEDESTRIAN',
       'CYCLIST', 'AUTOMOBILE', 'MOTORCYCLE', 'TRUCK', 'TRSN_CITY_VEH',
       'EMERG_VEH', 'PASSENGER', 'SPEEDING', 'AG_DRIV', 'REDLIGHT', 'ALCOHOL',
       'DISABILITY', 'FATAL'],
      dtype='object')

In [19]:
df[["STREET2","OFFSET","ACCLOC","INJURY","INITDIR","VEHTYPE","MANOEUVER","DRIVACT","DRIVCOND","PEDTYPE","PEDACT", 'CYCLISTYPE', 'CYCACT', 'CYCCOND']]
#columns Contains missing values

Unnamed: 0,STREET2,OFFSET,ACCLOC,INJURY,INITDIR,VEHTYPE,MANOEUVER,DRIVACT,DRIVCOND,PEDTYPE,PEDACT,CYCLISTYPE,CYCACT,CYCCOND
0,YORK ST,,,,West,Municipal Transit Bus (TTC),Going Ahead,Driving Properly,Normal,,,,,
1,AMETHYST RD,,,Minimal,,Other,,,,,,,,
2,GILLARD AVE,,At Intersection,Fatal,South,,,,,Vehicle is going straight thru inter.while ped...,"Crossing, no Traffic Control",,,
3,AMETHYST RD,,,,,Other,,,,,,,,
4,AMETHYST RD,,,,West,"Automobile, Station Wagon",Going Ahead,Failed to Yield Right of Way,Normal,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12552,SPRUCE ST,,,Major,South,Motorcycle,Going Ahead,Lost control,Inattentive,,,,,
12553,SPRUCE ST,,,,South,Other,Parked,,,,,,,
12554,SPRUCE ST,,,Minimal,East,Other,,,,Pedestrian hit at mid-block,Walking on Roadway with Traffic,,,
12555,BONNYCASTLE ST,65.6 M E of,Non Intersection,Fatal,,,,,,,,,,


In [20]:
df["PEDTYPE"].value_counts()

PEDTYPE
Pedestrian hit at mid-block                                                                     511
Vehicle turns left while ped crosses with ROW at inter.                                         413
Vehicle is going straight thru inter.while ped cross without ROW                                343
Pedestrian hit on sidewalk or shoulder                                                          141
Vehicle is going straight thru inter.while ped cross with ROW                                   115
Vehicle turns right while ped crosses with ROW at inter.                                        110
Pedestrian involved in a collision with transit vehicle anywhere along roadway                  107
Vehicle is reversing and hits pedestrian                                                         59
Other / Undefined                                                                                53
Vehicle turns left while ped crosses without ROW at inter.                                  

In [21]:
df["DRIVCOND"].value_counts()

DRIVCOND
Normal                                3975
Inattentive                           1061
Unknown                                664
Medical or Physical Disability         122
Had Been Drinking                      113
Ability Impaired, Alcohol               85
Ability Impaired, Alcohol Over .08      82
Fatigue                                 39
Other                                   33
Ability Impaired, Drugs                  9
Name: count, dtype: int64

In [22]:
df["DRIVACT"].value_counts()

DRIVACT
Driving Properly                2843
Failed to Yield Right of Way    1035
Lost control                     623
Improper Turn                    389
Other                            326
Disobeyed Traffic Control        320
Following too Close              205
Exceeding Speed Limit            151
Speed too Fast For Condition     137
Improper Lane Change              80
Improper Passing                  67
Wrong Way on One Way Road          7
Speed too Slow                     3
Name: count, dtype: int64

In [23]:
df["MANOEUVER"].value_counts()

MANOEUVER
Going Ahead                            4128
Turning Left                           1200
Stopped                                 447
Turning Right                           299
Slowing or Stopping                     195
Changing Lanes                          149
Other                                   143
Parked                                  122
Unknown                                  96
Reversing                                93
Making U Turn                            69
Overtaking                               67
Pulling Away from Shoulder or Curb       31
Merging                                  12
Pulling Onto Shoulder or towardCurb      11
Disabled                                  4
Name: count, dtype: int64

In [24]:
df["VEHTYPE"].value_counts()

VEHTYPE
Automobile, Station Wagon           5119
Other                               3707
Bicycle                              537
Motorcycle                           416
Municipal Transit Bus (TTC)          199
Truck - Open                         150
Pick Up Truck                         88
Passenger Van                         74
Delivery Van                          35
Truck - Closed (Blazer, etc)          31
Street Car                            24
Truck - Dump                          22
Taxi                                  22
Truck-Tractor                         20
Moped                                 19
Bus (Other) (Go Bus, Gray Coach)      11
Truck (other)                          7
Intercity Bus                          7
Truck - Tank                           4
Tow Truck                              4
Police Vehicle                         3
School Bus                             2
Construction Equipment                 2
Truck - Car Carrier                    1
Fire Veh

In [25]:
df["INITDIR"].value_counts()

INITDIR
East       2211
West       2117
South      2069
North      2008
Unknown     318
Name: count, dtype: int64

In [26]:
df["INJURY"].value_counts()

INJURY
Major      4119
Minor       932
Minimal     745
Fatal       591
Name: count, dtype: int64

In [27]:
df["ACCLOC"].value_counts()

ACCLOC
At Intersection          5594
Non Intersection         1177
Intersection Related      661
At/Near Private Drive     146
Private Driveway           12
Laneway                     7
Underpass or Tunnel         6
Overpass or Bridge          4
Trail                       1
Name: count, dtype: int64

In [28]:
df.drop(columns=["OFFSET","PEDTYPE","PEDACT","PEDCOND","CYCLISTYPE","CYCACT","CYCCOND","INJURY","DRIVCOND","MANOEUVER","DRIVACT"],inplace=True)
# columns Contains A large number of missing value

In [29]:
df[["ACCLOC","INITDIR","VEHTYPE"]]

Unnamed: 0,ACCLOC,INITDIR,VEHTYPE
0,,West,Municipal Transit Bus (TTC)
1,,,Other
2,At Intersection,South,
3,,,Other
4,,West,"Automobile, Station Wagon"
...,...,...,...
12552,,South,Motorcycle
12553,,South,Other
12554,,East,Other
12555,Non Intersection,,


In [30]:
df["VEHTYPE"].value_counts()

VEHTYPE
Automobile, Station Wagon           5119
Other                               3707
Bicycle                              537
Motorcycle                           416
Municipal Transit Bus (TTC)          199
Truck - Open                         150
Pick Up Truck                         88
Passenger Van                         74
Delivery Van                          35
Truck - Closed (Blazer, etc)          31
Street Car                            24
Truck - Dump                          22
Taxi                                  22
Truck-Tractor                         20
Moped                                 19
Bus (Other) (Go Bus, Gray Coach)      11
Truck (other)                          7
Intercity Bus                          7
Truck - Tank                           4
Tow Truck                              4
Police Vehicle                         3
School Bus                             2
Construction Equipment                 2
Truck - Car Carrier                    1
Fire Veh

In [31]:
df["INITDIR"].value_counts()

INITDIR
East       2211
West       2117
South      2069
North      2008
Unknown     318
Name: count, dtype: int64

In [32]:
df["INITDIR"] = df["INITDIR"].apply(lambda x: np.nan if x == "Unknown" else x)
df["INITDIR"].value_counts()

INITDIR
East     2211
West     2117
South    2069
North    2008
Name: count, dtype: int64

In [33]:
df["ACCLOC"].value_counts()

ACCLOC
At Intersection          5594
Non Intersection         1177
Intersection Related      661
At/Near Private Drive     146
Private Driveway           12
Laneway                     7
Underpass or Tunnel         6
Overpass or Bridge          4
Trail                       1
Name: count, dtype: int64

In [34]:
df=df.fillna(method='bfill')

  df=df.fillna(method='bfill')


In [35]:
df.isnull().sum()

ACCNUM           0
YEAR             0
MONTH            0
DAY              0
HOUR             0
MINUTES          0
WEEKDAY          0
LATITUDE         0
LONGITUDE        0
Ward_Name        0
Ward_ID          0
Hood_Name        0
Hood_ID          0
Division         0
District         0
STREET1          0
STREET2          0
ROAD_CLASS       0
LOCCOORD         0
ACCLOC           0
TRAFFCTL         0
VISIBILITY       0
LIGHT            0
RDSFCOND         0
ACCLASS          0
IMPACTYPE        0
INVTYPE          0
INVAGE           0
FATAL_NO         0
INITDIR          0
VEHTYPE          0
PEDESTRIAN       0
CYCLIST          0
AUTOMOBILE       0
MOTORCYCLE       0
TRUCK            0
TRSN_CITY_VEH    0
EMERG_VEH        0
PASSENGER        0
SPEEDING         0
AG_DRIV          0
REDLIGHT         0
ALCOHOL          0
DISABILITY       0
FATAL            0
dtype: int64

In [36]:
df.drop(columns=["Ward_ID","Hood_ID"],inplace=True)

In [37]:
df.head()

Unnamed: 0,ACCNUM,YEAR,MONTH,DAY,HOUR,MINUTES,WEEKDAY,LATITUDE,LONGITUDE,Ward_Name,...,TRUCK,TRSN_CITY_VEH,EMERG_VEH,PASSENGER,SPEEDING,AG_DRIV,REDLIGHT,ALCOHOL,DISABILITY,FATAL
0,1249781,2011,8,4,23,18,3,43.651545,-79.38349,Toronto Centre-Rosedale (27),...,0,1,0,0,0,0,0,0,0,0
1,1311542,2012,8,19,23,18,6,43.780445,-79.30049,Scarborough-Agincourt (40),...,0,0,0,1,1,1,0,0,0,0
2,5002235651,2015,12,30,23,39,2,43.682342,-79.328266,Toronto-Danforth (30),...,0,0,0,0,0,1,0,0,0,1
3,1311542,2012,8,19,23,18,6,43.780445,-79.30049,Scarborough-Agincourt (40),...,0,0,0,1,1,1,0,0,0,0
4,1311542,2012,8,19,23,18,6,43.780445,-79.30049,Scarborough-Agincourt (40),...,0,0,0,1,1,1,0,0,0,0


In [38]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12110 entries, 0 to 12556
Data columns (total 43 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   ACCNUM         12110 non-null  int64  
 1   YEAR           12110 non-null  int64  
 2   MONTH          12110 non-null  int64  
 3   DAY            12110 non-null  int64  
 4   HOUR           12110 non-null  int64  
 5   MINUTES        12110 non-null  int64  
 6   WEEKDAY        12110 non-null  int64  
 7   LATITUDE       12110 non-null  float64
 8   LONGITUDE      12110 non-null  float64
 9   Ward_Name      12110 non-null  object 
 10  Hood_Name      12110 non-null  object 
 11  Division       12110 non-null  object 
 12  District       12110 non-null  object 
 13  STREET1        12110 non-null  object 
 14  STREET2        12110 non-null  object 
 15  ROAD_CLASS     12110 non-null  object 
 16  LOCCOORD       12110 non-null  object 
 17  ACCLOC         12110 non-null  object 
 18  TRAFFCTL   

In [39]:
df["HOUR"].value_counts()

HOUR
18    799
17    780
16    739
14    712
19    684
15    672
13    658
20    650
21    595
10    556
12    551
11    522
8     500
9     487
22    451
7     413
0     391
23    377
6     366
2     331
3     297
1     257
5     208
4     114
Name: count, dtype: int64

### Feature Engineering

In [40]:
def get_day_status(hour):
    if 5 <= hour < 12:
        return 'Morning'
    elif 12 <= hour < 17:
        return 'Afternoon'
    elif 17 <= hour < 21:
        return 'Evening'
    else:
        return 'Night'



df['Day_Status'] = df['HOUR'].apply(get_day_status)

In [41]:
df["Day_Status"].value_counts()

Day_Status
Afternoon    3332
Morning      3052
Evening      2913
Night        2813
Name: count, dtype: int64

In [42]:
df["region"] = df["STREET1"]+" "+ df["STREET2"]

In [43]:
df.drop(columns=["STREET1","STREET2"],inplace=True)

In [44]:
df["region"].value_counts()

region
FINCH AVE W WESTON RD                   23
DIXON RD ISLINGTON AVE                  21
DON VALLEY PARKWAY  S LAWRENCE AVE E    21
QUEENS QUAY W LOWER SIMCOE ST           20
EGLINTON AVE W MARTIN GROVE RD          19
                                        ..
JUDSON ST HAROLD ST                      1
QUEENS PARK CRES W QUEENS PARK           1
1240 WARDEN AVE CLOVERLAWN AVE           1
7 INVERLEIGH DR KIPLING AVE              1
KINGSTON RD OLD KINGSTON RD              1
Name: count, Length: 3633, dtype: int64

In [45]:
df["ACCLOC"].value_counts()

ACCLOC
At Intersection          8929
Non Intersection         1831
Intersection Related     1031
At/Near Private Drive     281
Private Driveway           14
Laneway                    12
Underpass or Tunnel         7
Overpass or Bridge          4
Trail                       1
Name: count, dtype: int64

In [46]:
df["LOCCOORD"].value_counts()

LOCCOORD
Intersection                           8131
Mid-Block                              3975
Park, Private Property, Public Lane       2
Entrance Ramp Westbound                   2
Name: count, dtype: int64

In [47]:
df["TRAFFCTL"].value_counts()

TRAFFCTL
No Control              5806
Traffic Signal          5112
Stop Sign                946
Pedestrian Crossover     135
Traffic Controller        77
Streetcar (Stop for)      16
Yield Sign                 9
Traffic Gate               5
Police Control             2
School Guard               2
Name: count, dtype: int64

In [48]:
df["VISIBILITY"].value_counts()

VISIBILITY
Clear                     10415
Rain                       1272
Snow                        271
Other                        71
Fog, Mist, Smoke, Dust       34
Freezing Rain                29
Drifting Snow                13
Strong wind                   5
Name: count, dtype: int64

In [49]:
def VISIBIL(x):
    if  x in ["Fog, Mist, Smoke, Dust"]:
        return 'Clear'
    elif x in ["Freezing Rain","Strong wind"]:
        return 'Rain'
    elif x in ["Drifting Snow"]:
        return 'Snow'
    elif x in ["Other"]:
        return 'normal'
    else:
        return x



df['VISIBILITY'] = df['VISIBILITY'].apply(VISIBIL)

In [50]:
df["VISIBILITY"].value_counts()

VISIBILITY
Clear     10449
Rain       1306
Snow        284
normal       71
Name: count, dtype: int64

In [51]:
df["LIGHT"].value_counts()

LIGHT
Daylight                7132
Dark                    2631
Dark, artificial        1831
Dusk                     185
Dusk, artificial         121
Dawn                      75
Daylight, artificial      67
Dawn, artificial          66
Other                      2
Name: count, dtype: int64

In [52]:
def day_LIGHT(x):
    if x in ["Daylight","Dawn","Daylight, artificial","Dawn, artificial","Other"]:
        return 'Light'
    else:
        return "Dark"



df['LIGHT'] = df['LIGHT'].apply(day_LIGHT)

In [53]:
df["LIGHT"].value_counts()

LIGHT
Light    7342
Dark     4768
Name: count, dtype: int64

In [54]:
df["RDSFCOND"].value_counts()

RDSFCOND
Dry                     9661
Wet                     2046
Loose Snow               126
Other                    103
Slush                     74
Ice                       56
Packed Snow               38
Loose Sand or Gravel       5
Spilled liquid             1
Name: count, dtype: int64

In [55]:
df["ACCLASS"].value_counts()

ACCLASS
Non-Fatal Injury        10443
Fatal                    1655
Property Damage Only       12
Name: count, dtype: int64

In [56]:
def ACCLASS(x):
    if x in ["Property Damage Only","Non-Fatal Injury"]:
        return 'Non-Fatal'
    else:
        return "Fatal"



df['ACCLASS'] = df['ACCLASS'].apply(ACCLASS)

In [57]:
df["ACCLASS"].value_counts()

ACCLASS
Non-Fatal    10455
Fatal         1655
Name: count, dtype: int64

In [58]:
df["IMPACTYPE"].value_counts()

IMPACTYPE
Pedestrian Collisions     4927
Turning Movement          1817
Cyclist Collisions        1248
Rear End                  1194
SMV Other                  898
Angle                      831
Approaching                586
Sideswipe                  356
Other                      134
SMV Unattended Vehicle     119
Name: count, dtype: int64

In [59]:
df["INVTYPE"].value_counts()

INVTYPE
Driver                  5595
Pedestrian              2063
Passenger               1648
Vehicle Owner           1218
Cyclist                  544
Motorcycle Driver        416
Truck Driver             240
Other Property Owner     181
Other                    127
Moped Driver              25
Motorcycle Passenger      21
Driver - Not Hit          13
Wheelchair                10
In-Line Skater             3
Trailer Owner              2
Runaway - No Driver        1
Unknown - FTR              1
Pedestrian - Not Hit       1
Witness                    1
Name: count, dtype: int64

In [60]:
df["INVAGE"].value_counts()

INVAGE
unknown     1865
20 to 24    1093
25 to 29    1057
35 to 39     897
30 to 34     893
50 to 54     879
40 to 44     866
45 to 49     853
55 to 59     728
60 to 64     579
15 to 19     500
65 to 69     452
70 to 74     361
75 to 79     292
80 to 84     216
85 to 89     155
10 to 14     146
5 to 9       116
0 to 4       110
90 to 94      45
Over 95        7
Name: count, dtype: int64

In [61]:
df["FATAL_NO"].value_counts()

FATAL_NO
0     11567
26       10
14       10
8        10
3        10
      ...  
75        1
66        1
69        1
70        1
74        1
Name: count, Length: 78, dtype: int64

In [62]:
df["INITDIR"].value_counts()

INITDIR
East     3190
West     3121
South    2949
North    2850
Name: count, dtype: int64

In [63]:
df["VEHTYPE"].value_counts()

VEHTYPE
Automobile, Station Wagon           6093
Other                               4027
Bicycle                              584
Motorcycle                           489
Municipal Transit Bus (TTC)          229
Truck - Open                         156
Pick Up Truck                        119
Passenger Van                        113
Truck - Closed (Blazer, etc)          45
Delivery Van                          44
Street Car                            41
Truck - Dump                          36
Truck-Tractor                         28
Moped                                 24
Taxi                                  23
Bus (Other) (Go Bus, Gray Coach)      14
Truck (other)                         14
Intercity Bus                          7
Truck - Tank                           6
Police Vehicle                         5
Tow Truck                              5
School Bus                             2
Construction Equipment                 2
Truck - Car Carrier                    1
Fire Veh

In [64]:
df["SPEEDING"].value_counts()

SPEEDING
0    10073
1     2037
Name: count, dtype: int64

In [65]:
df["AG_DRIV"].value_counts()

AG_DRIV
1    6160
0    5950
Name: count, dtype: int64

In [66]:
df["REDLIGHT"].value_counts()

REDLIGHT
0    11107
1     1003
Name: count, dtype: int64

In [67]:
df["ALCOHOL"].value_counts()

ALCOHOL
0    11605
1      505
Name: count, dtype: int64

In [68]:
df["DISABILITY"].value_counts()

DISABILITY
0    11772
1      338
Name: count, dtype: int64

In [69]:
df["FATAL"].value_counts()

FATAL
0    10455
1     1655
Name: count, dtype: int64

## Data Analysis

### Q1:Distribution of accidents over months

In [70]:
df["MONTH"].value_counts()

MONTH
6     1229
10    1190
8     1162
9     1151
7     1109
11    1048
5     1007
4      892
12     871
1      862
3      846
2      743
Name: count, dtype: int64

In [71]:
fig = px.histogram(df, x="MONTH",text_auto=True,title="Distribution of Accidents Over Months")

fig.update_xaxes(categoryorder='category ascending', tickmode='linear', tick0=0, dtick=1)
fig.update_layout(bargap=0.2)

fig.show()


#### The month with the highest number of accidents is 6 The least month is 2

### Q2: Distribution of Accidents Over years

In [72]:
fig = px.histogram(df, x="YEAR",text_auto=True,title="Distribution of Accidents Over YEARs")

fig.update_xaxes(categoryorder='category ascending', tickmode='linear', tick0=0, dtick=1)
fig.update_layout(bargap=0.2)

fig.show()

#### The year with the highest number of accidents is 2007 The least year is 2017

### Q3:Distribution of Accidents Over WEEKDAYS

In [73]:
fig = px.histogram(df, x="WEEKDAY",text_auto=True,title="Distribution of Accidents Over WEEKDAYS")

fig.update_xaxes(categoryorder='category ascending', tickmode='linear', tick0=0, dtick=1)
fig.update_layout(bargap=0.2)

fig.show()

#### The weekday with the highest number of accidents is 4 The least weekday is 5

### Q4:Specify the location for each incident

In [74]:
fig = px.scatter_mapbox(df, 
                        lat="LATITUDE", 
                        lon="LONGITUDE", 
                        hover_name="LOCCOORD",  
                        zoom=10,  
                        height=600)

fig.update_layout(mapbox_style="carto-positron")

fig.update_layout(title="Incident Locations Map")

fig.show()


### Q5:Distribution of Accidents Over Ward Name


In [75]:
px.histogram(df, x="Ward_Name",text_auto=True,title="Distribution of Accidents Over Ward_Name",color="Ward_Name")

### Q6:The names of the twenty most important neighborhoods in which an accident occurred

In [76]:
df["count"] = 1

In [77]:
df_grouped =df.groupby("Hood_Name")["count"].count().reset_index().sort_values(by="count", ascending=False).head(20)


In [78]:
px.histogram(df_grouped,x="Hood_Name",y="count",color="Hood_Name")

### Q7: The Police division has had the most accidents

In [79]:
px.histogram(df, x="Division",text_auto=True,title="The Police division has had the most accidents",color="Division")



### Q8: The ROAD CLASS has had the most accidents

In [80]:
px.histogram(df, x="ROAD_CLASS",text_auto=True,title="The ROAD CLASS has had the most accidents",color="ROAD_CLASS")

#### The ROAD CLASS with the highest number of accidents is Major Arterial The least ROAD CLASS is Laneway

### Q9: Number of accidents at each Location offset 

In [81]:
px.histogram(df, x="LOCCOORD",text_auto=True,title="Number of accidents at each Location offset ",color="LOCCOORD")


### Q10: Number of accidents at each Specific location 

In [82]:
px.histogram(df, x="ACCLOC",text_auto=True,title="Number of accidents at each Specific location  ",color="ACCLOC")



#### Most accidents happen  At Intersection

In [83]:
df["ACCLOC"].value_counts()

ACCLOC
At Intersection          8929
Non Intersection         1831
Intersection Related     1031
At/Near Private Drive     281
Private Driveway           14
Laneway                    12
Underpass or Tunnel         7
Overpass or Bridge          4
Trail                       1
Name: count, dtype: int64

### Q11: How does the type of traffic control device affect incident frequency?

In [84]:
px.histogram(df, x="TRAFFCTL",text_auto=True,title="How does the type of traffic control device affect incident frequency? ",color="TRAFFCTL")


#### No Control of TRAFFCTL cause of most of accidents

### Q12:  What is the relationship between visibility conditions and incident occurrence?

In [85]:
px.histogram(df, x="VISIBILITY",text_auto=True,title="What is the relationship between visibility conditions and incident occurrence?",color="VISIBILITY")

### Q13: How do lighting conditions at the time of the incident impact the frequency of incidents?

In [86]:
px.histogram(df, x="LIGHT",text_auto=True,title="How do lighting conditions at the time of the incident impact the frequency of incidents?",color="LIGHT")

### Q14: How do road surface conditions influence incident rates?

In [87]:
px.histogram(df, x="RDSFCOND",text_auto=True,title="How do road surface conditions influence incident rates?",color="RDSFCOND")

### Q15: Are there any interactions between traffic control devices and visibility conditions?

In [88]:
interaction_analysis = pd.crosstab(df['TRAFFCTL'], df['VISIBILITY'])
interaction_analysis.head()

VISIBILITY,Clear,Rain,Snow,normal
TRAFFCTL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
No Control,5051,566,164,25
Pedestrian Crossover,111,16,6,2
Police Control,2,0,0,0
School Guard,2,0,0,0
Stop Sign,848,74,18,6


### Q16:What is the distribution of accident classes (ACCLASS)?


In [89]:
px.histogram(df, x="ACCLASS",text_auto=True,title="the distribution of accident classes (ACCLASS)",color="ACCLASS")

### Q17:Most Common Impact Type (IMPACTYPE) in Fatal Accidents

In [90]:
fatal_accidents = df[df['ACCLASS'] == 'Fatal']
impact_type_distribution = fatal_accidents['IMPACTYPE'].value_counts().reset_index()
impact_type_distribution.columns = ['Impact Type', 'Number of Fatal Accidents']

fig = px.bar(impact_type_distribution, 
             x='Impact Type', 
             y='Number of Fatal Accidents', 
             title='Impact Types in Fatal Accidents', 
             labels={'Impact Type': 'Impact Type', 'Number of Fatal Accidents': 'Number of Fatal Accidents'})

fig.show()


### Q18:Distribution of Involvement Types (INVTYPE) Across Accident Classes


In [91]:
involvement_distribution = pd.crosstab(df['INVTYPE'], df['ACCLASS']).reset_index()

involvement_distribution_melted = involvement_distribution.melt(id_vars='INVTYPE', var_name='Accident Class', value_name='Number of Incidents')

fig = px.bar(involvement_distribution_melted, 
             x='INVTYPE', 
             y='Number of Incidents', 
             color='Accident Class', 
             title='Involvement Type Distribution Across Accident Classes', 
             labels={'INVTYPE': 'Involvement Type', 'Number of Incidents': 'Number of Incidents'})

fig.show()


### Q19:Fatalities (FATAL_NO) by Impact Type (IMPACTYPE)


In [92]:
fatalities_by_impact = df.groupby('IMPACTYPE')['FATAL_NO'].sum().reset_index()
fatalities_by_impact.columns = ['Impact Type', 'Total Number of Fatalities']

fig = px.bar(fatalities_by_impact, 
             x='Impact Type', 
             y='Total Number of Fatalities', 
             title='Fatalities by Impact Type', 
             labels={'Impact Type': 'Impact Type', 'Total Number of Fatalities': 'Total Number of Fatalities'})

fig.show()


### Q20:How does the initial direction of travel (INITDIR) influence the likelihood of fatal incidents?

In [93]:
fatal_by_direction = df.groupby('INITDIR')['FATAL'].mean().reset_index()
fig = px.bar(fatal_by_direction, x='INITDIR', y='FATAL', title='Fatality Rate by Initial Direction of Travel',color="INITDIR",text_auto=True)
fig.show()


### Q21:What is the relationship between speeding and the involvement of different vehicle types?

In [94]:
vehicle_columns = ['AUTOMOBILE', 'MOTORCYCLE', 'TRUCK']
speed_corr = df[vehicle_columns + ['SPEEDING']].corr()
fig = px.imshow(speed_corr, title='Correlation Between Speeding and Vehicle Types',text_auto=True)
fig.show()


### Q22:How does the involvement of pedestrians (PEDESTRIAN) correlate with other contributing factors like alcohol or aggressive driving?

In [95]:
pedestrian_factors = df[['PEDESTRIAN', 'ALCOHOL', 'AG_DRIV']].corr()
fig = px.imshow(pedestrian_factors, title='Correlation Between Pedestrian Involvement and Contributing Factors',text_auto=True)
fig.show()


### Q23:What is the impact of aggressive driving (AG_DRIV) on incidents involving multiple vehicle types?

In [96]:
df['MULTI_VEH_INV'] = df[['AUTOMOBILE', 'TRUCK', 'MOTORCYCLE']].sum(axis=1) > 1
ag_driv_multi_veh = df.groupby('MULTI_VEH_INV')['AG_DRIV'].mean().reset_index()
fig = px.bar(ag_driv_multi_veh, x='MULTI_VEH_INV', y='AG_DRIV', title='Aggressive Driving in Multi-Vehicle Incidents')
fig.show()


### Q24:Are incidents involving cyclists (CYCLIST) more likely to occur in certain initial directions (INITDIR)?

In [97]:
cyclist_by_direction = df[df['CYCLIST'] == 1].groupby('INITDIR').size().reset_index(name='Count')
fig = px.bar(cyclist_by_direction, x='INITDIR', y='Count', title='Cyclist-Involved Incidents by Initial Direction',color="INITDIR")
fig.show()


### Q25: What is the relationship between running red lights (REDLIGHT) and the involvement of specific vehicle types?

In [98]:
redlight_vehicle_corr = df[vehicle_columns + ['REDLIGHT']].corr()
fig = px.imshow(redlight_vehicle_corr, title='Correlation Between Red Light Violations and Vehicle Types',text_auto=True)
fig.show()


### Q26:How does the presence of passengers (PASSENGER) affect the likelihood of fatalities in speeding-related incidents?

In [99]:
speeding_fatality = df[df['SPEEDING'] == 1].groupby('PASSENGER')['FATAL'].mean().reset_index()
fig = px.bar(speeding_fatality, x='PASSENGER', y='FATAL', title='Fatality Rate in Speeding Incidents with Passengers',color="PASSENGER")
fig.show()


### Q27:Is there a higher incidence of disability-related incidents (DISABILITY) involving certain vehicle types?

In [100]:
disability_by_vehicle = df.groupby('DISABILITY')[vehicle_columns].sum().reset_index()
fig = px.bar(disability_by_vehicle, x='DISABILITY', y=vehicle_columns, title='Disability-Related Incidents by Vehicle Type')
fig.show()


### Q28:What is the distribution of incidents involving emergency vehicles (EMERG_VEH) and their contributing factors?

In [101]:
emergency_factors = df[df['EMERG_VEH'] == 1][['SPEEDING', 'AG_DRIV', 'REDLIGHT', 'ALCOHOL', 'FATAL']].sum().reset_index(name='Count')
fig = px.bar(emergency_factors, x='index', y='Count', title='Contributing Factors in Emergency Vehicle Incidents')
fig.show()


### Q29:How do incidents involving city transit vehicles (TRSN_CITY_VEH) compare to those involving other types of vehicles in terms of fatalities?

In [102]:
transit_vs_others = df.groupby('TRSN_CITY_VEH')['FATAL'].mean().reset_index()
fig = px.bar(transit_vs_others, x='TRSN_CITY_VEH', y='FATAL', title='Fatality Rate in City Transit Vehicle Incidents vs Others')
fig.show()


### Q30:What is the distribution of accident Day Status?

In [103]:
px.histogram(df,x="Day_Status",text_auto=True,title="the distribution of accident Day Status",color="Day_Status")

In [104]:
df["SPEEDING"].value_counts()

SPEEDING
0    10073
1     2037
Name: count, dtype: int64

In [105]:
df.columns

Index(['ACCNUM', 'YEAR', 'MONTH', 'DAY', 'HOUR', 'MINUTES', 'WEEKDAY',
       'LATITUDE', 'LONGITUDE', 'Ward_Name', 'Hood_Name', 'Division',
       'District', 'ROAD_CLASS', 'LOCCOORD', 'ACCLOC', 'TRAFFCTL',
       'VISIBILITY', 'LIGHT', 'RDSFCOND', 'ACCLASS', 'IMPACTYPE', 'INVTYPE',
       'INVAGE', 'FATAL_NO', 'INITDIR', 'VEHTYPE', 'PEDESTRIAN', 'CYCLIST',
       'AUTOMOBILE', 'MOTORCYCLE', 'TRUCK', 'TRSN_CITY_VEH', 'EMERG_VEH',
       'PASSENGER', 'SPEEDING', 'AG_DRIV', 'REDLIGHT', 'ALCOHOL', 'DISABILITY',
       'FATAL', 'Day_Status', 'region', 'count', 'MULTI_VEH_INV'],
      dtype='object')

# Data Preprocessing

In [106]:
df2 = df.copy()

In [107]:
df2.columns

Index(['ACCNUM', 'YEAR', 'MONTH', 'DAY', 'HOUR', 'MINUTES', 'WEEKDAY',
       'LATITUDE', 'LONGITUDE', 'Ward_Name', 'Hood_Name', 'Division',
       'District', 'ROAD_CLASS', 'LOCCOORD', 'ACCLOC', 'TRAFFCTL',
       'VISIBILITY', 'LIGHT', 'RDSFCOND', 'ACCLASS', 'IMPACTYPE', 'INVTYPE',
       'INVAGE', 'FATAL_NO', 'INITDIR', 'VEHTYPE', 'PEDESTRIAN', 'CYCLIST',
       'AUTOMOBILE', 'MOTORCYCLE', 'TRUCK', 'TRSN_CITY_VEH', 'EMERG_VEH',
       'PASSENGER', 'SPEEDING', 'AG_DRIV', 'REDLIGHT', 'ALCOHOL', 'DISABILITY',
       'FATAL', 'Day_Status', 'region', 'count', 'MULTI_VEH_INV'],
      dtype='object')

In [108]:
df2.drop([ 'Ward_Name','Hood_Name','Division'
         , 'ROAD_CLASS', 'LOCCOORD','TRAFFCTL'
         , 'ACCLOC', 'RDSFCOND', 'ACCLASS','IMPACTYPE', 'INVTYPE'
         , 'INVAGE', 'FATAL_NO',  'VEHTYPE',"YEAR"
         , 'ACCLASS' ,"ACCNUM","region","count","MULTI_VEH_INV"
         , 'MONTH' , 'DAY' , 'HOUR' , 'MINUTES' , 'WEEKDAY' , 'LATITUDE' , 'LONGITUDE'] , axis = 1 , inplace = True)

In [109]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12110 entries, 0 to 12556
Data columns (total 19 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   District       12110 non-null  object
 1   VISIBILITY     12110 non-null  object
 2   LIGHT          12110 non-null  object
 3   INITDIR        12110 non-null  object
 4   PEDESTRIAN     12110 non-null  int64 
 5   CYCLIST        12110 non-null  int64 
 6   AUTOMOBILE     12110 non-null  int64 
 7   MOTORCYCLE     12110 non-null  int64 
 8   TRUCK          12110 non-null  int64 
 9   TRSN_CITY_VEH  12110 non-null  int64 
 10  EMERG_VEH      12110 non-null  int64 
 11  PASSENGER      12110 non-null  int64 
 12  SPEEDING       12110 non-null  int64 
 13  AG_DRIV        12110 non-null  int64 
 14  REDLIGHT       12110 non-null  int64 
 15  ALCOHOL        12110 non-null  int64 
 16  DISABILITY     12110 non-null  int64 
 17  FATAL          12110 non-null  int64 
 18  Day_Status     12110 non-null  

In [110]:
df2["LIGHT"].value_counts()

LIGHT
Light    7342
Dark     4768
Name: count, dtype: int64

In [111]:
for col in df2.select_dtypes('object').columns:
    print(f'{col} -- {df2[col].nunique()}')

District -- 5
VISIBILITY -- 4
LIGHT -- 2
INITDIR -- 4
Day_Status -- 4


In [112]:
cols_cat = ["District","VISIBILITY","LIGHT","INITDIR","Day_Status"]
for i in cols_cat:
    df2[i] = LabelEncoder().fit_transform(df2[i])

In [113]:
df2.head()

Unnamed: 0,District,VISIBILITY,LIGHT,INITDIR,PEDESTRIAN,CYCLIST,AUTOMOBILE,MOTORCYCLE,TRUCK,TRSN_CITY_VEH,EMERG_VEH,PASSENGER,SPEEDING,AG_DRIV,REDLIGHT,ALCOHOL,DISABILITY,FATAL,Day_Status
0,4,0,0,3,1,0,0,0,0,1,0,0,0,0,0,0,0,0,3
1,3,0,0,2,1,0,1,0,0,0,0,1,1,1,0,0,0,0,3
2,4,0,0,2,1,0,1,0,0,0,0,0,0,1,0,0,0,1,3
3,3,0,0,3,1,0,1,0,0,0,0,1,1,1,0,0,0,0,3
4,3,0,0,3,1,0,1,0,0,0,0,1,1,1,0,0,0,0,3


In [114]:
df2.corr()

Unnamed: 0,District,VISIBILITY,LIGHT,INITDIR,PEDESTRIAN,CYCLIST,AUTOMOBILE,MOTORCYCLE,TRUCK,TRSN_CITY_VEH,EMERG_VEH,PASSENGER,SPEEDING,AG_DRIV,REDLIGHT,ALCOHOL,DISABILITY,FATAL,Day_Status
District,1.0,0.009259,0.003374,-0.001723,0.067777,0.140221,-0.068669,0.012343,-0.025594,0.020897,-0.016026,-0.064382,-0.053411,-0.075417,-0.030142,-0.004312,-0.021041,-0.010065,0.006589
VISIBILITY,0.009259,1.0,-0.141652,-0.023015,0.055532,-0.05518,-0.019366,-0.081961,-0.006326,0.021532,-0.012803,-0.005542,0.010222,0.00258,-0.004964,0.012173,-0.038336,0.003947,0.038416
LIGHT,0.003374,-0.141652,1.0,-0.003756,-0.076285,0.077694,-0.054233,0.059474,0.078022,0.03164,0.018769,-0.066563,-0.129222,0.018372,-0.002511,-0.177693,0.066777,-0.050871,-0.557418
INITDIR,-0.001723,-0.023015,-0.003756,1.0,0.013525,-0.007198,-0.012623,-0.002599,0.002063,0.024596,0.005301,0.013687,0.001667,-0.01112,-0.014239,0.006451,0.015659,0.01224,0.006004
PEDESTRIAN,0.067777,0.055532,-0.076285,0.013525,1.0,-0.276419,-0.114514,-0.216372,-0.023748,0.038295,-0.019587,-0.35511,-0.204539,-0.145001,-0.141966,-0.088551,-0.10656,0.106796,0.005348
CYCLIST,0.140221,-0.05518,0.077694,-0.007198,-0.276419,1.0,-0.068735,-0.099786,-0.001171,-0.025393,0.025068,-0.168427,-0.119599,-0.111554,-0.064698,-0.051043,-0.059669,-0.074113,-0.036559
AUTOMOBILE,-0.068669,-0.019366,-0.054233,-0.012623,-0.114514,-0.068735,1.0,-0.099662,-0.363109,-0.443609,-0.020412,0.127519,0.085042,0.146977,0.089763,0.059546,0.053502,-0.08913,0.012973
MOTORCYCLE,0.012343,-0.081961,0.059474,-0.002599,-0.216372,-0.099786,-0.099662,1.0,-0.048044,-0.063313,-0.010368,-0.056501,0.077824,0.061346,-0.022125,-0.024826,-0.046188,-0.023973,-0.048153
TRUCK,-0.025594,-0.006326,0.078022,0.002063,-0.023748,-0.001171,-0.363109,-0.048044,1.0,-0.023967,-0.009068,-0.014702,0.003053,-0.010964,-0.017825,-0.034898,-0.027023,0.113582,-0.025701
TRSN_CITY_VEH,0.020897,0.021532,0.03164,0.024596,0.038295,-0.025393,-0.443609,-0.063313,-0.023967,1.0,-0.009081,0.014013,-0.033781,-0.071764,-0.018008,-0.007661,-0.037473,0.069421,0.005687


In [115]:
X , Y = df2.drop(['FATAL'] , axis = 1) , df2['FATAL']
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.2,random_state=42,stratify=Y)

## Data Modeling

In [137]:
model = DecisionTreeClassifier(random_state=42)

model.fit(x_train, y_train)

y_pred_train = model.predict(x_train)
y_pred_test = model.predict(x_test)


print("Accuracy_train:", accuracy_score(y_train, y_pred_train))
print("Accuracy_test:", accuracy_score(y_test, y_pred_test))
print("classification_report",classification_report(y_test, y_pred_test))


Accuracy_train: 0.9277456647398844
Accuracy_test: 0.8876961189099918
classification_report               precision    recall  f1-score   support

           0       0.91      0.97      0.94      2091
           1       0.65      0.38      0.48       331

    accuracy                           0.89      2422
   macro avg       0.78      0.67      0.71      2422
weighted avg       0.87      0.89      0.87      2422



In [138]:
model = RandomForestClassifier()
model.fit(x_train, y_train)

y_pred_train = model.predict(x_train)
y_pred_test = model.predict(x_test)


print("Accuracy_train:", accuracy_score(y_train, y_pred_train))
print("Accuracy_test:", accuracy_score(y_test, y_pred_test))
print("classification_report",classification_report(y_test, y_pred_test))


Accuracy_train: 0.9277456647398844
Accuracy_test: 0.8885218827415359
classification_report               precision    recall  f1-score   support

           0       0.90      0.97      0.94      2091
           1       0.68      0.34      0.46       331

    accuracy                           0.89      2422
   macro avg       0.79      0.66      0.70      2422
weighted avg       0.87      0.89      0.87      2422



In [118]:
for i in df2.columns:
    print(df2[i].value_counts
    ())

District
4    4062
0    2873
3    2836
2    2328
1      11
Name: count, dtype: int64
VISIBILITY
0    10449
1     1306
2      284
3       71
Name: count, dtype: int64
LIGHT
1    7342
0    4768
Name: count, dtype: int64
INITDIR
0    3190
3    3121
2    2949
1    2850
Name: count, dtype: int64
PEDESTRIAN
0    7187
1    4923
Name: count, dtype: int64
CYCLIST
0    10774
1     1336
Name: count, dtype: int64
AUTOMOBILE
1    10948
0     1162
Name: count, dtype: int64
MOTORCYCLE
0    11144
1      966
Name: count, dtype: int64
TRUCK
0    11357
1      753
Name: count, dtype: int64
TRSN_CITY_VEH
0    11355
1      755
Name: count, dtype: int64
EMERG_VEH
0    12095
1       15
Name: count, dtype: int64
PASSENGER
0    7774
1    4336
Name: count, dtype: int64
SPEEDING
0    10073
1     2037
Name: count, dtype: int64
AG_DRIV
1    6160
0    5950
Name: count, dtype: int64
REDLIGHT
0    11107
1     1003
Name: count, dtype: int64
ALCOHOL
0    11605
1      505
Name: count, dtype: int64
DISABILITY
0    11772
1

In [119]:
# joblib.dump(model,"model.pkl")

### **Hyperparameter Tuning**

In [120]:
X , Y

(       District  VISIBILITY  LIGHT  INITDIR  PEDESTRIAN  CYCLIST  AUTOMOBILE  \
 0             4           0      0        3           1        0           0   
 1             3           0      0        2           1        0           1   
 2             4           0      0        2           1        0           1   
 3             3           0      0        3           1        0           1   
 4             3           0      0        3           1        0           1   
 ...         ...         ...    ...      ...         ...      ...         ...   
 12552         4           0      1        2           1        0           0   
 12553         4           0      1        2           1        0           0   
 12554         4           0      1        0           1        0           0   
 12555         4           0      1        0           0        0           1   
 12556         4           0      1        0           0        0           1   
 
        MOTORCYCLE  TRUCK 

In [121]:
dt = DecisionTreeClassifier()

cv = cross_validate(dt, X, Y, cv=10, return_train_score=True, scoring='r2', return_estimator=True)

In [122]:
cv


{'fit_time': array([0.04042339, 0.05679321, 0.06349945, 0.08005881, 0.0484519 ,
        0.0481751 , 0.06407976, 0.04753399, 0.05641365, 0.04796982]),
 'score_time': array([0.00802779, 0.00752187, 0.00803304, 0.01595044, 0.        ,
        0.00842023, 0.00799942, 0.00813007, 0.00803709, 0.00799656]),
 'estimator': [DecisionTreeClassifier(),
  DecisionTreeClassifier(),
  DecisionTreeClassifier(),
  DecisionTreeClassifier(),
  DecisionTreeClassifier(),
  DecisionTreeClassifier(),
  DecisionTreeClassifier(),
  DecisionTreeClassifier(),
  DecisionTreeClassifier(),
  DecisionTreeClassifier()],
 'test_score': array([-0.69100759, -0.4524422 , -0.48050872, -0.31912625, -0.47349209,
        -0.45205511, -0.79412579, -0.89186026, -0.99657578, -0.54280855]),
 'train_score': array([0.39050103, 0.38428165, 0.36562352, 0.36018156, 0.37417516,
        0.37771072, 0.36837638, 0.37537714, 0.38704506, 0.36448707])}

In [123]:
index_max = cv['test_score'].argmax()
cv['estimator'][index_max]

In [124]:
cv['estimator'][index_max].score(x_train, y_train)

0.9162881915772089

In [125]:
cv['estimator'][index_max].score(x_test, y_test)

0.9174236168455822

In [126]:
dt = RandomForestClassifier()

params = {

    'max_depth': [4, 6, 8],
    'min_samples_leaf': [2000, 20000],
    'max_leaf_nodes': [10, 20],
    'criterion': ('gini', 'entropy')  # Corrected criteria

}

grid = GridSearchCV(dt, params)
grid.fit(x_train, y_train)

In [127]:
grid.best_params_

{'criterion': 'gini',
 'max_depth': 4,
 'max_leaf_nodes': 10,
 'min_samples_leaf': 2000}

In [128]:
df2.columns

Index(['District', 'VISIBILITY', 'LIGHT', 'INITDIR', 'PEDESTRIAN', 'CYCLIST',
       'AUTOMOBILE', 'MOTORCYCLE', 'TRUCK', 'TRSN_CITY_VEH', 'EMERG_VEH',
       'PASSENGER', 'SPEEDING', 'AG_DRIV', 'REDLIGHT', 'ALCOHOL', 'DISABILITY',
       'FATAL', 'Day_Status'],
      dtype='object')

In [129]:
grid.score(x_test, y_test)

0.8633360858794384

In [130]:
grid.score(x_train, y_train)

0.8633360858794384

In [131]:
dt = DecisionTreeClassifier()

params = {

    'max_depth': [4, 6, 8],
    'min_samples_leaf': [2000, 20000],
    'max_leaf_nodes': [10, 20],
    'criterion': ('gini', 'entropy')  # Corrected criteria

}

rand = RandomizedSearchCV(dt, params, n_iter=5)
rand.fit(x_train, y_train)

In [132]:
rand.best_params_

{'min_samples_leaf': 20000,
 'max_leaf_nodes': 20,
 'max_depth': 4,
 'criterion': 'gini'}

In [133]:
rand.score(x_train, y_train), rand.score(x_test, y_test)

(0.8633360858794384, 0.8633360858794384)

In [134]:
dt = RandomForestClassifier()

# Define the parameter grid
params = {
    'n_estimators': [50, 100, 200],            # Number of trees in the forest
    'max_depth': [4, 6, 8, 10],                # Depth of the tree
    'min_samples_split': [2, 10, 50],          # Minimum samples required to split a node
    'min_samples_leaf': [1, 50, 100],          # Minimum samples required at a leaf node
    'max_leaf_nodes': [None, 10, 20, 30],      # Maximum number of leaf nodes
    'criterion': ['gini', 'entropy']           # Criterion for splitting
}

# Setup the grid search
grid = GridSearchCV(dt, params, cv=5, scoring='accuracy', n_jobs=-1)

# Fit the model
grid.fit(x_train, y_train)

# Get the best accuracy
best_accuracy = grid.best_score_
best_params = grid.best_params_

print("Best Accuracy: ", best_accuracy)
print("Best Parameters: ", best_params)

Best Accuracy:  0.8781995606709385
Best Parameters:  {'criterion': 'gini', 'max_depth': 10, 'max_leaf_nodes': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}


In [135]:
df2.columns

Index(['District', 'VISIBILITY', 'LIGHT', 'INITDIR', 'PEDESTRIAN', 'CYCLIST',
       'AUTOMOBILE', 'MOTORCYCLE', 'TRUCK', 'TRSN_CITY_VEH', 'EMERG_VEH',
       'PASSENGER', 'SPEEDING', 'AG_DRIV', 'REDLIGHT', 'ALCOHOL', 'DISABILITY',
       'FATAL', 'Day_Status'],
      dtype='object')