<a href="https://colab.research.google.com/github/narasimhachikkala/Capstone_606/blob/main/src/Capstone_606_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Road Accident Severity Prediction

##   2. Background

The project is about predicting the severity of road accidents in a specific region. It accomplishes this by analyzing historical data related to road accidents in that region. The project involves training a predictive model using the historical accident data. Once the model is trained, it can be used to make predictions about the severity of accidents when given new input data, such as information about a recent accident. In essence, this project aims to create a tool or system that can assess and predict the level of severity for road accidents based on historical patterns and data analysis, potentially aiding in better accident prevention and response efforts in the specified region. This project matters because it has the potential to improve road safety, save lives, optimize resource allocation, and contribute to more effective accident prevention and response strategies in the specified region.

## 3. DATA

**Description :** This dataset reports details of all traffic collisions occurring on county and local roadways within Montgomery County, of the Maryland State from 2015 to September 2023. Each row consists details about crash report.

**Data Source :** https://catalog.data.gov/dataset/crash-reporting-drivers-data

## 3.1 Import necessary libraries

In [5]:
import pandas as pd
import numpy as np
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)

## 3.2 Reading the file from local drive

In [6]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [7]:
!ls "/content/drive/My Drive/Crash_Reporting_Drivers_Data.csv"

'/content/drive/My Drive/Crash_Reporting_Drivers_Data.csv'


In [8]:
data=pd.read_csv("/content/drive/My Drive/Crash_Reporting_Drivers_Data.csv")

#### Displaying Top 5 records

In [9]:
data.head()

Unnamed: 0,Report Number,Local Case Number,Agency Name,ACRS Report Type,Crash Date/Time,Route Type,Road Name,Cross-Street Type,Cross-Street Name,Off-Road Description,Municipality,Related Non-Motorist,Collision Type,Weather,Surface Condition,Light,Traffic Control,Driver Substance Abuse,Non-Motorist Substance Abuse,Person ID,Driver At Fault,Injury Severity,Circumstance,Driver Distracted By,Drivers License State,Vehicle ID,Vehicle Damage Extent,Vehicle First Impact Location,Vehicle Second Impact Location,Vehicle Body Type,Vehicle Movement,Vehicle Continuing Dir,Vehicle Going Dir,Speed Limit,Driverless Vehicle,Parked Vehicle,Vehicle Year,Vehicle Make,Vehicle Model,Equipment Problems,Latitude,Longitude,Location
0,MCP3040003N,190026050,Montgomery County Police,Property Damage Crash,05/31/2019 03:00:00 PM,,,,,PARKING LOT OF 3215 SPARTAN RD,,,OTHER,CLEAR,,DAYLIGHT,,UNKNOWN,,DE2A24CD-7919-4F8D-BABF-5B75CE12D21E,Yes,NO APPARENT INJURY,,UNKNOWN,,165AD539-A8C8-4004-AF73-B7DCAAA8B3CC,SUPERFICIAL,ONE OCLOCK,ONE OCLOCK,PASSENGER CAR,PARKING,North,North,15,No,No,2004,HONDA,TK,UNKNOWN,39.150044,-77.063089,"(39.15004368, -77.06308884)"
1,MCP1307000K,190024786,Montgomery County Police,Property Damage Crash,05/24/2019 05:00:00 PM,,,,,PARKING LOT,,,,CLEAR,,DAYLIGHT,,,,6208FA7B-5DC4-4B54-AD60-0C06DFE2AE81,Yes,NO APPARENT INJURY,,NOT DISTRACTED,XX,10239493-D667-42F9-A3D2-820FE184CB6C,FUNCTIONAL,ONE OCLOCK,ONE OCLOCK,PASSENGER CAR,PARKING,Unknown,Unknown,0,No,No,0,UNK,UNK,,39.199047,-77.250743,"(39.19904667, -77.25074333)"
2,MCP2846008X,230034260,Montgomery County Police,Property Damage Crash,07/17/2023 10:45:00 AM,County,SELFRIDGE RD,County,RANDOLPH RD,,,,OTHER,CLEAR,DRY,DARK LIGHTS ON,TRAFFIC SIGNAL,UNKNOWN,,9ACC5A7E-47A1-438F-BF0E-40B0A8632055,Yes,NO APPARENT INJURY,,INATTENTIVE OR LOST IN THOUGHT,MD,8B61B8E0-5473-4C78-A654-6029684ABD03,SUPERFICIAL,SEVEN OCLOCK,SEVEN OCLOCK,PASSENGER CAR,MOVING CONSTANT SPEED,East,East,35,No,No,2003,FORD,TK,NO MISUSE,39.054588,-77.085974,"(39.05458848, -77.08597423)"
3,MCP32610017,230034668,Montgomery County Police,Property Damage Crash,07/20/2023 11:40:00 PM,Maryland (State),MUNCASTER MILL RD,County,SHADY GROVE RD,,,,OTHER,,DRY,DARK LIGHTS ON,TRAFFIC SIGNAL,,,E611A3F8-5F7D-465B-8DE0-3814027998F1,No,NO APPARENT INJURY,,NOT DISTRACTED,MD,1A592482-AF1F-49CE-8554-77EF7C55966B,SUPERFICIAL,ELEVEN OCLOCK,ELEVEN OCLOCK,PASSENGER CAR,MAKING RIGHT TURN,South,East,45,No,No,2023,TOYT,CP,NO MISUSE,39.148721,-77.147111,"(39.14872076, -77.14711061)"
4,EJ78520081,230033429,Gaithersburg Police Depar,Property Damage Crash,07/13/2023 05:40:00 PM,Municipality,PERRY PKWY,Unknown,ENT TO SHOPPING CENTER,,,,SAME DIR REAR END,,DRY,DAYLIGHT,NO CONTROLS,,,3C7F6951-1701-44DC-9824-88DF4E32352E,Yes,NO APPARENT INJURY,,LOOKED BUT DID NOT SEE,MD,C2EF337E-5881-48ED-9B06-36D0BE00557C,SUPERFICIAL,TWELVE OCLOCK,TWELVE OCLOCK,"MEDIUM/HEAVY TRUCKS 3 AXLES (OVER 10,000LBS (4...",MOVING CONSTANT SPEED,Unknown,Unknown,25,No,No,2001,KENWORTH,TRUCK,,39.149085,-77.210731,"(39.14908542, -77.21073135)"


## 3.3 Dataset info

In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 164925 entries, 0 to 164924
Data columns (total 43 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   Report Number                   164925 non-null  object 
 1   Local Case Number               164925 non-null  object 
 2   Agency Name                     164925 non-null  object 
 3   ACRS Report Type                164925 non-null  object 
 4   Crash Date/Time                 164925 non-null  object 
 5   Route Type                      148660 non-null  object 
 6   Road Name                       149609 non-null  object 
 7   Cross-Street Type               148632 non-null  object 
 8   Cross-Street Name               149598 non-null  object 
 9   Off-Road Description            15314 non-null   object 
 10  Municipality                    18312 non-null   object 
 11  Related Non-Motorist            5210 non-null    object 
 12  Collision Type  

### Data shape

In [11]:
data.shape

(164925, 43)

### Target Variable

In [12]:
data['Injury Severity'].value_counts()

NO APPARENT INJURY          135177
POSSIBLE INJURY              16773
SUSPECTED MINOR INJURY       11471
SUSPECTED SERIOUS INJURY      1354
FATAL INJURY                   150
Name: Injury Severity, dtype: int64

In [13]:
# Plotting the Injury severity
value_counts = data['Injury Severity'].value_counts().reset_index()
value_counts.columns = ['Injury Severity', 'Count']

fig = px.bar(value_counts, x='Injury Severity', y='Count', title='Injury Severity Distribution')
fig.update_xaxes(type='category')

fig.show()

## 4. Exploratory Data Analysis (EDA)

In [14]:
#separete date time year hour minute seconds
# Formating datetime

data['Crash Date/Time'] = pd.to_datetime(data['Crash Date/Time'], format='%m/%d/%Y %I:%M:%S %p')

# Creating new column for year and month
data['Year'] = data['Crash Date/Time'].dt.year
data['Month'] = data['Crash Date/Time'].dt.month
data['Date'] = data['Crash Date/Time'].dt.day
data['Time'] = data['Crash Date/Time'].dt.time


data['Hour'] = data['Time'].apply(lambda x: x.hour)
data['Minute'] = data['Time'].apply(lambda x: x.minute)
data['Second'] = data['Time'].apply(lambda x: x.second)

data.columns

Index(['Report Number', 'Local Case Number', 'Agency Name', 'ACRS Report Type',
       'Crash Date/Time', 'Route Type', 'Road Name', 'Cross-Street Type',
       'Cross-Street Name', 'Off-Road Description', 'Municipality',
       'Related Non-Motorist', 'Collision Type', 'Weather',
       'Surface Condition', 'Light', 'Traffic Control',
       'Driver Substance Abuse', 'Non-Motorist Substance Abuse', 'Person ID',
       'Driver At Fault', 'Injury Severity', 'Circumstance',
       'Driver Distracted By', 'Drivers License State', 'Vehicle ID',
       'Vehicle Damage Extent', 'Vehicle First Impact Location',
       'Vehicle Second Impact Location', 'Vehicle Body Type',
       'Vehicle Movement', 'Vehicle Continuing Dir', 'Vehicle Going Dir',
       'Speed Limit', 'Driverless Vehicle', 'Parked Vehicle', 'Vehicle Year',
       'Vehicle Make', 'Vehicle Model', 'Equipment Problems', 'Latitude',
       'Longitude', 'Location', 'Year', 'Month', 'Date', 'Time', 'Hour',
       'Minute', 'Second'],

In [15]:
del data['Crash Date/Time']

In [16]:
data.head()

Unnamed: 0,Report Number,Local Case Number,Agency Name,ACRS Report Type,Route Type,Road Name,Cross-Street Type,Cross-Street Name,Off-Road Description,Municipality,Related Non-Motorist,Collision Type,Weather,Surface Condition,Light,Traffic Control,Driver Substance Abuse,Non-Motorist Substance Abuse,Person ID,Driver At Fault,Injury Severity,Circumstance,Driver Distracted By,Drivers License State,Vehicle ID,Vehicle Damage Extent,Vehicle First Impact Location,Vehicle Second Impact Location,Vehicle Body Type,Vehicle Movement,Vehicle Continuing Dir,Vehicle Going Dir,Speed Limit,Driverless Vehicle,Parked Vehicle,Vehicle Year,Vehicle Make,Vehicle Model,Equipment Problems,Latitude,Longitude,Location,Year,Month,Date,Time,Hour,Minute,Second
0,MCP3040003N,190026050,Montgomery County Police,Property Damage Crash,,,,,PARKING LOT OF 3215 SPARTAN RD,,,OTHER,CLEAR,,DAYLIGHT,,UNKNOWN,,DE2A24CD-7919-4F8D-BABF-5B75CE12D21E,Yes,NO APPARENT INJURY,,UNKNOWN,,165AD539-A8C8-4004-AF73-B7DCAAA8B3CC,SUPERFICIAL,ONE OCLOCK,ONE OCLOCK,PASSENGER CAR,PARKING,North,North,15,No,No,2004,HONDA,TK,UNKNOWN,39.150044,-77.063089,"(39.15004368, -77.06308884)",2019,5,31,15:00:00,15,0,0
1,MCP1307000K,190024786,Montgomery County Police,Property Damage Crash,,,,,PARKING LOT,,,,CLEAR,,DAYLIGHT,,,,6208FA7B-5DC4-4B54-AD60-0C06DFE2AE81,Yes,NO APPARENT INJURY,,NOT DISTRACTED,XX,10239493-D667-42F9-A3D2-820FE184CB6C,FUNCTIONAL,ONE OCLOCK,ONE OCLOCK,PASSENGER CAR,PARKING,Unknown,Unknown,0,No,No,0,UNK,UNK,,39.199047,-77.250743,"(39.19904667, -77.25074333)",2019,5,24,17:00:00,17,0,0
2,MCP2846008X,230034260,Montgomery County Police,Property Damage Crash,County,SELFRIDGE RD,County,RANDOLPH RD,,,,OTHER,CLEAR,DRY,DARK LIGHTS ON,TRAFFIC SIGNAL,UNKNOWN,,9ACC5A7E-47A1-438F-BF0E-40B0A8632055,Yes,NO APPARENT INJURY,,INATTENTIVE OR LOST IN THOUGHT,MD,8B61B8E0-5473-4C78-A654-6029684ABD03,SUPERFICIAL,SEVEN OCLOCK,SEVEN OCLOCK,PASSENGER CAR,MOVING CONSTANT SPEED,East,East,35,No,No,2003,FORD,TK,NO MISUSE,39.054588,-77.085974,"(39.05458848, -77.08597423)",2023,7,17,10:45:00,10,45,0
3,MCP32610017,230034668,Montgomery County Police,Property Damage Crash,Maryland (State),MUNCASTER MILL RD,County,SHADY GROVE RD,,,,OTHER,,DRY,DARK LIGHTS ON,TRAFFIC SIGNAL,,,E611A3F8-5F7D-465B-8DE0-3814027998F1,No,NO APPARENT INJURY,,NOT DISTRACTED,MD,1A592482-AF1F-49CE-8554-77EF7C55966B,SUPERFICIAL,ELEVEN OCLOCK,ELEVEN OCLOCK,PASSENGER CAR,MAKING RIGHT TURN,South,East,45,No,No,2023,TOYT,CP,NO MISUSE,39.148721,-77.147111,"(39.14872076, -77.14711061)",2023,7,20,23:40:00,23,40,0
4,EJ78520081,230033429,Gaithersburg Police Depar,Property Damage Crash,Municipality,PERRY PKWY,Unknown,ENT TO SHOPPING CENTER,,,,SAME DIR REAR END,,DRY,DAYLIGHT,NO CONTROLS,,,3C7F6951-1701-44DC-9824-88DF4E32352E,Yes,NO APPARENT INJURY,,LOOKED BUT DID NOT SEE,MD,C2EF337E-5881-48ED-9B06-36D0BE00557C,SUPERFICIAL,TWELVE OCLOCK,TWELVE OCLOCK,"MEDIUM/HEAVY TRUCKS 3 AXLES (OVER 10,000LBS (4...",MOVING CONSTANT SPEED,Unknown,Unknown,25,No,No,2001,KENWORTH,TRUCK,,39.149085,-77.210731,"(39.14908542, -77.21073135)",2023,7,13,17:40:00,17,40,0


### Removing 2023 data from the dataset

In [17]:
data=data[data['Year']!=2023]
data.shape

(153181, 49)

### Checking the Null values

In [18]:
data.isna().sum()

Report Number                          0
Local Case Number                      0
Agency Name                            0
ACRS Report Type                       0
Route Type                         14990
Road Name                          14172
Cross-Street Type                  15014
Cross-Street Name                  14177
Off-Road Description              139011
Municipality                      136194
Related Non-Motorist              148373
Collision Type                       509
Weather                            12029
Surface Condition                  17979
Light                               1278
Traffic Control                    23168
Driver Substance Abuse             27857
Non-Motorist Substance Abuse      149369
Person ID                              0
Driver At Fault                        0
Injury Severity                        0
Circumstance                      124875
Driver Distracted By                   0
Drivers License State               8406
Vehicle ID      

## Filling the Null values

In [19]:
# fill na
data[["Collision Type"]] = data[["Collision Type"]].fillna('Unknown')
data[["Weather"]] = data[["Weather"]].fillna('UNKNOWN')
data["Road Name"] = data["Road Name"].fillna('Unknown')
data[["Route Type","Cross-Street Type"]] = data[["Route Type","Cross-Street Type"]].fillna('Unknown')
data[["Cross-Street Name","Driver Distracted By"]] = data[["Cross-Street Name","Driver Distracted By"]].fillna('Unknown')
data[["Driver At Fault","Vehicle Damage Extent","Vehicle Movement"]] = data[["Driver At Fault","Vehicle Damage Extent","Vehicle Movement"]].fillna('Unknown')
data[["Surface Condition","Vehicle Body Type","Equipment Problems","Light"]] = data[["Surface Condition","Vehicle Body Type","Equipment Problems","Light"]].fillna('UNKNOWN')

### Injury Severity Each month

In [20]:
injury_monthly_counts = data.groupby(['Month', 'Injury Severity']).size().reset_index(name='Count')

fig = px.bar(injury_monthly_counts, x='Month', y='Count', color='Injury Severity',
             title='Injury Severity Month-wise',template='plotly_dark',text='Count',
             labels={'Month': 'Month', 'Count': 'Number of Accidents', 'Injury Severity': 'Severity'})

fig.update_traces(texttemplate='%{text:.3s}', textposition='outside',width=[.5,.5,.5,.5])
fig.show()

### Accidents recorded each month

In [21]:

monthly_accident_counts = data['Month'].value_counts().sort_index().reset_index()
monthly_accident_counts.columns = ['Month', 'Accident Count']

fig = px.line(monthly_accident_counts, x='Month', y='Accident Count',
              title='Accidents Recorded Each Month',
              labels={'Month': 'Month', 'Accident Count': 'Accident Count'},
             template='ggplot2')

fig.update_xaxes(type='category', tickmode='array', tickvals=list(range(1, 13)),
                 ticktext=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
fig.update_layout(showlegend=True)
fig.show()

### Accidents recorded each year

In [22]:
yearly_accident_counts = data['Year'].value_counts().sort_index().reset_index()
yearly_accident_counts.columns = ['Year', 'Accident Count']

fig = px.line(yearly_accident_counts, x='Year', y='Accident Count',
              title='Accidents Recorded Each Year',
              labels={'Year': 'Year', 'Accident Count': 'Accident Count'},
              template='presentation')

fig.update_xaxes(type='category')
fig.update_traces(line=dict(width=3))
fig.update_layout(showlegend=True)
fig.show()

In [23]:
# Avg number of accidents recorded from jan-june (first half) and july- dec (second half)

data['HalfYear'] = data['Month'].apply(lambda x: 1 if x <= 6 else 2)

halfyearly_accident_counts = data.groupby(['Year', 'HalfYear']).size().unstack().reset_index()
halfyearly_accident_counts.columns = ['Year', 'First Half', 'Second Half']

halfyearly_accident_counts['Average First Half'] = halfyearly_accident_counts['First Half'] / 6
halfyearly_accident_counts['Average Second Half'] = halfyearly_accident_counts['Second Half'] / 6

fig = px.line(halfyearly_accident_counts, x='Year', y=['Average First Half', 'Average Second Half'],
              title='Average Accidents for First Half and Second Half of Each Year',
              labels={'Year': 'Year', 'value': 'Average Accident Count'},
              template='presentation')

fig.update_xaxes(type='category')
fig.update_traces(line=dict(width=3))

fig.update_layout(showlegend=True)
fig.update_traces(patch={"line": {"color": "blue", "width": 3, "dash": 'dot'}}, selector={"legendgroup": "Average First Half"})
fig.update_traces(patch={"line": {"color": "red", "width": 3, "dash": 'dash'}}, selector={"legendgroup": "Average Second Half"})


fig.show()

In [24]:
#vehical make vs accidents


In [25]:
data.isna().sum()

Report Number                          0
Local Case Number                      0
Agency Name                            0
ACRS Report Type                       0
Route Type                             0
Road Name                              0
Cross-Street Type                      0
Cross-Street Name                      0
Off-Road Description              139011
Municipality                      136194
Related Non-Motorist              148373
Collision Type                         0
Weather                                0
Surface Condition                      0
Light                                  0
Traffic Control                    23168
Driver Substance Abuse             27857
Non-Motorist Substance Abuse      149369
Person ID                              0
Driver At Fault                        0
Injury Severity                        0
Circumstance                      124875
Driver Distracted By                   0
Drivers License State               8406
Vehicle ID      

In [26]:
data.columns

Index(['Report Number', 'Local Case Number', 'Agency Name', 'ACRS Report Type',
       'Route Type', 'Road Name', 'Cross-Street Type', 'Cross-Street Name',
       'Off-Road Description', 'Municipality', 'Related Non-Motorist',
       'Collision Type', 'Weather', 'Surface Condition', 'Light',
       'Traffic Control', 'Driver Substance Abuse',
       'Non-Motorist Substance Abuse', 'Person ID', 'Driver At Fault',
       'Injury Severity', 'Circumstance', 'Driver Distracted By',
       'Drivers License State', 'Vehicle ID', 'Vehicle Damage Extent',
       'Vehicle First Impact Location', 'Vehicle Second Impact Location',
       'Vehicle Body Type', 'Vehicle Movement', 'Vehicle Continuing Dir',
       'Vehicle Going Dir', 'Speed Limit', 'Driverless Vehicle',
       'Parked Vehicle', 'Vehicle Year', 'Vehicle Make', 'Vehicle Model',
       'Equipment Problems', 'Latitude', 'Longitude', 'Location', 'Year',
       'Month', 'Date', 'Time', 'Hour', 'Minute', 'Second', 'HalfYear'],
      dtype='

Since there are more null values for the below columns, these colums are removed from the dataset as the weightage of these columns doesn't affect the prediction.

* Off-Road Description              
* Municipality                      
* Related Non-Motorist              
* Non-Motorist Substance Abuse      
* Circumstance    

In [27]:
data.drop(columns=['Off-Road Description','Municipality','Related Non-Motorist','Non-Motorist Substance Abuse','Circumstance'],inplace=True)

In [28]:
# adding "_" in between the spaces in the column names
data.columns = data.columns.str.replace(' ', '_')

Initially removing the following columns as they don't have more impact in our prediction. But these columns may be used later depending on the evaluation metrics of the model.

- 'Report_Number'
- 'Local_Case_Number',
- 'Collision_Type',
- 'Traffic_Control',
- 'Person_ID',
- 'Drivers_License_State',
- 'Vehicle_ID',
- 'Vehicle_First_Impact_Location',
- 'Vehicle_Second_Impact_Location'
- 'Vehicle_Model'
- 'Vehicle_Continuing_Dir'
- 'Vehicle_Going_Dir'
- 'Vehicle_Make'
- 'Road_Name'
- 'Cross-Street_Type'
- 'Cross-Street_Name'
- 'HalfYear'
- 'ACRS_Report_Type'

In [29]:
data.drop(columns=['Report_Number','Local_Case_Number','Collision_Type','Traffic_Control','Person_ID','Drivers_License_State','Vehicle_ID',
                   'Vehicle_First_Impact_Location','Vehicle_Second_Impact_Location','Vehicle_Model','Vehicle_Continuing_Dir','Vehicle_Going_Dir',
                  'Vehicle_Make','Road_Name','Cross-Street_Type','Cross-Street_Name','HalfYear','ACRS_Report_Type'],inplace=True)

In [30]:
data['Driver_Substance_Abuse'].unique()

array(['UNKNOWN', nan, 'NONE DETECTED', 'ALCOHOL CONTRIBUTED',
       'ALCOHOL PRESENT', 'OTHER', 'ILLEGAL DRUG PRESENT',
       'ILLEGAL DRUG CONTRIBUTED', 'COMBINED SUBSTANCE PRESENT',
       'MEDICATION PRESENT', 'MEDICATION CONTRIBUTED',
       'COMBINATION CONTRIBUTED'], dtype=object)

In [31]:
data["Driver_Substance_Abuse"] = data["Driver_Substance_Abuse"].fillna('UNKNOWN')

In [32]:
data["Driver_Substance_Abuse"].value_counts()

NONE DETECTED                 109207
UNKNOWN                        38260
ALCOHOL PRESENT                 3701
ALCOHOL CONTRIBUTED             1330
ILLEGAL DRUG PRESENT             248
MEDICATION PRESENT               108
ILLEGAL DRUG CONTRIBUTED          90
COMBINED SUBSTANCE PRESENT        79
MEDICATION CONTRIBUTED            60
OTHER                             56
COMBINATION CONTRIBUTED           42
Name: Driver_Substance_Abuse, dtype: int64

In [33]:
data.Driverless_Vehicle.value_counts()

No         152509
Unknown       672
Name: Driverless_Vehicle, dtype: int64

In [34]:
# can remove driverless column
del data['Driverless_Vehicle']

In [35]:
data.columns

Index(['Agency_Name', 'Route_Type', 'Weather', 'Surface_Condition', 'Light',
       'Driver_Substance_Abuse', 'Driver_At_Fault', 'Injury_Severity',
       'Driver_Distracted_By', 'Vehicle_Damage_Extent', 'Vehicle_Body_Type',
       'Vehicle_Movement', 'Speed_Limit', 'Parked_Vehicle', 'Vehicle_Year',
       'Equipment_Problems', 'Latitude', 'Longitude', 'Location', 'Year',
       'Month', 'Date', 'Time', 'Hour', 'Minute', 'Second'],
      dtype='object')

In [36]:
data.head()

Unnamed: 0,Agency_Name,Route_Type,Weather,Surface_Condition,Light,Driver_Substance_Abuse,Driver_At_Fault,Injury_Severity,Driver_Distracted_By,Vehicle_Damage_Extent,Vehicle_Body_Type,Vehicle_Movement,Speed_Limit,Parked_Vehicle,Vehicle_Year,Equipment_Problems,Latitude,Longitude,Location,Year,Month,Date,Time,Hour,Minute,Second
0,Montgomery County Police,Unknown,CLEAR,UNKNOWN,DAYLIGHT,UNKNOWN,Yes,NO APPARENT INJURY,UNKNOWN,SUPERFICIAL,PASSENGER CAR,PARKING,15,No,2004,UNKNOWN,39.150044,-77.063089,"(39.15004368, -77.06308884)",2019,5,31,15:00:00,15,0,0
1,Montgomery County Police,Unknown,CLEAR,UNKNOWN,DAYLIGHT,UNKNOWN,Yes,NO APPARENT INJURY,NOT DISTRACTED,FUNCTIONAL,PASSENGER CAR,PARKING,0,No,0,UNKNOWN,39.199047,-77.250743,"(39.19904667, -77.25074333)",2019,5,24,17:00:00,17,0,0
326,Montgomery County Police,County,UNKNOWN,DRY,DAYLIGHT,UNKNOWN,Yes,POSSIBLE INJURY,NOT DISTRACTED,NO DAMAGE,TRANSIT BUS,PARKING,25,No,2014,NO MISUSE,38.997744,-77.032177,"(38.9977444, -77.03217719)",2019,5,17,17:07:00,17,7,0
444,Montgomery County Police,Unknown,CLEAR,UNKNOWN,DAYLIGHT,UNKNOWN,No,NO APPARENT INJURY,NOT DISTRACTED,FUNCTIONAL,PASSENGER CAR,PARKED,0,Yes,2017,UNKNOWN,39.119345,-77.165318,"(39.119345, -77.16531833)",2019,6,1,10:29:00,10,29,0
1038,Montgomery County Police,Unknown,CLEAR,UNKNOWN,DAYLIGHT,UNKNOWN,No,NO APPARENT INJURY,NOT DISTRACTED,SUPERFICIAL,PASSENGER CAR,PARKED,10,Yes,2009,UNKNOWN,39.012262,-77.199998,"(39.01226167, -77.19999833)",2019,6,14,12:08:00,12,8,0


In [37]:
data.isna().sum()

Agency_Name               0
Route_Type                0
Weather                   0
Surface_Condition         0
Light                     0
Driver_Substance_Abuse    0
Driver_At_Fault           0
Injury_Severity           0
Driver_Distracted_By      0
Vehicle_Damage_Extent     0
Vehicle_Body_Type         0
Vehicle_Movement          0
Speed_Limit               0
Parked_Vehicle            0
Vehicle_Year              0
Equipment_Problems        0
Latitude                  0
Longitude                 0
Location                  0
Year                      0
Month                     0
Date                      0
Time                      0
Hour                      0
Minute                    0
Second                    0
dtype: int64

In [38]:
# Create a new DataFrame that counts the combinations of Weather and Surface Condition
weather_surface_counts = data.groupby(['Weather', 'Surface_Condition']).size().reset_index(name='Count')

# Create a pie chart using Plotly Express
fig = px.pie(weather_surface_counts, names='Weather', values='Count', title='Accidents by Weather')
fig.update_traces(textposition='inside',textinfo='label+percent+value',pull=[0.02, 0.02, 0.02, 0.02], opacity=0.7, rotation=180)
fig.show()

In [39]:
# Count the number of accidents for each Surface Condition
surface_condition_counts = data['Surface_Condition'].value_counts().reset_index()
surface_condition_counts.columns = ['Surface_Condition', 'Accident_Count']

# Sort the values by Accident Count and select the top 6
top_6_surface_conditions = surface_condition_counts.head(6)

# Create a pie chart using Plotly Express with annotations
fig = px.pie(top_6_surface_conditions, names='Surface_Condition', values='Accident_Count',
             title='Top 6 Surface Conditions for Accidents')

# Add annotations to display the accident count within each slice
fig.update_traces(textinfo='value+percent')
fig.update_traces(textposition='inside',textinfo='label+percent+value',pull=[0.02, 0.02, 0.02, 0.02], opacity=0.7, rotation=180)

fig.show()

In [40]:
data['Agency_Name'].unique()

array(['Montgomery County Police', 'Gaithersburg Police Depar',
       'Takoma Park Police Depart', 'Rockville Police Departme',
       'Maryland-National Capital', 'MONTGOMERY', 'TAKOMA', 'ROCKVILLE',
       'GAITHERSBURG', 'MCPARK'], dtype=object)

In [41]:
data['Injury_Severity'].unique()

array(['NO APPARENT INJURY', 'POSSIBLE INJURY', 'SUSPECTED MINOR INJURY',
       'SUSPECTED SERIOUS INJURY', 'FATAL INJURY'], dtype=object)

In [42]:
# Group the data by 'Agency Name' and 'Injury Severity' and count the occurrences
counts = data.groupby(['Agency_Name', 'Injury_Severity']).size().reset_index(name='Count')

# Create a scatter plot using Plotly Express
fig = px.scatter(counts, x='Agency_Name', y='Injury_Severity', size='Count',color='Injury_Severity',
                 title='Count of Injury Severity by Agency Names',
                 labels={'Agency_Name': 'Agency Name', 'Injury_Severity': 'Injury Severity'},template='plotly_dark',
                 size_max=50)

# Customize the layout
fig.update_layout(xaxis_title='Agency Name', yaxis_title='Injury Severity')
fig.show()


In [43]:
data['Day/Night'] = data['Hour'].apply(lambda x: 0 if 6 <= x <= 18 else 1)
#Day = 0
#Night = 1

In [44]:
day_night_counts = data.groupby(['Year', 'Day/Night']).size().reset_index(name='Count')
day_night_counts

Unnamed: 0,Year,Day/Night,Count
0,2015,0,15692
1,2015,1,4594
2,2016,0,16950
3,2016,1,4828
4,2017,0,16805
5,2017,1,4736
6,2018,0,16552
7,2018,1,4490
8,2019,0,16270
9,2019,1,4671


In [45]:
print('Sum of all the accidents reported in daylight: ',day_night_counts[day_night_counts['Day/Night']==0]['Count'].sum())

Sum of all the accidents reported in daylight:  117706


In [46]:
print('Sum of all the accidents reported in the night: ',day_night_counts[day_night_counts['Day/Night']==1]['Count'].sum())

Sum of all the accidents reported in the night:  35475


In [47]:
import plotly.express as px

# Assuming you have day_night_counts DataFrame as previously defined
day_sum = day_night_counts[day_night_counts['Day/Night'] == 0]['Count'].sum()
night_sum = day_night_counts[day_night_counts['Day/Night'] == 1]['Count'].sum()

# Create a DataFrame for the sums
sums_df = pd.DataFrame({'Day/Night': ['Daylight', 'Night'], 'Sum of Accidents': [day_sum, night_sum]})

# Create a bar graph
fig = px.bar(sums_df, x='Day/Night', y='Sum of Accidents',
             labels={'Sum of Accidents': 'Total Accidents'},text='Sum of Accidents',template='ggplot2',
             title='Total Accidents Reported in Daylight vs. Night')

fig.update_traces(texttemplate='%{text:.3s}', textposition='outside',width=[.5,.5,.5,.5])

fig.show()

In [48]:
import plotly.graph_objects as go


# Filter the DataFrame to separate 'Day' and 'Night' data
day_counts = day_night_counts[day_night_counts['Day/Night'] == 0]
night_counts = day_night_counts[day_night_counts['Day/Night'] == 1]

# Create a Plotly line chart with 'Day' data in blue
fig = px.line(day_counts, x='Year', y='Count', labels={'Count': 'Number of Accidents'},
              title='Accidents by Day/Night per Year')


fig.add_trace(
    go.Scatter(x=day_counts['Year'], y=day_counts['Count'], mode='lines',
               name='Day', line=dict(color='blue'))
)

fig.add_trace(
    go.Scatter(x=night_counts['Year'], y=night_counts['Count'], mode='lines',
               name='Night', line=dict(color='red'))
)
fig.show()

In [49]:
data['Speed_Limit'].unique()

array([15,  0, 25, 10, 40,  5, 35, 20, 30, 50, 45, 55, 65, 60, 70])

In [50]:
# Speed limit vs # of acc.
speed_limit_counts = data.groupby(['Year', 'Speed_Limit']).size().reset_index(name='Count')

# Create a Plotly bar graph to visualize the relationship
fig = px.bar(speed_limit_counts, x='Speed_Limit', y='Count', color='Year',
             labels={'Count': 'Number of Accidents', 'Speed_Limit': 'Speed Limit (mph)'},
             title='Number of Accidents vs. Speed Limit (Year-wise)')



fig.show()

## Interpretations
- Most of the accidents happened in the daylight. Based on this we can say more people travel on day than night.
- Most of the accidents happened when the weather is clear and road surface is dry.
- Most of the accidents happened within the speed limits of 25-40 MPH.
- Most of the accidents are no apparent injury.
- Number of accidents got decreased during the COVID period 2019-2020 and catched an upward trend after that till 2022
- Most accidents happen in the second half of the year that is during winters (sept-dec) and in Montgomery County region.

In [51]:
data.columns

Index(['Agency_Name', 'Route_Type', 'Weather', 'Surface_Condition', 'Light',
       'Driver_Substance_Abuse', 'Driver_At_Fault', 'Injury_Severity',
       'Driver_Distracted_By', 'Vehicle_Damage_Extent', 'Vehicle_Body_Type',
       'Vehicle_Movement', 'Speed_Limit', 'Parked_Vehicle', 'Vehicle_Year',
       'Equipment_Problems', 'Latitude', 'Longitude', 'Location', 'Year',
       'Month', 'Date', 'Time', 'Hour', 'Minute', 'Second', 'Day/Night'],
      dtype='object')

In [52]:
#make year less than 1950 and greater than 2022 to 0
data.loc[data['Vehicle_Year'] < 1950, 'Vehicle_Year'] = 0
data.loc[data['Vehicle_Year'] > 2023, 'Vehicle_Year'] = 0

In [53]:
#Drop if vehicle year is not present?
data.Vehicle_Year.unique()

array([2004,    0, 2014, 2017, 2009, 1999, 2007, 2019, 2013, 2005, 2020,
       2021, 2006, 2015, 2012, 2018, 2016, 2000, 2003, 2010, 2008, 2022,
       2011, 1997, 1998, 2002, 1994, 2001, 1995, 1985, 1996, 1992, 1990,
       2023, 1989, 1991, 1988, 1993, 1987, 1969, 1980, 1971, 1978, 1966,
       1967, 1975, 1986, 1982, 1965, 1983, 1968, 1972, 1984, 1970, 1960,
       1979, 1976, 1961, 1977, 1955, 1974, 1963, 1981, 1959])

In [54]:
data.Vehicle_Year.value_counts()

2015    10689
2014    10500
2013     9900
2016     9877
2012     8487
        ...  
1965        2
1961        1
1960        1
1967        1
1959        1
Name: Vehicle_Year, Length: 64, dtype: int64

In [55]:
data.Vehicle_Year.unique()

array([2004,    0, 2014, 2017, 2009, 1999, 2007, 2019, 2013, 2005, 2020,
       2021, 2006, 2015, 2012, 2018, 2016, 2000, 2003, 2010, 2008, 2022,
       2011, 1997, 1998, 2002, 1994, 2001, 1995, 1985, 1996, 1992, 1990,
       2023, 1989, 1991, 1988, 1993, 1987, 1969, 1980, 1971, 1978, 1966,
       1967, 1975, 1986, 1982, 1965, 1983, 1968, 1972, 1984, 1970, 1960,
       1979, 1976, 1961, 1977, 1955, 1974, 1963, 1981, 1959])

In [56]:
data.isnull().sum()

Agency_Name               0
Route_Type                0
Weather                   0
Surface_Condition         0
Light                     0
Driver_Substance_Abuse    0
Driver_At_Fault           0
Injury_Severity           0
Driver_Distracted_By      0
Vehicle_Damage_Extent     0
Vehicle_Body_Type         0
Vehicle_Movement          0
Speed_Limit               0
Parked_Vehicle            0
Vehicle_Year              0
Equipment_Problems        0
Latitude                  0
Longitude                 0
Location                  0
Year                      0
Month                     0
Date                      0
Time                      0
Hour                      0
Minute                    0
Second                    0
Day/Night                 0
dtype: int64

In [57]:
data.shape

(153181, 27)

In [58]:
#removed null values and rectified data discrepency
data.sample(10)

Unnamed: 0,Agency_Name,Route_Type,Weather,Surface_Condition,Light,Driver_Substance_Abuse,Driver_At_Fault,Injury_Severity,Driver_Distracted_By,Vehicle_Damage_Extent,Vehicle_Body_Type,Vehicle_Movement,Speed_Limit,Parked_Vehicle,Vehicle_Year,Equipment_Problems,Latitude,Longitude,Location,Year,Month,Date,Time,Hour,Minute,Second,Day/Night
137971,Gaithersburg Police Depar,Interstate (State),CLEAR,UNKNOWN,DARK LIGHTS ON,ALCOHOL PRESENT,Yes,NO APPARENT INJURY,UNKNOWN,SUPERFICIAL,PASSENGER CAR,ACCELERATING,55,No,2004,UNKNOWN,39.115003,-77.192431,"(39.11500319, -77.1924305)",2017,3,27,23:46:00,23,46,0,1
14630,Montgomery County Police,Municipality,CLEAR,DRY,DARK LIGHTS ON,UNKNOWN,Yes,SUSPECTED SERIOUS INJURY,UNKNOWN,DESTROYED,PASSENGER CAR,UNKNOWN,25,No,1997,UNKNOWN,38.972893,-77.009936,"(38.97289299, -77.00993622)",2020,7,16,22:49:00,22,49,0,1
133069,Gaithersburg Police Depar,Maryland (State),CLEAR,DRY,DAYLIGHT,NONE DETECTED,No,NO APPARENT INJURY,NOT DISTRACTED,DISABLING,PASSENGER CAR,STOPPED IN TRAFFIC LANE,35,No,2014,NO MISUSE,39.131778,-77.188462,"(39.13177833, -77.18846167)",2016,8,15,14:30:00,14,30,0,0
49874,Montgomery County Police,Maryland (State),CLEAR,DRY,DAYLIGHT,NONE DETECTED,Unknown,POSSIBLE INJURY,NOT DISTRACTED,DISABLING,PASSENGER CAR,MAKING U TURN,35,No,2016,NO MISUSE,39.03009,-77.075268,"(39.03009, -77.07526833)",2021,1,21,14:00:00,14,0,0,0
140005,Montgomery County Police,County,RAINING,WET,DAYLIGHT,NONE DETECTED,No,POSSIBLE INJURY,NOT DISTRACTED,FUNCTIONAL,(SPORT) UTILITY VEHICLE,MOVING CONSTANT SPEED,30,No,2011,NO MISUSE,39.071175,-77.101927,"(39.071175, -77.10192667)",2016,9,19,09:27:00,9,27,0,0
45831,Rockville Police Departme,Municipality,CLEAR,DRY,DAYLIGHT,NONE DETECTED,Yes,NO APPARENT INJURY,LOOKED BUT DID NOT SEE,SUPERFICIAL,PASSENGER CAR,MOVING CONSTANT SPEED,20,No,2009,NO MISUSE,39.060627,-77.166813,"(39.06062667, -77.1668125)",2016,9,22,08:22:00,8,22,0,0
65267,Montgomery County Police,Maryland (State),CLEAR,DRY,DAYLIGHT,UNKNOWN,Yes,NO APPARENT INJURY,LOOKED BUT DID NOT SEE,SUPERFICIAL,PASSENGER CAR,MAKING LEFT TURN,35,No,2011,NO MISUSE,39.043543,-77.051802,"(39.04354333, -77.05180167)",2021,9,15,13:47:00,13,47,0,0
107972,Montgomery County Police,Maryland (State),CLEAR,DRY,DARK LIGHTS ON,NONE DETECTED,No,POSSIBLE INJURY,NOT DISTRACTED,FUNCTIONAL,PASSENGER CAR,STOPPED IN TRAFFIC LANE,35,No,1998,NO MISUSE,39.042342,-77.052052,"(39.04234167, -77.05205167)",2018,3,18,20:20:00,20,20,0,1
51362,Montgomery County Police,County,UNKNOWN,DRY,DAYLIGHT,NONE DETECTED,No,NO APPARENT INJURY,NOT DISTRACTED,FUNCTIONAL,PASSENGER CAR,ACCELERATING,20,No,2016,NO MISUSE,39.192713,-77.27732,"(39.19271333, -77.27732)",2017,5,8,08:54:00,8,54,0,0
6506,Montgomery County Police,Unknown,CLOUDY,UNKNOWN,DAYLIGHT,NONE DETECTED,No,NO APPARENT INJURY,NOT DISTRACTED,FUNCTIONAL,POLICE VEHICLE/NON EMERGENCY,MOVING CONSTANT SPEED,10,No,2015,NO MISUSE,39.164197,-77.205753,"(39.16419667, -77.20575333)",2021,3,25,15:30:00,15,30,0,0


## Assigning 0,1,2,3,4 values to the target value

In [59]:
class_mapping = {
    "FATAL INJURY": 0,
    "NO APPARENT INJURY": 1,
    "POSSIBLE INJURY": 2,
    "SUSPECTED MINOR INJURY": 3,
    "SUSPECTED SERIOUS INJURY":4
}

data['Injury'] = data['Injury_Severity'].apply(lambda x: class_mapping.get(x))

data.head()

Unnamed: 0,Agency_Name,Route_Type,Weather,Surface_Condition,Light,Driver_Substance_Abuse,Driver_At_Fault,Injury_Severity,Driver_Distracted_By,Vehicle_Damage_Extent,Vehicle_Body_Type,Vehicle_Movement,Speed_Limit,Parked_Vehicle,Vehicle_Year,Equipment_Problems,Latitude,Longitude,Location,Year,Month,Date,Time,Hour,Minute,Second,Day/Night,Injury
0,Montgomery County Police,Unknown,CLEAR,UNKNOWN,DAYLIGHT,UNKNOWN,Yes,NO APPARENT INJURY,UNKNOWN,SUPERFICIAL,PASSENGER CAR,PARKING,15,No,2004,UNKNOWN,39.150044,-77.063089,"(39.15004368, -77.06308884)",2019,5,31,15:00:00,15,0,0,0,1
1,Montgomery County Police,Unknown,CLEAR,UNKNOWN,DAYLIGHT,UNKNOWN,Yes,NO APPARENT INJURY,NOT DISTRACTED,FUNCTIONAL,PASSENGER CAR,PARKING,0,No,0,UNKNOWN,39.199047,-77.250743,"(39.19904667, -77.25074333)",2019,5,24,17:00:00,17,0,0,0,1
326,Montgomery County Police,County,UNKNOWN,DRY,DAYLIGHT,UNKNOWN,Yes,POSSIBLE INJURY,NOT DISTRACTED,NO DAMAGE,TRANSIT BUS,PARKING,25,No,2014,NO MISUSE,38.997744,-77.032177,"(38.9977444, -77.03217719)",2019,5,17,17:07:00,17,7,0,0,2
444,Montgomery County Police,Unknown,CLEAR,UNKNOWN,DAYLIGHT,UNKNOWN,No,NO APPARENT INJURY,NOT DISTRACTED,FUNCTIONAL,PASSENGER CAR,PARKED,0,Yes,2017,UNKNOWN,39.119345,-77.165318,"(39.119345, -77.16531833)",2019,6,1,10:29:00,10,29,0,0,1
1038,Montgomery County Police,Unknown,CLEAR,UNKNOWN,DAYLIGHT,UNKNOWN,No,NO APPARENT INJURY,NOT DISTRACTED,SUPERFICIAL,PASSENGER CAR,PARKED,10,Yes,2009,UNKNOWN,39.012262,-77.199998,"(39.01226167, -77.19999833)",2019,6,14,12:08:00,12,8,0,0,1


In [60]:
data.columns

Index(['Agency_Name', 'Route_Type', 'Weather', 'Surface_Condition', 'Light',
       'Driver_Substance_Abuse', 'Driver_At_Fault', 'Injury_Severity',
       'Driver_Distracted_By', 'Vehicle_Damage_Extent', 'Vehicle_Body_Type',
       'Vehicle_Movement', 'Speed_Limit', 'Parked_Vehicle', 'Vehicle_Year',
       'Equipment_Problems', 'Latitude', 'Longitude', 'Location', 'Year',
       'Month', 'Date', 'Time', 'Hour', 'Minute', 'Second', 'Day/Night',
       'Injury'],
      dtype='object')

## One-Hot Encoding

In [61]:
data = pd.get_dummies(data,columns=['Agency_Name','Route_Type','Weather','Surface_Condition','Light','Driver_Substance_Abuse','Driver_At_Fault','Driver_Distracted_By','Vehicle_Damage_Extent','Vehicle_Body_Type','Vehicle_Movement','Parked_Vehicle','Equipment_Problems'],dtype=int)
data.head()

Unnamed: 0,Injury_Severity,Speed_Limit,Vehicle_Year,Latitude,Longitude,Location,Year,Month,Date,Time,Hour,Minute,Second,Day/Night,Injury,Agency_Name_GAITHERSBURG,Agency_Name_Gaithersburg Police Depar,Agency_Name_MCPARK,Agency_Name_MONTGOMERY,Agency_Name_Maryland-National Capital,Agency_Name_Montgomery County Police,Agency_Name_ROCKVILLE,Agency_Name_Rockville Police Departme,Agency_Name_TAKOMA,Agency_Name_Takoma Park Police Depart,Route_Type_County,Route_Type_Government,Route_Type_Interstate (State),Route_Type_Maryland (State),Route_Type_Municipality,Route_Type_Other Public Roadway,Route_Type_Ramp,Route_Type_Service Road,Route_Type_US (State),Route_Type_Unknown,"Weather_BLOWING SAND, SOIL, DIRT",Weather_BLOWING SNOW,Weather_CLEAR,Weather_CLOUDY,Weather_FOGGY,Weather_OTHER,Weather_RAINING,Weather_SEVERE WINDS,Weather_SLEET,Weather_SNOW,Weather_UNKNOWN,Weather_WINTRY MIX,Surface_Condition_DRY,Surface_Condition_ICE,"Surface_Condition_MUD, DIRT, GRAVEL",Surface_Condition_OIL,Surface_Condition_OTHER,Surface_Condition_SAND,Surface_Condition_SLUSH,Surface_Condition_SNOW,Surface_Condition_UNKNOWN,Surface_Condition_WATER(STANDING/MOVING),Surface_Condition_WET,Light_DARK -- UNKNOWN LIGHTING,Light_DARK LIGHTS ON,Light_DARK NO LIGHTS,Light_DAWN,Light_DAYLIGHT,Light_DUSK,Light_OTHER,Light_UNKNOWN,Driver_Substance_Abuse_ALCOHOL CONTRIBUTED,Driver_Substance_Abuse_ALCOHOL PRESENT,Driver_Substance_Abuse_COMBINATION CONTRIBUTED,Driver_Substance_Abuse_COMBINED SUBSTANCE PRESENT,Driver_Substance_Abuse_ILLEGAL DRUG CONTRIBUTED,Driver_Substance_Abuse_ILLEGAL DRUG PRESENT,Driver_Substance_Abuse_MEDICATION CONTRIBUTED,Driver_Substance_Abuse_MEDICATION PRESENT,Driver_Substance_Abuse_NONE DETECTED,Driver_Substance_Abuse_OTHER,Driver_Substance_Abuse_UNKNOWN,Driver_At_Fault_No,Driver_At_Fault_Unknown,Driver_At_Fault_Yes,Driver_Distracted_By_ADJUSTING AUDIO AND OR CLIMATE CONTROLS,Driver_Distracted_By_BY MOVING OBJECT IN VEHICLE,Driver_Distracted_By_BY OTHER OCCUPANTS,Driver_Distracted_By_DIALING CELLULAR PHONE,Driver_Distracted_By_DISTRACTED BY OUTSIDE PERSON OBJECT OR EVENT,Driver_Distracted_By_EATING OR DRINKING,Driver_Distracted_By_INATTENTIVE OR LOST IN THOUGHT,Driver_Distracted_By_LOOKED BUT DID NOT SEE,Driver_Distracted_By_NO DRIVER PRESENT,Driver_Distracted_By_NOT DISTRACTED,Driver_Distracted_By_OTHER CELLULAR PHONE RELATED,Driver_Distracted_By_OTHER DISTRACTION,Driver_Distracted_By_OTHER ELECTRONIC DEVICE (NAVIGATIONAL PALM PILOT),Driver_Distracted_By_SMOKING RELATED,Driver_Distracted_By_TALKING OR LISTENING TO CELLULAR PHONE,Driver_Distracted_By_TEXTING FROM A CELLULAR PHONE,Driver_Distracted_By_UNKNOWN,Driver_Distracted_By_USING DEVICE OBJECT BROUGHT INTO VEHICLE,Driver_Distracted_By_USING OTHER DEVICE CONTROLS INTEGRAL TO VEHICLE,Vehicle_Damage_Extent_DESTROYED,Vehicle_Damage_Extent_DISABLING,Vehicle_Damage_Extent_FUNCTIONAL,Vehicle_Damage_Extent_NO DAMAGE,Vehicle_Damage_Extent_OTHER,Vehicle_Damage_Extent_SUPERFICIAL,Vehicle_Damage_Extent_UNKNOWN,Vehicle_Damage_Extent_Unknown,Vehicle_Body_Type_(SPORT) UTILITY VEHICLE,Vehicle_Body_Type_ALL TERRAIN VEHICLE (ATV),Vehicle_Body_Type_AMBULANCE/EMERGENCY,Vehicle_Body_Type_AMBULANCE/NON EMERGENCY,Vehicle_Body_Type_AUTOCYCLE,"Vehicle_Body_Type_CARGO VAN/LIGHT TRUCK 2 AXLES (OVER 10,000LBS (4,536 KG))",Vehicle_Body_Type_CROSS COUNTRY BUS,Vehicle_Body_Type_FARM VEHICLE,Vehicle_Body_Type_FIRE VEHICLE/EMERGENCY,Vehicle_Body_Type_FIRE VEHICLE/NON EMERGENCY,Vehicle_Body_Type_LIMOUSINE,Vehicle_Body_Type_LOW SPEED VEHICLE,"Vehicle_Body_Type_MEDIUM/HEAVY TRUCKS 3 AXLES (OVER 10,000LBS (4,536KG))",Vehicle_Body_Type_MOPED,Vehicle_Body_Type_MOTORCYCLE,Vehicle_Body_Type_OTHER,Vehicle_Body_Type_OTHER BUS,"Vehicle_Body_Type_OTHER LIGHT TRUCKS (10,000LBS (4,536KG) OR LESS)",Vehicle_Body_Type_PASSENGER CAR,Vehicle_Body_Type_PICKUP TRUCK,Vehicle_Body_Type_POLICE VEHICLE/EMERGENCY,Vehicle_Body_Type_POLICE VEHICLE/NON EMERGENCY,Vehicle_Body_Type_RECREATIONAL VEHICLE,Vehicle_Body_Type_SCHOOL BUS,Vehicle_Body_Type_SNOWMOBILE,Vehicle_Body_Type_STATION WAGON,Vehicle_Body_Type_TRANSIT BUS,Vehicle_Body_Type_TRUCK TRACTOR,Vehicle_Body_Type_UNKNOWN,Vehicle_Body_Type_VAN,Vehicle_Movement_ACCELERATING,Vehicle_Movement_BACKING,Vehicle_Movement_CHANGING LANES,Vehicle_Movement_DRIVERLESS MOVING VEH.,Vehicle_Movement_ENTERING TRAFFIC LANE,Vehicle_Movement_LEAVING TRAFFIC LANE,Vehicle_Movement_MAKING LEFT TURN,Vehicle_Movement_MAKING RIGHT TURN,Vehicle_Movement_MAKING U TURN,Vehicle_Movement_MOVING CONSTANT SPEED,Vehicle_Movement_NEGOTIATING A CURVE,Vehicle_Movement_OTHER,Vehicle_Movement_PARKED,Vehicle_Movement_PARKING,Vehicle_Movement_PASSING,Vehicle_Movement_RIGHT TURN ON RED,Vehicle_Movement_SKIDDING,Vehicle_Movement_SLOWING OR STOPPING,Vehicle_Movement_STARTING FROM LANE,Vehicle_Movement_STARTING FROM PARKED,Vehicle_Movement_STOPPED IN TRAFFIC LANE,Vehicle_Movement_UNKNOWN,Vehicle_Movement_Unknown,Parked_Vehicle_No,Parked_Vehicle_Yes,Equipment_Problems_AIR BAG FAILED,Equipment_Problems_BELT(S) MISUSED,Equipment_Problems_BELTS/ANCHORS BROKE,Equipment_Problems_FACING WRONG WAY,Equipment_Problems_NO MISUSE,Equipment_Problems_NOT STREPPED RIGHT,Equipment_Problems_OTHER,Equipment_Problems_SIZE/TYPE IMPROPER,Equipment_Problems_STRAP/TETHER LOOSE,Equipment_Problems_UNKNOWN
0,NO APPARENT INJURY,15,2004,39.150044,-77.063089,"(39.15004368, -77.06308884)",2019,5,31,15:00:00,15,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1
1,NO APPARENT INJURY,0,0,39.199047,-77.250743,"(39.19904667, -77.25074333)",2019,5,24,17:00:00,17,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1
326,POSSIBLE INJURY,25,2014,38.997744,-77.032177,"(38.9977444, -77.03217719)",2019,5,17,17:07:00,17,7,0,0,2,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0
444,NO APPARENT INJURY,0,2017,39.119345,-77.165318,"(39.119345, -77.16531833)",2019,6,1,10:29:00,10,29,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1
1038,NO APPARENT INJURY,10,2009,39.012262,-77.199998,"(39.01226167, -77.19999833)",2019,6,14,12:08:00,12,8,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1


In [62]:
data.shape

(153181, 172)

In [63]:
del data['Injury_Severity']

In [64]:
data.shape

(153181, 171)

In [65]:
data.reset_index(drop=True,inplace=True)

In [66]:
data.sample(10)

Unnamed: 0,Speed_Limit,Vehicle_Year,Latitude,Longitude,Location,Year,Month,Date,Time,Hour,Minute,Second,Day/Night,Injury,Agency_Name_GAITHERSBURG,Agency_Name_Gaithersburg Police Depar,Agency_Name_MCPARK,Agency_Name_MONTGOMERY,Agency_Name_Maryland-National Capital,Agency_Name_Montgomery County Police,Agency_Name_ROCKVILLE,Agency_Name_Rockville Police Departme,Agency_Name_TAKOMA,Agency_Name_Takoma Park Police Depart,Route_Type_County,Route_Type_Government,Route_Type_Interstate (State),Route_Type_Maryland (State),Route_Type_Municipality,Route_Type_Other Public Roadway,Route_Type_Ramp,Route_Type_Service Road,Route_Type_US (State),Route_Type_Unknown,"Weather_BLOWING SAND, SOIL, DIRT",Weather_BLOWING SNOW,Weather_CLEAR,Weather_CLOUDY,Weather_FOGGY,Weather_OTHER,Weather_RAINING,Weather_SEVERE WINDS,Weather_SLEET,Weather_SNOW,Weather_UNKNOWN,Weather_WINTRY MIX,Surface_Condition_DRY,Surface_Condition_ICE,"Surface_Condition_MUD, DIRT, GRAVEL",Surface_Condition_OIL,Surface_Condition_OTHER,Surface_Condition_SAND,Surface_Condition_SLUSH,Surface_Condition_SNOW,Surface_Condition_UNKNOWN,Surface_Condition_WATER(STANDING/MOVING),Surface_Condition_WET,Light_DARK -- UNKNOWN LIGHTING,Light_DARK LIGHTS ON,Light_DARK NO LIGHTS,Light_DAWN,Light_DAYLIGHT,Light_DUSK,Light_OTHER,Light_UNKNOWN,Driver_Substance_Abuse_ALCOHOL CONTRIBUTED,Driver_Substance_Abuse_ALCOHOL PRESENT,Driver_Substance_Abuse_COMBINATION CONTRIBUTED,Driver_Substance_Abuse_COMBINED SUBSTANCE PRESENT,Driver_Substance_Abuse_ILLEGAL DRUG CONTRIBUTED,Driver_Substance_Abuse_ILLEGAL DRUG PRESENT,Driver_Substance_Abuse_MEDICATION CONTRIBUTED,Driver_Substance_Abuse_MEDICATION PRESENT,Driver_Substance_Abuse_NONE DETECTED,Driver_Substance_Abuse_OTHER,Driver_Substance_Abuse_UNKNOWN,Driver_At_Fault_No,Driver_At_Fault_Unknown,Driver_At_Fault_Yes,Driver_Distracted_By_ADJUSTING AUDIO AND OR CLIMATE CONTROLS,Driver_Distracted_By_BY MOVING OBJECT IN VEHICLE,Driver_Distracted_By_BY OTHER OCCUPANTS,Driver_Distracted_By_DIALING CELLULAR PHONE,Driver_Distracted_By_DISTRACTED BY OUTSIDE PERSON OBJECT OR EVENT,Driver_Distracted_By_EATING OR DRINKING,Driver_Distracted_By_INATTENTIVE OR LOST IN THOUGHT,Driver_Distracted_By_LOOKED BUT DID NOT SEE,Driver_Distracted_By_NO DRIVER PRESENT,Driver_Distracted_By_NOT DISTRACTED,Driver_Distracted_By_OTHER CELLULAR PHONE RELATED,Driver_Distracted_By_OTHER DISTRACTION,Driver_Distracted_By_OTHER ELECTRONIC DEVICE (NAVIGATIONAL PALM PILOT),Driver_Distracted_By_SMOKING RELATED,Driver_Distracted_By_TALKING OR LISTENING TO CELLULAR PHONE,Driver_Distracted_By_TEXTING FROM A CELLULAR PHONE,Driver_Distracted_By_UNKNOWN,Driver_Distracted_By_USING DEVICE OBJECT BROUGHT INTO VEHICLE,Driver_Distracted_By_USING OTHER DEVICE CONTROLS INTEGRAL TO VEHICLE,Vehicle_Damage_Extent_DESTROYED,Vehicle_Damage_Extent_DISABLING,Vehicle_Damage_Extent_FUNCTIONAL,Vehicle_Damage_Extent_NO DAMAGE,Vehicle_Damage_Extent_OTHER,Vehicle_Damage_Extent_SUPERFICIAL,Vehicle_Damage_Extent_UNKNOWN,Vehicle_Damage_Extent_Unknown,Vehicle_Body_Type_(SPORT) UTILITY VEHICLE,Vehicle_Body_Type_ALL TERRAIN VEHICLE (ATV),Vehicle_Body_Type_AMBULANCE/EMERGENCY,Vehicle_Body_Type_AMBULANCE/NON EMERGENCY,Vehicle_Body_Type_AUTOCYCLE,"Vehicle_Body_Type_CARGO VAN/LIGHT TRUCK 2 AXLES (OVER 10,000LBS (4,536 KG))",Vehicle_Body_Type_CROSS COUNTRY BUS,Vehicle_Body_Type_FARM VEHICLE,Vehicle_Body_Type_FIRE VEHICLE/EMERGENCY,Vehicle_Body_Type_FIRE VEHICLE/NON EMERGENCY,Vehicle_Body_Type_LIMOUSINE,Vehicle_Body_Type_LOW SPEED VEHICLE,"Vehicle_Body_Type_MEDIUM/HEAVY TRUCKS 3 AXLES (OVER 10,000LBS (4,536KG))",Vehicle_Body_Type_MOPED,Vehicle_Body_Type_MOTORCYCLE,Vehicle_Body_Type_OTHER,Vehicle_Body_Type_OTHER BUS,"Vehicle_Body_Type_OTHER LIGHT TRUCKS (10,000LBS (4,536KG) OR LESS)",Vehicle_Body_Type_PASSENGER CAR,Vehicle_Body_Type_PICKUP TRUCK,Vehicle_Body_Type_POLICE VEHICLE/EMERGENCY,Vehicle_Body_Type_POLICE VEHICLE/NON EMERGENCY,Vehicle_Body_Type_RECREATIONAL VEHICLE,Vehicle_Body_Type_SCHOOL BUS,Vehicle_Body_Type_SNOWMOBILE,Vehicle_Body_Type_STATION WAGON,Vehicle_Body_Type_TRANSIT BUS,Vehicle_Body_Type_TRUCK TRACTOR,Vehicle_Body_Type_UNKNOWN,Vehicle_Body_Type_VAN,Vehicle_Movement_ACCELERATING,Vehicle_Movement_BACKING,Vehicle_Movement_CHANGING LANES,Vehicle_Movement_DRIVERLESS MOVING VEH.,Vehicle_Movement_ENTERING TRAFFIC LANE,Vehicle_Movement_LEAVING TRAFFIC LANE,Vehicle_Movement_MAKING LEFT TURN,Vehicle_Movement_MAKING RIGHT TURN,Vehicle_Movement_MAKING U TURN,Vehicle_Movement_MOVING CONSTANT SPEED,Vehicle_Movement_NEGOTIATING A CURVE,Vehicle_Movement_OTHER,Vehicle_Movement_PARKED,Vehicle_Movement_PARKING,Vehicle_Movement_PASSING,Vehicle_Movement_RIGHT TURN ON RED,Vehicle_Movement_SKIDDING,Vehicle_Movement_SLOWING OR STOPPING,Vehicle_Movement_STARTING FROM LANE,Vehicle_Movement_STARTING FROM PARKED,Vehicle_Movement_STOPPED IN TRAFFIC LANE,Vehicle_Movement_UNKNOWN,Vehicle_Movement_Unknown,Parked_Vehicle_No,Parked_Vehicle_Yes,Equipment_Problems_AIR BAG FAILED,Equipment_Problems_BELT(S) MISUSED,Equipment_Problems_BELTS/ANCHORS BROKE,Equipment_Problems_FACING WRONG WAY,Equipment_Problems_NO MISUSE,Equipment_Problems_NOT STREPPED RIGHT,Equipment_Problems_OTHER,Equipment_Problems_SIZE/TYPE IMPROPER,Equipment_Problems_STRAP/TETHER LOOSE,Equipment_Problems_UNKNOWN
120106,40,2012,39.227872,-77.265927,"(39.22787167, -77.26592667)",2018,5,4,14:45:00,14,45,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0
19616,15,2015,39.051413,-77.101961,"(39.05141317, -77.10196117)",2021,2,21,21:20:00,21,20,0,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1
122662,45,2018,39.046778,-76.986577,"(39.04677833, -76.98657667)",2021,4,8,15:40:00,15,40,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0
50999,35,2018,38.993211,-77.065008,"(38.99321067, -77.065008)",2021,10,5,16:00:00,16,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0
34386,40,2011,39.068193,-77.016597,"(39.06819333, -77.01659667)",2019,1,4,12:50:00,12,50,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0
25791,30,2015,39.055686,-76.959317,"(39.055686, -76.9593175)",2022,1,1,14:35:00,14,35,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1
74700,30,2014,39.022362,-76.977708,"(39.02236169, -76.97770779)",2018,11,20,08:29:00,8,29,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1
56331,35,2012,39.060287,-77.089187,"(39.06028667, -77.08918667)",2021,10,3,14:16:00,14,16,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0
129859,35,2015,39.005612,-77.038343,"(39.00561201, -77.03834295)",2016,10,8,23:11:00,23,11,0,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0
99776,35,2012,39.099583,-77.099018,"(39.09958333, -77.09901833)",2022,6,8,15:56:00,15,56,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1


# Feature Selection

In [67]:
# TRAINING AND TESTING DATA
X= data.drop(data.columns[13],axis=1)
y= data.iloc[:,13]

In [68]:
X

Unnamed: 0,Speed_Limit,Vehicle_Year,Latitude,Longitude,Location,Year,Month,Date,Time,Hour,Minute,Second,Day/Night,Agency_Name_GAITHERSBURG,Agency_Name_Gaithersburg Police Depar,Agency_Name_MCPARK,Agency_Name_MONTGOMERY,Agency_Name_Maryland-National Capital,Agency_Name_Montgomery County Police,Agency_Name_ROCKVILLE,Agency_Name_Rockville Police Departme,Agency_Name_TAKOMA,Agency_Name_Takoma Park Police Depart,Route_Type_County,Route_Type_Government,Route_Type_Interstate (State),Route_Type_Maryland (State),Route_Type_Municipality,Route_Type_Other Public Roadway,Route_Type_Ramp,Route_Type_Service Road,Route_Type_US (State),Route_Type_Unknown,"Weather_BLOWING SAND, SOIL, DIRT",Weather_BLOWING SNOW,Weather_CLEAR,Weather_CLOUDY,Weather_FOGGY,Weather_OTHER,Weather_RAINING,Weather_SEVERE WINDS,Weather_SLEET,Weather_SNOW,Weather_UNKNOWN,Weather_WINTRY MIX,Surface_Condition_DRY,Surface_Condition_ICE,"Surface_Condition_MUD, DIRT, GRAVEL",Surface_Condition_OIL,Surface_Condition_OTHER,Surface_Condition_SAND,Surface_Condition_SLUSH,Surface_Condition_SNOW,Surface_Condition_UNKNOWN,Surface_Condition_WATER(STANDING/MOVING),Surface_Condition_WET,Light_DARK -- UNKNOWN LIGHTING,Light_DARK LIGHTS ON,Light_DARK NO LIGHTS,Light_DAWN,Light_DAYLIGHT,Light_DUSK,Light_OTHER,Light_UNKNOWN,Driver_Substance_Abuse_ALCOHOL CONTRIBUTED,Driver_Substance_Abuse_ALCOHOL PRESENT,Driver_Substance_Abuse_COMBINATION CONTRIBUTED,Driver_Substance_Abuse_COMBINED SUBSTANCE PRESENT,Driver_Substance_Abuse_ILLEGAL DRUG CONTRIBUTED,Driver_Substance_Abuse_ILLEGAL DRUG PRESENT,Driver_Substance_Abuse_MEDICATION CONTRIBUTED,Driver_Substance_Abuse_MEDICATION PRESENT,Driver_Substance_Abuse_NONE DETECTED,Driver_Substance_Abuse_OTHER,Driver_Substance_Abuse_UNKNOWN,Driver_At_Fault_No,Driver_At_Fault_Unknown,Driver_At_Fault_Yes,Driver_Distracted_By_ADJUSTING AUDIO AND OR CLIMATE CONTROLS,Driver_Distracted_By_BY MOVING OBJECT IN VEHICLE,Driver_Distracted_By_BY OTHER OCCUPANTS,Driver_Distracted_By_DIALING CELLULAR PHONE,Driver_Distracted_By_DISTRACTED BY OUTSIDE PERSON OBJECT OR EVENT,Driver_Distracted_By_EATING OR DRINKING,Driver_Distracted_By_INATTENTIVE OR LOST IN THOUGHT,Driver_Distracted_By_LOOKED BUT DID NOT SEE,Driver_Distracted_By_NO DRIVER PRESENT,Driver_Distracted_By_NOT DISTRACTED,Driver_Distracted_By_OTHER CELLULAR PHONE RELATED,Driver_Distracted_By_OTHER DISTRACTION,Driver_Distracted_By_OTHER ELECTRONIC DEVICE (NAVIGATIONAL PALM PILOT),Driver_Distracted_By_SMOKING RELATED,Driver_Distracted_By_TALKING OR LISTENING TO CELLULAR PHONE,Driver_Distracted_By_TEXTING FROM A CELLULAR PHONE,Driver_Distracted_By_UNKNOWN,Driver_Distracted_By_USING DEVICE OBJECT BROUGHT INTO VEHICLE,Driver_Distracted_By_USING OTHER DEVICE CONTROLS INTEGRAL TO VEHICLE,Vehicle_Damage_Extent_DESTROYED,Vehicle_Damage_Extent_DISABLING,Vehicle_Damage_Extent_FUNCTIONAL,Vehicle_Damage_Extent_NO DAMAGE,Vehicle_Damage_Extent_OTHER,Vehicle_Damage_Extent_SUPERFICIAL,Vehicle_Damage_Extent_UNKNOWN,Vehicle_Damage_Extent_Unknown,Vehicle_Body_Type_(SPORT) UTILITY VEHICLE,Vehicle_Body_Type_ALL TERRAIN VEHICLE (ATV),Vehicle_Body_Type_AMBULANCE/EMERGENCY,Vehicle_Body_Type_AMBULANCE/NON EMERGENCY,Vehicle_Body_Type_AUTOCYCLE,"Vehicle_Body_Type_CARGO VAN/LIGHT TRUCK 2 AXLES (OVER 10,000LBS (4,536 KG))",Vehicle_Body_Type_CROSS COUNTRY BUS,Vehicle_Body_Type_FARM VEHICLE,Vehicle_Body_Type_FIRE VEHICLE/EMERGENCY,Vehicle_Body_Type_FIRE VEHICLE/NON EMERGENCY,Vehicle_Body_Type_LIMOUSINE,Vehicle_Body_Type_LOW SPEED VEHICLE,"Vehicle_Body_Type_MEDIUM/HEAVY TRUCKS 3 AXLES (OVER 10,000LBS (4,536KG))",Vehicle_Body_Type_MOPED,Vehicle_Body_Type_MOTORCYCLE,Vehicle_Body_Type_OTHER,Vehicle_Body_Type_OTHER BUS,"Vehicle_Body_Type_OTHER LIGHT TRUCKS (10,000LBS (4,536KG) OR LESS)",Vehicle_Body_Type_PASSENGER CAR,Vehicle_Body_Type_PICKUP TRUCK,Vehicle_Body_Type_POLICE VEHICLE/EMERGENCY,Vehicle_Body_Type_POLICE VEHICLE/NON EMERGENCY,Vehicle_Body_Type_RECREATIONAL VEHICLE,Vehicle_Body_Type_SCHOOL BUS,Vehicle_Body_Type_SNOWMOBILE,Vehicle_Body_Type_STATION WAGON,Vehicle_Body_Type_TRANSIT BUS,Vehicle_Body_Type_TRUCK TRACTOR,Vehicle_Body_Type_UNKNOWN,Vehicle_Body_Type_VAN,Vehicle_Movement_ACCELERATING,Vehicle_Movement_BACKING,Vehicle_Movement_CHANGING LANES,Vehicle_Movement_DRIVERLESS MOVING VEH.,Vehicle_Movement_ENTERING TRAFFIC LANE,Vehicle_Movement_LEAVING TRAFFIC LANE,Vehicle_Movement_MAKING LEFT TURN,Vehicle_Movement_MAKING RIGHT TURN,Vehicle_Movement_MAKING U TURN,Vehicle_Movement_MOVING CONSTANT SPEED,Vehicle_Movement_NEGOTIATING A CURVE,Vehicle_Movement_OTHER,Vehicle_Movement_PARKED,Vehicle_Movement_PARKING,Vehicle_Movement_PASSING,Vehicle_Movement_RIGHT TURN ON RED,Vehicle_Movement_SKIDDING,Vehicle_Movement_SLOWING OR STOPPING,Vehicle_Movement_STARTING FROM LANE,Vehicle_Movement_STARTING FROM PARKED,Vehicle_Movement_STOPPED IN TRAFFIC LANE,Vehicle_Movement_UNKNOWN,Vehicle_Movement_Unknown,Parked_Vehicle_No,Parked_Vehicle_Yes,Equipment_Problems_AIR BAG FAILED,Equipment_Problems_BELT(S) MISUSED,Equipment_Problems_BELTS/ANCHORS BROKE,Equipment_Problems_FACING WRONG WAY,Equipment_Problems_NO MISUSE,Equipment_Problems_NOT STREPPED RIGHT,Equipment_Problems_OTHER,Equipment_Problems_SIZE/TYPE IMPROPER,Equipment_Problems_STRAP/TETHER LOOSE,Equipment_Problems_UNKNOWN
0,15,2004,39.150044,-77.063089,"(39.15004368, -77.06308884)",2019,5,31,15:00:00,15,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1
1,0,0,39.199047,-77.250743,"(39.19904667, -77.25074333)",2019,5,24,17:00:00,17,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1
2,25,2014,38.997744,-77.032177,"(38.9977444, -77.03217719)",2019,5,17,17:07:00,17,7,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0
3,0,2017,39.119345,-77.165318,"(39.119345, -77.16531833)",2019,6,1,10:29:00,10,29,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1
4,10,2009,39.012262,-77.199998,"(39.01226167, -77.19999833)",2019,6,14,12:08:00,12,8,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
153176,25,2016,38.972560,-76.997466,"(38.97255976, -76.99746609)",2016,3,1,10:01:00,10,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0
153177,35,2008,39.004640,-77.108502,"(39.00464, -77.10850167)",2017,7,19,14:22:00,14,22,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0
153178,40,2008,39.228963,-77.236757,"(39.22896333, -77.23675667)",2020,11,23,07:37:00,7,37,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0
153179,35,2018,39.120440,-77.180047,"(39.12043995, -77.18004738)",2019,11,23,23:23:00,23,23,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0


In [69]:
y.unique()

array([1, 2, 3, 4, 0])

In [70]:
# Drop location columns and standardize the lat and long columns.
del X['Location']

In [71]:
# print(X.shape)
# X.reset_index(drop=True,inplace=True)

In [72]:
X.tail(5)

Unnamed: 0,Speed_Limit,Vehicle_Year,Latitude,Longitude,Year,Month,Date,Time,Hour,Minute,Second,Day/Night,Agency_Name_GAITHERSBURG,Agency_Name_Gaithersburg Police Depar,Agency_Name_MCPARK,Agency_Name_MONTGOMERY,Agency_Name_Maryland-National Capital,Agency_Name_Montgomery County Police,Agency_Name_ROCKVILLE,Agency_Name_Rockville Police Departme,Agency_Name_TAKOMA,Agency_Name_Takoma Park Police Depart,Route_Type_County,Route_Type_Government,Route_Type_Interstate (State),Route_Type_Maryland (State),Route_Type_Municipality,Route_Type_Other Public Roadway,Route_Type_Ramp,Route_Type_Service Road,Route_Type_US (State),Route_Type_Unknown,"Weather_BLOWING SAND, SOIL, DIRT",Weather_BLOWING SNOW,Weather_CLEAR,Weather_CLOUDY,Weather_FOGGY,Weather_OTHER,Weather_RAINING,Weather_SEVERE WINDS,Weather_SLEET,Weather_SNOW,Weather_UNKNOWN,Weather_WINTRY MIX,Surface_Condition_DRY,Surface_Condition_ICE,"Surface_Condition_MUD, DIRT, GRAVEL",Surface_Condition_OIL,Surface_Condition_OTHER,Surface_Condition_SAND,Surface_Condition_SLUSH,Surface_Condition_SNOW,Surface_Condition_UNKNOWN,Surface_Condition_WATER(STANDING/MOVING),Surface_Condition_WET,Light_DARK -- UNKNOWN LIGHTING,Light_DARK LIGHTS ON,Light_DARK NO LIGHTS,Light_DAWN,Light_DAYLIGHT,Light_DUSK,Light_OTHER,Light_UNKNOWN,Driver_Substance_Abuse_ALCOHOL CONTRIBUTED,Driver_Substance_Abuse_ALCOHOL PRESENT,Driver_Substance_Abuse_COMBINATION CONTRIBUTED,Driver_Substance_Abuse_COMBINED SUBSTANCE PRESENT,Driver_Substance_Abuse_ILLEGAL DRUG CONTRIBUTED,Driver_Substance_Abuse_ILLEGAL DRUG PRESENT,Driver_Substance_Abuse_MEDICATION CONTRIBUTED,Driver_Substance_Abuse_MEDICATION PRESENT,Driver_Substance_Abuse_NONE DETECTED,Driver_Substance_Abuse_OTHER,Driver_Substance_Abuse_UNKNOWN,Driver_At_Fault_No,Driver_At_Fault_Unknown,Driver_At_Fault_Yes,Driver_Distracted_By_ADJUSTING AUDIO AND OR CLIMATE CONTROLS,Driver_Distracted_By_BY MOVING OBJECT IN VEHICLE,Driver_Distracted_By_BY OTHER OCCUPANTS,Driver_Distracted_By_DIALING CELLULAR PHONE,Driver_Distracted_By_DISTRACTED BY OUTSIDE PERSON OBJECT OR EVENT,Driver_Distracted_By_EATING OR DRINKING,Driver_Distracted_By_INATTENTIVE OR LOST IN THOUGHT,Driver_Distracted_By_LOOKED BUT DID NOT SEE,Driver_Distracted_By_NO DRIVER PRESENT,Driver_Distracted_By_NOT DISTRACTED,Driver_Distracted_By_OTHER CELLULAR PHONE RELATED,Driver_Distracted_By_OTHER DISTRACTION,Driver_Distracted_By_OTHER ELECTRONIC DEVICE (NAVIGATIONAL PALM PILOT),Driver_Distracted_By_SMOKING RELATED,Driver_Distracted_By_TALKING OR LISTENING TO CELLULAR PHONE,Driver_Distracted_By_TEXTING FROM A CELLULAR PHONE,Driver_Distracted_By_UNKNOWN,Driver_Distracted_By_USING DEVICE OBJECT BROUGHT INTO VEHICLE,Driver_Distracted_By_USING OTHER DEVICE CONTROLS INTEGRAL TO VEHICLE,Vehicle_Damage_Extent_DESTROYED,Vehicle_Damage_Extent_DISABLING,Vehicle_Damage_Extent_FUNCTIONAL,Vehicle_Damage_Extent_NO DAMAGE,Vehicle_Damage_Extent_OTHER,Vehicle_Damage_Extent_SUPERFICIAL,Vehicle_Damage_Extent_UNKNOWN,Vehicle_Damage_Extent_Unknown,Vehicle_Body_Type_(SPORT) UTILITY VEHICLE,Vehicle_Body_Type_ALL TERRAIN VEHICLE (ATV),Vehicle_Body_Type_AMBULANCE/EMERGENCY,Vehicle_Body_Type_AMBULANCE/NON EMERGENCY,Vehicle_Body_Type_AUTOCYCLE,"Vehicle_Body_Type_CARGO VAN/LIGHT TRUCK 2 AXLES (OVER 10,000LBS (4,536 KG))",Vehicle_Body_Type_CROSS COUNTRY BUS,Vehicle_Body_Type_FARM VEHICLE,Vehicle_Body_Type_FIRE VEHICLE/EMERGENCY,Vehicle_Body_Type_FIRE VEHICLE/NON EMERGENCY,Vehicle_Body_Type_LIMOUSINE,Vehicle_Body_Type_LOW SPEED VEHICLE,"Vehicle_Body_Type_MEDIUM/HEAVY TRUCKS 3 AXLES (OVER 10,000LBS (4,536KG))",Vehicle_Body_Type_MOPED,Vehicle_Body_Type_MOTORCYCLE,Vehicle_Body_Type_OTHER,Vehicle_Body_Type_OTHER BUS,"Vehicle_Body_Type_OTHER LIGHT TRUCKS (10,000LBS (4,536KG) OR LESS)",Vehicle_Body_Type_PASSENGER CAR,Vehicle_Body_Type_PICKUP TRUCK,Vehicle_Body_Type_POLICE VEHICLE/EMERGENCY,Vehicle_Body_Type_POLICE VEHICLE/NON EMERGENCY,Vehicle_Body_Type_RECREATIONAL VEHICLE,Vehicle_Body_Type_SCHOOL BUS,Vehicle_Body_Type_SNOWMOBILE,Vehicle_Body_Type_STATION WAGON,Vehicle_Body_Type_TRANSIT BUS,Vehicle_Body_Type_TRUCK TRACTOR,Vehicle_Body_Type_UNKNOWN,Vehicle_Body_Type_VAN,Vehicle_Movement_ACCELERATING,Vehicle_Movement_BACKING,Vehicle_Movement_CHANGING LANES,Vehicle_Movement_DRIVERLESS MOVING VEH.,Vehicle_Movement_ENTERING TRAFFIC LANE,Vehicle_Movement_LEAVING TRAFFIC LANE,Vehicle_Movement_MAKING LEFT TURN,Vehicle_Movement_MAKING RIGHT TURN,Vehicle_Movement_MAKING U TURN,Vehicle_Movement_MOVING CONSTANT SPEED,Vehicle_Movement_NEGOTIATING A CURVE,Vehicle_Movement_OTHER,Vehicle_Movement_PARKED,Vehicle_Movement_PARKING,Vehicle_Movement_PASSING,Vehicle_Movement_RIGHT TURN ON RED,Vehicle_Movement_SKIDDING,Vehicle_Movement_SLOWING OR STOPPING,Vehicle_Movement_STARTING FROM LANE,Vehicle_Movement_STARTING FROM PARKED,Vehicle_Movement_STOPPED IN TRAFFIC LANE,Vehicle_Movement_UNKNOWN,Vehicle_Movement_Unknown,Parked_Vehicle_No,Parked_Vehicle_Yes,Equipment_Problems_AIR BAG FAILED,Equipment_Problems_BELT(S) MISUSED,Equipment_Problems_BELTS/ANCHORS BROKE,Equipment_Problems_FACING WRONG WAY,Equipment_Problems_NO MISUSE,Equipment_Problems_NOT STREPPED RIGHT,Equipment_Problems_OTHER,Equipment_Problems_SIZE/TYPE IMPROPER,Equipment_Problems_STRAP/TETHER LOOSE,Equipment_Problems_UNKNOWN
153176,25,2016,38.97256,-76.997466,2016,3,1,10:01:00,10,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0
153177,35,2008,39.00464,-77.108502,2017,7,19,14:22:00,14,22,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0
153178,40,2008,39.228963,-77.236757,2020,11,23,07:37:00,7,37,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0
153179,35,2018,39.12044,-77.180047,2019,11,23,23:23:00,23,23,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0
153180,35,2011,39.106847,-77.15838,2015,1,21,09:02:00,9,2,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1


In [73]:
# since lat long columns are included with all the columns, we need to have a separate df for them then use stdscalar and then concat that to the whole dataframe

from sklearn.preprocessing import StandardScaler

lat_long_df = X[['Latitude', 'Longitude']]

# Standardize latitude and longitude
scaler = StandardScaler()
lat_long_standardized = scaler.fit_transform(lat_long_df)

# Create a DataFrame for standardized lat and long
lat_long_standardized_df = pd.DataFrame(lat_long_standardized, columns=['Lat', 'Long'])
print(lat_long_standardized_df.shape)
# Concatenate standardized lat and long with the original date, time, year columns
X_standardized = pd.concat([X, lat_long_standardized_df], axis=1)
# X_standardized.head(2)
print(X_standardized.shape)

(153181, 2)
(153181, 171)


In [74]:
del X_standardized['Latitude']
del X_standardized['Longitude']

In [75]:
# Final check of X and y
X_standardized.sample(5)
# y.sample(5)

Unnamed: 0,Speed_Limit,Vehicle_Year,Year,Month,Date,Time,Hour,Minute,Second,Day/Night,Agency_Name_GAITHERSBURG,Agency_Name_Gaithersburg Police Depar,Agency_Name_MCPARK,Agency_Name_MONTGOMERY,Agency_Name_Maryland-National Capital,Agency_Name_Montgomery County Police,Agency_Name_ROCKVILLE,Agency_Name_Rockville Police Departme,Agency_Name_TAKOMA,Agency_Name_Takoma Park Police Depart,Route_Type_County,Route_Type_Government,Route_Type_Interstate (State),Route_Type_Maryland (State),Route_Type_Municipality,Route_Type_Other Public Roadway,Route_Type_Ramp,Route_Type_Service Road,Route_Type_US (State),Route_Type_Unknown,"Weather_BLOWING SAND, SOIL, DIRT",Weather_BLOWING SNOW,Weather_CLEAR,Weather_CLOUDY,Weather_FOGGY,Weather_OTHER,Weather_RAINING,Weather_SEVERE WINDS,Weather_SLEET,Weather_SNOW,Weather_UNKNOWN,Weather_WINTRY MIX,Surface_Condition_DRY,Surface_Condition_ICE,"Surface_Condition_MUD, DIRT, GRAVEL",Surface_Condition_OIL,Surface_Condition_OTHER,Surface_Condition_SAND,Surface_Condition_SLUSH,Surface_Condition_SNOW,Surface_Condition_UNKNOWN,Surface_Condition_WATER(STANDING/MOVING),Surface_Condition_WET,Light_DARK -- UNKNOWN LIGHTING,Light_DARK LIGHTS ON,Light_DARK NO LIGHTS,Light_DAWN,Light_DAYLIGHT,Light_DUSK,Light_OTHER,Light_UNKNOWN,Driver_Substance_Abuse_ALCOHOL CONTRIBUTED,Driver_Substance_Abuse_ALCOHOL PRESENT,Driver_Substance_Abuse_COMBINATION CONTRIBUTED,Driver_Substance_Abuse_COMBINED SUBSTANCE PRESENT,Driver_Substance_Abuse_ILLEGAL DRUG CONTRIBUTED,Driver_Substance_Abuse_ILLEGAL DRUG PRESENT,Driver_Substance_Abuse_MEDICATION CONTRIBUTED,Driver_Substance_Abuse_MEDICATION PRESENT,Driver_Substance_Abuse_NONE DETECTED,Driver_Substance_Abuse_OTHER,Driver_Substance_Abuse_UNKNOWN,Driver_At_Fault_No,Driver_At_Fault_Unknown,Driver_At_Fault_Yes,Driver_Distracted_By_ADJUSTING AUDIO AND OR CLIMATE CONTROLS,Driver_Distracted_By_BY MOVING OBJECT IN VEHICLE,Driver_Distracted_By_BY OTHER OCCUPANTS,Driver_Distracted_By_DIALING CELLULAR PHONE,Driver_Distracted_By_DISTRACTED BY OUTSIDE PERSON OBJECT OR EVENT,Driver_Distracted_By_EATING OR DRINKING,Driver_Distracted_By_INATTENTIVE OR LOST IN THOUGHT,Driver_Distracted_By_LOOKED BUT DID NOT SEE,Driver_Distracted_By_NO DRIVER PRESENT,Driver_Distracted_By_NOT DISTRACTED,Driver_Distracted_By_OTHER CELLULAR PHONE RELATED,Driver_Distracted_By_OTHER DISTRACTION,Driver_Distracted_By_OTHER ELECTRONIC DEVICE (NAVIGATIONAL PALM PILOT),Driver_Distracted_By_SMOKING RELATED,Driver_Distracted_By_TALKING OR LISTENING TO CELLULAR PHONE,Driver_Distracted_By_TEXTING FROM A CELLULAR PHONE,Driver_Distracted_By_UNKNOWN,Driver_Distracted_By_USING DEVICE OBJECT BROUGHT INTO VEHICLE,Driver_Distracted_By_USING OTHER DEVICE CONTROLS INTEGRAL TO VEHICLE,Vehicle_Damage_Extent_DESTROYED,Vehicle_Damage_Extent_DISABLING,Vehicle_Damage_Extent_FUNCTIONAL,Vehicle_Damage_Extent_NO DAMAGE,Vehicle_Damage_Extent_OTHER,Vehicle_Damage_Extent_SUPERFICIAL,Vehicle_Damage_Extent_UNKNOWN,Vehicle_Damage_Extent_Unknown,Vehicle_Body_Type_(SPORT) UTILITY VEHICLE,Vehicle_Body_Type_ALL TERRAIN VEHICLE (ATV),Vehicle_Body_Type_AMBULANCE/EMERGENCY,Vehicle_Body_Type_AMBULANCE/NON EMERGENCY,Vehicle_Body_Type_AUTOCYCLE,"Vehicle_Body_Type_CARGO VAN/LIGHT TRUCK 2 AXLES (OVER 10,000LBS (4,536 KG))",Vehicle_Body_Type_CROSS COUNTRY BUS,Vehicle_Body_Type_FARM VEHICLE,Vehicle_Body_Type_FIRE VEHICLE/EMERGENCY,Vehicle_Body_Type_FIRE VEHICLE/NON EMERGENCY,Vehicle_Body_Type_LIMOUSINE,Vehicle_Body_Type_LOW SPEED VEHICLE,"Vehicle_Body_Type_MEDIUM/HEAVY TRUCKS 3 AXLES (OVER 10,000LBS (4,536KG))",Vehicle_Body_Type_MOPED,Vehicle_Body_Type_MOTORCYCLE,Vehicle_Body_Type_OTHER,Vehicle_Body_Type_OTHER BUS,"Vehicle_Body_Type_OTHER LIGHT TRUCKS (10,000LBS (4,536KG) OR LESS)",Vehicle_Body_Type_PASSENGER CAR,Vehicle_Body_Type_PICKUP TRUCK,Vehicle_Body_Type_POLICE VEHICLE/EMERGENCY,Vehicle_Body_Type_POLICE VEHICLE/NON EMERGENCY,Vehicle_Body_Type_RECREATIONAL VEHICLE,Vehicle_Body_Type_SCHOOL BUS,Vehicle_Body_Type_SNOWMOBILE,Vehicle_Body_Type_STATION WAGON,Vehicle_Body_Type_TRANSIT BUS,Vehicle_Body_Type_TRUCK TRACTOR,Vehicle_Body_Type_UNKNOWN,Vehicle_Body_Type_VAN,Vehicle_Movement_ACCELERATING,Vehicle_Movement_BACKING,Vehicle_Movement_CHANGING LANES,Vehicle_Movement_DRIVERLESS MOVING VEH.,Vehicle_Movement_ENTERING TRAFFIC LANE,Vehicle_Movement_LEAVING TRAFFIC LANE,Vehicle_Movement_MAKING LEFT TURN,Vehicle_Movement_MAKING RIGHT TURN,Vehicle_Movement_MAKING U TURN,Vehicle_Movement_MOVING CONSTANT SPEED,Vehicle_Movement_NEGOTIATING A CURVE,Vehicle_Movement_OTHER,Vehicle_Movement_PARKED,Vehicle_Movement_PARKING,Vehicle_Movement_PASSING,Vehicle_Movement_RIGHT TURN ON RED,Vehicle_Movement_SKIDDING,Vehicle_Movement_SLOWING OR STOPPING,Vehicle_Movement_STARTING FROM LANE,Vehicle_Movement_STARTING FROM PARKED,Vehicle_Movement_STOPPED IN TRAFFIC LANE,Vehicle_Movement_UNKNOWN,Vehicle_Movement_Unknown,Parked_Vehicle_No,Parked_Vehicle_Yes,Equipment_Problems_AIR BAG FAILED,Equipment_Problems_BELT(S) MISUSED,Equipment_Problems_BELTS/ANCHORS BROKE,Equipment_Problems_FACING WRONG WAY,Equipment_Problems_NO MISUSE,Equipment_Problems_NOT STREPPED RIGHT,Equipment_Problems_OTHER,Equipment_Problems_SIZE/TYPE IMPROPER,Equipment_Problems_STRAP/TETHER LOOSE,Equipment_Problems_UNKNOWN,Lat,Long
133722,40,2016,2018,9,3,21:28:00,21,28,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,-0.34611,-0.084699
50842,35,2014,2020,6,23,08:51:00,8,51,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,-0.335309,0.568486
109595,50,0,2021,10,2,22:19:00,22,19,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0.463832,-0.969345
114843,35,2022,2022,10,13,21:42:00,21,42,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,-1.235999,0.809592
79956,35,2017,2022,12,27,15:10:00,15,10,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,-0.41306,0.072922


In [76]:
X_standardized.shape

(153181, 169)

In [77]:
X_standardized.isna().sum()

Speed_Limit                              0
Vehicle_Year                             0
Year                                     0
Month                                    0
Date                                     0
                                        ..
Equipment_Problems_SIZE/TYPE IMPROPER    0
Equipment_Problems_STRAP/TETHER LOOSE    0
Equipment_Problems_UNKNOWN               0
Lat                                      0
Long                                     0
Length: 169, dtype: int64

In [78]:
del X_standardized['Time']

In [79]:
# TRAIN _TEST_SPLIT (80-20)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_standardized, y, test_size=0.2,stratify=y,random_state=42)

In [80]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(122544, 168)
(30637, 168)
(122544,)
(30637,)


## Balancing dataset

In [81]:
# pip install imblearn

### Applying SMOTE (Synthetic Minority oversampling technique) to data

In [82]:
from imblearn.over_sampling import SMOTE
from collections import Counter

# Assuming X_train, X_test, y_train, y_test are already defined

# Check class distribution in the original training set
print('Class distribution before SMOTE:', Counter(y_train))

# Apply SMOTE to balance the classes
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)

# Check class distribution in the balanced training set
print('Class distribution after SMOTE:', Counter(y_train_balanced))

# Now you can use X_train_balanced, y_train_balanced for training your model
# Use X_test, y_test for testing


Class distribution before SMOTE: Counter({1: 100272, 2: 12532, 3: 8622, 4: 1008, 0: 110})
Class distribution after SMOTE: Counter({1: 100272, 2: 100272, 4: 100272, 3: 100272, 0: 100272})


In [83]:
#Training
print(X_train_balanced.shape)
print(y_train_balanced.shape)
print("--------------")
#Testing
print(X_test.shape)
print(y_test.shape)

(501360, 168)
(501360,)
--------------
(30637, 168)
(30637,)


# Feature selection

In [84]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import accuracy_score

In [85]:
# Step 1: Select 10 features using PCA and print selected features
n_selected_components = 10
pca = PCA(n_components=n_selected_components)
X_train_pca = pca.fit_transform(X_train_balanced)
X_test_pca = pca.transform(X_test)

# Print the selected feature names
selected_feature_indices = np.argsort(pca.components_)[::-1][:n_selected_components]
selected_feature_names = [X_train_balanced.columns[i] for i in selected_feature_indices]
print("Selected Features:", selected_feature_names)

Selected Features: [Index(['Lat', 'Long', 'Surface_Condition_WET', 'Day/Night',
       'Light_DARK LIGHTS ON', 'Weather_RAINING',
       'Driver_Distracted_By_NOT DISTRACTED',
       'Vehicle_Movement_MOVING CONSTANT SPEED',
       'Vehicle_Body_Type_PASSENGER CAR', 'Driver_At_Fault_No',
       ...
       'Vehicle_Damage_Extent_FUNCTIONAL',
       'Driver_Distracted_By_LOOKED BUT DID NOT SEE',
       'Driver_Substance_Abuse_UNKNOWN',
       'Driver_Substance_Abuse_NONE DETECTED', 'Driver_Distracted_By_UNKNOWN',
       'Driver_At_Fault_Yes', 'Route_Type_Maryland (State)', 'Light_DAYLIGHT',
       'Surface_Condition_DRY', 'Weather_CLEAR'],
      dtype='object', length=168), Index(['Driver_Distracted_By_NOT DISTRACTED',
       'Driver_Substance_Abuse_NONE DETECTED', 'Driver_At_Fault_No',
       'Equipment_Problems_NO MISUSE', 'Light_DAYLIGHT',
       'Vehicle_Damage_Extent_DISABLING', 'Vehicle_Body_Type_PASSENGER CAR',
       'Route_Type_Maryland (State)', 'Vehicle_Damage_Extent_FUNCTIONA

In [86]:
# Step 2: Train a Random Forest Classifier with specific hyperparameters
rf_classifier = RandomForestClassifier(n_estimators=200, max_depth=10, random_state=42)

In [None]:
# Step 3: Create a learning curve
from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import learning_curve
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import GridSearchCV, cross_val_score

train_sizes, train_scores, test_scores = learning_curve(
    rf_classifier, X_train_pca, y_train_balanced, cv=5, scoring='accuracy')

plt.figure(figsize=(10, 6))
plt.plot(train_sizes, np.mean(train_scores, axis=1), label='Training Accuracy')
plt.plot(train_sizes, np.mean(test_scores, axis=1), label='Testing Accuracy')
plt.xlabel('Number of Training Examples')
plt.ylabel('Accuracy')
plt.title('Learning Curve')
plt.legend()
plt.grid()

In [None]:
# Step 4: Make predictions and compute classification metrics
rf_classifier.fit(X_train_pca, y_train_balanced)

y_pred = rf_classifier.predict(X_test_pca)
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

print('Accuracy:', accuracy)
print('Classification Report:\n', classification_rep)

# Display the plots
plt.show()

In [None]:
# # Train a machine learning algorithm (e.g., RandomForestClassifier) with hyperparameter tuning
# param_grid = {
#     'n_estimators': [50, 100, 200],
#     'max_depth': [None, 10, 20, 30]
# }

# rf_classifier = RandomForestClassifier(random_state=42)
# grid_search = GridSearchCV(rf_classifier, param_grid, cv=5)
# grid_search.fit(X_train_pca, y_train_balanced)

# # Print the best parameters found by GridSearch
# print("Best Parameters: ", grid_search.best_params_)

# # Evaluate the model using cross-validation
# cross_val_scores = cross_val_score(grid_search.best_estimator_, X_train_pca, y_train_balanced, cv=5)
# print("Cross-Validation Scores: ", cross_val_scores)

# # Evaluate the model on the test set
# y_pred = grid_search.best_estimator_.predict(X_test_pca)
# accuracy = accuracy_score(y_test, y_pred)
# print('Accuracy:', accuracy)

In [None]:
#training curves to be drawn to fig out issues of overfitting and underfitting

In [None]:
# import numpy as np
# import matplotlib.pyplot as plt
# from sklearn.model_selection import learning_curve


# # Define a function to plot the learning curve
# def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
#                         n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5)):
#     plt.figure()
#     plt.title(title)
#     if ylim is not None:
#         plt.ylim(*ylim)
#     plt.xlabel("Training examples")
#     plt.ylabel("Score")
#     train_sizes, train_scores, test_scores = learning_curve(
#         estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
#     train_scores_mean = np.mean(train_scores, axis=1)
#     train_scores_std = np.std(train_scores, axis=1)
#     test_scores_mean = np.mean(test_scores, axis=1)
#     test_scores_std = np.std(test_scores, axis=1)
#     plt.grid()

#     plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
#                      train_scores_mean + train_scores_std, alpha=0.1,
#                      color="r")
#     plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
#                      test_scores_mean + test_scores_std, alpha=0.1, color="g")
#     plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
#              label="Training score")
#     plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
#              label="Cross-validation score")

#     plt.legend(loc="best")
#     return plt

# # Plot the learning curve
# title = "Learning Curves (Random Forest)"
# plot_learning_curve(grid_search.best_estimator_, title, X_train_pca, y_train_balanced, cv=5)
# plt.show()