<h1 align="center"><font size="5">Colorado Car Accident Severity Prediction</font></h1>
<h2 align="center"><font size="5">by Joanna Johnson</font></h2>

In this notebook, I will use Colorado_Accidents dataset (a subset of US_Accidents dataset) to predict the severity level of a car accident (in terms of impact on traffic), in order to help roadway users (e.g. drivers), navigation service providers to make an informed decision whether or not an alternative route should be taken. Also it can be used to help authorities to make more accurate estimation of when the accident-caused traffic jam will be eased, therefore redirect traffic from the impacted roadway to minimize its impact.


I will build and train a few models using some classification algorithms that we learned and choose the best model for this project.

In [10]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing
import itertools
from matplotlib.ticker import NullFormatter
import matplotlib.ticker as ticker
%matplotlib inline
import seaborn as sns

In [88]:
# Read the Colorado_Accidents.csv (this is a subset of US_Accidents_June20.csv downloaded from https://smoosavi.org/datasets/us_accidents
# First upload the file to jupyterlab, then use pandas read_csv to read it and assign it to co_accidents_df variable
co_accidents_df=pd.read_csv('Colorado_Accidents_new.csv', encoding = "ISO-8859-1", engine='python')
co_accidents_df.head()

Unnamed: 0,Rush_Hour,Duration_in_Hour,Interstate,ID,Source,TMC,Severity,Start_Time,End_Time,Start_Lat,...,Roundabout,Station,Stop,Traffic_Calming,Traffic_Signal,Turning_Loop,Sunrise_Sunset,Civil_Twilight,Nautical_Twilight,Astronomical_Twilight
0,0,2.26,0,A-512419,MapQuest,241.0,2,5/24/2020 11:23,5/24/2020 13:38,39.26,...,False,False,False,False,False,False,Day,Day,Day,Day
1,0,0.49,1,A-512420,MapQuest,245.0,3,5/24/2020 14:24,5/24/2020 14:54,39.05,...,False,False,False,False,False,False,Day,Day,Day,Day
2,1,1.49,1,A-512421,MapQuest,241.0,3,5/24/2020 16:17,5/24/2020 17:47,39.78,...,False,False,False,False,False,False,Day,Day,Day,Day
3,1,0.99,0,A-512422,MapQuest,201.0,2,5/24/2020 16:22,5/24/2020 17:22,39.72,...,False,False,False,False,True,False,Day,Day,Day,Day
4,1,0.49,0,A-512423,MapQuest,241.0,2,5/24/2020 16:54,5/24/2020 17:23,39.56,...,False,False,False,False,False,False,Day,Day,Day,Day


In [89]:
# find out the number of rows in the dataframe
co_accidents_df.shape

(49731, 52)

In [90]:
# show dtype for all columns
co_accidents_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49731 entries, 0 to 49730
Data columns (total 52 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Rush_Hour              49731 non-null  int64  
 1   Duration_in_Hour       49731 non-null  float64
 2   Interstate             49731 non-null  int64  
 3   ID                     49731 non-null  object 
 4   Source                 49731 non-null  object 
 5   TMC                    30865 non-null  float64
 6   Severity               49731 non-null  int64  
 7   Start_Time             49731 non-null  object 
 8   End_Time               49731 non-null  object 
 9   Start_Lat              49731 non-null  float64
 10  Start_Lng              49731 non-null  float64
 11  End_Lat                18866 non-null  float64
 12  End_Lng                18866 non-null  float64
 13  Distance(mi)           49731 non-null  float64
 14  Description            49731 non-null  object 
 15  Nu

In [92]:
# drop the columns not being used
co_accidents_df.drop(['ID', 'Source', 'TMC', 'Start_Time', 'End_Time', 'Start_Lat', 'Start_Lng', 'End_Lat', 'End_Lng', 'Description', 'Number', 'Street', 'Side', 'City', 'County', 'Zipcode', 'State', 'Country', 'Timezone', 'Airport_Code', 'Weather_Timestamp', 'Wind_Direction', 'Amenity', 'Sunrise_Sunset', 'Civil_Twilight', 'Nautical_Twilight'], axis = 1, inplace=True)
co_accidents_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49731 entries, 0 to 49730
Data columns (total 26 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Rush_Hour              49731 non-null  int64  
 1   Duration_in_Hour       49731 non-null  float64
 2   Interstate             49731 non-null  int64  
 3   Severity               49731 non-null  int64  
 4   Distance(mi)           49731 non-null  float64
 5   Temperature(F)         49266 non-null  float64
 6   Wind_Chill(F)          33652 non-null  float64
 7   Humidity(%)            49251 non-null  float64
 8   Pressure(in)           49362 non-null  float64
 9   Visibility(mi)         49089 non-null  float64
 10  Wind_Speed(mph)        44579 non-null  float64
 11  Precipitation(in)      23870 non-null  float64
 12  Weather_Condition      49215 non-null  object 
 13  Bump                   49731 non-null  bool   
 14  Crossing               49731 non-null  bool   
 15  Gi