### College of Computing and Informatics, Drexel University
### INFO 213: Data Science Programming II
---

## Project Proposal

## Project Title: Countrywide Car Accidents Analysis and Forecasting

## Student(s): Khanh Tran, Mark Amon, Amanjyot Singh

#### Date: August 10, 2020
---

#### Purpose
---
You are asked to propose a final project and present in the class. This proposal should describe the problem, the data sets, and the goal(s) of the project. Use the Project Requirements at the end of this notebook for choosing and scoping your project. 

### 1. Introduction
---
*(Introduce the project and describe the objectives.)* 

With a good amount of data and thoroughly executed analytics, one can possibly unveil the many faces of a problem or phenomenon. Data science has been being considered the most direct and reliable way to attack a problem, tracing it to the root and predicting what and when next consequences will take place. This project will follow the same direction and try to solve a specific real-world problem: what can data analytics do to reduce the number of car accidents in the U.S. The analytics will be based on “A Countrywide Traffic Accident Dataset” by Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. In this project, we will strive for understanding the cause and effect rules of the accidents, and from that, we will try to build several machine learning models that can help with the future accidents forecasting.

### 2. Problem Definition
---
*(Define the problem that will be solved in this data analytics project.)*

On average, there are 6 million car accidents in the U.S. every year. That's roughly 16,438 per day. Over 37,000 Americans die in automobile crashes per year, and there is an additional 3 million injured or disabled annually. Economically, traffic accidents cost the country $871 billion a year, and that was 6 years ago. These are only a few quick car crash statistics happening right now in the U.S. Even though the country is standing at 110th on the list of countries with the highest traffic-related death rate, the number can still be lowered tremendously if science-based solutions are carried out in a mission to improve the safety of the people on the roads. With a good dataset, data analysis can be an efficient method to extract useful information in order to figure out the cause and effect rules of the accidents, which will result in improved accident prevention.

### 3. Data Sources
---
*(Describe the origin of the data sources. What is the format of the original data? How to access the data?)*

As the dataset was acquired on Kaggle and because of its size, downloading it to local computers will be quite time-consuming. Using Kaggle notebook will solve this problem as we don't have to manually download the dataset to use it. Kaggle allows their users to get access to the datasets available on their website. There are currently about 3.5 million accident records in this dataset. It covers 49 states of the USA, and the data were collected from February 2016 to June 2020, using two APIs that provide streaming traffic incident (or event) data. Along with the large number of records, this dataset also provide a wide range of attributes for each accident. With 49 columns, analysts can observe and discover many faces of the accidents such as starting-ending time, exact starting-ending location, address, weather conditions, existed crossings, junctions, or bumps, etc. Our goals are planned upon this variety of features. We will also make use of pandas, numpy, matplotlib.pyplot, math, and sklearn packages of Python to effectively analyze, visualize, and model our data.

Acknowledgements

- Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. “A Countrywide Traffic Accident Dataset.”, 2019.

- Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. "Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights." In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.

https://www.kaggle.com/sobhanmoosavi/us-accidents

### 4. The Goal(s) of the predictions
---
*(What are the expected results of the project?)*

The project's mission is to provide assitance for this battle against car accidents with statistics-based findings and data-based analysis. To be more specific, we strive for determining the importance of each attribute toward predicting severity levels of accidents. The process is to create two similar models that predict severity level of available accidents based on a set of attributes. One model will take in all attributes except for the target attribute, for example weather conditions, and the other one will take in every attribute including the target attribute. Two sets of metric scores will be calculated and compared to see if adding the target attribute to the model will improve its performance or hurt it. Additionally, the level of influence of each target attribute will also be evaluated to find out which one plays the most important role and which one plays the least in supporting the performance of the algorithms. The target attributes are:

- Weather Conditions
- Locations
- Time of the day
- Time of the year

For each attribute, we will create a separate set of models. We will try to implement as many machine learning algorithms as possible. Each of the attribute listed above will be carefully processed and feeded into the models, making sure they retain their full features and hopefully are influential enough to affect the performance of the algorithms for the better or worse. 

### 5. Preliminary Models
---

In this section, we will examine the possibility of achieving the goals mentioned above. We will use 8 different models which are listed below in the code. The goal is to make sure the models work fluently without errors, quickly examine the performance of the models, and find out if the target attribute in this example improves or hurts the accuracy. Weather conditions will be the target attribute for these preliminary models.

In [2]:
# Import models 
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import r2_score, mean_squared_error
import matplotlib.pyplot as plt
import pandas as  pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

df = pd.read_csv("../input/us-accidents/US_Accidents_June20.csv")
df.head()

Unnamed: 0,ID,Source,TMC,Severity,Start_Time,End_Time,Start_Lat,Start_Lng,End_Lat,End_Lng,...,Roundabout,Station,Stop,Traffic_Calming,Traffic_Signal,Turning_Loop,Sunrise_Sunset,Civil_Twilight,Nautical_Twilight,Astronomical_Twilight
0,A-1,MapQuest,201.0,3,2016-02-08 05:46:00,2016-02-08 11:00:00,39.865147,-84.058723,,,...,False,False,False,False,False,False,Night,Night,Night,Night
1,A-2,MapQuest,201.0,2,2016-02-08 06:07:59,2016-02-08 06:37:59,39.928059,-82.831184,,,...,False,False,False,False,False,False,Night,Night,Night,Day
2,A-3,MapQuest,201.0,2,2016-02-08 06:49:27,2016-02-08 07:19:27,39.063148,-84.032608,,,...,False,False,False,False,True,False,Night,Night,Day,Day
3,A-4,MapQuest,201.0,3,2016-02-08 07:23:34,2016-02-08 07:53:34,39.747753,-84.205582,,,...,False,False,False,False,False,False,Night,Day,Day,Day
4,A-5,MapQuest,201.0,2,2016-02-08 07:39:07,2016-02-08 08:09:07,39.627781,-84.188354,,,...,False,False,False,False,True,False,Day,Day,Day,Day


For proposal purposes, we have decided to keep 30k records to decrease the running times. This will be changed in our final report. Furthermore, we will use more attributes in our final report. These are used just for proposal purposes. 

In [3]:
# With the Weather_Condition column
df2 = df[["Distance(mi)", 
          "Temperature(F)", 
          "Wind_Chill(F)", 
          "Humidity(%)", 
          "Pressure(in)", 
          "Visibility(mi)", 
          "Precipitation(in)", 
          "Weather_Condition",
         "Severity"]]

# Without the Weather_Condition column
df1 = df[["Distance(mi)",  
          "Temperature(F)", 
          "Wind_Chill(F)", 
          "Humidity(%)", 
          "Pressure(in)", 
          "Visibility(mi)", 
          "Precipitation(in)",
          "Severity"
          ]]

df1.replace(-1, np.nan, inplace=True)  
df1 = df1.dropna()

df2.replace(-1, np.nan, inplace=True)  
df2 = df2.dropna()

# Keep 30000 to decrease running times
df1 = df1[:30000] 
df2 = df2[:30000] 


Y1 = df1.Severity.values
X1 = df1.loc[:, df1.columns != 'Severity']

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  method=method,
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  method=method,


In [16]:
print(Y1.shape)
print(X1.shape)

(30000,)
(30000, 7)


In [6]:
X1.head()

Unnamed: 0,Distance(mi),Temperature(F),Wind_Chill(F),Humidity(%),Pressure(in),Visibility(mi),Precipitation(in)
5,0.01,37.9,35.5,97.0,29.63,7.0,0.03
9,0.01,37.4,33.8,100.0,29.62,3.0,0.02
11,0.01,37.4,33.8,100.0,29.62,3.0,0.02
14,0.01,37.4,33.8,100.0,29.62,3.0,0.02
20,0.0,33.8,29.6,100.0,29.62,2.0,0.01


In [7]:
X_train1, X_test1,Y_train1,Y_test1 = train_test_split(X1, Y1, test_size=0.33, random_state=99)
#Without weather

svc = SVC()
svc.fit(X_train1, Y_train1)
Y_pred = svc.predict(X_test1)
acc_svc1 = round(svc.score(X_test1, Y_test1) * 100, 2)
print("Accuracy SVC: " , acc_svc1)

knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train1, Y_train1)
Y_pred = knn.predict(X_test1)
acc_knn1 = round(knn.score(X_test1, Y_test1) * 100, 2)
print("Accuracy KNN: " , acc_knn1)


# Logistic Regression
logreg = LogisticRegression(max_iter = 2000)
logreg.fit(X_train1, Y_train1)
Y_pred = logreg.predict(X_test1)
acc_log1 = round(logreg.score(X_train1, Y_train1) * 100, 2)
print("Accuracy Log: ", acc_log1)


# Gaussian Naive Bayes

gaussian = GaussianNB()
gaussian.fit(X_train1, Y_train1)
Y_pred = gaussian.predict(X_test1)
acc_gaussian1 = round(gaussian.score(X_test1, Y_test1) * 100, 2)
print("Accuracy Gaussian: ", acc_gaussian1)

# Perceptron

perceptron = Perceptron()
perceptron.fit(X_train1, Y_train1)
Y_pred = perceptron.predict(X_test1)
acc_perceptron1 = round(perceptron.score(X_test1, Y_test1) * 100, 2)
print("Accuracy Perceptron: ", acc_perceptron1)

# Linear SVC

#linear_svc = LinearSVC(max_iter = 10000)
#linear_svc.fit(X_train1, Y_train1)
#Y_pred = linear_svc.predict(X_test1)
#acc_linear_svc1 = round(linear_svc.score(X_test1, Y_test1) * 100, 2)
#print("Accuracy Linear SVC: ", acc_linear_svc1)

# Stochastic Gradient Descent

sgd = SGDClassifier()
sgd.fit(X_train1, Y_train1)
Y_pred = sgd.predict(X_test1)
acc_sgd1 = round(sgd.score(X_test1, Y_test1) * 100, 2)
print("Accuracy SGD: ", acc_sgd1)

# Decision Tree

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train1, Y_train1)
Y_pred = decision_tree.predict(X_test1)
acc_decision_tree1 = round(decision_tree.score(X_test1, Y_test1) * 100, 2)
print("Accuracy Decision Tree: ", acc_decision_tree1)

# Random Forest

random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train1, Y_train1)
Y_pred = random_forest.predict(X_test1)
random_forest.score(X_train1, Y_train1)
acc_random_forest1 = round(random_forest.score(X_test1, Y_test1) * 100, 2)
print("Accuracy Random Forest: ", acc_random_forest1)

Accuracy SVC:  66.34
Accuracy KNN:  63.0
Accuracy Log:  67.77
Accuracy Gaussian:  25.65
Accuracy Perceptron:  40.52
Accuracy SGD:  57.87
Accuracy Decision Tree:  64.1
Accuracy Random Forest:  68.62


When feeding weather conditions to the models, we have to use a technique call one-hot encoding since this is a categorical variable. In other words, we will map the values of this attribute to numerical values.

In [13]:
df2["Weather_Condition"].unique()

array(['Light Rain', 'Light Snow', 'Overcast', 'Mostly Cloudy', 'Snow',
       'Light Freezing Drizzle', 'Rain', 'Heavy Rain',
       'Light Freezing Rain', 'Cloudy', 'Clear', 'Light Freezing Fog',
       'Scattered Clouds', 'Haze', 'Partly Cloudy', 'Fair', 'Fog',
       'Smoke', 'Blowing Dust / Windy', 'Fair / Windy',
       'Light Rain / Windy', 'Light Thunderstorms and Rain',
       'Showers in the Vicinity', 'Light Rain Shower',
       'Light Rain with Thunder', 'Mostly Cloudy / Windy',
       'Partly Cloudy / Windy', 'Light Drizzle',
       'Thunder in the Vicinity', 'T-Storm', 'Thunder', 'Heavy T-Storm',
       'Heavy T-Storm / Windy', 'Blowing Snow', 'Drizzle',
       'Thunderstorms and Rain', 'Light Thunderstorms and Snow',
       'Heavy Thunderstorms and Rain', 'Heavy Snow', 'Light Ice Pellets',
       'Light Rain Showers', 'Mist', 'Ice Pellets', 'Heavy Drizzle',
       'N/A Precipitation', 'Cloudy / Windy',
       'Heavy Thunderstorms and Snow', 'Rain / Windy',
       'Heavy 

As there are a variety of different values indicating different weather conditions, for the sake of the main goal of this example, we will simplify it to "Rain", "Snow, "Fog", and "Other". "Rain" will be mapped to 1, "Snow" is 2, "Fog" is 3, and "Other" is 4.

In [15]:
# One-hot encoding
encoded_cons = []
for con in df2["Weather_Condition"].values:
    if "Rain" in con.split(" "):
        encoded_cons.append(1)
    elif "Snow" in con.split(" "):
        encoded_cons.append(2)
    elif "Fog" in con.split(" "):
        encoded_cons.append(3)
    else:
        encoded_cons.append(4)

# New column and delete the original Weather_Condition column
df2['Encoded_Weather'] = encoded_cons
del df2["Weather_Condition"]

df2

Unnamed: 0,Distance(mi),Temperature(F),Wind_Chill(F),Humidity(%),Pressure(in),Visibility(mi),Precipitation(in),Severity,Encoded_Weather
5,0.01,37.9,35.5,97.0,29.63,7.0,0.03,3,1
9,0.01,37.4,33.8,100.0,29.62,3.0,0.02,3,1
11,0.01,37.4,33.8,100.0,29.62,3.0,0.02,3,1
14,0.01,37.4,33.8,100.0,29.62,3.0,0.02,2,1
20,0.00,33.8,29.6,100.0,29.62,2.0,0.01,2,2
...,...,...,...,...,...,...,...,...,...
527539,0.00,64.0,64.0,32.0,27.44,10.0,0.00,2,4
527540,0.00,59.0,59.0,36.0,30.10,10.0,0.00,2,4
527541,0.00,63.0,63.0,54.0,30.07,10.0,0.00,2,4
527542,0.00,59.0,59.0,60.0,30.05,10.0,0.00,2,4


In [17]:
Y = df2.Severity.values
X = df2.loc[:, df2.columns != 'Severity']

In [19]:
X.head()

Unnamed: 0,Distance(mi),Temperature(F),Wind_Chill(F),Humidity(%),Pressure(in),Visibility(mi),Precipitation(in),Encoded_Weather
5,0.01,37.9,35.5,97.0,29.63,7.0,0.03,1
9,0.01,37.4,33.8,100.0,29.62,3.0,0.02,1
11,0.01,37.4,33.8,100.0,29.62,3.0,0.02,1
14,0.01,37.4,33.8,100.0,29.62,3.0,0.02,1
20,0.0,33.8,29.6,100.0,29.62,2.0,0.01,2


In [20]:
print(Y.shape)
print(X.shape)

(30000,)
(30000, 8)


Pay attention that these new models that are trained on weather conditions will be tested with X_test1 and Y_test1 in order to produce an unbiased accuracy score. 

In [24]:
X_train2, X_test2, Y_train2, Y_test2 = train_test_split(X, Y, test_size=0.33, random_state=99)
#With weather

svc = SVC()
svc.fit(X_train2, Y_train2)
Y_pred = svc.predict(X_test1)
acc_svc2 = round(svc.score(X_test1, Y_test1) * 100, 2)
print("Accuracy SVC: " , acc_svc2)
print("Improvement: ", acc_svc2 > acc_svc1)


knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train2, Y_train2)
Y_pred = knn.predict(X_test1)
acc_knn2 = round(knn.score(X_test1, Y_test1) * 100, 2)
print("Accuracy KNN: " , acc_knn2)
print("Improvement: ", acc_knn2 > acc_knn1)


# Logistic Regression
logreg = LogisticRegression(max_iter = 2000)
logreg.fit(X_train2, Y_train2)
Y_pred = logreg.predict(X_test1)
acc_log2 = round(logreg.score(X_train1, Y_train1) * 100, 2)
print("Accuracy Log: ", acc_log2)
print("Improvement: ", acc_log2 > acc_log1)


# Gaussian Naive Bayes

gaussian = GaussianNB()
gaussian.fit(X_train2, Y_train2)
Y_pred = gaussian.predict(X_test1)
acc_gaussian2 = round(gaussian.score(X_test1, Y_test1) * 100, 2)
print("Accuracy Gaussian: ", acc_gaussian2)
print("Improvement: ", acc_gaussian2 > acc_gaussian1)

# Perceptron

perceptron = Perceptron()
perceptron.fit(X_train2, Y_train2)
Y_pred = perceptron.predict(X_test1)
acc_perceptron2 = round(perceptron.score(X_test1, Y_test1) * 100, 2)
print("Accuracy Perceptron: ", acc_perceptron2)
print("Improvement: ", acc_perceptron2 > acc_perceptron1)

# Linear SVC

#linear_svc = LinearSVC(max_iter = 10000)
#linear_svc.fit(X_train1, Y_train1)
#Y_pred = linear_svc.predict(X_test1)
#acc_linear_svc1 = round(linear_svc.score(X_test1, Y_test1) * 100, 2)
#print("Accuracy Linear SVC: ", acc_linear_svc1)

# Stochastic Gradient Descent

sgd = SGDClassifier()
sgd.fit(X_train2, Y_train2)
Y_pred = sgd.predict(X_test1)
acc_sgd2 = round(sgd.score(X_test1, Y_test1) * 100, 2)
print("Accuracy SGD: ", acc_sgd2)
print("Improvement: ", acc_sgd2 > acc_sgd1)


# Decision Tree

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train2, Y_train2)
Y_pred = decision_tree.predict(X_test1)
acc_decision_tree2 = round(decision_tree.score(X_test1, Y_test1) * 100, 2)
print("Accuracy Decision Tree: ", acc_decision_tree2)
print("Improvement: ", acc_decision_tree2 > acc_decision_tree1)


# Random Forest

random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train2, Y_train2)
Y_pred = random_forest.predict(X_test1)
random_forest.score(X_train1, Y_train1)
acc_random_forest2 = round(random_forest.score(X_test1, Y_test1) * 100, 2)
print("Accuracy Random Forest: ", acc_random_forest2)
print("Improvement: ", acc_random_forest2 > acc_random_forest1)

Accuracy SVC:  66.22
Improvement:  False
Accuracy KNN:  63.6
Improvement:  False
Accuracy Log:  67.71
Improvement:  False
Accuracy Gaussian:  43.08
Improvement:  False
Accuracy Perceptron:  66.22
Improvement:  False
Accuracy SGD:  66.05
Improvement:  True
Accuracy Decision Tree:  64.93
Improvement:  False
Accuracy Random Forest:  68.36
Improvement:  False


The results indicated that adding weather-related features to a machine learning algorithm in predicting severity of an accident did not substantially improve the accuracy of models in this PARTICULAR example. However, in order to firmly conclude that whether weather-related features, and other target features (time of the day, time of the year, locations), actually hurt the models or not, we need to take into consideration the possibility of an imbalance dataframe, df1 and df2 in this case. Not all four classes of severity levels are evidently evenly distributed in the training set. Another thing to consider is the size of the training set; 30,000 can possibly be an insufficient number considering we have quite a few attributes feeding into the models. 

In our final report, we will focus more on data preprocessing to create an unbiased experiment as well as increase the training set size to fully utilize the machine learning algorithms. Other metric scores (precision, recall, and F1 score) will also be calculated along side with accuracy. Visualization will be selectively added to plot out the difference between models' performances. Lastly, we will expand the input to consist of more attributes considering the great resource of the original dataset. 

---
(*Use the following requirements for writing your reports. DO NOT DELETE THE CELLS BELLOW*)

# Project Requirements

This final project examines the level of knowledge the students have learned from the course. The following course outcomes will be checked against the content of the report:

Upon successful completion of this course, a student will be able to:
* Describe the key Python tools and libraries that related to a typical data analytics project. 
* Identify data science libraries, frameworks, modules, and toolkits in Python that efficiently implement the most common data science algorithms and techniques.
* Apply latest Python techniques in data acquisition, transformation and predictive analytics for data science projects.
* Discuss the underlying principles and main characteristics of the most common methods and techniques for data analytics. 
* Build data analytic and predictive models for real world data sets using existing Python libraries.

** Marking will be foucsed on both presentation and content.** 

## Written Presentation Requirements
The report will be judged on the basis of visual appearance, grammatical correctness, and quality of writing, as well as its contents. Please make sure that the text of your report is well-structured, using paragraphs, full sentences, and other features of well-written presentation.

## Technical Content:
* Is the problem well defined and described thoroughly?
* Is the size and complexity of the data set used in this project comparable to that of the example data sets used in the lectures and assignments?
* Did the report describe the charactriatics of the data?
* Did the report describe the goals of the data analysis?
* Did the analysis conduct exploratory analyses on the data?
* Did the analysis build models of the data and evaluated the performance of the models?
* Overall, what is the rating of this project?