### College of Computing and Informatics, Drexel University
### INFO 213: Data Science Programming II
---

## Project Proposal

## Project Title: Countrywide Car Accidents Analysis and Forecasting

## Student(s): Khanh Tran, Mark Amon, Amanjyot Singh

#### Date: August 10, 2020
---

#### Purpose
---
You are asked to propose a final project and present in the class. This proposal should describe the problem, the data sets, and the goal(s) of the project. Use the Project Requirements at the end of this notebook for choosing and scoping your project. 

### 1. Introduction
---
*(Introduce the project and describe the objectives.)* 

With a good amount of data and thoroughly executed analytics, one can possibly unveil the many faces of a problem or phenomenon. Data science has been being considered the most direct and reliable way to attack a problem, tracing it to the root and predicting what and when next consequences will take place. This project will follow the same direction and try to solve a specific real-world problem: what can data analytics do to reduce the number of car accidents in the U.S. The analytics will be based on “A Countrywide Traffic Accident Dataset” by Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. In this project, we will strive for understanding the cause and effect rules of the accidents, and from that, we will try to build several machine learning models that can help with the future accidents forecasting.

### 2. Problem Definition
---
*(Define the problem that will be solved in this data analytics project.)*

On average, there are 6 million car accidents in the U.S. every year. That's roughly 16,438 per day. Over 37,000 Americans die in automobile crashes per year, and there is an additional 3 million injured or disabled annually. Economically, traffic accidents cost the country $871 billion a year, and that was 6 years ago. These are only a few quick car crash statistics happening right now in the U.S. Even though the country is standing at 110th on the list of countries with the highest traffic-related death rate, the number can still be lowered tremendously if science-based solutions are carried out in a mission to improve the safety of the people on the roads. With a good dataset, data analysis can be an efficient method to extract useful information in order to figure out the cause and effect rules of the accidents, which will result in improved accident prevention.

### 3. Data Sources
---
*(Describe the origin of the data sources. What is the format of the original data? How to access the data?)*

As the dataset was acquired on Kaggle and because of its size, downloading it to local computers will be quite time-consuming. Using Kaggle notebook will solve this problem as we don't have to manually download the dataset to use it. Kaggle allows their users to get access to the datasets available on their website. There are currently about 3.5 million accident records in this dataset. It covers 49 states of the USA, and the data were collected from February 2016 to June 2020, using two APIs that provide streaming traffic incident (or event) data. Along with the large number of records, this dataset also provide a wide range of attributes for each accident. With 49 columns, analysts can observe and discover many faces of the accidents such as starting-ending time, exact starting-ending location, address, weather conditions, existed crossings, junctions, or bumps, etc. Our goals are planned upon this variety of features. We will also make use of pandas, numpy, matplotlib.pyplot, math, and sklearn packages of Python to effectively analyze, visualize, and model our data.

Acknowledgements

- Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. “A Countrywide Traffic Accident Dataset.”, 2019.

- Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. "Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights." In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.

https://www.kaggle.com/sobhanmoosavi/us-accidents

### 4. The Goal(s) of the predictions
---
*(What are the expected results of the project?)*

The project's mission is to provide assitance for this battle against car accidents with statistics-based findings and data-based analysis. To be more specific, we have goals that we strive for in this project:

- Real-time car accident prediction
- Studying car accidents hotspot locations
- Casualty analysis
- Extracting cause and effect rules to predict car accidents
- Contrast frequency of rural and urban accident rates
- Contrast frequency of accident rates in time of year
- Weather contributing factor analysis
- Accidents from left side or right => Indicates safety features and blind spots
- The project progress will be based on these goals.

With multiple goals that we plan to achieve, there will be multiple ways to define and measure our outcomes. Therefore, we will discuss the outcomes of each of the goal.

- Real-time car accident prediction: A good outcome for this task should include having a functional model that can locate the incoming accidents based on real-time influencing factors such as time of the day, day of the week, surrounding traffic condition, weather, etc. On the other hand, a bad outcome will fail to achieve all of these criterias. The worst possible outcome will be the model producing wrong predictions.
- Studying car accidents hotspot locations: The locations of accidents need to be studied thoroughly with visualizations on the total number of accidents at each location throughout the period of time. We also need to examine the connections between big hotspots to see if there are any clusters or they are separated. These locations also need to be examined against other factors like weather to discover potential patterns.
- Casualty analysis: Apart from examining the total casualty based on time and space, we also need to compare the casualty analysis against hotspot locations analysis. A wanted outcome is to find out if there is a common trend between the movements.
- Extracting cause and effect rules to predict car accidents: This is a task that will take place before "Real-time car accident prediction". The result of this task will decide the outcome of the predictions. Clearly, a good outcome needs to contain specific patterns, dependencies, and cause and effect factors of existing accidents. Time, space, and weather conditions are the target for analysis.
- Contrast frequency of rural and urban accident rates: These analysis aim to further break down the records into categories. This analysis needs to prove the contrary in the frequency. A good outcome will serve as an assitance to "Studying car accidents hotspot locations".
- Contrast frequency of accident rates in the time of year: These analysis aim to further break down the records into categories. This analysis needs to prove the contrary in the frequency. A good outcome will serve as an assitance to "Studying car accidents hotspot locations".
- Weather contributing factor analysis: The outcome should include in-depth analysis on the frequency of different weather factors occuring during accidents. Based on the result, we should be able to feed these factors into our predicting model. Temperature, humidity, rain, etc. should all be included in the analysis process.
- Accidents from left side or right => Indicates safety features and blind spots: Quick fact: 70% of accidents occurs on car located left passenger rear quarter panel. This analysis strives for providing statistics-based suggestions on how to instantly improve safety level for drivers by raising their awareness on their driving behaviours. Therefore, this task should produce a statistical outcome that can present a tendency (if there is any).

Again, the goals of the project is to assist the authority or any organizations with a proper power in improving the efficiency of accident prevention methods. In other words, the statistics-based results of this project will ultimately serve to produce real solutions, but the discussion of how those solutions are created and how they utilize the results is not a part of this project.

---
(*Use the following requirements for writing your reports. DO NOT DELETE THE CELLS BELLOW*)

# Project Requirements

This final project examines the level of knowledge the students have learned from the course. The following course outcomes will be checked against the content of the report:

Upon successful completion of this course, a student will be able to:
* Describe the key Python tools and libraries that related to a typical data analytics project. 
* Identify data science libraries, frameworks, modules, and toolkits in Python that efficiently implement the most common data science algorithms and techniques.
* Apply latest Python techniques in data acquisition, transformation and predictive analytics for data science projects.
* Discuss the underlying principles and main characteristics of the most common methods and techniques for data analytics. 
* Build data analytic and predictive models for real world data sets using existing Python libraries.

** Marking will be foucsed on both presentation and content.** 

## Written Presentation Requirements
The report will be judged on the basis of visual appearance, grammatical correctness, and quality of writing, as well as its contents. Please make sure that the text of your report is well-structured, using paragraphs, full sentences, and other features of well-written presentation.

## Technical Content:
* Is the problem well defined and described thoroughly?
* Is the size and complexity of the data set used in this project comparable to that of the example data sets used in the lectures and assignments?
* Did the report describe the charactriatics of the data?
* Did the report describe the goals of the data analysis?
* Did the analysis conduct exploratory analyses on the data?
* Did the analysis build models of the data and evaluated the performance of the models?
* Overall, what is the rating of this project?