## 1. Introduction

For the final capstone project in the IBM certificate course, we want to analyze the accident “severity” in terms of human fatality, traffic delay, property damage, or any other type of accident bad impact. The data was collected by Seattle SPOT Traffic Management Division and provided by Coursera via a link. This dataset is updated weekly and is from 2004 to present. It contains information such as severity code, address type, location, collision type, weather, road condition, speeding, among others.

The target audiences of this study are those people who really care about the traffic records, especially in the transportation department. Also, we want to figure out the reason for collisions and help to reduce accidents in the future.

## 2. Data

# Data
The data that will be used to predict severity level and train the model is the shared data fetched from GISWeb and Seattle government. 
This data contains the detailed traffic accident record in the time period 2004-present and is renewed weekly.

There are a total of 194673 accidents recorded, each record contains 38 different properties. The example of a data:

| Index          | Value                                                  |
|----------------|--------------------------------------------------------|
| SEVERITYCODE   | 2                                                      |
| X              | -122.32314840000002                                    |
| Y              | 47.70314032                                            |
| OBJECTID       | 1                                                      |
| INCKEY         | 1307                                                   |
| COLDETKEY      | 1307                                                   |
| REPORTNO       | 3502005                                                |
| STATUS         | Matched                                                |
| ADDRTYPE       | Intersection                                           |
| INTKEY         | 37475.0                                                |
| LOCATION       | 5TH AVE NE AND NE 103RD ST                             |
| EXCEPTRSNCODE  |                                                        |
| EXCEPTRSNDESC  | nan                                                    |
| SEVERITYCODE.1 | 2                                                      |
| SEVERITYDESC   | Injury Collision                                       |
| COLLISIONTYPE  | Angles                                                 |
| PERSONCOUNT    | 2                                                      |
| PEDCOUNT       | 0                                                      |
| PEDCYLCOUNT    | 0                                                      |
| VEHCOUNT       | 2                                                      |
| INCDATE        | 2013/03/27 00:00:00+00                                 |
| INCDTTM        | 3/27/2013 2:54:00 PM                                   |
| JUNCTIONTYPE   | At Intersection (intersection related)                 |
| SDOT_COLCODE   | 11                                                     |
| SDOT_COLDESC   | MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END AT ANGLE |
| INATTENTIONIND | nan                                                    |
| UNDERINFL      | N                                                      |
| WEATHER        | Overcast                                               |
| ROADCOND       | Wet                                                    |
| LIGHTCOND      | Daylight                                               |
| PEDROWNOTGRNT  | nan                                                    |
| SDOTCOLNUM     | nan                                                    |
| SPEEDING       | nan                                                    |
| ST_COLCODE     | 10                                                     |
| ST_COLDESC     | Entering at angle                                      |
| SEGLANEKEY     | 0                                                      |
| CROSSWALKKEY   | 0                                                      |
| HITPARKEDCAR   | N                                                      |

The usage of different parameters are as follows:
(Full description is in [Metadata.pdf](https://github.com/Mick235711/Coursera_Capstone/blob/main/Metadata.pdf)).
- `SEVERITYCODE` (1 or 2), `PERSONCOUNT` (0-81), `PEDCOUNT` (0-6), `PEDCYLCOUNT` (0-2), `VEHCOUNT` (0-12), `ST_COLCODE` (0-84): 
  These parameters describe the severity of the accident. We can uniformalize the severity of an accident by combining all these parameters. 
  The example data have a severity code of 2, person and vehicle count as 2, pedestrian and pedcylcist count of 0. The state collision code is 10.
  From these data, we can determine that the final severity value as 6 (sum of all the counts).
- `COLLISIONTYPE` (10 different types): Can be changed to values of 0-9 for effective predicting. The example have a collision type of `Angles` (1).
- `INCDATE` & `INCDTTM` (date & time): although not used for prediction, can be used to plot the frequency of accidents in each year. The example happened in March 27th, 2013.
- `JUNCTIONTYPE` (6 different types): Changed to 0-5. The 9 `Unknown`s can be seen as majority `Mid-Block (not related to intersection)`.
  The example have a junction type of `At Intersection (intersection related)` (1).
- `INATTENTIONIND` (Y/N): Change to 0-1. The NaN values represents No, so need processing. Example have an id of No (0).
- `UNDERINFL` (Y/N): Change to 0-1. Example have an value of No (0).
- `WEATHER` (11 different types): Can be combined into 
  `Clear` (`Clear`, `Partly Cloudy`, `Overcast`), 
  `Waterdrop` (`Raining`, `Snowing`, `Sleet/Hail/Freezing Rain`), 
  `Severe Condition` (`Blowing Sand/Dirt`, `Severe Crosswind`, `Fog/Smog/Smoke`),
  `Other` (`Other`, `Unknown`) and then change to 0-3. Example have weather of `Overcast` (0).
- `ROADCOND` (9 different types): Can be combined into 
  `Good` (`Dry`), 
  `Sweeping` (`Wet`, `Sand/Mud/Dirt`, `Oil`), 
  `Bad` (`Ice`, `Standing Water`, `Snow/Slush`), 
  `Other` (`Other`, `Unknown`) and then change to 0-3. Example have road condition of `Wet` (`Sweeping`, 1).
- `LIGHTCOND` (9 different types): Can assume the unknown in `Dark` to be street light on, as that is the majority. Then can be combined into 
  `Light` (`Daylight`), 
  `Partial Light` (`Dawn`, `Dusk`, `Dark - Street Lights On`, `Dark - Unknown Lighting`),
  `Dark` (`Dark - No Street Lights`, `Dark - Street Lights Off`),
  `Other` (`Other`, `Unknown`) and then change to 0-3. Example have light condition of `Daylight` (`Light`, 0).
- `PEDROWNOTGRNT` (Y/N): NaN = No, then change to 0-1. Example have value of No (0).
- `SPEEDING` (Y/N): NaN = No, then change to 0-1. Example have value of No (0).
- `HITPARKEDCAR` (Y/N): Change to 0-1. Example have value of No (0).

Together, we have 10 independent inputs and 1 target parameter (severity). The value of example data is as below:

|   SEVERITY |   COLLISIONTYPE |   JUNCTIONTYPE |   INATTENTIONIND |   UNDERINFL |   WEATHER |   ROADCOND |   LIGHTCOND |   PEDROWNOTGRANT |   SPEEDING |   HITPARKEDCAR |
|------------|-----------------|----------------|------------------|-------------|-----------|------------|-------------|------------------|------------|----------------|
|          6 |               1 |              1 |                0 |           0 |         0 |          1 |           0 |                0 |          0 |              0 |

We can then use machine learning models to perform the prediction.



## 3. Methodology

We used Jupyter Notebook to do the data analysis. To generate the table and graph for the dataset, we imported Python libraries (Pandas, Numpy, Matplotlib, and Seaborn).
First we imported the data through pd.read_csv. We noticed that it had 194,673 rows and 38 columns. Therefore, we narrowed it down to 8 columns (‘Severity’, ‘X’, ‘Y’, ‘Location’, 'Vehcount’, ‘Weather’, ‘Roadcond’, ‘Lighdcond’) and delete the missing values, which made the final dataset with 184,167 observations and 8 variables.

Since most of the variable were categorical, it was hard to make the regression model. So, in this study, we focused more on the graphical data and the value count for different categories. There were around 135,000 (2/3) level 1 accidents and 60,000 (1/3) level 2 accidents.