# Analysis on the Car Accident Severity in US

This notebook was used to analyze the car accident severity in US. The accident data from February 2016 to June 2020 were collected in US and the severity was assessed. The analysis includes the following parts:

* [Introduction](#Part_1)
* [Data](#Part_2)
* [Methodology](#Part_3)
    + [Exploratory Data Analysis](#Part_3_1)
    + [Severity Prediction](#Part_3_2)

written by Harry Li

## Introduction <a id='Part_1'></a>

Traffic accidents have become a big threat to public safety and resulted in great amounts of ecomoic loss around the world. A global status report on traffic safety indicated that the number of road traffic death continued to increase steadily, reaching to 1.35 million in 2016 [[WHO, 2018]](#cite-world2018global). Therefore, one of the important task for safety analysts and political makers, in order to mitigate the severity of the accidental consequence, is to make a comprehensive assessment of historical traffic accidents and then to increase the predictability of accidents.

Accident analysis and prediction has been discussed in many previous studies and covers a broad range of categories, including, for example, ***Environmental Stimuli Analysis***, ***Accident Frequency Prediction***, and ***Accident Risk Prediction*** [[Moosavi, et. al., 2019a]](#cite-moosavi2019accident). ***Environmental Stimuli Analysis*** assesses the environmental conditions (e.g. weather, and road conditions) that are correlated with the possibility or severity of traffic accidents. ***Accident Frequency Prediction*** is targeted on predicting the number of traffic accidents for a specific road-segment or geographical region. ***Accident Risk Prediction*** is similar to the previous one. However, instead of predicting the number of accidents, it is focusing on predict the possibility of road traffic accidents for real-tiem applications.

The analysis in this work belongs to the first category as I tried to seek the potential environmental stimuli of road traffic accidents. I used several explory data analysis (EDA) tools to investigate heterogeneity in the environmental factors and assessed the impact of environmental stimuli on severity of the accidents in US using several different machine learning models. The results from my analysis may provide advice to political makers on whether new regulations are needed in specific roads or weather conditions to reduce the risk of traffic accidents. In addition, it may also give suggestions to car drivers to avoid certain raod segments or to be vigilant on certain environmental conditions.

## Data <a id='Part_2'></a>

The data set employed in this study is a countrywide traffic accident dataset (US-Accidents), which covers 49 states of the United States [[Moosavi, et. al., 2019b]](#cite-moosavi2019countrywide). The data were collected continuously from February 2016 to March 2019 and contains about 3.5 million accident records in total. This data set contains various attributs including time, location, severity and description of accidents, weather conditions, points of interest annotation (e.g. whether there is a Stop sign in a nearby location). A summary table of all data attributes is shown in [Table 1](#Table_1). Details of the attributes and data acquisition strategy can be found in [Moosavi, et. al., 2019b](#cite-moosavi2019countrywide). And the data set is available on Kaggle.com (https://www.kaggle.com/sobhanmoosavi/us-accidents).

### <center>Table 1: A Listing of attibutes in US-Accidents data set (https://smoosavi.org/datasets/us_accidents) <a id='Table_1'></a></center>

|#|	Attribute|	Description|	Nullable|
| :-: | :-: | :- | :-: |
|1|	ID|	This is a unique identifier of the accident record.|	No|
|2|	Source|	Indicates source of the accident report (i.e. the API which reported the accident.).|	No|
|3|	TMC|	A traffic accident may have a Traffic Message Channel (TMC) code which provides more detailed description of the event.|	Yes|
|4|	Severity|	Shows the severity of the accident, a number between 1 and 4, where 1 indicates the least impact on traffic (i.e., short delay as a result of the accident) and 4 indicates a significant impact on traffic (i.e., long delay).|	No|
|5|	Start_Time|	Shows start time of the accident in local time zone.|	No|
|6|	End_Time|	Shows end time of the accident in local time zone. End time here refers to when the impact of accident on traffic flow was dismissed.|	No|
|7|	Start_Lat|	Shows latitude in GPS coordinate of the start point.|	No|
|8|	Start_Lng|	Shows longitude in GPS coordinate of the start point.|	No|
|9|	End_Lat|	Shows latitude in GPS coordinate of the end point.|	Yes|
|10|	End_Lng|	Shows longitude in GPS coordinate of the end point.|	Yes|
|11|	Distance(mi)|	The length of the road extent affected by the accident.|	No|
|12|	Description|	Shows natural language description of the accident.|	No|
|13|	Number|	Shows the street number in address field.|	Yes|
|14|	Street|	Shows the street name in address field.|	Yes|
|15|	Side|	Shows the relative side of the street (Right/Left) in address field.|	Yes|
|16|	City|	Shows the city in address field.|	Yes|
|17|	County|	Shows the county in address field.|	Yes|
|18|	State|	Shows the state in address field.|	Yes|
|19|	Zipcode|	Shows the zipcode in address field.|	Yes|
|20|	Country|	Shows the country in address field.|	Yes|
|21|	Timezone|	Shows timezone based on the location of the accident (eastern, central, etc.).|	Yes|
|22|	Airport_Code|	Denotes an airport-based weather station which is the closest one to location of the accident.|	Yes|
|23|	Weather_Timestamp|	Shows the time-stamp of weather observation record (in local time).|	Yes|
|24|	Temperature(F)|	Shows the temperature (in Fahrenheit).|	Yes|
|25|	Wind_Chill(F)|	Shows the wind chill (in Fahrenheit).|	Yes|
|26|	Humidity(%)|	Shows the humidity (in percentage).|	Yes|
|27|	Pressure(in)|	Shows the air pressure (in inches).|	Yes|
|28|	Visibility(mi)|	Shows visibility (in miles).|	Yes|
|29|	Wind_Direction|	Shows wind direction.|	Yes|
|30|	Wind_Speed(mph)|	Shows wind speed (in miles per hour).|	Yes|
|31|	Precipitation(in)|	Shows precipitation amount in inches, if there is any.|	Yes|
|32|	Weather_Condition|	Shows the weather condition (rain, snow, thunderstorm, fog, etc.)|	Yes|
|33|	Amenity|	A POI annotation which indicates presence of amenity in a nearby location.|	No|
|34|	Bump|	A POI annotation which indicates presence of speed bump or hump in a nearby location.|	No|
|35|	Crossing|	A POI annotation which indicates presence of crossing in a nearby location.|	No|
|36|	Give_Way|	A POI annotation which indicates presence of give_way in a nearby location.|	No|
|37|	Junction|	A POI annotation which indicates presence of junction in a nearby location.|	No|
|38|	No_Exit|	A POI annotation which indicates presence of no_exit in a nearby location.|	No|
|39|	Railway|	A POI annotation which indicates presence of railway in a nearby location.|	No|
|40|	Roundabout|	A POI annotation which indicates presence of roundabout in a nearby location.|	No|
|41|	Station|	A POI annotation which indicates presence of station in a nearby location.|	No|
|42|	Stop|	A POI annotation which indicates presence of stop in a nearby location.|	No|
|43|	Traffic_Calming|	A POI annotation which indicates presence of traffic_calming in a nearby location.|	No|
|44|	Traffic_Signal|	A POI annotation which indicates presence of traffic_signal in a nearby location.|	No|
|45|	Turning_Loop|	A POI annotation which indicates presence of turning_loop in a nearby location.|	No|
|46|	Sunrise_Sunset|	Shows the period of day (i.e. day or night) based on sunrise/sunset.|	Yes|
|47|	Civil_Twilight|	Shows the period of day (i.e. day or night) based on civil twilight.|	Yes|
|48|	Nautical_Twilight|	Shows the period of day (i.e. day or night) based on nautical twilight.|	Yes|
|49|	Astronomical_Twilight|	Shows the period of day (i.e. day or night) based on astronomical twilight.|	Yes|

## Methodology <a id='Part_3'></a>

### Exploratory Data Analysis <a id='Part_3_1'></a>

Exploratory data analysis (EDA) was implemented on the US-Accidents data set in order to reveal the heterogeneity in the data attributes. The distribution in the time, location, weather conditions were analyzed to get a comprehensive understanding on the characteristics of the data set. The cross-corelation was also calculated for better investigating the relationship between the environmental stimuli and severity of accidents, as well as for reducing the dimension of features for regression since only one of the highly correlated attributes is needed as the input feature of regression models.

### Severity Analysis <a id='Part_3_2'></a>

Multiple regression models, including multi-variate logistic model, Random Forest, LightGBM and XGBoost, are employed to predict severity of the accidents. The performance of each model was evaluated and compared based on <em>F1-Score</em>, Recall and Precision. I also fine-tuned the XGBoost model and analyzed the importance of features using the Recursive Feature Elimination method.

## Reference

* Moosavi, S., Samavatian, M. H., Parthasarathy, S., & Ramnath, R. (2019a). A Countrywide Traffic Accident Dataset. arXiv preprint arXiv:1906.05409. <a id='cite-moosavi2019countrywide'></a>

* Moosavi, S., Samavatian, M. H., Parthasarathy, S., Teodorescu, R., & Ramnath, R. (2019b, November). Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights. In Proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (pp. 33-42). <a id='cite-moosavi2019accident'></a>

* World Health Organization. (2018). Global status report on road safety 2018: Summary (No. WHO/NMH/NVI/18.20). World Health Organization. <a id='cite-world2018global'></a> 