# US Traffic Accidents Exploratory Data Analysis




## Introduction

This is Exploratory Data Analysis using **Python** of the "US Traffic Accidents". The purpose of this project is to find out and visualize the data's main characteristics and trends using statistical methods and data visualization techniques.


# Phase 1 Ask

## About the data

The data can be found in [Kaggle](https://www.kaggle.com/sobhanmoosavi/us-accidents). This is a countrywide car accident dataset, which covers **49 states** of the USA. The accident data are collected **from February 2016 to Dec 2020**, using multiple APIs that provide streaming traffic incident (or event) data. These APIs broadcast traffic data captured by a variety of entities, such as the US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors within the road networks. Currently, there are about **1.5 million** accident records in this dataset. Check here to learn more about this dataset.

- It is important to notice that the dataset does not contain information about Alaska(AK) and Hawaï (HI).


## Objective

The purpose of this analysis is to answer the following questions:

 - Which **States** and **Cities** have the most traffic accidents?
 - What is the **time of the day**, the **day of the wee**k and **the month** with the higher number of accidents?
 - How many accidents per **year**? And what is the trend, increasing or decreasing?
 - What are the most common weather conditions on the days of the accidents?
 - How many accidents have a severity level of 1, 2, 3 and 4?

# Phase 2 Data Preparation



### Importing libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style("darkgrid")
import folium
from folium import plugins
from folium.plugins import HeatMap
from wordcloud import WordCloud

### Loading the dataset with Pandas

In [None]:
# Reading the dataset
accidents = pd.read_csv('../input/us-accidents/US_Accidents_Dec20_updated.csv')

### Exploring the dataset

In [None]:
accidents

### Column names and types

In [None]:
accidents.info()

### Statistical description of each column

In [None]:
accidents.describe()

### Number of numeric columns

In [None]:
# int, float and boolean data
print(accidents.count(numeric_only=True))
print("Total No. of Numerical Columns:", len(accidents.count(numeric_only=True)))

# Phase 3 Process 

### Checking missing data
#### Percentage of missing data for each column
Due to the high number of missing data, the columns 'Number', 'Precipitation', 'Wind_Chill(F)' are not going to be used in this analysis.

In [None]:
missing_percentage = accidents.isna().sum().sort_values(ascending=False) / len(accidents)*100
missing_percentage

In [None]:
missing_percentage[missing_percentage != 0].plot(kind='barh')

### Checking for duplicates

In [None]:
accidents['ID'].duplicated().any()

# Phase 4 Exploratory Data Analysis

## Analysing the number of accidents by location

### State
#### Number of states in the dataset

- It's important to remember that the dataset does not contain information about Alaska (AK) and Hawaï (HI).
- The total number of States in the dataset is  49


In [None]:
#Distinct states in the dataset
states = accidents.State.unique()
states

In [None]:
#How many states in the dataset
len(states)

#### States with the most number of accidents

In [None]:
accidents_by_state = accidents.State.value_counts()
accidents_by_state

In [None]:
fig, ax = plt.subplots(figsize = (20,5))
state_plot = sns.countplot(x=accidents.State, data=accidents, order=accidents.State.value_counts().iloc[:49].index, orient = 'v', palette = "viridis")
state_plot.set_title("No. of Accidents by State")
state_plot

- California(CA) is the most populated state, followed by Texas(TX) and Florida(FL), they are also in the top 5 of the states with the higher number of accidents.
- Oregon (OR) is the 3rd state with the most number of accidents and the 27th most populated state in the US.

### Cities

#### Number of cities in the dataset
- The dataset has 10658 distinct cities



In [None]:
cities = accidents.City.unique()
len(cities)

#### Accidents per city

In [None]:
#How many accidents by city
accidents_by_city = accidents.City.value_counts()
accidents_by_city

#### Top 20 cities with the most number of accidents

In [None]:
#top 20 cities with the most number of accidents
accidents_by_city[:20]

In [None]:
fig, ax = plt.subplots(figsize = (20,5))
city_plot = sns.countplot(x=accidents.City, data=accidents, order=accidents.City.value_counts().iloc[:50].index, orient = 'v', palette = "crest")
city_plot.set_title("No. of Accidents by City - Top 50 cities")
city_plot.set_xticklabels(city_plot.get_xticklabels(), rotation=90)
city_plot

#### Percentage of cities with more and less than 1000 accidents
- Only 2.35% of the cities have more than 1000 accidents

In [None]:
# Calculating the number of cities with more and less than 1000 accidents
high_accident_city = accidents_by_city[accidents_by_city >=1000]
low_accident_city = accidents_by_city[accidents_by_city <1000]

In [None]:
# Percentage of Cities with more than 1000 accidents
len(high_accident_city) / len(cities)*100

In [None]:
# Percentage of Cities with more than 1000 accidents
len(low_accident_city) / len(cities)*100

### Distribution on a map

#### Creating a dataset sample of 10%

In [None]:
sample_accidents = accidents.sample(int(0.1 * len(accidents)))

In [None]:
map = folium.Map(location = [40, -102], zoom_start = 4)
folium.plugins.HeatMap(zip(list(sample_accidents.Start_Lat), list(sample_accidents.Start_Lng)), scale_radius = False, radius = 12).add_to(map)
map

- There is a lower number of accidents in the central regions. That is also the regions less populated
- Both coasts have a higher number of accidents.

## Analysing the Timestamp of the accidents 

### Time

#### Checking the Start_Time column

In [None]:
accidents.Start_Time

#### Converting Start_Time column to a 'datetime' format

In [None]:
accidents.Start_Time = pd.to_datetime(accidents.Start_Time)
accidents.Start_Time[0]

#### Number of accidents per hour of the day

In [None]:
fig, ax = plt.subplots(figsize = (10,5))
hour_plot = sns.countplot(x=accidents.Start_Time.dt.hour, data=accidents, orient = 'v', palette = "crest")
hour_plot.set_title("No. of Accidents by Hour")
hour_plot

#### Number of accidents by day of the week

In [None]:
dayofweek_plot = sns.countplot(x=accidents.Start_Time.dt.dayofweek, data=accidents, orient = 'v', palette = "crest")
dayofweek_plot.set_title("No. of Accidents by day of the week")
dayofweek_plot

#### Number of accidents per hour on Sundays
- Sunday is the day of the week with the lower number of accidents

In [None]:
fig, ax = plt.subplots(figsize = (10,5))
sundays_star_time= accidents.Start_Time[accidents.Start_Time.dt.dayofweek == 6]
dayofweek_plot = sns.countplot(x=sundays_star_time.dt.hour, data=accidents, orient = 'v', palette = "crest")
dayofweek_plot.set_title("No. of Accidents per hour on Sundays")
dayofweek_plot

#### Number of accidents per hour on Thursday
- Thursday is the day of the week with the lower number of accidents

In [None]:
fig, ax = plt.subplots(figsize = (10,5))
sundays_star_time= accidents.Start_Time[accidents.Start_Time.dt.dayofweek == 3]
dayofweek_plot = sns.countplot(x=sundays_star_time.dt.hour, data=accidents, orient = 'v', palette = "crest")
dayofweek_plot.set_title("No. of Accidents per hour on Thursdays")
dayofweek_plot

#### Number of accidents by Month

In [None]:
fig, ax = plt.subplots(figsize = (10,5))
month_plot = sns.countplot(x=accidents.Start_Time.dt.month, data=accidents, orient = 'v', palette = "crest")
month_plot.set_title("No. of Accidents by Month")
month_plot

#### Number of accidents by Year

In [None]:
# Number of accidents by year
year_plot = sns.countplot(x=accidents.Start_Time.dt.year, data=accidents, orient = 'v', palette = "crest")
year_plot.set_title("No. of Accidents by Year")
year_plot

## Analyzing the Weather impact on accidents

### Weather conditions

#### Weather condition with the most accidents

In [None]:
#Top 50 weather conditions with the most accidents
weather = accidents.Weather_Condition.value_counts()
weather[:50]

In [None]:
fig, ax = plt.subplots(figsize = (20,5))
wc_plot = sns.countplot(x=accidents.Weather_Condition, data=accidents,order=accidents.Weather_Condition.value_counts().iloc[:20].index, orient = 'v', palette = "crest")
wc_plot.set_title("No. of Accidents by Weather Condition")
wc_plot.set_xticklabels(wc_plot.get_xticklabels(), rotation=90)
wc_plot

In [None]:
plt.style.use('seaborn')
plt.figure(figsize=(20,5))
weather_words = accidents['Weather_Condition'].value_counts().to_dict()
wc = WordCloud(scale=5, max_words=100,background_color ='white').generate_from_frequencies(weather_words)
plt.imshow(wc)
plt.axis('off')
plt.title('Weather condition', color='b')
plt.show()

### Temperature

#### Most common temperatures on the days of accidents

In [None]:
temperature = accidents['Temperature(F)'].value_counts()
temperature[:20]

In [None]:
fig, ax = plt.subplots(figsize = (20,5))
temp_plot = sns.countplot(x=accidents['Temperature(F)'], data=accidents,order=accidents['Temperature(F)'].value_counts().iloc[:50].index, orient = 'v', palette = "crest")
temp_plot.set_title("No. of Accidents by Temperature")
temp_plot.set_xticklabels(temp_plot.get_xticklabels())
temp_plot

## Analyzing the level of severity of the accidents

### Severity
- Shows the severity of the accident, a number between 1 and 4, where 1 indicates the least impact on traffic (i.e., short delay as a result of the accident) and 4 indicates a significant impact on traffic (i.e., long delay).

In [None]:
severity = accidents.Severity.value_counts()/ len(accidents.Severity)*100
severity

In [None]:
severity.plot.pie(subplots=True,  figsize=(20, 10))

# Phase 5 Findings

#### About the location
- The dataset does not contain information about Alaska (AK) and Hawaï (HI).
- California(CA) is the most populated state, followed by Texas(TX) and Florida(FL), they are also in the top 5 of the states with the higher number of accidents.
- Oregon (OR) is the 3rd state with the most number of accidents and the 27th most populated state in the US.
- Only 2,35 % of the cities have more than 1000 accidents 
- 1167 cities reported just 1 accident ( Need to investigate further)
- The number of accidents per city decreases exponentially

#### About the Timestamp
- In the mornings, accidents start to increase at 5 am and reach a higher point at 8 am. 
- In the afternoon accidents start to increase at 13pm and reaches the higher point at 17pm
- The number of accidents is lower on weekends
- On weekends the distribution of accidents per hour is different than on weekdays. It increases at 12am and start to decrease at 1am
- The number of accidents is higher in October, November and December. December is the mont with the most number of accidents. 
- It's important to noticed that the dataset starts in February 2016. The month of January 2016 is missing in this dataset.
- 2020 has a significantly higher number of accidents. This is an issue that needs further investigation.

#### About the Weather
- Interestingly, most accidents happen on the days with a Fairweather, follow by days Mostly Cloudy.
- Most accidents happen on days with temperatures between 50°F and 75°F (10°C and 23°C)

#### About the Severity
- Most accidents have a severity level of 2, 79.96% which means a lower impact on traffic. Level 4, high impact on traffic comes on third place with 7.5%.

### Acknowledgements
Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. **“A Countrywide Traffic Accident Dataset.”**, 2019.

Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. **"Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights."** In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.