# US Accidents Exploratory Data Analysis

In this project, we will be performaing exploratory data analysis on a Kaggle dataset which is a countrywide car accident dataset, which covers 49 states of the USA. 
The accident data are collected from February 2016 to Dec 2020, using multiple APIs that provide streaming traffic incident (or event) data. 
There are  about 1.5 million accident records in this dataset.

Aim of the analysis: 
This analysis can further be used by the authorities to study car accidents hotspot locations and prevent accidents in the future and to help them take certain steps by taking preventive measures which can help curb the occurrence of such incidents.

### Step One: Download the dataset and pre-process/clean it using Pandas and Numpy

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
accident_dataset = pd.read_csv("C:\\Users\\Meghna\\Downloads\\archive\\US_Accidents_Dec20_updated.csv")

Steps we'll be performing here are:
1. Look at the information in the dataset and get a basic understanding of the data
2. Fill in any missing values in the dataset and clean it

In [None]:
accident_dataset.head()

In [None]:
accident_dataset.shape
#The dataset has 1.5 million rows and 47 columns

In [None]:
accident_dataset.info()

In [None]:
accident_dataset.describe()

In [None]:
#How many  numeric columns are there in the dataset?
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
numeric_dataset = accident_dataset.select_dtypes(include=numerics)
print(numeric_dataset.shape)
numeric_dataset

So, we'll be working with 14 columns having numeric values. All the other columns have categorical/date-time etc. values

In [None]:
#How many NULL values are there in each column?

accident_dataset.isnull().sum().sort_values(ascending=False)

In [None]:
#What is the percentage of NULL values in each column?
missing_val = accident_dataset.isnull().sum().sort_values(ascending=False)/len(accident_dataset) * 100
missing_val[missing_val>0]

In [None]:
#Visulaize the number of missing values in our dataset
import missingno as msn
msn.bar(accident_dataset)

Since Number column has a lot of missing values, we'll drop that column all together and not use it in our analysis

In [None]:
accident_dataset['Weather_Condition'].isnull().sum()

### Step Two: Exploratory analysis and visualization

We'll be analysisng and exploring the following columns in our dataset:
1. City: Shows the city in address record.
2. Start Time: Shows start time of the accident in local time zone.
3. Start Lat, Start Long: Shows latitude in GPS coordinate of the start point // Shows longitude in GPS coordinate of the start point of accident.
4. Temperature: Shows the temperature (in Fahrenheit).
5. Weather condition: Shows the weather condition (rain, snow, thunderstorm, fog, etc.)

#### Analysing City column

In [None]:
cities = accident_dataset['City']
cities

#### Q. What are the unique cities of USA where accidents have taken place?

In [None]:
unique_cities = cities.unique()
print("The number of unique cities where accidents have taken place in USA: ",len(unique_cities))
unique_cities

#### Q. What are the number of accidents in each city?

In [None]:
unique_city_count = cities.value_counts()
unique_city_count

#### Q. What are the Top 10 Cities of USA with highest number of accidents?

In [None]:
print("The Top 10 Cities of USA with highest number of accidents are:")
unique_city_count[:11]

In [None]:
'New York' in unique_city_count

In [None]:
unique_city_count[:20].plot(kind='barh')

#### Q. What is the percentage of cities having more than 1000 yearly accidents?

In [None]:
percent_moreThan1000 = len(unique_city_count[unique_city_count>1000]) / len(unique_city_count) * 100
print("The percentage of cities have more than 1000 accidents in a year is: ", percent_moreThan1000, ", which is less than 5%.")

In [None]:
high_accident_cities = unique_city_count[unique_city_count>1000]
sns.distplot(high_accident_cities)

#### Q. What are the cities having only 1 accident?

In [None]:
one_accident_cities = unique_city_count[unique_city_count == 1]
one_accident_cities
print(len(one_accident_cities))

This must be investigates further as these 1167 cities have had only 1 accident in the past 4 years. So the preventive measures taken must be analysed

#### Analysing Start Time column

In [None]:
accident_dataset['Start_Time']

In [None]:
accident_dataset['Start_Time'] = pd.to_datetime(accident_dataset['Start_Time'])
accident_dataset['Start_Time']
#We converted object data type to datetime data type

In [None]:
accident_dataset['Start_Time'].dt.hour

#### The distribution plot for the hours in the day during which an accident takes place

In [None]:
sns.distplot(accident_dataset['Start_Time'].dt.hour)

#### Q. During which hour do the maximum number of accidents take place?

In [None]:
accident_dataset['Start_Time'].dt.hour.value_counts()

#### Q. During which day of the week do the maximum number of accidents take place?

In [None]:
accident_dataset['Start_Time'].dt.dayofweek.value_counts()
#The day of the week with Monday=0, Sunday=6.

In [None]:
sns.distplot(accident_dataset['Start_Time'].dt.dayofweek)

#### Q.Is the distribution of accidents by hour same on weekdays as that on Sunday?

In [None]:
start_time_dataset = accident_dataset['Start_Time']
start_time_dataset_weekend = start_time_dataset[accident_dataset['Start_Time'].dt.dayofweek == 6]
start_time_dataset_weekend

In [None]:
sns.distplot(start_time_dataset_weekend.dt.hour)

It's roughly the same, but there are a lot more accidents in the morning as compared to that on weekdays

#### Q.Which US state has the highest number of accidents?

In [None]:
state_wise_counts= accident_dataset['State'].value_counts()[:20]
state_wise_counts

As we can clearly see, the state having the most number of accidents is California followed by Florida.

In [None]:
state_wise_counts[:11].plot(kind='barh')

#### Analysing Start Lat and Start Long

In [None]:
accident_dataset['Start_Lat']

In [None]:
accident_dataset['Start_Lng']

In [None]:
sns.scatterplot(x=accident_dataset['Start_Lng'], y=accident_dataset['Start_Lat'])

As, we can see the density of points is more at the eastern and western coasts as compared to the middle of the country

In [None]:
list(zip(accident_dataset['Start_Lat'],accident_dataset['Start_Lng']))[:10]

In [None]:
from folium import plugins
from folium.plugins import HeatMap

In [None]:
mapWorld = folium.Map()
HeatMap(zip(accident_dataset['Start_Lat'],accident_dataset['Start_Lng'])).add_to(mapWorld)
mapWorld

### Step Three: Summarizing the inferences and drawing a conclusion

1. The number of unique cities where accidents have taken place in USA: 10658
2. The Top 5 cities of USA with maximum number of accidents in 4 years are: Los Angeles, Miami, Charlotte, Houston, Dallas  
3. The percentage of cities have more than 1000 accidents in a year is: 2.3552594538800786%
4. 1167 cities of USA have had only accident in 4 years!
5. The maximum number of accidents have taken place at around 4-5P.M. which can be a result of the fact that most people are travelling back to their homes after work which causes a rush hour.
6. Suprisingly, maximum number of accidents occured on a Thursday and not on a weekend. This means that maybe, not a lot of people travel on the weekends