# Car Accidents in the Contiguous United States and determining ideal driving conditions. 
## A data analysis and visualization by Montel Hardy

### The following is the cleaning, analysis, and visualization of a dataset covering nearly three years of car accident data in the United States. The below dataset features about three million accidents spanning 48 states from February 2016 to December 2019, collected by Bing and Mapquest. Below are some of the insights from the data I've visualized, many are included near their visualizations.

* Three times as many accidents have occurred during the day than at night.
* Daytime driving during inclement weather is a scenario that holds the highest risk of accidents for drivers.
* Severe accidents aren't occuring more frequently at night than during the day, but they are more likely due to reduced visibility.
* For more severe accidents (rating of 3 or 4), there is strong is a correlation between reduced visibility and accident severity.
* Aside from California, states with large populations don't necessarily have a large number of accidents. Part of this could be due to the accesibility of public transportation in some states.


## References & Disclaimer


  #### Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. “A Countrywide Traffic Accident Dataset.”, arXiv preprint arXiv:1906.05409 (2019).
   
 #### Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. “Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights.” In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.******


#### This dataset is being distributed only for Research purposes, under Creative Commons Attribution-Noncommercial-ShareAlike license (https://creativecommons.org/licenses/by-nc-sa/4.0/).  You may cite the above papers if you use this dataset.*******

#### The dataset can be found here: https://osu.app.box.com/v/us-accidents-dec19

https://creativecommons.org/licenses/by-nc-sa/4.0/

# Getting Started

###  First, I'll confirm that I'm running Python and import the proper libraries. Next, I will read in the csv file.

In [None]:
run python

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import geopandas as gpd
import plotly.graph_objects as go
import matplotlib.ticker as ticker



print ('All Set!')

In [None]:
import folium

print('Folium installed and imported!')

In [None]:
df=pd.read_csv('../input/2019-accident-data/US_Accidents_Dec19.csv')

us_accidents = df

# Data Cleaning

### First, lets inspect the dataset. By using the shape(), head(), and isnull() functions; we can get a better picture look at our dataset and identify how to best prepare the data for analysis. In the below cell, we discover that 49 columns and just under three million rows make up the shape of our data set.

In [None]:
us_accidents.shape

### We can see a preview of the aforementioned rows and columns below. 

In [None]:
us_accidents.head()

### The dataframe produced by the modified isnull() function identifies the amount of rows with missing data for each column. First, rows that have no value or have value that is unintelligible will be deleted from the dataset. As you can see below, there are a good number of columns with missing data. 

### We'll delete a few columns that we won't be using. This process will gave us a dataset with no missing values, a few less columns and still leave over two million accidents for analysis and visualization.

In [None]:
us_accidents.isnull().sum()

### Below is where I remove the columns from the dataframe.

In [None]:
us_accidents.drop(['Astronomical_Twilight','Nautical_Twilight','Civil_Twilight','TMC','End_Lng','End_Lat','Number','Wind_Chill(F)',
                  'Precipitation(in)'], axis=1,inplace=True)

In [None]:
us_accidents.dropna(subset=["Sunrise_Sunset","Description","Zipcode","Timezone",
                            "Airport_Code","Weather_Timestamp","Temperature(F)","Humidity(%)","Pressure(in)","Visibility(mi)",
                           "Wind_Direction","Weather_Condition"], axis=0, how= 'any',inplace= True)

In [None]:
us_accidents.drop(['Wind_Speed(mph)'], axis=1,inplace=True)

### After dropping the necessary columns and rows, our data no longer has missing values. We'll use the shape() function to ensure that we still have a very large amount of clean data to work with. With that said, the next part of this project.

In [None]:
us_accidents.isnull().sum()

In [None]:
us_accidents.shape

# Data Visualization

### For our first visualization we'll use the US map to visualize the accident data across the 48 states. In order to do this, we have to read in a world map file and download a GeoJSON file.

In [None]:
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

In [None]:
!wget --quiet https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DV0101EN/labs/Data_Files/world_countries.json -O world_countries.json
    
print('GeoJSON file downloaded!')

In [None]:
state_count = pd.value_counts(df['State'])

fig = go.Figure(data=go.Choropleth(
    locations=state_count.index,
    z = state_count.values.astype(float),  
    locationmode = 'USA-states',     
    colorscale = 'plasma',
    colorbar_title = "Count",
))

fig.update_layout(
    title_text = 'United States Accidents Visualization (2016-2019)',
    geo_scope='usa', 
)

fig.show()



### The above map shows us that California, Texas, and Florida have the largest amount of car accidents. After those three the totals of the rest of the high-accident states level off.

### To get a better look at states with high car accident numbers, I created a top ten list. I used the groupby function to gather these numbers in a dataframe. I varied the colors to make it look more visually apealing.

In [None]:


df_top = df.groupby('State').size().to_frame('Counts')
df_top = df_top.reset_index().sort_values('Counts', ascending = False)[:10]
df_top = df_top[::-1]   

colors = ['cyan', 'gold', 'coral', 'dodgerblue',
     'palevioletred', 'peru', 'lightblue', 'lightsalmon','crimson', 'lawngreen',]

fig, ax=plt.subplots(figsize=(15,8))
ax.barh(df_top['State'], df_top['Counts'], color = colors)

for i, (value, name) in enumerate(zip(df_top['Counts'], df_top['State'])):
        ax.text(value, i,     name,           size=14, weight=600, ha='right', va='bottom')
        ax.text(value, i-.25,     f'{value:,.0f}',  size=14, ha='left',  va='center')
        
ax.text(0, 1.06, 'by State', transform=ax.transAxes, size=12, color='#777777')
ax.xaxis.set_major_formatter(ticker.StrMethodFormatter('{x:,.0f}'))
ax.xaxis.set_ticks_position('top')
ax.tick_params(axis='x', colors='#777777', labelsize=12)
ax.set_yticks([])
ax.margins(0, 0.01)
ax.grid(which='major', axis='x', linestyle='-')
ax.set_axisbelow(True)
ax.text(0, 1.12, 'Top 10 States with the Highest Number of Accidents',
            transform=ax.transAxes, size=24, weight=600, ha='left')
plt.box(False)



In [None]:
us_accidents.columns = list(map(str, us_accidents.columns))

In [None]:
us_accidents['Sunrise_Sunset'].unique()


### Next, we explore the relationship betwwen the frequency of accidents by part of day (day or night). I use the groupby() function to gather the data, then I visualize it with a bar chart.

In [None]:
us_accidents.groupby('Sunrise_Sunset').size()


In [None]:

df.groupby('Sunrise_Sunset').size().plot(kind = 'barh', 
                                  color= 'lawngreen',
                                 align = 'center',
                                edgecolor = 'b',
                                 linewidth = 0.9,
                                 width = 0.3,
                                xerr=np.std(df.groupby('Sunrise_Sunset').size()),
                                 grid = True,figsize=(10, 6));


plt.title('Accidents by Time of Day', fontsize=20)
plt.xlabel('Accidents (Millions)', fontsize=16)
plt.ylabel('Time of Day', fontsize=16)


### These findings show us that three times as many accidents happen during the day. A limitation of the dataset we're using is that it doesn't take note of the total number of drivers on the road, but a safe assumption could be that there are a significantly larger number of drivers on the road during the day.  

### In the below scatter plots, we shift gears to analyze the impact visibility has on accident severity and frequency.

In [None]:
df_sev_day= us_accidents[['Severity','Sunrise_Sunset']]

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))
ax.set_title('Accident Severity by Visibility', fontsize=18)
ax.plot(df['Severity'], df['Visibility(mi)'], 'ko');
plt.xlabel("Severity")
plt.ylabel('Visibility (mi)')

fig, ax = plt.subplots(figsize=(10, 4),)
ax.set_title('Visibility by daypart', fontsize=18,)
ax.plot(df['Visibility(mi)'], df['Sunrise_Sunset'],'ko',color='red');
plt.xlabel("Visibility (mi)")
plt.ylabel('Daypart')

### The distribution in the first scatter plot shows us that most accidents occuring in the dataset have a severity of either two or three. The next takeaway is, more serious accidents occur more frequently if there's 80 miles or less visibility drivers.

### The second plot shows that visibility is a big part of what makes night driving more dangerous, but can provide an even more dangerous scenario for daytime drivers during inclement weather. According to our data, a large number of drivers on the road with reduced visibility due inclement weather creates a high likelihood of accidents.

# Conclusion


### This notebook featured data cleaning, analysis, and visualization of a car accident dataset with well over two million entries. This was a fun, timely datasets with some insights that may be worth sharing with a friend, coworker, or family member. 


### Thanks for reading and be safe!

### - Montel N. Hardy
