# U.S. Traffic Casualty Analysis - Data Wrangling & EDA

Traffic fatalities are a significant issue in the United States. The goal of this project proposal is to use data science to identify factors contributing to traffic accidents and develop targeted interventions to reduce the number of fatalities using resources from 2016-2021.

According to the National Highway Traffic Safety Administration (NHTSA), the number of traffic fatalities in the United States from 2016 to 2021 is as follows:
* 2016: 37,806 
* 2017: 37,133 
* 2018: 36,835 
* 2019: 36,096 
* 2020: 38,680 
* 2021: 42,915 

It's worth noting that 2020 saw an increase in traffic fatalities, despite a decrease in the number of vehicles on the road due to the COVID-19 pandemic. The reasons for this increase are complex and multifactorial, but some contributing factors include an increase in risky behaviors such as speeding and distracted driving, as well as an increase in alcohol and drug use.

## 1.1 Imports

In [2]:
#Import pandas, matplotlib.pyplot, and seaborn in the correct lines below
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

**Data** \
accidents: this is a countrywide car accident dataset, which covers 49 states of the USA. The accident data are collected from February 2016 to Dec 2021, using multiple APIs that provide streaming traffic incident (or event) data. These APIs broadcast traffic data captured by a variety of entities, such as the US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors within the road-networks. Currently, there are about 2.8 million accident records in this dataset. Check here to learn more about this dataset.

weather: this is a countrywide weather events dataset that includes 7.5 million events, and covers 49 states of the United States. Examples of weather events are rain, snow, storm, and freezing condition. Some of the events in this dataset are extreme events (e.g. storm) and some could be regarded as regular events (e.g. rain and snow). The data is collected from January 2016 to December 2021, using historical weather reports that were collected from 2,071 airport-based weather stations across the nation.

constructions: this is a countrywide dataset of road construction and closure events, which covers 49 states of the US. Construction events in this dataset could be any roadwork, ranging from fixing pavements to substantial projects that could take months to finish. The data is collected from Jan 2016 to Dec 2021

In [6]:
accidents = pd.read_csv('accidents.csv')
weather = pd.read_csv('weather.csv')
constructions = pd.read_csv('constructions.csv')

In [8]:
accidents.head()

Unnamed: 0,ID,Severity,Start_Time,End_Time,Start_Lat,Start_Lng,End_Lat,End_Lng,Distance(mi),Description,...,Roundabout,Station,Stop,Traffic_Calming,Traffic_Signal,Turning_Loop,Sunrise_Sunset,Civil_Twilight,Nautical_Twilight,Astronomical_Twilight
0,A-1,3,2016-02-08 00:37:08,2016-02-08 06:37:08,40.10891,-83.09286,40.11206,-83.03187,3.23,Between Sawmill Rd/Exit 20 and OH-315/Olentang...,...,False,False,False,False,False,False,Night,Night,Night,Night
1,A-2,2,2016-02-08 05:56:20,2016-02-08 11:56:20,39.86542,-84.0628,39.86501,-84.04873,0.747,At OH-4/OH-235/Exit 41 - Accident.,...,False,False,False,False,False,False,Night,Night,Night,Night
2,A-3,2,2016-02-08 06:15:39,2016-02-08 12:15:39,39.10266,-84.52468,39.10209,-84.52396,0.055,At I-71/US-50/Exit 1 - Accident.,...,False,False,False,False,False,False,Night,Night,Night,Day
3,A-4,2,2016-02-08 06:51:45,2016-02-08 12:51:45,41.06213,-81.53784,41.06217,-81.53547,0.123,At Dart Ave/Exit 21 - Accident.,...,False,False,False,False,False,False,Night,Night,Day,Day
4,A-5,3,2016-02-08 07:53:43,2016-02-08 13:53:43,39.172393,-84.492792,39.170476,-84.501798,0.5,At Mitchell Ave/Exit 6 - Accident.,...,False,False,False,False,False,False,Day,Day,Day,Day


In [9]:
weather.head()

Unnamed: 0,EventId,Type,Severity,StartTime(UTC),EndTime(UTC),Precipitation(in),TimeZone,AirportCode,LocationLat,LocationLng,City,County,State,ZipCode
0,W-1,Snow,Light,2016-01-06 23:14:00,2016-01-07 00:34:00,0.0,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0
1,W-2,Snow,Light,2016-01-07 04:14:00,2016-01-07 04:54:00,0.0,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0
2,W-3,Snow,Light,2016-01-07 05:54:00,2016-01-07 15:34:00,0.03,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0
3,W-4,Snow,Light,2016-01-08 05:34:00,2016-01-08 05:54:00,0.0,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0
4,W-5,Snow,Light,2016-01-08 13:54:00,2016-01-08 15:54:00,0.0,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0


In [10]:
constructions.head()

Unnamed: 0,ID,Severity,Start_Time,End_Time,Start_Lat,Start_Lng,End_Lat,End_Lng,Distance(mi),Description,...,Roundabout,Station,Stop,Traffic_Calming,Traffic_Signal,Turning_Loop,Sunrise_Sunset,Civil_Twilight,Nautical_Twilight,Astronomical_Twilight
0,C-1,4,2019-04-05 16:00:00.000000000,2020-09-29 11:53:57.000000000,32.83836,-93.152378,32.85074,-93.164388,1.103497,Construction on LA-534 WB near EDMONDS LOOP Ro...,...,False,False,False,False,False,False,Day,Day,Day,Day
1,C-2,2,2021-11-12 07:59:00.000000000,2021-11-12 08:22:30.000000000,30.221331,-92.008625,30.216642,-92.003809,0.433173,Slow traffic on US-90 E from US-167/Louisiana ...,...,False,False,False,False,False,False,Day,Day,Day,Day
2,C-3,2,2021-10-12 07:17:30.000000000,2021-10-12 09:18:55.000000000,39.653153,-104.910224,39.65312,-104.913838,0.192266,Slow traffic on CO-30 from S Tamarac Dr (E Ham...,...,False,True,False,False,False,False,Day,Day,Day,Day
3,C-4,4,2021-02-10 02:46:10.000000000,2021-02-17 23:59:00.000000000,33.961506,-118.029339,33.961919,-118.029082,0.032112,Closed road from Whittier to College Ave due t...,...,False,False,False,False,False,False,Night,Night,Night,Night
4,C-5,2,2020-09-24 15:58:00.000000000,2020-09-25 21:04:54.000000000,40.008734,-79.599696,40.022822,-79.595703,0.996057,Construction on US-119 NB near SAMPSON ST Allo...,...,False,False,False,False,False,False,Day,Day,Day,Day


In [11]:
accidents.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2845342 entries, 0 to 2845341
Data columns (total 47 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   ID                     object 
 1   Severity               int64  
 2   Start_Time             object 
 3   End_Time               object 
 4   Start_Lat              float64
 5   Start_Lng              float64
 6   End_Lat                float64
 7   End_Lng                float64
 8   Distance(mi)           float64
 9   Description            object 
 10  Number                 float64
 11  Street                 object 
 12  Side                   object 
 13  City                   object 
 14  County                 object 
 15  State                  object 
 16  Zipcode                object 
 17  Country                object 
 18  Timezone               object 
 19  Airport_Code           object 
 20  Weather_Timestamp      object 
 21  Temperature(F)         float64
 22  Wind_Chill(F)     

In [12]:
missing_accidents = pd.concat([accidents.isnull().sum(), 100 * accidents.isnull().mean()], axis=1)
missing_accidents.columns=['count','%']
missing_accidents.sort_values(by=['%'], ascending=False)

Unnamed: 0,count,%
Number,1743911,61.290031
Precipitation(in),549458,19.310789
Wind_Chill(F),469643,16.505678
Wind_Speed(mph),157944,5.550967
Wind_Direction,73775,2.592834
Humidity(%),73092,2.56883
Weather_Condition,70636,2.482514
Visibility(mi),70546,2.47935
Temperature(F),69274,2.434646
Pressure(in),59200,2.080593


The three fields with the highest % of missing values for the 'accidents' dataset are Numbers, Precipitation(in), and Wind_Chili(F). 
* `Numbers` is a field that shows the street number in address field. 
* `Precipitation(in)` shows precipitation amount in inches, if there is any. 
* `Wind_Chili(F)` shows the wind chill (in Fahrenheit).

In [13]:
missing_weather = pd.concat([weather.isnull().sum(), 100 * weather.isnull().mean()], axis=1)
missing_weather.columns=['count','%']
missing_weather.sort_values(by=['%'], ascending=False)

Unnamed: 0,count,%
ZipCode,59234,0.791987
City,14563,0.194714
EventId,0,0.0
Type,0,0.0
Severity,0,0.0
StartTime(UTC),0,0.0
EndTime(UTC),0,0.0
Precipitation(in),0,0.0
TimeZone,0,0.0
AirportCode,0,0.0


In [14]:
missing_constructions = pd.concat([constructions.isnull().sum(), 100 * constructions.isnull().mean()], axis=1)
missing_constructions.columns=['count','%']
missing_constructions.sort_values(by=['%'], ascending=False)

Unnamed: 0,count,%
Number,2674829,43.347767
Precipitation(in),922444,14.948951
Wind_Chill(F),776566,12.58488
End_Lat,634048,10.275261
End_Lng,634048,10.275261
Wind_Speed(mph),340277,5.514464
Wind_Direction,162549,2.634238
Visibility(mi),142246,2.305211
Weather_Condition,138203,2.239691
Humidity(%),122560,1.986184


Similar to the accidents dataset, the three fields with the highest % of missing values for the 'accidents' dataset are Numbers, Precipitation(in), and Wind_Chili(F).