# Dallas Accident Data Analysis
data source: us accident from Kaggle
A Countrywide Traffic Accident Dataset (2016 - 2019)
link: https://www.kaggle.com/sobhanmoosavi/us-accidents


# Business Case

This datasheet contains information that can be useful in identifying previous year vehicle accident trends within Dallas to identify where safety enhancements are needed.

The data was originally collected, according to Sobhan Moosavi from https://www.kaggle.com/sobhanmoosavi/us-accidents, was for “real-time accident prediction, studying accident hotspot locations, casualty analysis and extracting cause and effect rules to predict accidents, or studying the impact of precipitation or other environmental stimuli on accident occurrence.”

The datasheet description best explains how the database was collected (https://www.kaggle.com/sobhanmoosavi/us-accidents): "This is a countrywide traffic accident dataset, which covers 49 states of the United States. The data is collected from February 2016 to December 2019, using several data providers, including two APIs that provide streaming traffic incident data. These APIs broadcast traffic data captured by a variety of entities, such as the US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors within the road-networks."

As noted on https://www.nbcdfw.com/news/local/dallas-ranks-among-top-in-us-for-fatal-crashes-report/273443/, out of the largest 25 U.S. cities, Dallas is at the top of the list for traffic accidents. According to https://www.dallasnews.com/business/autos/2019/06/26/study-drivers-are-46-more-likely-to-get-into-accidents-on-dallas-roads-than-the-rest-of-the-u-s/, based on a study done by Allstate Dallas drivers are “46% more likely to get into a wreck than the average U.S. driver.

The goal for analyzing this data set is to find auto accident trends and hot locations that can assist the city with not only identifying where most accidents occur, but to also help make decisions on what locations should have a higher priority on safety enhancements.

If our predictions are correct, we would expect that after a location has been identified as a hot spot for auto accidents and has been “safety enhanced” by the city, there should be at least a 75% chance of a drop in auto accidents around that area. The reason we cannot guarantee a 100% chance of auto accident reductions is because some people are just simply bad drivers. 

**More would be added as we dig into the data later


In [5]:
import pandas as pd
import numpy as np

print('Pandas:', pd.__version__)
print('Numpy:',np.__version__)

# commented this out for me...couldn't get it to go back a directory for some reason
# df = pd.read_csv('../data/US_Accidents_Dec19.csv') # read in the csv file
df = pd.read_csv('US_Accidents_Dec19.csv') # read in the csv file

state_filter = df.State.str.contains('TX')
df = df[state_filter]
city_filter = df.City.str.contains('Dallas')
df = df[city_filter]

df

Pandas: 0.25.1
Numpy: 1.16.5


Unnamed: 0,ID,Source,TMC,Severity,Start_Time,End_Time,Start_Lat,Start_Lng,End_Lat,End_Lng,...,Roundabout,Station,Stop,Traffic_Calming,Traffic_Signal,Turning_Loop,Sunrise_Sunset,Civil_Twilight,Nautical_Twilight,Astronomical_Twilight
261012,A-261014,MapQuest,201.0,2,2016-11-30 16:10:04,2016-11-30 17:25:00,32.662193,-96.943153,,,...,False,False,False,False,False,False,Day,Day,Day,Day
261013,A-261015,MapQuest,201.0,3,2016-11-30 16:05:32,2016-11-30 17:24:00,32.778790,-96.782021,,,...,False,False,False,False,False,False,Day,Day,Day,Day
261014,A-261016,MapQuest,201.0,2,2016-11-30 16:10:46,2016-11-30 17:27:00,32.724277,-96.762245,,,...,False,False,False,False,False,False,Day,Day,Day,Day
261015,A-261017,MapQuest,201.0,2,2016-11-30 15:45:59,2016-11-30 17:18:00,32.708355,-96.700043,,,...,False,False,False,False,False,False,Day,Day,Day,Day
261016,A-261018,MapQuest,201.0,3,2016-11-30 16:06:04,2016-11-30 17:20:42,32.864021,-96.661140,,,...,False,False,False,False,False,False,Day,Day,Day,Day
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2973607,A-2973631,Bing,,3,2019-08-22 17:54:40,2019-08-22 18:23:27,32.678270,-96.822740,32.67153,-96.82262,...,False,False,False,False,False,False,Day,Day,Day,Day
2973611,A-2973635,Bing,,3,2019-08-22 19:18:27,2019-08-22 19:47:43,33.119510,-97.032410,33.12436,-97.03547,...,False,False,False,False,False,False,Day,Day,Day,Day
2973612,A-2973636,Bing,,3,2019-08-22 23:50:30,2019-08-23 00:19:59,32.926010,-96.820616,32.92601,-96.82072,...,False,False,False,False,False,False,Night,Night,Night,Night
2974057,A-2974081,Bing,,3,2019-08-23 14:12:33,2019-08-23 14:41:14,32.904220,-96.769110,32.89449,-96.76926,...,False,False,False,False,False,False,Day,Day,Day,Day


In [6]:
df.describe() # does not work on all data types

Unnamed: 0,TMC,Severity,Start_Lat,Start_Lng,End_Lat,End_Lng,Distance(mi),Number,Temperature(F),Wind_Chill(F),Humidity(%),Pressure(in),Visibility(mi),Wind_Speed(mph),Precipitation(in)
count,47962.0,58086.0,58086.0,58086.0,10124.0,10124.0,58086.0,15916.0,57927.0,16738.0,57919.0,57989.0,57906.0,54629.0,16395.0
mean,209.710771,2.382312,32.811202,-96.796752,32.820906,-96.79847,0.095629,4747.669703,69.001927,58.560867,63.597783,29.902372,9.431069,9.679602,0.01984
std,17.980151,0.50288,0.086146,0.07264,0.090912,0.074821,0.279749,3921.85442,17.112714,23.191381,20.893385,0.31982,2.227893,5.171056,0.087521
min,201.0,1.0,32.620309,-97.07235,32.620225,-97.08661,0.0,1.0,12.9,2.0,4.0,28.75,0.1,0.0,0.0
25%,201.0,2.0,32.75256,-96.840315,32.76644,-96.84087,0.0,1939.0,57.0,38.0,47.0,29.77,10.0,6.0,0.0
50%,201.0,2.0,32.798169,-96.799248,32.81177,-96.79286,0.0,3498.0,71.6,62.5,65.0,29.96,10.0,9.2,0.0
75%,201.0,3.0,32.88543,-96.752739,32.90585,-96.75305,0.01,6999.0,82.0,78.0,81.0,30.1,10.0,12.7,0.0
max,406.0,4.0,33.129891,-96.557159,33.153665,-96.564205,6.92,39725.0,109.9,103.0,100.0,30.86,111.0,255.0,2.28


In [7]:
print(df.dtypes)
print('======================================')
print(df.info())

ID                        object
Source                    object
TMC                      float64
Severity                   int64
Start_Time                object
End_Time                  object
Start_Lat                float64
Start_Lng                float64
End_Lat                  float64
End_Lng                  float64
Distance(mi)             float64
Description               object
Number                   float64
Street                    object
Side                      object
City                      object
County                    object
State                     object
Zipcode                   object
Country                   object
Timezone                  object
Airport_Code              object
Weather_Timestamp         object
Temperature(F)           float64
Wind_Chill(F)            float64
Humidity(%)              float64
Pressure(in)             float64
Visibility(mi)           float64
Wind_Direction            object
Wind_Speed(mph)          float64
Precipitat