# Exploratory Data Analysis for the Mapbox Open Data Challenge

## Challenge

The [challenge](https://splashthat.com/sites/view/opendata-contest.splashthat.com?cmsPage=1#sfid-199769342) is hosted by [Mapbox](https://www.mapbox.com).

Gaining a better understanding on the causes of crashes can help transportation planners and engineers reduce their frequency and severity. Traffic volume at a given location is often looked as a risk factor. But contributing causes to the crashes themselves, such as, weather, driver distraction, etc, and high frequency of non-severe/non-fatal crashes can be an indicator of increased probability for severe or fatal crashes.

* Create a visualization of crash locations around Bloomington, IN by type (vehicle, bicycle, pedestrian).
    * Include accident cause and traffic volume where available.
    * Make sure to normalize traffic volume by converting average daily traffic (ADT) to per
      million entering vehicles (MEV).
    * Separate crash types to provide insight on patterns in car crashes involving bicycles
      and pedestrians hold true for car on car crashes.
      
## Data

* [Crash data](https://data.bloomington.in.gov/dataset/117733fb-31cb-480a-8b30-fbf425a690cd/resource/8673744e-53f2-42d1-9d05-4e412bd55c94/download/monroe-county-crash-data2003-to-2015.cs) 
* [Bicycle & Pedestrian Counts](https://data.bloomington.in.gov/dataset/117733fb-31cb-480a-8b30-fbf425a690cd/resource/2b2a4280-964c-4845-b397-3105e227a1ae/download/pedestrian-and-bicyclist-counts.csv)
* [Traffic Counts](https://data.bloomington.in.gov/dataset/117733fb-31cb-480a-8b30-fbf425a690cd/resource/d5ba88f9-5798-46cd-888a-189eb59f7b46/download/traffic-counts2013-2015.csv)

In [27]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os

%matplotlib inline

plt.style.use('fivethirtyeight')

In [28]:
crashPath = '/Users/joshisaacson-work/Desktop/DesktopRoot/Projects/MapboxChallenge/Datasets/crash_data_2003-2015.csv'

crashDf = pd.read_csv(crashPath) 
crashDf.head()

Unnamed: 0,Master Record Number,Year,Month,Day,Weekend?,Hour,Collision Type,Injury Type,Primary Factor,Reported_Location,Latitude,Longitude
0,902363382,2015,1,5,Weekday,0.0,2-Car,No injury/unknown,OTHER (DRIVER) - EXPLAIN IN NARRATIVE,1ST & FESS,39.159207,-86.525874
1,902364268,2015,1,6,Weekday,1500.0,2-Car,No injury/unknown,FOLLOWING TOO CLOSELY,2ND & COLLEGE,39.16144,-86.534848
2,902364412,2015,1,6,Weekend,2300.0,2-Car,Non-incapacitating,DISREGARD SIGNAL/REG SIGN,BASSWOOD & BLOOMFIELD,39.14978,-86.56889
3,902364551,2015,1,7,Weekend,900.0,2-Car,Non-incapacitating,FAILURE TO YIELD RIGHT OF WAY,GATES & JACOBS,39.165655,-86.575956
4,902364615,2015,1,7,Weekend,1100.0,2-Car,No injury/unknown,FAILURE TO YIELD RIGHT OF WAY,W 3RD,39.164848,-86.579625


In [38]:
crashDf.describe()

Unnamed: 0,Master Record Number,Year,Month,Day,Hour,Latitude,Longitude
count,53943.0,53943.0,53943.0,53943.0,53718.0,53913.0,53913.0
mean,674811900.0,2008.968059,6.662162,4.196912,1347.265349,35.582109,-78.619224
std,390756300.0,3.78976,3.51463,1.90944,531.654039,11.289883,24.957587
min,14705.0,2003.0,1.0,1.0,0.0,0.0,-88.959213
25%,1991074.0,2006.0,4.0,3.0,1000.0,39.142048,-86.55152
50%,901124100.0,2009.0,7.0,4.0,1400.0,39.16443,-86.530992
75%,901903900.0,2012.0,10.0,6.0,1700.0,39.173344,-86.508288
max,902639400.0,2015.0,12.0,7.0,2300.0,41.228665,86.596363


Mapbox Studio only accepts csv files that are under 5 MB so I'm just going to split the crash data into multiple files, each consisting of 2 MB.  To do so, I just use the following shell command:

![Screen%20Shot%202017-11-19%20at%206.56.04%20PM.png](attachment:Screen%20Shot%202017-11-19%20at%206.56.04%20PM.png)

In [30]:
os.listdir('/Users/joshisaacson-work/Desktop/DesktopRoot/Projects/MapboxChallenge/Datasets/')

['crash_data_2003-2015_P4.csv',
 'traffic_counts_2013-2015.csv',
 'crash_data_2003-2015_P2.csv',
 '.DS_Store',
 'crash_data_2003-2015_P3.csv',
 'crash_data_2003-2015_P1.csv',
 'pedestrian_and_bicyclist_counts.csv',
 'crash_data_2003-2015.csv']

It would be nice to see how prevalent NaN values are for each crash.

In [37]:
nan_count = np.count_nonzero(crashDf.isnull().values)
print("There are " + str(nan_values) + " NaN values.")

rows_with_nan = crashDf[crashDf.isnull().any(axis=1)]
print(rows_with_nan)


There are 1515 NaN values.
       Master Record Number  Year  Month  Day Weekend?    Hour Collision Type  \
18                902365255  2015      1    1  Weekend  1600.0          2-Car   
132               902375614  2015      1    6  Weekend  1900.0          2-Car   
243               902401558  2015      2    3  Weekday  1800.0          2-Car   
297               902404235  2015      2    6  Weekday  1200.0          2-Car   
349               902410100  2015      3    2  Weekday   100.0          1-Car   
356               902410776  2015      3    7  Weekend  1800.0          2-Car   
406               902412717  2015      3    5  Weekday  1800.0          2-Car   
418               902413187  2015      3    5  Weekday  2200.0          2-Car   
443               902414629  2015      3    3  Weekday   500.0          2-Car   
472               902432846  2015      4    3  Weekday  2200.0          2-Car   
535               902434780  2015      4    1  Weekend     0.0          2-Car   
5

### Primary Factor for Crash Breakdown