# New York City Motor Vehicle Collisions

Data Description: The dataset I've chosen is Motor Vehicle Collisions - Persons (Public Safety) provided by NYC Open Data and uploaded by the NYPD to the website itself (https://opendata.cityofnewyork.us). As the name suggests, this dataset is about vehicular accidents that occur to people and or the people in them and it's ties to public safety. 


In [1]:
import os
import calendar
import datetime
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
import matplotlib.dates as mdates
import seaborn as sns
from pywaffle import Waffle
import geopandas as gpd

ModuleNotFoundError: No module named 'pywaffle'

In [None]:
raw_data = pd.read_csv('datasets/Motor_Vehicle_Collisions_-_Crashes.csv')

print(raw_data.shape)
raw_data.head(3)

Column Details: 

CRASH DATE: The date of the collision.

CRASH TIME: The time of the collision.

BOROUGH: Borough in which the collision occured.

LATITUDE, LONGITUDE and LOCATION: Geographical coordinates of the collision.

ON STREET NAME: Street on which the collision occurred.

CROSS STREET NAME: Nearest cross street to the collision.

OFF STREET NAME: Street address (if known).

NUMBER OF PERSONS INJURED, NUMBER OF PERSONS KILLED, NUMBER OF PEDESTRIANS INJURED, NUMBER OF PEDESTRIANS KILLED, NUMBER OF CYCLIST 

INJURED, NUMBER OF CYCLIST KILLED, NUMBER OF MOTORIST INJURED and NUMBER OF MOTORIST KILLED: Details about the number of people injured or killed in the accident.

CONTRIBUTING FACTOR VEHICLE 1-5: Factors contributing to the collision for designated vehicle.

COLLISION_ID: Unique record code generated by the system.
 
VEHICLE TYPE CODE 1-5: Type of vehicle based on the selected vehicle category.

### Processing and cleaning of the  data

In [None]:
mask = raw_data.isna().sum() / len(raw_data) < 0.34
raw_data = raw_data.loc[:, mask]

# Removing columns that don't have a large contributing factor to EDA and Predictions
cols_to_drop = ['ZIP CODE', 'LOCATION', 
                'CONTRIBUTING FACTOR VEHICLE 2', 'VEHICLE TYPE CODE 2']
raw_data.drop(cols_to_drop, axis = 1, inplace = True)

In [None]:
raw_data['CRASH_DATE_TIME'] = raw_data['CRASH DATE'] + ' ' + raw_data['CRASH TIME']

cols_to_drop = ['CRASH DATE', 'CRASH TIME']
raw_data.drop(cols_to_drop, axis = 1, inplace = True)

In [None]:
raw_data['CRASH_DATE_TIME']= pd.to_datetime(raw_data['CRASH_DATE_TIME'], 
                                            dayfirst=True, errors='coerce')

idx = raw_data[raw_data['CRASH_DATE_TIME'].isnull()].index
raw_data.drop(idx, axis = 0, inplace = True)

print(raw_data.shape)
raw_data.head(3)

### Borough Analysis

In [None]:
borough_wise = raw_data.groupby(['BOROUGH']).size().reset_index(name='NoOfAccidents')
borough_wise.head()

The [GIS data](https://data.cityofnewyork.us/City-Government/Borough-Boundaries/tqmj-j8zm) with the Boundaries of Boroughs for New York City is obtained from NYC Open Data. The data is provided by the Department of City Planning (DCP).

In [None]:
fp = 'datasets/Borough Boundaries/geo_export_87071461-9196-46f3-8d1b-52fed88fb835.shp'
borough_geo = gpd.read_file(fp)
borough_geo['boro_name'] = borough_geo['boro_name'].str.upper() 

borough_wise = borough_geo.set_index('boro_name').join(borough_wise.set_index('BOROUGH'))

In [None]:
fig, ax = plt.subplots(1, figsize=(10, 7))

borough_wise.plot(column = 'NoOfAccidents', cmap = 'Reds', linewidth = 0.8, 
                      ax = ax, edgecolor = '0.8')

ax.axis('off')
ax.set_title('Motor Vehicle Collisions in NYC', size = 16)
ax.annotate('Source: NYC Open Data', xy = (0.1, .08),  
            xycoords = 'figure fraction', horizontalalignment = 'left', verticalalignment = 'top', 
            fontsize = 12, color = '#555555')

sm = plt.cm.ScalarMappable(cmap = 'Reds', 
                           norm = plt.Normalize(vmin = 22822, vmax = 189648))
cbar = fig.colorbar(sm)

fig.savefig('plots/borough_wise_accidents.png', dpi=300)

**Analysis: As you can see, the majority of Motor Vehicle Collisions that occurred during  the years, were predominantly in Brooklyn and Queens. Manhattan coming after it, followed by the Bronx, and lastly Staten Island trailing behind. 

In [None]:
injuries_and_fatalities = raw_data.groupby(['BOROUGH'])\
['NUMBER OF PERSONS KILLED', 'NUMBER OF PERSONS INJURED'].agg('sum').reset_index()

injuries_and_fatalities['Total Accidents'] = raw_data.groupby(['BOROUGH']).size().\
reset_index(name='NoOfAccidents').NoOfAccidents

injuries_and_fatalities['Injury%'] = round((injuries_and_fatalities['NUMBER OF PERSONS INJURED']\
/ injuries_and_fatalities['Total Accidents'] * 100), 1)
injuries_and_fatalities['Fatality%'] = round((injuries_and_fatalities['NUMBER OF PERSONS KILLED']\
/ injuries_and_fatalities['Total Accidents'] * 100), 3)

injuries_and_fatalities.head()

In [None]:
injuries_and_fatalities.drop('Total Accidents', axis = 1, inplace = True)
injuries_and_fatalities = borough_geo.set_index('boro_name').join(injuries_and_fatalities.set_index('BOROUGH'))

In [None]:
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows=2, ncols=2, sharex=True, sharey=True, figsize=(10, 7))

injuries_and_fatalities.plot(column = 'NUMBER OF PERSONS INJURED', cmap = 'PuRd', linewidth = 0.8, 
                      ax = ax1, edgecolor = '0.8')
injuries_and_fatalities.plot(column = 'NUMBER OF PERSONS KILLED', cmap = 'Reds', linewidth = 0.8, 
                      ax = ax2, edgecolor = '0.8')
injuries_and_fatalities.plot(column = 'Injury%', cmap = 'PuRd', linewidth = 0.8, 
                      ax = ax3, edgecolor = '0.8')
injuries_and_fatalities.plot(column = 'Fatality%', cmap = 'Reds', linewidth = 0.8, 
                      ax = ax4, edgecolor = '0.8')

ax1.axis('off'); ax2.axis('off'); ax3.axis('off'); ax4.axis('off')
ax1.set_title('Total number of people injured in NYC', size = 10)
ax2.set_title('Total number of people killed in NYC', size = 10)
ax3.set_title('Percentage of people injured in vehicle collisions', size = 9)
ax4.set_title('Percentage of people killed in vehicle collisions', size = 9)

sm = plt.cm.ScalarMappable(cmap = 'PuRd', norm = plt.Normalize(vmin = 5800, vmax = 53000))
cbar = fig.colorbar(sm, ax = ax1)

sm = plt.cm.ScalarMappable(cmap = 'Reds', norm = plt.Normalize(vmin = 30, vmax = 210))
cbar = fig.colorbar(sm, ax = ax2)

sm = plt.cm.ScalarMappable(cmap = 'PuRd', norm = plt.Normalize(vmin = 15, vmax = 30))
cbar = fig.colorbar(sm, ax = ax3)

sm = plt.cm.ScalarMappable(cmap = 'Reds', norm = plt.Normalize(vmin = 0.05, vmax = 0.015))
cbar = fig.colorbar(sm, ax = ax4)

fig.savefig('plots/borough_wise_injury_percentage.png', dpi=500)

**Analysis: Brooklyn and Bronx have reported a very high percentage of accidents that result in injury. Queens and Staten Island also have a very high percentage, third and fourth to the first two borough by only a couple of percentages. On the other hand, Manhattan reported the least number of accidents in New York City boroughs.**

### Contributing Factor Analysis

In the dataset, the column `CONTRIBUTING FACTOR VEHICLE 1` gives the factor contributing to the collision for designated vehicle. 

In [None]:
factor_wise = raw_data.groupby(['CONTRIBUTING FACTOR VEHICLE 1'])\
['NUMBER OF PERSONS KILLED', 'NUMBER OF PERSONS INJURED'].agg('sum').reset_index()

factor_wise['Total Accidents'] = raw_data.groupby(['CONTRIBUTING FACTOR VEHICLE 1']).size().\
reset_index(name='NoOfAccidents').NoOfAccidents

factor_wise = factor_wise.sort_values('Total Accidents', ascending = False).head(10).iloc[1:]

factor_wise['Injury%'] = round((factor_wise['NUMBER OF PERSONS INJURED']/factor_wise['Total Accidents'] * 100), 1)
factor_wise['Fatality%'] = round((factor_wise['NUMBER OF PERSONS KILLED']/factor_wise['Total Accidents'] * 100), 3)

factor_wise = factor_wise[:-1]
factor_wise.head(3)

#### Common reasons for accidents - 

In [None]:
factor_accidents = factor_wise.sort_values('Total Accidents', ascending = False).head(10)
factor_accidents.head(3)

In [None]:
fig, ax = plt.subplots(1, figsize=(14, 6))

color = np.flip(cm.Reds(np.linspace(.2,.6, 10)))

factor_accidents.plot(x = 'CONTRIBUTING FACTOR VEHICLE 1', 
                      y = 'Total Accidents', kind = 'bar', 
                      color = color, stacked = True, ax = ax)

ax.set_title('Factors causing the most number of accidents', size = 12)
ax.set_xlabel('Contributing Factor', size = 12)
ax.set_ylabel('Number of Accidents', size = 12)
ax.tick_params(labelrotation = 20)

fig.savefig('plots/factor_accidents.png', dpi=500)

waf_df = factor_accidents[['CONTRIBUTING FACTOR VEHICLE 1', 'Total Accidents']].\
set_index('CONTRIBUTING FACTOR VEHICLE 1')

waf = plt.figure(
    FigureClass = Waffle, 
    rows = 5, 
    values = ((waf_df['Total Accidents'] / 485767) * 100) ,
    title={'label': 'Factors causing the most number of accidents', 
           'loc': 'center', 'size': 22},
    labels=["{0} ({1}%)".format(k, round((v / 485767) * 100), 2) for k, v in waf_df['Total Accidents'].items()],
    legend={'loc': 'lower left', 'bbox_to_anchor': (0, -0.15), 'ncol': len(waf_df), 'framealpha': 0},
    starting_location='NW',
    figsize=(22, 8)
)

waf.gca().set_facecolor('#EEEEEE')
waf.set_facecolor('#EEEEEE')

waf.savefig('plots/factor_accidents_waffle.png', dpi=500)

**Analysis: Driver Distraction is by far the most common factor leading to accidents in New York City.**

#### Factors that contributed to the highest injurt and fatality rate - 

In [None]:
factor_inj_rate = factor_wise.sort_values('Injury%', ascending = False).head(10)
factor_fat_rate = factor_wise.sort_values('Fatality%', ascending = False).head(10)

In [None]:
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(18, 6))

color = np.flip(cm.plasma(np.linspace(.2,.6, 10)))
color2 = cm.autumn(np.linspace(.2,.6, 10))

factor_inj_rate.plot(x = 'CONTRIBUTING FACTOR VEHICLE 1', 
                      y = 'Injury%', kind = 'bar', 
                      color = color, stacked = True, ax = ax1)

factor_fat_rate.plot(x = 'CONTRIBUTING FACTOR VEHICLE 1', 
                      y = 'Fatality%', kind = 'bar', 
                      color = color2, stacked = True, ax = ax2)

ax1.set_title('Factors with the highest rate of injury', size = 12)
ax1.set_xlabel('Contributing Factor', size = 12)
ax1.set_ylabel('Rate of Injury (%)', size = 12)
ax1.tick_params(labelrotation = 30)

ax2.set_title('Factors with the highest rate of fatality', size = 12)
ax2.set_xlabel('Contributing Factor', size = 12)
ax2.set_ylabel('Rate of Fatality (%)', size = 12)
ax2.tick_params(labelrotation = 30)

fig.savefig('plots/factor_inj_fat_rate.png', dpi=500)

**Analysis: The foremost cause of fatalities in accidents is "Failure to Yield Right-of-Way," responsible for more than 16% of all incidents resulting in death. Following at a notable distance is "Driver Inattention/Distraction," contributing to approximately 6%. Conversely, the incidence of injuries is markedly higher, with "Failure to Yield Right-of-Way" once again leading, accounting for over 45% of injuries. Additionally, "Following Too Closely," "Driver Inattention/Distraction," and "Other Vehicular" also exhibit significantly high injury rates, ranging between 20% and 30%.**

### Vehicle Type Analysis

In the dataset, the column `VEHICLE TYPE CODE 1` gives the type of the vehicle which was involved in motor collisions.

In [None]:
vehicle_wise = raw_data.groupby(['VEHICLE TYPE CODE 1'])\
['NUMBER OF PERSONS KILLED', 'NUMBER OF PERSONS INJURED'].agg('sum').reset_index()

vehicle_wise['Total Accidents'] = raw_data.groupby(['VEHICLE TYPE CODE 1']).size().\
reset_index(name='NoOfAccidents').NoOfAccidents

vehicle_wise = vehicle_wise.sort_values('Total Accidents', ascending = False)

vehicle_wise['Injury%'] = round((vehicle_wise['NUMBER OF PERSONS INJURED']/vehicle_wise['Total Accidents'] * 100), 1)
vehicle_wise['Fatality%'] = round((vehicle_wise['NUMBER OF PERSONS KILLED']/vehicle_wise['Total Accidents'] * 100), 3)

mask = vehicle_wise['Total Accidents'] > 100
vehicle_wise = vehicle_wise[mask]

vehicle_wise.head(3)

In [None]:
vehicle_accidents = vehicle_wise.sort_values('Total Accidents', ascending = False).head(10)
vehicle_accidents.head(5)

In [None]:
fig, ax = plt.subplots(1, figsize=(14, 6))

color = np.flip(cm.Reds(np.linspace(.2,.6, 10)))

vehicle_accidents.plot(x = 'VEHICLE TYPE CODE 1', 
                      y = 'Total Accidents', kind = 'bar', 
                      color = color, stacked = True, ax = ax)

ax.set_title('Vehicle types involved in the most number of accidents', size = 12)
ax.set_xlabel('Vehicle Type', size = 12)
ax.set_ylabel('Number of Accidents', size = 12)
ax.tick_params(labelrotation = 10)

fig.savefig('plots/vehicle_type_accidents.png', dpi=500)

**Analysis: Passenger Vehicle are by far the most common vehicle type leading to accidents on the roads of New York City. This is followed by Sedan, SUV, Station Wagons, etc. These vehicle are the most commonly used ones, here in New York City.**

In [None]:
vehicle_inj_rate = vehicle_wise.sort_values('Injury%', ascending = False).head(10)
vehicle_fat_rate = vehicle_wise.sort_values('Fatality%', ascending = False).head(10)

In [None]:
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(18, 6))

color = np.flip(cm.plasma(np.linspace(.2,.6, 10)))
color2 = cm.autumn(np.linspace(.2,.6, 10))

vehicle_inj_rate.plot(x = 'VEHICLE TYPE CODE 1', 
                      y = 'Injury%', kind = 'bar', 
                      color = color, stacked = True, ax = ax1)

vehicle_fat_rate.plot(x = 'VEHICLE TYPE CODE 1', 
                      y = 'Fatality%', kind = 'bar', 
                      color = color2, stacked = True, ax = ax2)

ax1.set_title('Vehicle Types with the highest rate of injury', size = 12)
ax1.set_xlabel('Vehicle Types', size = 12)
ax1.set_ylabel('Rate of Injury (%)', size = 12)
ax1.tick_params(labelrotation = 30)

ax2.set_title('Vehicle Types with the highest rate of fatality', size = 12)
ax2.set_xlabel('Vehicle Types', size = 12)
ax2.set_ylabel('Rate of Fatality (%)', size = 12)
ax2.tick_params(labelrotation = 30)

fig.savefig('plots/vehicle_inj_fat_rate.png', dpi=500)

### What role do date and time play? 

In [None]:
date_only = raw_data.copy() 
date_only['Date'] = date_only['CRASH_DATE_TIME'].dt.date

date_wise = date_only.groupby(['Date'])\
['NUMBER OF PERSONS KILLED', 'NUMBER OF PERSONS INJURED'].agg('sum').reset_index()

date_wise['Total Accidents'] = date_only.groupby(['Date']).size().\
reset_index(name='NoOfAccidents').NoOfAccidents

date_wise['Injury%'] = round((date_wise['NUMBER OF PERSONS INJURED']/date_wise['Total Accidents'] * 100), 1)
date_wise['Fatality%'] = round((date_wise['NUMBER OF PERSONS KILLED']/date_wise['Total Accidents'] * 100), 3)

date_wise = date_wise.sort_values('Total Accidents', ascending = False)

In [None]:
date_accidents = date_wise.sort_values('Total Accidents', ascending = False).head(10)

fig, ax = plt.subplots(1, figsize=(12, 4))

color = np.flip(cm.Oranges(np.linspace(.2,.6, 10)))

date_accidents.plot(x = 'Date', 
                      y = 'Total Accidents', kind = 'bar', 
                      color = color, stacked = True, ax = ax)

ax.set_title('Dates on which the most number of accidents occured', size = 12)
ax.set_xlabel('Date', size = 12)
ax.set_ylabel('Number of Accidents', size = 12)
ax.tick_params(labelrotation = 10)

fig.savefig('plots/date_accidents.png', dpi=500)

**Analysis: Here are the top ten dates noted for the highest number of recorded accidents in New York City.**

#### What role do year and month play? 

In [None]:
month_only = raw_data.copy() 
month_only['Date'] = month_only['CRASH_DATE_TIME']
mask = month_only['Date'] > '2014-12-31' 
month_only = month_only[mask]

mask2 = month_only['Date'] < '2019-12-31' 
month_only = month_only[mask2]

month_only['Year'] = month_only['Date'].map(
    lambda x: datetime.datetime(
        x.year,
        x.month,
        max(calendar.monthcalendar(x.year, x.month)[-1][:5])
    )
)

month_wise = month_only.groupby(['Year'])\
['NUMBER OF PERSONS KILLED', 'NUMBER OF PERSONS INJURED'].agg('sum').reset_index()

month_wise['Total Accidents'] = month_only.groupby(['Year']).size().\
reset_index(name='NoOfAccidents').NoOfAccidents

month_wise['Injury%'] = round((month_wise['NUMBER OF PERSONS INJURED']/month_wise['Total Accidents'] * 100), 1)
month_wise['Fatality%'] = round((month_wise['NUMBER OF PERSONS KILLED']/month_wise['Total Accidents'] * 100), 3)

month_wise = month_wise.sort_values('Total Accidents', ascending = False)

month_wise.head()

In [None]:
fig, ax = plt.subplots(1, figsize=(18, 10))

color = np.flip(cm.Oranges(np.linspace(.2,.6, 10)))
color2 = np.flip(cm.plasma(np.linspace(.2,.6, 10)))

month_wise.plot(x = 'Year', y = 'Total Accidents', kind = 'line', 
                color = color, stacked = True, ax = ax)
month_wise.plot(x = 'Year', y = 'NUMBER OF PERSONS INJURED', kind = 'line', 
                color = color2, stacked = True, ax = ax)

ax.set_title('Number of Accidents and Injuries over the years (monthly distribution)', size = 12)
ax.set_xlabel('Year', size = 12)
ax.tick_params(labelrotation = 90)

locator = mdates.MonthLocator()
fmt = mdates.DateFormatter('%b')

x_axis = plt.gca().xaxis
x_axis.set_major_locator(locator)

x_axis.set_major_formatter(fmt)

plt.axvspan('2015-01-30', '2015-12-31', alpha = 0.14, color = 'xkcd:blue')
plt.text('2015-06-25', 10000, '2015')

plt.axvspan('2016-01-01', '2016-12-31', alpha = 0.14, color = 'xkcd:cornflower')
plt.text('2016-06-25', 10000, '2016')

plt.axvspan('2017-01-01', '2017-12-31', alpha = 0.14, color = 'xkcd:periwinkle blue')
plt.text('2017-06-25', 10000, '2017')

plt.axvspan('2018-01-01', '2018-12-31', alpha = 0.14, color = 'xkcd:lightish blue')
plt.text('2018-06-25', 10000, '2018')

plt.axvspan('2019-01-01', '2019-12-31', alpha = 0.14, color = 'xkcd:sky')
plt.text('2019-06-25', 10000, '2019')

fig.savefig('plots/yearly_month_accidents.png', dpi=500)

**Analysis: In the latter half of 2016, there was a substantial drop in the number of accidents as well as injuries reported. Following this, from 2016 onwards, there was a gradual increase in both metrics until 2019, when a slight decline was observed.**

In [None]:
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows = 2, ncols = 2, figsize=(18, 14))

color = np.flip(cm.Oranges(np.linspace(.2,.6, 10)))
color2 = np.flip(cm.plasma(np.linspace(.2,.6, 10)))

fig.suptitle('Accident Statistics over the years', fontsize=16)

month_wise.plot(x = 'Year', y = 'NUMBER OF PERSONS INJURED', kind = 'line', 
                color = color, stacked = True, ax = ax1)
month_wise.plot(x = 'Year', y = 'Injury%', kind = 'line', 
                color = color, stacked = True, ax = ax2)
month_wise.plot(x = 'Year', y = 'NUMBER OF PERSONS KILLED', kind = 'line', 
                color = 'xkcd:lightish blue', stacked = True, ax = ax3)
month_wise.plot(x = 'Year', y = 'Fatality%', kind = 'line', 
                color = 'xkcd:lightish blue', stacked = True, ax = ax4)

fig.savefig('plots/yearly_month_accidents_stats.png', dpi=500)

In [None]:
month_group = raw_data.copy() 

month_group['Month'] = month_group['CRASH_DATE_TIME'].dt.month

month_grouped = month_group.groupby(['Month'])\
['NUMBER OF PERSONS KILLED', 'NUMBER OF PERSONS INJURED'].agg('sum').reset_index()

month_grouped['Total Accidents'] = month_group.groupby(['Month']).size().\
reset_index(name='NoOfAccidents').NoOfAccidents

month_grouped['Injury%'] = round((month_grouped['NUMBER OF PERSONS INJURED']/month_grouped['Total Accidents'] * 100), 1)
month_grouped['Fatality%'] = round((month_grouped['NUMBER OF PERSONS KILLED']/month_grouped['Total Accidents'] * 100), 3)

month_grouped = month_grouped.sort_values('Month')

month_grouped.head()

In [None]:
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows = 2, ncols = 2, figsize=(18, 14))
fig2, ax = plt.subplots(1, figsize=(18, 10))

month_grouped.plot(x = 'Month', y = 'Total Accidents', kind = 'line', 
                color = color, stacked = True, ax = ax)
month_grouped.plot(x = 'Month', y = 'NUMBER OF PERSONS INJURED', kind = 'line', 
                color = color2, stacked = True, ax = ax)

ax.set_title('Number of Accidents and Injuries vs Month', size = 12)
ax.set_xlabel('Months', size = 12)
ax.tick_params(labelrotation = 90)

fig.suptitle('Accident Statistics vs Month', fontsize=16)

month_grouped.plot(x = 'Month', y = 'NUMBER OF PERSONS INJURED', kind = 'bar', 
                color = 'xkcd:light orange', stacked = True, ax = ax1)
month_grouped.plot(x = 'Month', y = 'Injury%', kind = 'bar', 
                color = 'xkcd:light orange', stacked = True, ax = ax2)
month_grouped.plot(x = 'Month', y = 'NUMBER OF PERSONS KILLED', kind = 'bar', 
                color = 'xkcd:lightish blue', stacked = True, ax = ax3)
month_grouped.plot(x = 'Month', y = 'Fatality%', kind = 'bar', 
                color = 'xkcd:lightish blue', stacked = True, ax = ax4)

fig.savefig('plots/month_wise_accidents_stats.png', dpi=500)
fig2.savefig('plots/month_wise_accidents.png', dpi=500)

#### Day of the week vs Accidents

In [None]:
week_only = raw_data.copy() 
week_only['Weekday'] = week_only['CRASH_DATE_TIME'].dt.day_name()

week_wise = week_only.groupby(['Weekday'])\
['NUMBER OF PERSONS KILLED', 'NUMBER OF PERSONS INJURED'].agg('sum').reset_index()

week_wise['Total Accidents'] = week_only.groupby(['Weekday']).size().\
reset_index(name='NoOfAccidents').NoOfAccidents

week_wise['Injury%'] = round((week_wise['NUMBER OF PERSONS INJURED']/week_wise['Total Accidents'] * 100), 1)
week_wise['Fatality%'] = round((week_wise['NUMBER OF PERSONS KILLED']/week_wise['Total Accidents'] * 100), 3)

cats = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
week_wise['Week_day'] = pd.Categorical(week_wise['Weekday'], 
                                   categories=cats, 
                                   ordered=True)

week_wise = week_wise.sort_values('Week_day')

week_wise.head()

In [None]:
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows = 2, ncols = 2, figsize=(18, 10))

fig2, ax = plt.subplots(1, figsize=(18, 6))

color = np.flip(cm.Oranges(np.linspace(.2,.6, 10)))
color2 = np.flip(cm.plasma(np.linspace(.2,.6, 10)))

week_wise.plot(x = 'Weekday', y = 'Total Accidents', kind = 'bar', 
                color = color, stacked = True, ax = ax)
week_wise.plot(x = 'Weekday', y = 'Total Accidents', kind = 'line', 
                color = 'xkcd:lightish blue', stacked = True, ax = ax)

fig.suptitle('Days of the weeks vs Accident Statistics', fontsize=16)

week_wise.plot(x = 'Weekday', y = 'NUMBER OF PERSONS INJURED', kind = 'line', 
                color = color, stacked = True, ax = ax1)
week_wise.plot(x = 'Weekday', y = 'Injury%', kind = 'line', 
                color = color, stacked = True, ax = ax2)
week_wise.plot(x = 'Weekday', y = 'NUMBER OF PERSONS KILLED', kind = 'line', 
                color = 'xkcd:lightish blue', stacked = True, ax = ax3)
week_wise.plot(x = 'Weekday', y = 'Fatality%', kind = 'line', 
                color = 'xkcd:lightish blue', stacked = True, ax = ax4)

ax.set_title('Weekdays vs Number of Accidents', size = 16)
ax.set_xlabel('Weekday', size = 12)
ax.set_xlabel('No. of Accidents', size = 12)
ax.tick_params(labelrotation = 90)

fig.savefig('plots/weekday_accidents_stats.png', dpi=500)
fig2.savefig('plots/weekday_accidents.png', dpi=500)

#### Does the time play a role?

In [None]:
time_only = raw_data.copy() 
time_only['Time'] = time_only['CRASH_DATE_TIME'].dt.hour

time_wise = time_only.groupby(['Time'])\
['NUMBER OF PERSONS KILLED', 'NUMBER OF PERSONS INJURED'].agg('sum').reset_index()

time_wise['Total Accidents'] = time_only.groupby(['Time']).size().\
reset_index(name='NoOfAccidents').NoOfAccidents

time_wise['Injury%'] = round((time_wise['NUMBER OF PERSONS INJURED']/time_wise['Total Accidents'] * 100), 1)
time_wise['Fatality%'] = round((time_wise['NUMBER OF PERSONS KILLED']/time_wise['Total Accidents'] * 100), 3)

time_wise = time_wise.sort_values('Time', ascending = True)

time_wise.head()

In [None]:
fig, ax = plt.subplots(1, figsize=(18, 10))

color = np.flip(cm.Oranges(np.linspace(.2,.6, 10)))
color2 = np.flip(cm.plasma(np.linspace(.2,.6, 10)))

time_wise.plot(x = 'Time', y = 'Total Accidents', kind = 'line', 
                color = color, stacked = True, ax = ax)
time_wise.plot(x = 'Time', y = 'NUMBER OF PERSONS INJURED', kind = 'line', 
                color = color2, stacked = True, ax = ax)

ax.set_title('Time vs Number of Accidents and Injuries', size = 12)
ax.set_xlabel('Time (in hours)', size = 12)
ax.tick_params(labelrotation = 90)

fig.savefig('plots/time_accidents.png', dpi=500)

In [None]:
date_wise['Total Accidents'].mean()

Multiple dates exhibit an unusually high number of accidents. Given the influential role of weather conditions in motor accidents, we intend to investigate the correlation by examining weather conditions on days when more than 536 accidents (exceeding the average) were reported.

On these specific days, an average of 170 individuals were injured, and a total of 764 fatalities occurred. Thus, identifying correlations between these days and other factors could support authorities in improving road safety measures.

#### Weather Matching: 

In [None]:
NYC_LAT = '40.730610'
NYC_LONG = '-73.935242'

In [None]:
vehicle_accidents_500_date = vehicle_accidents_500.copy()['Date'].head(10) 

frame = {'Date': vehicle_accidents_500_date} 
vehicle_accidents_500_date_df = pd.DataFrame(frame) 

In [None]:
casts = []

for date in vehicle_accidents_500_date_df['Date'].values.tolist():
    dt = str(date)
    date_time = dt + "T12:00:00"
    link = "https://api.darksky.net/forecast/{}/{},{},{}".format(secret_key, NYC_LAT, NYC_LONG, date_time)
    
    r = requests.get(url = link)
    
    data = r.json() 
    to_cast = data['currently']['summary']
    
    casts.append(to_cast)

vehicle_accidents_500_date_df['Summary'] = casts

In [None]:
vehicle_accidents_500_date_df = date_wise.set_index('Date').\
join(vehicle_accidents_500_date_df.set_index('Date'))

In [None]:
to_plot = vehicle_accidents_500_date_df.head(10)

to_plot_grouped = to_plot.groupby(['Summary'])\
['NUMBER OF PERSONS KILLED', 'NUMBER OF PERSONS INJURED'].agg('sum').reset_index()

to_plot_grouped['Total Accidents'] = to_plot.groupby(['Summary']).size().\
reset_index(name='NoOfAccidents').NoOfAccidents

to_plot_grouped['Injury%'] = round((to_plot_grouped['NUMBER OF PERSONS INJURED']/to_plot_grouped['Total Accidents'] * 100), 1)
to_plot_grouped['Fatality%'] = round((to_plot_grouped['NUMBER OF PERSONS KILLED']/to_plot_grouped['Total Accidents'] * 100), 3)

to_plot_grouped = to_plot_grouped.sort_values('Total Accidents', ascending = False)

In [None]:
fig, ax = plt.subplots(1, figsize=(14, 6))

color = np.flip(cm.Reds(np.linspace(.2,.6, 10)))

to_plot_grouped.plot(x = 'Summary', y = 'Total Accidents', 
             kind = 'bar', color = color, 
             stacked = True, ax = ax)

ax.set_title('Weather Condition vs Number of Accidents', size = 12)
ax.set_xlabel('Weather Condition', size = 12)
ax.set_ylabel('Number of Accidents', size = 12)
ax.tick_params(labelrotation = 90)

fig.savefig('plots/weather_summary_accidents.png', dpi=500)

In [None]:
weather_inj_rate = to_plot_grouped.sort_values('Injury%', ascending = False).head(10)
weather_inj = to_plot_grouped.sort_values('NUMBER OF PERSONS INJURED', ascending = False).head(10)
weather_fat_rate = to_plot_grouped.sort_values('Fatality%', ascending = False).head(10)
weather_fat = to_plot_grouped.sort_values('NUMBER OF PERSONS KILLED', ascending = False).head(10)

In [None]:
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows=2, ncols=2, figsize=(18, 18))

color = np.flip(cm.plasma(np.linspace(.2,.6, 10)))
color2 = cm.PuRd(np.linspace(.2,.6, 10))

weather_inj_rate.plot(x = 'Summary', 
                      y = 'Injury%', kind = 'bar', 
                      color = color, stacked = True, ax = ax1)

weather_inj.plot(x = 'Summary', y = 'NUMBER OF PERSONS INJURED', kind = 'bar', 
                 color = color, stacked = True, ax = ax3)

weather_fat_rate.plot(x = 'Summary', 
                      y = 'Fatality%', kind = 'bar', 
                      color = color2, stacked = True, ax = ax2)

weather_fat.plot(x = 'Summary', y = 'NUMBER OF PERSONS KILLED', kind = 'bar', 
                 color = color2, stacked = True, ax = ax4)

ax1.set_title('Weather Condition with the highest rate of injury', size = 12)
ax1.set_ylabel('Rate of Injury (%)', size = 12)
ax1.set_xlabel(' ', size = 12)
ax1.tick_params(labelrotation = 30)

ax2.set_title('Weather Condition with the highest rate of fatality', size = 12)
ax2.set_ylabel('Rate of Fatality (%)', size = 12)
ax2.set_xlabel(' ', size = 12)
ax2.tick_params(labelrotation = 30)

ax3.set_title('Weather Condition vs Injuries', size = 12)
ax3.set_xlabel('Weather Condition', size = 12)
ax3.set_ylabel('Number of Injured People', size = 12)
ax3.tick_params(labelrotation = 30)

ax4.set_title('Weather Condition vs Fatalities', size = 12)
ax4.set_xlabel('Weather Condition', size = 12)
ax4.set_ylabel('Number of Deaths', size = 12)
ax4.tick_params(labelrotation = 30)

fig.savefig('plots/weather_inj_fat_rate.png', dpi=500)