## Preparing data for data visualization
In this notebook I prepare the data from the csv `nyc_bike_accidents.csv` for data visualization.

In [None]:
import pandas as pd
import numpy as np

Load in the data

In [None]:
df = pd.read_csv('data/nyc_bike_accidents.csv')

In [None]:
df['CRASH DATE'] = pd.to_datetime(df['CRASH DATE'])

df['YEAR'] = df['CRASH DATE'].dt.strftime('%Y')

Here I get the data for 2018-2022 only. 

In [None]:
df.dropna(subset=['LATITUDE', 'LONGITUDE'], inplace=True)
df

nyc_bike_accidents_2018_2022 = df[df['YEAR'].str.contains('2018|2019|2020|2021|2022')]
nyc_bike_accidents_2018_2022 = nyc_bike_accidents_2018_2022[['YEAR','LATITUDE', 'LONGITUDE','NUMBER OF CYCLIST INJURED', 'NUMBER OF CYCLIST KILLED']]

# And create new columns that will be used for visualization
nyc_bike_accidents_2018_2022['ACCIDENT TYPE'] = 0

nyc_bike_accidents_2018_2022.loc[nyc_bike_accidents_2018_2022['NUMBER OF CYCLIST INJURED'] > 0, 'ACCIDENT TYPE'] = 'Injured'
nyc_bike_accidents_2018_2022.loc[nyc_bike_accidents_2018_2022['NUMBER OF CYCLIST KILLED'] > 0, 'ACCIDENT TYPE'] = 'Fatal'

nyc_bike_accidents_2018_2022

In [None]:
nyc_bike_accidents_2018_2022.to_csv('data/nyc_bike_accidents_2018_2022.csv', index=False)

### Reasons for accidents
In this section I explore the reasons for bike accidents.

In [None]:
bike_df = df

#Standardize the vehicle types
bike_df.replace(['E-Bike', 'BICYCLE', 'E-Bik'], 'Bike')

# Create a column that specifies if the vehicle type was a bike or not.
bike_df['vehicle_1_bike'] = np.where(bike_df['VEHICLE TYPE CODE 1'] == 'Bike', 0, 1)

# Group by the vehicle type and the reason for the accident.
bike_df = bike_df.groupby(['vehicle_1_bike','CONTRIBUTING FACTOR VEHICLE 1']).sum().sort_values('NUMBER OF PERSONS INJURED', ascending=False).reset_index()
bike_df = bike_df[['CONTRIBUTING FACTOR VEHICLE 1', 'NUMBER OF CYCLIST INJURED', 'NUMBER OF CYCLIST KILLED', 'vehicle_1_bike']]

# Drop the rows where the reason was unspecified.
bike_df = bike_df[bike_df["CONTRIBUTING FACTOR VEHICLE 1"].str.contains("Unspecified") == False]

For the visualization, I am only interested in the top 5 reasons, so I filter out the rest.

In [None]:
bike_df_5 = bike_df.head(5)

In [None]:
bike_df_5.to_csv('data/nyc_accidents_reason_5.csv')

### Prepare accident data for mapping

In this section of the notebook I prepare the accident data to be mapped as dots on a datawrapper map. For the mapping I need information about the location of the accident, whether the cyclist was injured or killed (to color the dots depending on it).

In [None]:
df.groupby('YEAR').sum('NUMBER OF PERSONS KILLED')

In [None]:
# Get only the data from 2022
df_2022 = df[df['YEAR'] == '2022']

# Get only the columns we need
df_2022 = df_2022[['CRASH DATE', 'CRASH TIME', 'BOROUGH', 'ZIP CODE', 'LATITUDE', 'LONGITUDE', 'NUMBER OF CYCLIST KILLED', 'NUMBER OF CYCLIST INJURED', 'ON STREET NAME', 'CROSS STREET NAME']]

# And create new columns that will be used for visualization
df_2022['ACCIDENT TYPE'] = 0

df_2022.loc[df_2022['NUMBER OF CYCLIST INJURED'] > 0, 'ACCIDENT TYPE'] = 'Injured'
df_2022.loc[df_2022['NUMBER OF CYCLIST KILLED'] > 0, 'ACCIDENT TYPE'] = 'Fatal'

df_2022['INJURED OR DEATH'] = df_2022['NUMBER OF CYCLIST KILLED'] + df_2022['NUMBER OF CYCLIST INJURED']

df_2022.dropna(subset=['LATITUDE', 'LONGITUDE'], inplace=True)

In [None]:
df_2022.to_csv('data/nyc_bike_accidents_2022.csv')

And to check if any streets are particularly dangerous for cyclists.

In [None]:
df_2022.groupby('ON STREET NAME').sum().sort_values('INJURED OR DEATH', ascending=False).head(50)

### Development of accidents over time
In this part, I explore how the number of accidents has changed over time. This includes standardizing the data to account for the different number of bikers each year.

In [None]:
accidents = pd.read_csv('data/nyc_bike_accidents.csv')

bikers = pd.read_csv('data/nyc_bikerides_numbers.csv')

In [None]:
accidents['CRASH DATE'] = pd.to_datetime(accidents['CRASH DATE'])

accidents['YEAR'] = accidents['CRASH DATE'].dt.strftime('%Y')

accidents_year = accidents.groupby('YEAR').sum('NUMBER OF PERSONS KILLED')
accidents_year = accidents_year[['NUMBER OF CYCLIST INJURED', 'NUMBER OF CYCLIST KILLED']]

In [None]:
accidents_year = accidents_year.reset_index()

accidents_year.rename(columns={"YEAR": "year", "NUMBER OF CYCLIST INJURED": "injured", "NUMBER OF CYCLIST KILLED": "killed"}, inplace=True)


In [None]:
bikers['year'] = bikers['Unnamed: 0']
bikers = bikers[['year', 'Total Daily Cycling Trips']]

In [None]:
bikers.drop([13], inplace=True)

In [None]:
bikers_accidents = pd.merge(accidents_year, bikers, on=['year'], how='left')


# Standardize the injuries and deaths per 1 million rides
bikers_accidents['injury_rate'] = bikers_accidents['injured'] / (bikers_accidents['Total Daily Cycling Trips'] * 365) * 1000000
bikers_accidents['fatality_rate'] = bikers_accidents['killed'] / (bikers_accidents['Total Daily Cycling Trips'] * 365) * 1000000

bikers_accidents.rename({'Total Daily Cycling Trips': 'total_daily_bikerides'}, axis=1, inplace=True)

In [None]:
bikers_accidents_rates = bikers_accidents[['year', 'injury_rate', 'fatality_rate']]

In [None]:
bikers_accidents_rates.to_csv('data/nyc_accidents_development_2012_2021.csv', index=False)

In [None]:
bikers_injuries = bikers_accidents[['year', 'injured']]
bikers_injuries.to_csv('data/nyc_injuries_development_2012_2021.csv', index=False)

In [None]:
melt = pd.melt(bikers_accidents, id_vars=['year'], value_vars=['injury_rate', 'fatality_rate'], var_name='accident_type', value_name='rate')
melt = melt[melt['rate'].notna()]
melt['year'] = melt['year'].astype(int)

melt.to_csv('data/nyc_accidents_development_2012_2021_long.csv.csv', index=False)

In [None]:
melt.to_csv('data/nyc_accidents_development_2012_2021_long.csv.csv', index=False)